SOURCETRACKER: Hybrid Vector-Fingerprint System Enables Scalable Provenance Tracking for LLM-Generated Code

28/05/2026 · 16 vues · SOURCETRACKER THESTACKV2 Winnowing code provenance large language models

Provenance Tracking at Scale

Large language models (LLMs) for code completion often reproduce training examples verbatim, raising legal and ethical concerns about plagiarism and license compliance. Classical fingerprint-based methods like Winnowing are effective for exact-match detection but require linear-time search across billion-scale corpora, making them impractical for modern code LLMs. A new paper introduces SOURCETRACKER, a 300M-parameter encoder for code retrieval, and a hybrid pipeline called HYBRIDSOURCETRACKER (HST) that combines vector search with Winnowing to achieve scalable, high-precision provenance tracking. The system first narrows down candidate snippets via vector search, then re-ranks them with exact fingerprint matching, preserving logarithmic-time query complexity.

How the Hybrid Pipeline Works

The researchers trained and evaluated SOURCETRACKER on a 10M-snippet subset of the THESTACKV2 dataset, which includes both verbatim and adapted snippets that simulate realistic identifier renaming. In the hybrid approach, vector search quickly retrieves a small set of candidates from the entire corpus, reducing the search space dramatically before the more computationally expensive fingerprint matching step. This two-stage design addresses the core limitation of standalone fingerprinting: although Winnowing is highly accurate, its linear scan over every snippet in the training set does not scale. By contrast, vector search enables logarithmic-time nearest-neighbor retrieval, but alone may miss subtle matches that fingerprints catch. Combining both methods yields both scalability and precision.

Performance Benchmarks and Key Findings

In an in vitro experiment using a 100k-snippet search space with adapted queries, the hybrid approach achieved mean reciprocal rank on par with Winnowing for 30-token fragments. For windows of 60 tokens or larger, the hybrid system consistently outperformed pure Winnowing by up to 5.4%, while maintaining logarithmic-time query complexity. This is a significant result: longer code fragments are often the most problematic for copyright infringement because they contain more original expression. The team also conducted a complementary evaluation using an LLM-based judge. They discovered that many retrieved snippets not labeled as ground truth were still highly similar to the expected sources, particularly with longer context windows. This means that even when exact provenance cannot be established, users still receive useful information about potential source code origins, improving practical usability.

Implications for AI-Assisted Coding

The findings have direct implications for developers and enterprises using code completion tools. As LLMs increasingly generate substantial code blocks, the ability to attribute snippets back to their training data becomes essential for complying with open-source licenses and avoiding plagiarism. Current solutions like GitHub Copilot's code referencing feature rely on fuzzy matching, which can miss adapted versions. HST's hybrid method offers a more rigorous alternative by combining semantic similarity (vector search) with exact pattern matching (fingerprinting). Moreover, the logarithmic-time complexity means the system can scale to billion-snippet corpora, making it feasible for continuous integration pipelines or IDE plugins that need real-time checks.

Limitations and Future Work

The paper acknowledges that the evaluation was limited to the THESTACKV2 dataset, which may not fully represent the diversity of code in production LLM training sets. Additionally, the adapted snippets used in testing only cover identifier renaming, not more sophisticated obfuscation like control flow changes or comment insertion. However, the hybrid framework is modular: the vector encoder can be retrained on domain-specific code, and the fingerprinting component can be swapped for other hashing schemes. Looking ahead, the researchers propose extending SOURCETRACKER to support multi-language corpora and integrating it into a public provenance API for LLM providers. As regulatory scrutiny around AI-generated content intensifies, tools like HST will likely become standard components in responsible AI deployment.

Source: HuggingFace Papers

345tool Editorial Team

We are a team of AI technology enthusiasts and researchers dedicated to discovering, testing, and reviewing the latest AI tools to help users find the right solutions for their needs.

我们是一支由 AI 技术爱好者和研究人员组成的团队，致力于发现、测试和评测最新的 AI 工具，帮助用户找到最适合自己的解决方案。

Commentaires

Loading comments...