When developers build applications, they often use entire open source packages. But they may sometimes also pull in individual files or parts of files. These smaller parts of files or lines of code are known as snippets.
Historically, programmers have included snippets from sites like Stack Overflow. AI coding tools like GitHub Copilot are also potential snippet sources; the AI tool may output a snippet of the open source library upon which it’s trained.
Although a snippet is only part of a larger file, the open source license that governs the full file still applies to the snippet. If, say, a file is licensed under the GPL v3 license, a snippet from that file would be as well.
Snippet scanning tools work by analyzing a snippet and matching it against a database of known full open source components. The goal is to identify the full file (and accompanying license) where the snippet originated, along with potential vulnerabilities that impact it.
The commonly used SPDX and CycloneDX SBOM formats both support snippets.
How Snippet Scanning Works: A Technical Overview
Snippet scanning tools generally use one of two methodologies to match snippets to their parent files:
- Function-level matching
- Expression-level matching
These are significantly different snippet-scanning implementations. Function-level matching detects functions in a file, whereas expression-level matches on expressions, of which there are commonly multiple per line.
Function-level matching would, for example, index the function “levenshtein” in the code block below, and then check if this function exists in the users’ codebase.
Expression-level matching would treat many permutations of the expressions inside this function as separate snippets. For example, depending on how aggressively the expression extraction is tuned, each line in this example function could consist of multiple snippets.
Given this example, function-level matching would evaluate a single snippet: the entire content of the function (usually such snippet scanners are space-normalized).
Meanwhile, expression-level matching would attempt to extract many smaller snippets from the function. For example, the below are all possibly snippets that an expression-level snippet scanner would consider:
size_t levenshtein
size_t levenshtein(const char *a
size_t levenshtein(const char *a, const char *b)
levenshtein(const char *a, const char *b)
So, there are many snippets just for the function signature before we even get into the body. Some implementations have ways to filter noisy snippets, so it’s not always as noisy as one might initially infer based on this example.
But, of course, “this is a noisy match” doesn’t automatically mean it’s always safe to ignore: licenses do not suddenly lose their effect just because the code they cover is often copied.
Discarding noisy matches is most safe when the match is noisy due to it being a common coding convention rather than the snippet simply being copied a lot, but there’s no good way for a scanner to reliably infer the difference — that’s in the domain of human review.
Accepting Snippet Matches
Regardless of the type of matching utilized, once a snippet scanning tool analyzes the code and surfaces potential matches, the end-user may be required to take action to confirm the appropriate one.
Particularly for shorter snippets, it’s virtually impossible for a scanning tool to identify with 100% confidence the source of the snippet. A large snippet might only match a few open source libraries, but small snippets often have an arbitrary number of matches which require triage. Tools can and do attempt to analyze and report which full files are likely the original source, but it’s ultimately up to the end-user to review and confirm.
Depending on the number of snippets analyzed (and the number of potential matches for each snippet), this can either be a relatively quick process — or a rather lengthy and painful one.
Do I Need Snippet Scanning?
There is no universal truth when it comes to the necessity of snippet scanning. Different organizations in different industries and with different levels of risk tolerance will have different requirements.
However, at FOSSA, our view is that most teams and companies would be best served to deprioritize snippet scanning and instead invest in other open source license compliance initiatives. There are several reasons for this belief.
- Snippet scanning is generally plagued by false positives and requires manual intervention
Like we discussed earlier in this blog, the process of an end-user accepting possible matches between snippet and parent library is a critical part of the snippet scanning process.
But since snippet scanning tools provide only “likely” matches (in many cases), you will inevitably come across many false positives.
- Many "likely" matches are actually correct — meaning the snippet is truly found in multiple projects, which means you then need to investigate provenance, meaning which one came first.
- Most CVEs are identified based on the overall package and version and not on the specific snippet.
Although it is possible to fine-tune the results of snippet scanning tools to reduce some of the noise, even the best-case outcome requires a meaningful amount of manual work to review potential matches and discard false positives.
Kate Downing, a leading IP law attorney (and FOSSA advisor) shared a real-world example of the challenges in matching a snippet to its parent file on her blog.
“I’ve personally seen code scanners attribute snippets to very large, very popular projects, when the snippet is actually found in a subcomponent owned by someone else entirely, written long before the popular project came into existence.”
- Many snippets aren’t copyrightable
Beyond the inherently manual nature of reviewing snippet matches, there’s also the legal question of snippets and copyright law. For a work to be copyrightable, it must meet certain thresholds for originality and creativity. A basic function or control flow statement — or similar purely functional piece of code — is unlikely to be copyrightable.
As Downing writes:
“Even when a snippet scanner turns up an exact match, it might mean very little. The snippet may not be copyrightable, or may reflect a common code pattern used by many projects.”
- AI code completion tools like GitHub Copilot have built-in safeguards
Most of our customers who talk to us about snippet scanning do so in the context of analyzing AI-produced code. If you are in this category, we’d highly recommend taking advantage of the protections tools like GitHub Copilot offer. These include a “Duplication Detection” filter that, if enabled, ensures Copilot will not suggest an exact match of code upon which it was trained.
Moreover, Microsoft’s Copilot Copyright Commitment indemnifies paying customers from copyright claims as long as those customers enable Duplication Detection and other safeguards.
Of course, there are more AI code completion tools than just GitHub Copilot, but, in our opinion, Copilot sets market precedent here. You can work with the company providing your coding completion tool to ensure it covers you the same way or choose to use one that does provide you with this peace of mind if it is important.
- There has been no known legal enforcement action based on snippets
Most in-house legal and engineering teams aren’t exactly swimming in excess bandwidth. In practice, this means managing IP risks becomes an exercise in prioritizing which risks to address.
As such, it is important to consider the realities of snippet-related IP legal risks.
As of this writing, we’re not aware of any cases based on snippets.
As Downing explains:
“One has to look practically at the possibility of actual legal enforcement in this context. I’m not aware of any litigation based merely on snippets. Every open source-related litigation I’m aware of involved taking substantial portions of libraries, drivers, even operating systems without proper attribution or source code offers. Even if one were in the business of trolling, trolling merely on the basis of snippets and nothing more is just not profitable. There are so many companies out there not doing even basic compliance with entire Linux distributions, that there’s really no reason to spend time and money arguing about much more gray cases like snippets, which the plaintiff is less likely to win and which will be more costly because the plaintiff will need to bring in evidence and experts to defend the copyrightability of the snippet.”
When you combine these factors — false positives, manual interventions, questions over copyrightability, and the lack of real-world legal cases — our view is that most organizations will find investing in other open source license compliance activities to be a better use of limited resources. These include container scanning and accurate CI scanning, which require effort to execute properly — effort better spent compared to hunting snippets.
For more background on snippet scanning and FOSSA’s open source license compliance solutions, please reach out to our experts.
Editor's note: A version of this post was published in October 2023. It was updated in September 2024 with additional details on how snippet scanning works and perspective on the necessity of snippet scanning.