FOSSA Logo

Snippet Scanning, Explained

September 30, 2024 · 7 min read·Jessica Black
Snippet Scanning, Explained

When developers build applications, they often use entire open source packages. But they may sometimes also pull in individual files or parts of files. These smaller parts of files or lines of code are known as snippets.

Historically, programmers have included snippets from sites like Stack Overflow. AI coding tools like GitHub Copilot are also potential snippet sources; the AI tool may output a snippet of the open source library upon which it's trained.

Although a snippet is only part of a larger file, the open source license that governs the full file still applies to the snippet. If, say, a file is licensed under the GPL v3 license, a snippet from that file would be as well.

Snippet scanning tools work by analyzing a snippet and matching it against a database of known full open source components. The goal is to identify the full file (and accompanying license) where the snippet originated, along with potential vulnerabilities that impact it.

The commonly used SPDX and CycloneDX SBOM formats both support snippets.

How Snippet Scanning Works: A Technical Overview

Snippet scanning tools generally use one of two methodologies to match snippets to their parent files:

  • Function-level matching
  • Expression-level matching

These are significantly different snippet-scanning implementations. Function-level matching detects functions in a file, whereas expression-level matches on expressions, of which there are commonly multiple per line.

Function-level matching would, for example, index the function "levenshtein" in the code block below and then check if this function exists in the users' codebase.

Expression-level matching would treat many permutations of the expressions inside this function as separate snippets. For example, depending on how aggressively the expression extraction is tuned, each line in this example function could consist of multiple snippets.

Code snippet example

Given this example, function-level matching would evaluate a single snippet: the entire content of the function (usually such snippet scanners are space-normalized).

Meanwhile, expression-level matching would attempt to extract many smaller snippets from the function. For example, the below are all possibly snippets that an expression-level snippet scanner would consider:

  • size_t levenshtein
  • size_t levenshtein(const char *a
  • size_t levenshtein(const char *a, const char *b)
  • levenshtein(const char *a, const char *b)

So, there are many snippets just for the function signature before we even get into the body. Some implementations have ways to filter noisy snippets, so it's not always as noisy as one might initially infer based on this example.

But, of course, "this is a noisy match" doesn't automatically mean it's always safe to ignore: licenses do not suddenly lose their effect just because the code they cover is often copied.

Discarding noisy matches is most safe when the match is noisy due to it being a common coding convention rather than the snippet simply being copied a lot, but there's no good way for a scanner to reliably infer the difference — that's in the domain of human review.

Accepting Snippet Matches

Regardless of the type of matching utilized, once a snippet scanning tool analyzes the code and surfaces potential matches, the end-user may be required to take action. The action will differ depending on the snippet scanning tool, however.

Some tools provide a list of potential matches and ask the end-user to confirm accurate ones and reject false positives.

Other tools take an inverse approach, in which all snippets are “confirmed” by default and a user needs to take action to reject incorrect or low-confidence matches.

Regardless of workflow specifics, the need for human intervention does put a premium on the value of snippet detection accuracy, including the ability to propose as few false positives as possible.

Do I Need Snippet Scanning?

There is no universal truth when it comes to the necessity of snippet scanning. Different organizations in different industries and with different levels of risk tolerance will have different requirements. How organizations use AI tools in coding (and which tools they use) also has an impact.

Our view at FOSSA on the importance of snippet scanning has evolved. While, in past years, we recommended our customers to deprioritize the initiative, we now think it is a worthwhile investment — but with certain caveats. (This is why we’re working on developing a snippet scanning product, which you can learn more about here.)

First, the caveats. We do continue to think that organizations early in their license compliance journeys should focus on getting non-snippet-related foundations in place. These include container scanning and accurate CI scanning, which require effort to execute properly. The ability to generate (and update) attribution notices at scale is also critical.

With that said, we do think there are several important reasons why more mature license compliance organizations should consider snippet scanning.

  1. Generative AI coding tools are rapidly gaining adoption in software development

The rapid adoption of generative AI coding tools presents new risks for organizations. While some established tools like GitHub Copilot offer copyright indemnification under specific conditions (including requirements like using the Duplication Detection feature), this protection is not universal across the growing landscape of AI coding assistants. These AI tools can generate code snippets that may be derivative of existing open source code, making users responsible for understanding and complying with the original licenses when distributing such generated code. The increased availability of numerous AI coding tools, compared to when Copilot held a larger market share, underscores the importance of user awareness regarding licensing obligations.

  1. Snippet scanning tools and techniques have improved

Historically, one of the big issues with snippet scanning tools has been their high rate of false positives — and the need for extensive manual intervention to confirm or reject potential matches. The nature of snippet scanning is such that some false positives will always exist, but tools are getting better at reducing them. Another area of improvement is in new automations that significantly cut down the amount of time teams need to spend on review. FOSSA is prioritizing both of these capabilities as we bring our new snippet scanning product to market.

  1. Real-world legal outcomes in the AI age are largely unknown

One of the reasons we previously cited when recommending against prioritizing snippet scanning was that there hasn’t been legal enforcement action based on snippets (that we’re aware of, anyway). This remains the case, but we should also acknowledge that the rise of AI means the past is not necessarily a predictor of the future. For one, trolls now have many more tools at their disposal to analyze and identify potential license compliance issues in an extremely large volume of code. Additionally, it’s very hard to know how courts will view matters of protecting intellectual property given the ongoing debates about how models are trained.

Ultimately, like we mentioned earlier, we don’t think there’s a one-size-fits-all approach to the snippet scanning decision. But the current AI-powered coding landscape is such that there’s a stronger argument to prioritize the initiative than there was in past years.

For more background on snippet scanning and FOSSA's open source license compliance solutions, please reach out to our experts.

Editor's note: Previous versions of this post were published in October 2023 and September 2024. It was updated again in May 2025 with additional perspectives on the value of snippet scanning in modern software development.

Subscribe to our newsletter

Get the latest insights on open source license compliance and security delivered to your inbox.