Heather Meeker on AI Coding Assistants and OSS License Compliance

There’s an old joke that may be familiar to longtime open source license compliance professionals. It goes something like: 80% of developers say they’re using open source code. And the other 20% lied about it.

I think that's where we are with AI coding assistants today — they’re ubiquitous, and, just like open source, they’re unbelievably helpful. So their penetration into coding over the last few years has been a landslide.

Of course, there are IP concerns to monitor when it comes to using these tools. Below, I’ll outline risks associated with model training vs. output, what guardrails exist (and their limitations), and practical ways to extend open source license compliance to AI-generated code.

This post is based on a webinar I recently conducted with FOSSA: If you’re interested in this topic, I’d point you in the direction of the on-demand recording for a more in-depth discussion.

The OSS License Compliance Implications of How Coding Assistants Are Trained

There are two different sources of potential IP risk associated with AI assistants: input (training) and output (what the tool produces).

Model Input (Training)

Open source licenses, by definition, allow you to do anything you want with the licensed code, with conditions that mostly trigger on distribution (or SaaS-style substitutes). So, particularly for software that is released under OSS licenses, the training risk is probably not significant. Training is, literally, permitted by the license.

However, not all public code is open source. There’s a lot of public code on GitHub, for example, covered by other licenses that might very well specifically prohibit AI training, grant other limited licenses, or grant no rights at all. If you’re training or fine-tuning a model, you need to consider your liability position on training. Most companies training on code therefore limit their training to code under open source licenses — particularly permissive licenses, for reasons outlined below.

However, it’s not entirely clear that licenses are necessary for training. Open source licenses clearly allow someone training an ML to make local copies to prepare for training. And they would clearly license the training activity as well.

But many experts think that properly training an ML model is fair use. In the U.S., fair use is a balancing test, and one of the important factors is sometimes called the transformation element. Today’s models are literally created with a technique called “transformer” architecture. Moreover, another important element of fair use depends on the type of work being used, and functional works like software tend to be more vulnerable to fair use than more expressive works like music, books, or videos. This question is also important to the next area of risk, which is model output.

Model Output

Output is different. If a user prompts the model and it spits out code, it is possible that the output code looks substantially similar to training data.

A well-trained model shouldn’t regurgitate, but similarities can appear — sometimes because there might not be many ways to express a certain statement. After all, coders have been borrowing ideas and coding conventions from each other for decades, and there are many similarities between various examples of human-written code that do the same thing.

If output resembles a licensed work to the point where it would be considered a derivative, the end-user’s ability to use the output might be conditioned on requirements that range from producing notice files to copyleft source code disclosure.

This is where the fair use question comes back into play. If training is fair use, then no license is necessary, and the notice and source code sharing requirements of open source notices would not apply.

But the jury is still out on this question — or more accurately, the courts have not spoken as a matter of law. There are over 50 cases pending in the U.S. today about ML training, and not only is there no consensus on the fair use question, very few of the pending cases concern software. Additionally, if there are very large similarities between model output and the licensed open source libraries upon which it was trained, it probably would not inoculate the output user from liability.

There are practical aspects to applying open source requirements to ML output as well. The model itself can’t provide the provenance for a specific code suggestion; it’s a black box in that respect. The fact that models don’t preserve provenance information is why independent code scanning tools (particularly ones with snippet scanning capabilities) are so important.

Built-In Guardrails from License Compliance Tools and Their Limitations

The good news is that, while AI coding assistant output does raise certain OSS license compliance concerns, there are also ways to mitigate those risks. This includes built-in guardrails offered by the tools themselves.

Early commercial tools introduced filters that would enable users to block suggestions that were more than a specific length or try to match output to input.

Additionally, paid tools may offer indemnification against infringement claims, though there are a few important conditions and limitations to keep in mind.

If you use the free version, you likely won't be covered.
Tools may require certain filters to be enabled for the indemnification offer to stand.
Even with a paid subscription, indemnities are limited in nature, often capped at subscription fees paid (which won't get you very far in defending a copyright infringement claim).

Ultimately, do make sure that your developers are using paid versions, and turn on all of those features if you are risk-averse — doing so can even improve your indemnity position. (If a tool offers indemnification, you may also consider trying to negotiate terms if you're a large customer.)

But remember: AI assistants do not preserve provenance. That’s why independent scanning and compliance tools matter. They are probably the only way, and the last resort, for fixing any infringement that might be introduced by ML coding tools.

Strategies for Extending OSS License Compliance Workflows to Cover AI Coding Assistant Output

Along with making use of built-in guardrails, it’s important to consider how to apply license compliance tools and workflows to AI output.

Snippet scanning tools are particularly valuable in this new landscape. Snippet scanning is becoming essential because of the need to identify small, matching fragments of code that might originate from open source projects.

It’s also important to bring your license compliance policies (e.g. allowed, flagged, and denied licenses) into your snippet scanning tool. This ensures the tool will create an issue if it detects a license that conflicts with your policy.

Then, from a workflow perspective, if your tool detects a match and the license conflicts with your policy, treat it like any other compliance issue: remove, remediate, rewrite, or relicense.

In hard cases (e.g., an AGPL-licensed snippet): consider rewriting using a clean room, getting an alternative license via dual licensing, or removing or replacing the code. It’s almost never a situation where you can’t find your way out of it; it’s a question of how long it will take and how expensive it will be.

A Final Word

Adoption of AI coding assistants is widespread — so widespread that warnings not to use these tools may simply push teams to surreptitiously use the free versions, which won’t offer built-in guardrails.

The defensible path is to choose paid tools, enable guardrails, use snippet scanning, and apply your existing license policies to AI outputs. That won’t eliminate all risk, but it brings the proven software compliance discipline to a new generation of coding tools.