The software engineering world has been buzzing in recent days following the release of GitHub Copilot — a machine learning-based programming assistant. Copilot aims to help developers work faster and more efficiently by auto-suggesting lines of code and/or entire functions.

However, the matter of how GitHub Copilot generates these suggestions has been the subject of some controversy. In short, Copilot’s machine learning model has been trained on public code. This raises two questions:

  1. Does GitHub need permission from the code copyright owner to train Copilot on that code?
  2. Copilot’s suggestions are based on a massive amount of public code, some of which are covered by strong copyleft licenses. Will using Copilot count as creating a derivative work of the original copyleft-licensed code?

To answer these questions (and to put Copilot in the appropriate legal context), we turned to Kate Downing, an IP lawyer who specializes in helping software companies navigate areas like open source compliance.

October 2022 Update: The license compliance controversy surrounding GitHub Copilot took a recent turn when news surfaced of a potential lawsuit. On Oct. 17, programmer and lawyer Matthew Butterick announced that he and a team of lawyers are looking into filing a lawsuit against Copilot. You can visit their website for more information.

Is GitHub Copilot Committing Copyright Infringement?

Before exploring the license compliance ramifications of GitHub Copilot, we’ll start with a topic that goes to the very heart of whether GitHub can train Copilot on publicly available code without the copyright holder’s permission.

According to Downing, the answer depends to a certain extent on where that code is hosted. If it’s on GitHub, there very clearly would not be copyright infringement.

“If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” Downing says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”

Things aren’t quite as clear-cut in a scenario where Copilot is trained on code that is hosted outside of GitHub. In that situation, the copyright infringement question would hinge largely on the concept of fair use.

“If Copilot is being trained on code outside of GitHub, we accept that at least some of what they’re looking at is copyrightable work,” Downing says. “So, the question then becomes if it’s fair use. Now, you ultimately can’t conclude definitively that something is fair use until you go to court and a judge agrees with your assessment. But I think there’s a strong case to be made that Copilot’s use of code is very transformative — a point which would favor the fair use argument.

“There is precedent for this sort of situation. Take the case of Google Books, for example. Google scanned millions of books, provided people who were doing research with the ability to search the book, and provided the user a small snippet of the text that the user was searching for in the book itself. The court did in fact find that was fair use. The use was very transformative. It allowed people to search millions of books. It didn’t substitute for the book itself. It didn’t really take away anything from the copyright holders; in fact, it made it easier for readers to access the work and actually opened a broader market for book authors. And, it was a huge value add on top of the copyrighted corpus.

“I think those arguments are really strong and applicable to the Copilot example, but like I said, nothing is really fair use until a court decides that’s the case. So, despite the fact that we do have precedent in that direction, it remains an open question.”

To summarize:

  • Downing does not think GitHub is committing copyright infringement by training Copilot on code hosted on GitHub.
  • For code not hosted on GitHub (and thus not governed by GitHub’s terms of service): Downing thinks there’s a strong case that Copilot uses said code in a transformative manner, which would support a fair use argument that there is no copyright infringement. Ultimately, however, we can’t be completely confident one way or the other until the matter is settled in a court of law.

What About GitHub Copilot and License Compliance?

As we mentioned, GitHub trains Copilot on numerous pieces of public code, many of which are covered by strong copyleft licenses (i.e. GPL v2, GPL v3). Copyleft licenses require that derivative works (of the copyleft-licensed code) must carry the same license as the original code.

In other words, if an organization builds and distributes a piece of software (let’s call it “Jim’s Product”) that includes code licensed under GPL v3, “Jim’s Product” will likely also need to be licensed under GPL v3 (there are a handful of exceptions). This, of course, has the potential to be problematic.

The question, then, is whether GitHub Copilot users will inadvertently be creating derivative works of copyleft-licensed code. As Downing sees it, the answer to the derivative work question depends on the precise nature of Copilot’s suggestions.

“This is really case-specific and remains to be seen,” Downing says. “A lot depends on the thoroughness and the length of Copilot’s suggestions. The more complex and lengthy the suggestion, the more likely it has some sort of copyrightable expression. If a suggestion is short enough, the fact that it repeats something in someone else’s code may not make it copyrightable expression.

“This is the case because of the way programming languages work. In certain languages, there are specific class names and there are specific function names. There are a lot of pieces that get reused throughout code, almost like lego blocks. So if the suggestion is fairly small, it probably doesn’t have any copyrightable expression in it in the first place; the suggestion may be purely functional (i.e. this is the only way to do x in this language).

“There’s also the question of whether what’s being produced is actually a copy of what’s in the corpus. That’s a little unclear right now. GitHub reports that Copilot is mostly producing brand-new material, only regurgitating copies of learned code 0.1% of the time. But, we have seen certain examples online of the suggestions and those suggestions include fairly large amounts of code and code that clearly is being copied because it even includes comments from the original source code.”

To summarize:

  • Copilot is far from a finished product, and the complexity, length, and thoroughness of its code suggestions seem to vary.
  • For this reason — and the fact that code suggestions must be sufficiently original to meet the standard of copyrightable expression — it’s difficult to assess with confidence whether using Copilot would result in the creation of a derivative work.

A Practical Take on Copilot and the Law

As one would expect in light of the fact that GitHub Copilot is still just a few weeks old, there are still more questions than answers when it comes to matters of potential copyright infringement and license compliance. With that said, Downing suggests Copilot users consider taking certain simple and straightforward precautions to guard against potential license compliance issues.

“I’d caution anyone using Copilot right now to help write code to pay close attention to the nature of Copilot’s suggestions,” Downing says. “To the extent you see a piece of suggested code that’s very clearly regurgitated from another source — perhaps it still has comments attached to it, for example — use your common sense and don’t use those kinds of suggestions.”

Additional Resources