Celebrated Writers File Copyright Lawsuit Against AI
Whether you're a Luddite who's never used ChatGPT or you revere modern artificial intelligence like His Holiness the DALL-E LLaMA, the past year has taught us that AI is powerful and here to stay. As with any technology that grows faster than regulators can keep up with, it's been leaving behind a lot of legal questions in its wake.
It started with questions about copyright in the visual arts during the "AI-art gold rush," when Christie's auction house sold a paintbot-generated print for almost half a million dollars. Then, attorneys in the music industry were up in arms when AI platforms were used to create "original" songs that seemed to be ripped off from actual musicians such as Drake.
It wasn't just the issue of creating new art too similar to an existing artist that copyright holders found problematic, though. It was using any part of their licensed works in any way – notably, for training the AI. Comedian Sarah Silverman, for example, took a couple of generative AI companies to court back in July. She had a problem with the companies using her copyrighted works as part of the machines' learning algorithm in the first place.
Now, the literary industry is following [law]suit.
Renowned Writers Rally
Michael Chabon is a beloved writer who's won the Pulitzer, Hugo, and Nebula awards for his witty fiction. Another celebrated writer is David Henry Hwang, the playwright who focused much of his plots around contemporary Asian American identity. If you had the privilege of hitting up Broadway in 2017, you might have caught the revival of his prized play, M. Butterfly. How about Matthew Klam, Louise Snyder, and Ayelet Walman? These creatives have spent years writing and refining their craft and getting a lot of popular and critical acclaim for it.
What do all of these authors have in common, other than that two of them are married to each other? They are the latest in the copyright tidal wave that's been surging against AI since the spring. The above group of writers have joined together as plaintiffs in two separate but similar lawsuits against OpenAI (creator of ChatGPT) and Meta (the company that owns Facebook and Instagram, as well as the chatbot LLaMA). Most of you will be familiar with ChatGPT by now, at least in passing. You may be less familiar with LLaMA. We'll give you a quick run-down of the relevant technology behind these chatty cathode-powered Cathys.
The Bots' Bios
ChatGPT (Chat Generative Pre-trained Transformer) is the large language model-based chatbot from OpenAI. Released at the end of last November, it gave the world a holiday season full of playing with our new Christmas present. But Mark Zuckerberg, not wanting to be left out of the party, had already begun developing Meta's answer to this shiny new toy. In February, Meta released LLaMA, which stands for the rather redundantly named "Large Language Model Meta AI" (unfortunately, Zuckerberg did not follow his tech brothers in leaving forced acronyms well enough alone). Facebook also released LLaMA 2 under a community license in July.
A large language model (or LLM, not to be confused with the law degree some of you may have) is an artificial intelligence program that can generate "convincingly natural" responses in text based on prompts made by the human user. OpenAI's LLMs are the different models that ChatGPT itself runs on, such as the currently-free GPT-3.5, and the premium GPT-4, released in March.
So LLaMA, while not a chatbot itself, is a similar to OpenAI's GPT models, with some key differences. For one, LLaMA is much "smaller" than other LLMs in that it operates on fewer parameters (depending on the model, up to 70 billion parameters). Compare that with GPT-4's 1.7 trillion parameters. While more parameters can generate a more sophisticated response, a leaner system like LLaMA is more efficient and less resource-intensive, which is what it was designed for.
And LLaMA is crucially different in one other regard: where it sources its data from. LLaMA sources its learning from a broader range of primary sources than, say, ChatGPT. For example, LLaMA will include news articles and scientific articles, whereas ChatGPT will mostly use internet-based text, like posts from social media and websites. Overall, the consensus seems to be that LLaMA is more accurate for researching facts, while ChatGPT can produce more sophisticated language and is more creative.
That might be way more tech talk than you signed up for, but it's these kinds of details that will make or break the lawsuits here. The way the chatbots were built, particularly the data on which they were trained, is crucial to the legal inquiry.
Dubious Databases
What are the primary sources that these chatbots are "learning" from, and are they under copyright? It's hard to say.
At first, things seemed to be less worrisome. In its prototypical version of ChatGPT, OpenAI revealed that the bot was trained on, in addition to web pages, a "BookCorpus" of more than 7,000 unpublished books from different genres. These books were, at the time, freely available on the website of an "indie book distributor" called Smashwords.com. Similarly, Meta revealed that about 9% of its data set was sourced from Project Gutenberg and ThePile, collections of books that are in the public domain and available for free.
In other words, both platforms held themselves out as mining their data legally, from copyright-free sources. Whether or not that's true is debatable, but that doesn't matter for purposes of these lawsuits, since the plaintiffs' works weren't on there. These authors' works were copyrighted, so they wouldn't be in any of the free collections that OpenAI and Meta claimed they used. "Then what's the problem?" you might ask.
The problem is that the plaintiffs don't buy it. According to them, from their own digging, the AI creators' explanations don't hold water. OpenAI and Meta both reveal only some information about where a portion of their data comes from, and they're conveniently disclosing the public sources. But their data sets are a lot larger than what could be extracted from the free sources cited.
Plaintiffs suspect, based on the size of the overall data sets, that they are derived in large part from "shadow libraries." These are sites like Library Genesis or Bibliotik, which host enormous collections of pirated (i.e., published and copyrighted) books and other texts, including the works written by the writers bringing the lawsuits. These writers never gave their permission for such sites to download their works without payment in the first place, so these shadow libraries are illegal – at least in the U.S. The only reason they're still up and running on the internet is because they're being hosted on foreign servers in other countries, such as Russia, where IP law is not always enforced.
The gist of the legal theory against the AI companies is that if they're using these shadow libraries to train their bots, then the bots are created from a copyright infringement. When the bots use illegally obtained material to not only train themselves, but to generate and create from, it gets even more complicated. In any event, plaintiffs argue, somewhere in all of this is a copywrong.
The Copyright Claims: Not Quite in the Can
Many saw these kinds of copyright issues coming months ago. No doubt that this kind of quasi-pirating will rub some of us the wrong way, especially for devoted readers of the plaintiffs (the writer of this blog happens to be a big fan of Chabon's The Amazing Adventures of Kavalier and Clay). At an ethical level, most of us don't want to see our favorite authors ripped off. But when it comes to copyright claims, do they have a legal leg to stand on?
In a previous blog, before all these lawsuits were brought, FindLaw investigated the question of whether someone can sue an AI company for copyright infringement. It was a question courts hadn't had to deal with before. The answer wasn't totally clear then, and it's not much clearer now. What is clear is that with creatives across just about every industry starting to bring lawsuits, this question will soon have to be addressed formally by the federal C-suite of Congress, the courts, and the Copyright Office.
Related Resources:
- Legal (and Moral) Issues in AI-Generated Content (FindLaw's Law and Daily Life)
- Copyrights (FindLaw's Learn About the Law)
- Silverman Goes for the Gold (FindLaw's Practice of Law)