Opinion: The Copyright Office is making a mistake on AI-generated art

greenskye@beehaw.org · 1 year ago

Opinion: The Copyright Office is making a mistake on AI-generated art

Beej Jorgensen@lemmy.sdf.org · 1 year ago

I think this nails it. It’s probably the attack authors will use against OpenAI.

But the copyright office clearly states otherwise, so we’re in for a showdown.

Personally, I think the AI stuff seems more akin to writing a book in the style of another author, which is completely legal. And, to be clear, my option has no legal effect here whatsoever. 😅

FlowVoid@midwest.social · edit-2 1 year ago

There are two separate issues here. First, can you copyright art that is completely AI-generated? The answer is no. So openAI cannot claim a copyright for its output, no matter how it was trained.

The other issue is if openAI violated a copyright. It’s true that if you write a book in the style of another author, then you aren’t violating copyright. And the same is true of openAI.

But that’s not really what the openAI lawsuit alleges. The issue is not what it produces today, but how it was originally trained. The authors point out that in the process of training openAI, the developers illegally download their works. You can’t illegally download copyrighted material, period. It doesn’t matter what you do with it afterwards. And AI developers don’t get a free pass.

Illegally downloading copyrighted books for pleasure reading is illegal. Illegally downloading copyrighted books for training an AI is equally illegal.

Even_Adder@lemmy.dbzer0.com · 1 year ago

You actually can. I recommend reading this article by Kit Walsh, a senior staff attorney at the EFF if you haven’t already. The EFF is a digital rights group who most recently won a historic case: border guards now need a warrant to search your phone.

Here’s an excerpt:

Like copying to create search engines or other analytical uses, downloading images to analyze and index them in service of creating new, noninfringing images is very likely to be fair use. When an act potentially implicates copyright but is a necessary step in enabling noninfringing uses, it frequently qualifies as a fair use itself. After all, the right to make a noninfringing use of a work is only meaningful if you are also permitted to perform the steps that lead up to that use. Thus, as both an intermediate use and an analytical use, scraping is not likely to violate copyright law.

The article I linked is about image generation, but this part about scraping applies here as well. Copyright forbids a lot of things, but it also allows much more than people think. Fair use is vital to protecting creativity, innovation, and our freedom of expression. We shouldn’t be trying to weaken it.

You should also read this open letter by artists that have been using generative AI for years, some for decades. I’d like to hear your thoughts.

FlowVoid@midwest.social · edit-2 1 year ago

When determining whether something is fair use, the key questions are often whether the use of the work (a) is commercial, or (b) may substitute for the original work. Furthermore, the amount of the work copied is also considered.

Search engine scrapers are fair use, because they only copy a snippet of a work and a search result cannot substitute for the work itself. Likewise if you copy an excerpt of a movie in order to critique it, because consumers don’t watch reviews as a substitute for watching movies.

On the other hand, openAI is accused of copying entire works, and openAI is explicitly intended as a replacement for hiring actual writers. I think it is unlikely to be considered fair use.

And in practice, fair use is not easy to establish.

Even_Adder@lemmy.dbzer0.com · 1 year ago

You should know that the statistical models don’t contain copies of their training data. During training, the data is used just to give a bump to the numbers in the model. This is all in service of getting LLMs to generate cohesive text that is original and doesn’t occur in their training sets. It’s also very hard if not impossible to get them to quote back copyrighted source material to you verbatim. If they’re going with the copying angle, this is going to be an uphill battle for them.

FlowVoid@midwest.social · edit-2 1 year ago

I know the model doesn’t contain a copy of the training data, but it doesn’t matter.

If the copyrighted data is downloaded at any point during training, that’s an IP violation. Even if it is immediately deleted after being processed by the model.

As an analogy, if you illegally download a Disney movie, watch it, write a movie review, and then delete the file … then you still violated copyright. The movie review doesn’t contain the Disney movie and your computer no longer has a copy of the Disney movie. But at one point it did, and that’s all that matters.

Even_Adder@lemmy.dbzer0.com · 1 year ago

Read the article I linked, it goes over this.

FlowVoid@midwest.social · 1 year ago

No, it doesn’t.

It defends web scraping (downloading copyrighted works) as legal if necessary for fair use. But fair use is not a foregone conclusion.

In fact, there was a recent case in which a company was sued for scraping images and texts from Facebook users. Their goal was to analyze them and create a database of advertising trackers, in competition with Facebook. The case settled, but not before the judge noted that the web scraper was not fair use and very likely infringing IP.

Even_Adder@lemmy.dbzer0.com · 1 year ago

The whole thing hinges on if this is fair use or not, so, yes, it does.

Beej Jorgensen@lemmy.sdf.org · 1 year ago

I’m happy with the illegal downloading being illegal. Where things get murky for me is what algorithms you’re allowed to use on the data.

I get the impression that if they’d bought all the books legally that the lawsuit would still be happening.

FlowVoid@midwest.social · 1 year ago

If they bought physical books then the lawsuit might happen, but it would be much harder to win.

If they bought e-books, then it might not have helped the AI developers. When you buy an e-book you are just buying a license, and the license might restrict what you can do with the text. If an e-book license prohibits AI training (and they will in the future, if they don’t already) then buying the e-book makes no difference.

Anyway, I expect that in the future publishers will make sets of curated data available for AI developers who are willing to pay. Authors who want to participate will get royalties, and developers will have a clear license to use the data they paid for.