Generative AI Could Leave Users Holding the Bag for Copyright Violations

Yves here. Many experts have raised liability issues with tech standing in for humans, such as self-driving cars and AIs making decisions that have consequences, like denying pre-authorizations for medical procedures. But a potentially bigger (in aggregate) and more pervasive risk is uses, as in any user, being exposed to copyright violations via the AI having made meaningful use of a training set that included copyrighted material. Most of what passed for knowledge is copyrighted. For instance, you have a copyright interest in the e-mails you send. This is not an idle issue; we have contacts who publish a small but prestigious online publication who got in a dustup about how another site misrepresented their work. Things got ugly to the degree that lawyers got involved. My colleagues very much wanted to post e-mails from earlier exchanges, which undermined later claims made by the counterparty, but were advised strongly not to.

By Anjana Susarla, Professor of Information Systems, Michigan State University. Originally published at The Conversation

Generative artificial intelligence has been hailed for its potential to transform creativity, and especially by lowering the barriers to content creation. While the creative potential of generative AI tools has often been highlighted, the popularity of these tools poses questions about intellectual property and copyright protection.

Generative AI tools such as ChatGPT are powered by foundational AI models, or AI models trained on vast quantities of data. Generative AI is trained on billions of pieces of data taken from text or images scraped from the internet.

Generative AI uses very powerful machine learning methods such as deep learning and transfer learning on such vast repositories of data to understand the relationships among those pieces of data – for instance, which words tend to follow other words. This allows generative AI to perform a broad range of tasks that can mimic cognition and reasoning.

One problem is that output from an AI tool can be very similar to copyright-protected materials. Leaving aside how generative models are trained, the challenge that widespread use of generative AI poses is how individuals and companies could be held liable when generative AI outputs infringe on copyright protections.

When Prompts Result in Copyright Violations

Researchers and journalists have raised the possibility that through selective prompting strategies, people can end up creating text, images or video that violates copyright law. Typically, generative AI tools output an image, text or video but do not provide any warning about potential infringement. This raises the question of how to ensure that users of generative AI tools do not unknowingly end up infringing copyright protection.

The legal argument advanced by generative AI companies is that AI trained on copyrighted works is not an infringement of copyright since these models are not copying the training data; rather, they are designed to learn the associations between the elements of writings and images like words and pixels. AI companies, including Stability AI, maker of image generator Stable Diffusion, contend that output images provided in response to a particular text prompt is not likely to be a close match for any specific image in the training data.

Builders of generative AI tools have argued that prompts do not reproduce the training data, which should protect them from claims of copyright violation. Some audit studies have shown, though, that end users of generative AI can issue prompts that result in copyright violations by producing works that closely resemble copyright-protected content.

Establishing infringement requires detecting a close resemblance between expressive elements of a stylistically similar work and original expression in particular works by that artist. Researchers have shown that methods such as training data extraction attacks, which involve selective prompting strategies, and extractable memorization, which tricks generative AI systems into revealing training data, can recover individual training examples ranging from photographs of individuals to trademarked company logos.

Audit studies such as the one conducted by computer scientist Gary Marcus and artist Reid Southern provide several examples where there can be little ambiguity about the degree to which visual generative AI models produce images that infringe on copyright protection. The New York Times provided a similar comparison of images showing how generative AI tools can violate copyright protection.

How to Build Guardrails

Legal scholars have dubbed the challenge in developing guardrails against copyright infringement into AI tools the “Snoopy problem.” The more a copyrighted work is protecting a likeness – for example, the cartoon character Snoopy – the more likely it is a generative AI tool will copy it compared to copying a specific image.

Researchers in computer vision have long grappled with the issue of how to detect copyright infringement, such as logos that are counterfeited or images that are protected by patents. Researchers have also examined how logo detection can help identify counterfeit products. These methods can be helpful in detecting violations of copyright. Methods to establish content provenance and authenticity could be helpful as well.

With respect to model training, AI researchers have suggested methods for making generative AI models unlearncopyrighted data. Some AI companies such as Anthropic have announced pledges to not use data produced by their customers to train advanced models such as Anthropic’s large language model Claude. Methods for AI safety such as red teaming – attempts to force AI tools to misbehave – or ensuring that the model training process reduces the similarity between the outputs of generative AI and copyrighted material may help as well.

Role for Regulation

Human creators know to decline requests to produce content that violates copyright. Can AI companies build similar guardrails into generative AI?

There’s no established approaches to build such guardrails into generative AI, nor are there any public tools or databases that users can consult to establish copyright infringement. Even if tools like these were available, they could put an excessive burden on both users and content providers.

Given that naive users can’t be expected to learn and follow best practices to avoid infringing copyrighted material, there are roles for policymakers and regulation. It may take a combination of legal and regulatory guidelines to ensure best practices for copyright safety.

For example, companies that build generative AI models could use filtering or restrict model outputs to limit copyright infringement. Similarly, regulatory intervention may be necessary to ensure that builders of generative AI models build datasets and train models in ways that reduce the risk that the output of their products infringe creators’ copyrights.

Print Friendly, PDF & Email


    1. The Rev Kev

      Maybe because all to often that it will just make up sources mentioned in any footnotes and bibliographies. I think that I read that a coupla days ago how a judge gave a lawyer a slap on the wrist as two points of law in his case were just made up ones by an AI. You cannot trust sources listed by an AI but must see if they actually exist. If a human had done that they would probably have been charged with contempt of court.

      1. GramSci

        The Brave search engine’s chatbot supplies footnotes that link back to a matching text. In my anecdotal experience (no RCTs) it supplies valid links, perhaps because it has been trained on a small corpus, largely Wikipedia.

      2. vao

        I have read cases of physicians and scientists delighted to see Chat-GPT returning article references in their field of interest that they were unaware of — and later, after infruitful searches, realizing they had been entirely concocted by AI. The titles were believable, the authors existed and worked in the area, the journals were actual ones, reputable, and in scope, the dates of publication appropriate — but the issues did not exist, the authors had never worked together, the articles were pure invention, etc.

        Every time one prompts those generative AI to produce something corresponding to some specification, one should always keep in mind that the output is “something that could look like” what one asked for — i.e. it is a plausible estimate of what the end-user is looking for, it is definitely not knowledge or a higher-level abstraction of the data ingested by the AI system. At best, the system will return what it has been trained on, if there is an exact match.


        The summary of a book by Chat-GPT is not a summary, but what a summary would reasonably look like.

        A diagnostic is not a diagnostic, but a plausible rendition of a diagnostic given the symptoms.

        A code snippet is not genuine source code, but what a program could look like given the prompt.

        Bibliographic references are not true references, but a credible presentation of a reference for the topic of choice.

        The information given by the customer support is not actual support (with diagnostic and proper steps to solve a problem), but an interpretation of what such advice could look like given the customer query.

        The picture of a horse-riding cosmonaut in the style of Picasso is not a painting by Picasso, but what such a painting could possibly look like.

        An audio of Trump singing “La donna è mobile” is not a recording, but what it might, possibly, hypothetically, sound like.

        To make a comparison, one does not expect a statistical estimator of ice-cream sales to return the actual figures, but a reasonably accurate approximation (including trend and seasonal variations) for them given the historical data to train the MLE, ARIMA, or whatever regression technique is used. Nobody is aghast and says the estimator “hallucinates” when it does not give the exact value for a past observation.

        What I find disquieting is that people seriously think that the output of those newfangled AI systems actually represents knowledge based on underlying reasoning, steps of induction/deduction/abduction, abstraction and imagination.

        Those generative AI are just highly multidimensional, inscrutable, exotic regression techniques. Except that none of those systems ever gives you the confidence interval for what it computes…

        1. Acacia

          Yep. All of which just makes it clear that the people who are developing this tech aren’t actually serious about knowledge at all.

          1. vao

            The really scary thing is that the people using that tech do not seem to care about knowledge, either.

            1. Jeff Farkas

              Amen. Those folks don’t understand the difference between a sorting algorithm in a data base application from comprehending the implications of the data sort. But, everyone still gets an “A” in the course.

      3. Keith

        Like Ted Chiang says, these models and the underlying technology essentially generate a compressed, or “lossy,” version of a chunk of internet text. If what one receives is akin to a “blurry JPEG of all the text on the internet,” source citations are not to be trusted and as Rev Kev suggests, will be improvised when push comes to shove.

        1. Grebo

          These models supposedly train on Terabytes or Petabytes of data but take up only Gigabytes. If they can faithfully or even approximately reproduce all that input it would be a revolution in data compression, never mind AI.

          I suspect that those things which can be recognisably recovered were highly generic to begin with.

          1. Keith

            Yes, all correct! The big players will train these models on terabytes of data which basically compresses down to a few hundred gigabytes. Your compression ratios are where the great ML teams separate themselves from the mediocre ML teams using a variety techniques that are changing daily.

            Its a weird roundabout way to compress data (knowledge?) and really one of the things that are pretty remarkable about this tech.

  1. Keith

    In the realm of coding and programming, this issue presents a significant dilemma. Consider a scenario in which an individual dedicates hundreds of hours to developing open-source software on Github. Subsequently, a corporation such as Microsoft, which owns the platform, might utilize this data to develop a coding assistant service, such as Copilot, charging $20 per month. Although the software was shared with the world under an MIT license, the intent was not to provide material for a cloud-based monopoly to exploit as part of a paid service. This raises the question of whether such actions constitute copyright infringement, which remains ambiguous.

    Furthermore, the justification that language models leverage open-source code merely to “learn coding” on their own is not convincing. These coding assistants do not possess an inherent understanding of coding; instead, they regurgitate their stolen training data and often struggle with tasks that include even a slight degree of complexity. Therefore, it appears they are essentially appropriating your code and repurposing it into responses for a subscription fee.

    1. The Rev Kev

      Sounds like another variation of an exploitation of the commons. In this case open-source software is the commons which Microsoft exploited for their own use and then monetized. it is just theft of a common good.

    2. Piotr Berman

      I guess it is within reasonable expectations. So far, people, EVEN PROGRAMMERS, have to do some thinking. ChatGPT is much faster than using thick books or Google searches in answering questions about scripting languages or data bases and so on, but at the end of the day, AI is not intelligence.

      When writing social network posts, articles etc., you need to have your own sense of style too, check the references on your own, as in the “dark age before generative AI”.

    3. BrooklinBridge

      The potential problem I see isn’t just infringement, but the likelihood/certainty that the current web sites that already provide free how-to info, usually from software developers, for a remarkable range of problems from the most general to remarkably specific tasks will wither on the vine and be no longer available when AI pollutes the well with answers and then dries up itself in legal battles or where some “big player” decides to monetize it, say into paying subscriptions, for even the most trivial information. Bad bears!

      The one advantage of AI that I’ve been told by at least one developer, is that AI comes up with a good answer almost instantly where as searching web sites can take a little time (often remarkably little but still). So as usual, we will follow the pied piper of convenience till suddenly there is no alternative and the piper suddenly has a sign saying, “CASH ONLY”…

      1. BrooklinBridge

        Note, AI can (and does) offer entire code blocks to do exactly what you want and it will even offer to plug them in for you… The potential problem with that (basically the point of this article) should be obvious. I’m assuming a non trivial risk that down the road one may have to pay big time for copyright infringement in such cases.

  2. Acacia

    As Lambert might say: “there’s that word ‘guardrails’ again.”

    Given that naive users can’t be expected to learn and follow best practices to avoid infringing copyrighted material, there are roles for policymakers and regulation.

    Depends how we define “naive users”. In the case of students, actually they ARE absolutely expected to learn and follow guidelines to avoid infringing copyrighted material. In the field of education, there are well-established policies and regulation for this: academic honesty and punishment for academic misconduct.

    The most typical form of academic misconduct is plagiarism, and it doesn’t matter if it’s intentional or not. That’s what the practices of academic writing are about: if you “unintentionally” plagiarize, it’s because you f*cked up and didn’t follow the correct practices of academic writing. Not excusable.

    So if a college student uses generative AI to “help” with writing and pastes the output into her/her essay, following the guidelines of academic honesty, that text should be cited as coming from a source, i.e., ChatGPT. Failure to use quotation marks and attribute the source is plagiarism.

    If the school or instructor has a “NO generative AI” policy, of course some students may still try to cheat, hoping the plagiarism won’t be detected, but given the chances that generative AI may infringe copyright, it becomes more akin to “do you feel lucky… ?”

  3. Bsn

    Well, like so many discussions about tech and it’s intrusions, illegalities and abuses, the are lots of “could” in the article. Essentially we “could” “if” we, etc. As they say, “aint gonna happen”. There are very few examples of tech being given guardrails or limits. It is killing off many professions, one by one (and faster and faster). Remember music? When’s the last time you bought music? Musicians are getting ripped off by tech via Spitify, Yoo Hoo Toob, etc. to site one example. I am beginning to believe that tech will be the death of us all (as a recent article in NC proposed – can’t find it’s link).

  4. JustTheFacts

    A lot of tech these days is based on magical thinking that there there’s a free lunch: One can use other people’s copyrighted works to build the manifold underlying the language model by learning to correctly predict what the next word in a copyrighted text is given the previous words, but yet, magically, not violate copyright, and claim “it’s fair use”. Yes, there’s generalization, so it’s not exactly lossy compression, but you’re skating pretty close to it.

Comments are closed.