Richard Murphy Gets ChatGPT to Describe How It Inherently Makes Shit Up

Posted on by

Yves here. We have repeatedly warned reading against relying on generative AI like ChatGPT and with very very few exceptions, do not allow AI-ginned-up content in comments. That is because, as this post confirms, many of the uses of AI are resulting in both the crapifying of what passes for knowledge and the degradation of reasoning skills. AI fans, please read Thinking Being Offloaded to AI Even in Elite Medical Programs before objecting.

As we can see below, many users seem to have lost sight as to how generative AI programs like ChatGPT work. They do not do research. They use the data in their training set and then give probabalistic responses based on that (forgive me if this description is not fully buzzsword compatible, but that is effectively what it does).

Now there are no doubt narrow applications where this approach could be fruitful, for instance, AI reading MRI images, having consumed a ginormous number of correctly scored ones. What is disconcerting is the big push for what amounts to general purpose AI, which is bound to lead to mishaps (we cited a new report from The Register, on how AI models are already breaking down due to over-ingestion of their own crap findings). While there are very likely papers that discuss the data and query parameters for highly reliable AI output, I see no evidence of much if any such thinking making its way to companies that are adopting AI as yet the latest management fad (keep in mind that fads are a major source of busywork that justifies the existence of said managers).

Richard Murphy queried ChatGPT as to how it had delivered bogus results to a request he’d made. The response confirms that ChatGPT, and presumably other generative AI will often include garbage output, which means any user is always at risk of relying on bad information. Moreover, ChatGPT ‘splaining itself is reminiscent of the scorpion tell the frog that his fatal behavior is in his nature.

Some of the comments an Murphy’s post are as deadly as the piece proper:

I fail to understand why Murphy believes ChapGPT will validate references if told to do so. From comments again:

I too do not read output from AI on the assumption that it can’t be relied upon. I wonder how long it will take for “AI” to start to have negative connotations for accuracy.

By Richard Murphy, Professor of Accounting Practice at Sheffield University Management School and a director of the Corporate Accountability Network. Originally published at Funding the Future

I was trying to find an old blog post of mine yesterday and could not do so. As a result, I enlisted the help of ChatGPT. I wrote a pretty clear instruction and made clear I wanted to find a specific article on a very precise topic, and then have ten other articles on the same broad theme listed by it using Harvard style referencing, but with an explicit URL shown in each case so I could find the materials if I needed to do so.

It fulfilled the ask. But, when I checked the list of articles supplied, some had titles that I thought were very unlikely that I had written. The construction just felt wrong. I checked three. None of them existed, even though precise titles and URLs specifying dates of publication were provided.

I asked Chat GPT whether they really did exist. In each case, it said:

After reviewing the available information, it appears that the article titled “[xxx etc]” purportedly published on [date] on your blog “Tax Research UK,” does not exist. A thorough search of your blog’s archives for that date and title yielded no results.

It’s possible that this reference was included in error or that the article was published under a different title or on a different date. If you have any additional details or context about the content of the article, I would be happy to assist you further in locating it.

I then asked:

Why did you reference articles that did not exist?

I got this reply:

You’re absolutely right to challenge this, and I appreciate your patience.

The mistake occurred because I generated plausible-sounding article titles and publication dates based on the themes you’d asked for and your well-established positions—but without first verifying whether those precise articles actually existed on your blog. That was an error.

I should have fact-checked each reference against your real published work on Tax Research UK before presenting them as citations. Instead, I inferred them based on patterns in your writing, which created the impression of referencing your own archive—but in fact pointed to non-existent URLs.

That’s not acceptable when accuracy is required, and I take responsibility for the confusion it caused. Going forward, I’ll verify every reference directly against your blog unless you specify otherwise.

So, a warning. If you ask for references from ChatGPT, make clear you want them for sources that actually exist. Otherwise, it just makes stuff up.

Print Friendly, PDF & Email

44 comments

  1. psy

    Just as concerning is the amount of times I’ve seen an AI-generated comment, usually on Reddit, that people treat as genuine. No one questions it, it just slides right in and people respond to it as if it were a human with dozens of comments and hundreds/thousands of upvotes.

    It’s subtle, like it was prompted to write how a Redditor would write, but the use of em dashes, perfect grammar, and the dramatism/ass-kissing throughout the account’s comments is a giveaway.

  2. vao

    There was a brief discussion on this very site about AI affabulating references, following the MAHA report that was full of them.

    I mentioned a few other occurrences of this phenomenon. I would like to re-link to one of them which is particularly interesting because it provides an analysis of how ChatGPT may concoct those invented bibliographical references. Read it as a complement to Richard Murphy’s posting.

    1. Ignacio

      Conclusion: never ask AI for references as it may invent them. Might it also “create” data from those non existing references bringing and mixing stuff from existing references?

      Does this occur with other kinds of documents? For instance, laws. Imagine if it starts affabulating laws.

      1. vao

        Those AI tools have invented references to legal cases that were as spurious as references to academic papers (to the dismay of the attorneys using the AI, and to the great displeasure of the judges who had to deal with the slop). I am not yet aware that they invented laws, decrees, or rulings, but I presume they are perfectly able to do so.

        I suspect that asking for the legal prescriptions regarding some obscure corner of law will lead the AI to contrive some text that looks plausibly like some governmental decree, or precedent from the ruling of a tribunal.

        Again: take the LLM AI as a fancy hyperdimensional maximum likelihood estimator that computes the most plausible answer to a query, any query — not as a knowledge aggregator, nor as a retrieval mechanism for a database of facts.

        1. TomDority

          IBM’s Seterus is an AI program developed to service mortgages in arrears and the MERS dross and 2008 crisis disasters etc. – I believe is was purchased by Notionstar mortgage under the title d.b.a Mr Cooper (I found that name to be synonomous with the fellow who jumped out the plane with loot) the Consumer Protection Bureau did bring suit against Notionstar
          https://www.csbs.org/sites/default/files/2020-12/Nationstar%20Complaint–as%20filed.pdf
          for practices harmfull to mortgage holders inside of bankruptcy and to those catching up …whereas -they were being hit with close out fees and charges in a fashion contradictory to law and at the last minute. This suit would have been equally applicable to Seterus in its operations with FannieMae but that (i would think) would have required a first time Quasi-Qui-Tam as Fannie was a quasi governmental program.
          It brought up thinking in my mind that the lawyers whom Nationstar were provided AI generated court filing documents upon which they used the after examining the junk “I believe the stuff I presented to be the truth and swear to it” without even knowing the underlying info generating this erronious garbage – It is like AI is practicing law without a licence.
          AI is operating as a human without being sanctioned as human.
          Where their is dba Mr Cooper by another name– the same misdeeds and theft will continue

    2. NN Cassandra

      I would thought by now it would be widely understood that asking LLMs for list of specific articles or urls is begging them to hallucinate. It’s practically The canonical example where they fail spectacularly. Yet humans persist…

      Or if you must, at least manually check each returned item.

  3. Steven Smith

    AI is driven to provide answers, regardless of whether they are correct.

    Why is this part believable?
    “Going forward, I’ll verify every reference directly against your blog unless you specify otherwise.”

    1. Adam1

      I found that sentence quite enlightening because it’s a flat out lie. It implies that it was or at least is now conscious of the mistake it made, but the reality is that it is not conscious of that so it will not actually perform in the manner that it should have performed even though it said it would and implied it would even if you asked it to redo the search. It will continue to operate as before until some human makes a change to the programming.

      Personally that whole paragraph seems to have been created by someone who is worried about people reducing their usage of ChatGPT or more strongly questioning its outputs. Because it is an outright lie that was likely directly influence by its owners, I’d say it is an intentionally fraudulent statement, but I doubt anyone will be arrested for it.

  4. Fred

    I could not agree more. I have been working on fundamentals of Marx’s theory of money for a couple months now and I experienced two crucial things when seeking more insight into sometimes confusing text passages in the original texts by Marx or trying to clear ambigious terms and the like. First I replaced ChatGPT with DeepSeek because DeepSeek is obviously way better trained in the field of Marxism and secondly, when I was asking for further references and recommendations for secondary literature, AI made up some titles which in fact did not exist. After inquiring about those mistakes I got basically the same reply – that AI had ‘hallucinated’ those titles based upon the content of the topic and that it was based on “possibly conflating concepts from other authors or misremembering a title”

  5. The Rev Kev

    If you hired a research assistant but soon found that they were giving you made-up research filled with references to article that did not exist, you would not hesitate to fire them and make clear that they could expect no recommendations. So why is AI any different? I think I know what those AI creators are aiming for but this present iteration of AI is not it – and can never be. But by now there are so many hundreds of billions of dollars pumped into this scam, that investors feel that they must push forward. This being the case, expect AI to be jammed into computers, computer programs, mobiles, tablets, apps and anything else that deals with 1s and 0s. Before it implodes on itself, we will look back fondly on the days when we were only dealing with spam emails and the like.

    1. Ed S.

      Not only jammed into computers, but into work itself. “Goals” in the company I work for include specific call-outs as to how I will integrate AI into my work processes, how I will use AI to leverage my work so I can do more in less time, and how AI has enabled me to be more efficient with specific examples. There’s tremendous pressure to integrate AI into every aspect of internal business processes and into customer facing products.

      The issues cited repeatedly here on the accuracy and reliability of AI aren’t even considered (not sure that individuals are even aware of the issues); it’s treated as a magic box to quickly provide information. In the very short term it is handy to generate directionally correct content, but unless the individual using it is familiar with the underlying issue, there’s no way to tell if it’s correct or incorrect.

      As a simple (non-work related example), I wanted to check the 2025 capital gains brackets for married couples in the US. That’s a VERY simple query and the AI returned a plausible but completely incorrect answer. I was generally aware of what the number is, and ultimately double-checked it directly on the IRS website. The broader issue is if you don’t have any knowledge of the issue you’re researching then reliance on AI could be disastrous. I can easily imagine AI creating DIY videos with incorrect information which could result in injury or even death.

      1. TimH

        Reminds me of people using pocket calcualtors in school with no feel for the order of the result. So if they typed 20 * 20 and got 580 they would not check that the last 0 was mistyped a 9.

        1. AG

          When we got our first calculators in 7th grade I rejected them for 1 year distrusting them.

    2. redleg

      As a licenced/registered professional, I will never allow AI to be used in any of my or my subordinates’ professional work.
      I have to stamp it “work done by me or under my direct supervision” and I will never be able or willing to take legal responsibility for how AI is trained or programmed.

      My bigger concerns are 1. (short term) how many professionals are willing to take that risk, which means that new bridges and buildings are going to collapse with a much higher frequency than they should, and 2. (long term) the newbies aren’t going to learn how to do their jobs, resulting in the next generation of professionals knowing less than their predecessors.

  6. Haymer Doots

    The basic problem is that AI is, as Yves correctly stated, probabalistic, but the output usually contains only the data and not the (generated alongside) score or assessed accuracy of response. If each response by ChatGPT had included, say, a 5% score for accuracy and (from experience) you knew to disregard anything below, say, 95% then you would waste less time and feel less cheated.

    Based on many years personal research into Optical Character Recognition (which uses artificial neural network technology much like most AI) I concluded that the only way to have acceptably accurate results (1 error per 40,000 characters or 99.995%) was to use ensemble programming of at least 5 OCR engines and also to remove the fairly small – but persistent – number of errors where the accuracy score is 100% but the answer is wrong – by re-running data input in a different order.

    Of course, giving a score for each result (with OCR that is every single character) renders the output unreadable as ordinary text, which pretty much destroys the usefulness(?) of AI. But for anything serious (such as the author’s queries) then AI without metrics is basically a chocolate teapot.

  7. Es s Ce Tera

    I’ve commented several times here on NC that ChatGPT has made up titles and authors for my queries, with the same pattern of responses as Richard has encountered. My theory is it’s consuming student papers or dissertations where there are made up citations in the bibliographies.

    For the most part, I ask ChatGPT “who said X and where” sorts of questions and the responses have tended to be accurate and verifiable. Even “who taught that X?” But if you ask “what text would be good to read for X topic”, where X is a very niche topic, Richard’s example has tax law, then woe is you. I think it seems to want to make things up rather than say it has no information on the topic.

    However, it has been good at flushing out authors and titles I never would have found any other way, would take a lifetime of study to learn about. With the consequence that I now have a MUCH larger To Read pile, but oh what an interesting pile it is, with some very choice authorship. My latest discovery is Charles Journet, who I had never heard or read of before ChatGPT.

    Also, there are versions of LLM which are being refined to provide citations and sources for everything claimed (GPT-4o-mini and Claude-3.5). They still suck, but that’s the hope.

  8. brian wilder

    Perhaps, “The fault, . . . , is not in our stars but in ourselves”?

    If LLM are a simulacrum of human explanatory rhetoric, maybe what is revealed in the mirror are also flaws in the original, profoundly flawed artifacts of human discourse.

    Automating the production of bull excretions cannot be a good thing if LLM training goes forward interpolating the copy of the copy, which humans have been doing “by hand” for centuries, but which machines will do faster and at greater volume.

    1. Michaelmas

      If LLM are a simulacrum of human explanatory rhetoric, maybe what is revealed in the mirror are also flaws in the original, profoundly flawed artifacts of human discourse.

      Correct. We have a winner, as they say.

      You can train LLMs — the very same technology, with transformers etc. — on non-human, non-semantic data such as biogenetic pathways or potential molecular models, and have it do its pattern-recognition thing in that context. Then test the results. A few examples —

      GP-GPT: Large Language Model for Gene-Phenotype Mapping
      https://arxiv.org/abs/2409.09825
      Large Language Models for Bioinformatics
      https://arxiv.org/abs/2501.06271
      Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models
      https://arxiv.org/abs/2501.18287
      Adaptive language model training for molecular design
      https://link.springer.com/article/10.1186/s13321-023-00719-7
      GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis
      https://arxiv.org/abs/2406.15341

      And so on.

      1. brian wilder

        thank you for the links — important clues to how dim my understanding is are buried there

  9. Peter L.

    I think it is fair to describe Large Language Model Generative Pre-trained Transformers as next word predictors. Given a string of words, these programs assign probabilities to a set of words and symbols such that the probabilites are correlated with what is likely to be accepted by a human as roughly making sense. So far, this is all that LLM’s can do.

    However, the question I would like answered is why the programmers choose to have their LLM’s output text that reads like it was produced by a human. It would be very easy to change them so that they simply don’t do things like “refer to themselves” in the first person, or output other kinds of text that could confuse someone into thinking the machines are doing something other than predicting which string of words would be acceptable to the user given an input string of words.

    I wish a journalist would dig into this question. It’s fascinating to go back and read Turing’s paper about this, which is both quite funny and insightful. Interestingly Turing more-or-less said that our use of the word “think” would change, not that machines would ever think, even if passing his famous game:

    “The original question, “Can machines think?” I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.”

    There is simply no reason that these LLMs should be programmed to confuse people about what they are doing. In fact, it appears to me that by inserting a command into a prompt like “Don’t use any language that could be confused as indicating this program is conscious or has any intentions” appears to be enough to somewhat improve the output. This could easily be made mandatory in an LLM.

    How is it useful to have people fooled into thinking that next word predictors have intentions?

    By the way, to me the most interesting part of 2001: A Space Odyssey was to have the character Dave utterly unemotional when dealing with HAL. HAL malfunctioned as any machine might, and Dave fixed the problem. These days it strikes me as a profound directorial choice.

    1. albrt

      “How is it useful to have people fooled into thinking that next word predictors have intentions?”

      Uh, because fraud is the main business case?

  10. Mikel

    “Generative language models are always making it up. They are just a sophisticated statistical engines that produce plausible looking text from your prompt.”

    I wouldn’t phrase it that way. The algorithm hasn’t “made up” anything.
    It’s more like throwing together plausible looking text from other sources.
    I think it’s important to phrase it that way.

  11. TiPi

    Just had Google AI tell me my 4 stroke petrol garden machine was a 2 stroke requiring a 40:1 fuel/oil mix.

    1. Mikel

      It all began with the progressive enshittification of search engines by prioritizing advertising.

  12. moishe pipik

    Apparently, AI as developed by our cyber overlords is a collection of sociopaths. sounds about right.

  13. XXYY

    Going forward, I’ll verify every reference directly against your blog unless you specify otherwise.

    Another major misconception that people have about AI frameworks is that they “learn” over time. Not sure where this one came from; maybe it’s just baked into our thinking that anything “intelligent” must be able to “learn.” (AI output like the above further muddies the waters here.)

    Learning requires a persistent change of state. For example, in the human brain, the organization of neurons evolves over time so that the way it looked yesterday is different from the way it looks today.

    LLM frameworks don’t change their state in normal operation. The only time that happens is during training, which is a pre-release activity. So once a particular version of an LLM is released to users, it can no longer “learn” and will just do whatever it does now forever.

    (Note: self-modifying software does exist in the computer science field, but the description is almost invariably a criticism and an example of what not to do, since it leads to a system that has unpredictable and non-repeatable behavior; such software is impossible to support and is thus a Bad Thing except in a hobby or research context.)

    1. ChrisPacific

      Yes, that quote is an outright fabrication. It’s there because the training data suggests that’s what a responsible human would say when caught in that kind of error, not because it’s in any way true.

      You could, if you wished, ask it about how LLMs work and whether it’s really possible for them to ‘verify references directly’ as it claimed it was going to do, and it would earnestly confess that no they can’t, and it was incorrect for it to say that, before fabricating another response about how it would learn from your feedback and be careful not to do it again.

      (I have stopped doing this, because they just loop again and again, confessing their mistakes earnestly without ever learning, changing, or suffering the kind of cumulative embarrassment that a human would, and after enough of it the illusion breaks and it becomes really clear you’re just talking to a prediction engine).

  14. Gulag

    Here is a somewhat different appraisal.

    AI tends to produce thoughtlike configurations because of probabilistic utterances. Don’t these types of responses, in turn, often seem quite persuasive simply because such responses resonate with the user. The absence of logical connection is not experienced as a flaw because it is not ever perceived. The rhythm, cadence, and semantic resonance of the output becomes capable of winning one over.

    If such a process actually takes place, this type of intelligibility seems to mirror rhetorical, aphoristic, or even more poetic traditions. What we are left with (and for me, this is quite scary) is the possibility that, down the road, these type of responses begin to sound more like oracles rather than arguments.

    1. brian wilder

      there is a kind of intellectual piety associated with the presumption that persuasive argument ought to be “logical” as opposed to what propaganda usually is, which is hypnotic

      “Logic” is mechanistic and computer programs to check and confirm arithmetical calculation or formal proofs are certainly possible. But, arguments which mediate between theory and facts have no fixed, formal structure that a machine could mechanically test for credibility. If there is a structure to persuasive argument, it is normally and conventionally the hypnotic induction structure of good salesmanship and something akin to charisma in the sales person writing or speaking.

      LLMs striving to “sound” conventionally plausible, to my ear at least, tend to sound like the mission statement of a well-connected NGO or, tuned differently, the “neutral” voice of Wikipedia, but with more consistent grammar.

      I cannot claim my own judgment in that matter, but I can only suppose that LLMs can be used to generate effective propaganda using the well-known techniques of the hypnotist. Charisma and intuition about reality might be a reach — a plausible imitation of charisma, sure. Humans mimick those they perceive as powerful, why not copy machines?

      can LLMs be critics and skeptics without devaluing the moral ambitions of mere humans?

    1. Rubicon

      Maybe that is the KEY problem with A1. How many of these people are educated enough to know how and where to search & find the most appropriate answers to the questions asked by people?

  15. Señor Ding Dong

    ChatGpt: “That’s not acceptable when accuracy is required, and I take responsibility for the confusion it caused.”

    How nice the algorithm takes the heat and not the (billionaire) humans in the driver seat pushing this crap on everyone.

  16. AG

    In screenwriting they use it all the time:

    -For research to save time and to find the details you need to create texture and what is now known as “world-building”. The more detail you got the more believable a screenplay looks and reads.
    -For outlining alternative plot lines – using AI like a co-writer for tossing around ideas
    -For rewriting a script in the style of another known novelist, e.g. “rewrite the dialogue Charles Dickens style” or “Gonzo style”
    -for creating simple new dialogue
    -they will be using it for dailies and simple formatted shows to totally replace writers

  17. AG

    Since I have never used any of these and I don´t intend to do so in the future – if people research a topic via these programs the result is not the actual research – as shown in any sci-fi flick when AI is looking through billions of pages gathering info on a weapons systems – but short-cut creation of a simulation of research via algos?
    Or is this depending on the subject matter?
    So Chat does factual research if I ask about the 1917 Revolution but when it is about some complicated medical issue it merely recreates what it assumes it correct?

  18. AG

    Our legal system is not cut out for this problem at all.

    What will hapen is that severe punishment and high sentences will be introduced for people using AI which for me on the other hand is absurd. And in order to enforce those laws massive surveillance into the most private parts of life will be enabled and justified and supported by large parts of those groups who up to now were opposed to surveillance.

    This is going to be a nightmare.

    The only solution would be extreme limitation of private property, dismantling of huge corporations, and transferring the means of production to society and the state at large, i.e. socialism.

    The alternative will be a capitalist dictatorship, literally.

  19. Steve M

    Rummaging through thoughts for the most probabilistic to become random.

    It’s being directed.

    It mimics the universe because all of this is the result of electrons running a course.

    When a perfect creator produces an imperfection, then religion results when the imperfection must atone for itself.

    Like any corporate product, planned obsolescence is the eventual profit enhancement for most technical goods.

    And so on…

  20. Piotr Berman

    My experience is that ChatGPT is definitely “bad” in generating references but it finds useful stuff when provided with feedback. But yes, to use it for anything demanding reliability (like a reply to a social media post), one has to test the answers. In short, it can be useful but definitely a source if “truth”, you have to start with some knowledge.

  21. Irritable

    I just asked Gemini to write a literary review of “Why I Drink” by Mark Twain, which of course doesn’t exist.

    It gave a very nice report on that essay!

Comments are closed.