The Challenge of Preserving Good Data in the Age of AI

Posted on September 26, 2024 by Yves Smith

Yves here. The challenge of preserving good data existed before AI but AI looks set to make the problem worse. There are several layers to this conundrum. One is the pre versus post Internet divide, in that pre-Internet print sources are being devalued and may be becoming even harder to access (as in efforts to maintain them are probably weakening). Second is that Internet-era material disappears all the time. For instance, even as of the early ‘teens, I was regularly told Naked Capitalism was a critical source for research on the financial crisis because so much source material had disappeared or is difficult to access. The piece below explains how AI produces new problems by creating an “information” flood and increasing the difficulty of determining what older material to preserve.

By Peter Hall is a computer science graduate student at the New York University Courant Institute of Mathematical Sciences. His research is focused on the theoretical foundations of cryptography and technology policy. Originally published at Undark

Growing up, people of my generation were told to be careful of what we posted online, because “the internet is forever.” But in reality, people lose family photos, shared to social media accounts they’ve long-since been locked out of. Streaming services pull access to beloved shows, content that was never even possible to own. Journalists, animators, and developers lose years of work when web companies and technology platforms die.

At the same time, artificial intelligence-driven tools such as ChatGPT and the image creator Midjourney have grown in popularity, and some believe they will one day replace work that humans have traditionally done, like writing copy or filming video B-roll. Regardless of their actual ability to perform these tasks, though, one thing is certain: The internet is about to become deluged with a mass of low-effort, AI-generated content, potentially drowning out human work. This oncoming wave poses a problem to computer scientists like me who think about data privacy, fidelity, and dissemination daily. But everyone should be paying attention. Without clear preservation plans in place, we’ll lose a lot of good data and information.

Ultimately, data preservation is a question of resources: Who will be responsible for storing and maintaining information, and who will pay for these tasks to be done? Further, who decides what is worth keeping? Companies developing so-called foundation AI models are some of the key players wanting to catalog online data, but their interests are not necessarily aligned with those of the average person.

The costs of electricity and server space needed to keep data indefinitely add up over time. Data infrastructure must be maintained, in the same way bridges and roads are. Especially for small-scale content publishers, these costs can be onerous. Even if we could just download and back up the entirety of the internet periodically, though, that’s not enough. Just as a library is useless without some sort of organizational structure, any form of data preservation must be archived mindfully. Compatibility is also an issue. If someday we move on from saving our documents as PDFs, for example, we will need to keep older computers (with compatible software) around.

When saving all these files and digital content, though, we must also respect and work with copyright holders. Spotify spent over $9 billion on music licensing last year, for example; any public-facing data archival system would hold many times this amount of value. A data preservation system is useless if it’s bankrupted to lawsuits. This can be especially tricky if the content was made by a group, or if it’s changed hands a few times – even if the original creator of a work approves, someone may still be out there to protect the copyright they bought.

Finally, we must be careful to only archive the true and useful information, a task that has become increasingly difficult in the internet age. Before the internet, the cost to produce physical media — books, newspapers, magazines, board games, DVDs, CDs, and so on — naturally limited the flow of information. Online, the barriers to publishing are much lower, and thus a lot of false or useless information can be disseminated every day. When data is decentralized, as it is on the internet, we still need some way to make sure that we are promoting the best of it, however that is defined.

This has never been more relevant than now, on an internet plagued with AI-generated babble. Generative AI models such as ChatGPT have been shown to unintentionally memorize training data (leading to a lawsuit brought by The New York Times), hallucinate false information, and at times offend human sensibilities, all while AI-generated content has become increasingly prevalent on websites and social media apps.

My opinion is that because AI-generated content can just be reproduced, we don’t need to preserve it. While many of the leading AI developers do not want to give away the secrets to how they collected their training data, it seems overwhelmingly likely that these models are trained on vast amounts of scraped data from the internet, so even AI companies are wary of so-called synthetic data online degrading the quality of their models.

While manufacturers, developers, and average people can solve some of these problems, the government is in the unique position of having the funds and legal power to save the breadth of our collective intelligence. Libraries save and document countless books, movies, music, and other forms of physical media. The Library of Congress even keeps some web archives, mainly historical and cultural documents. However, this is not nearly enough.

The scale of the internet, or even just digital-only media, almost certainly far outpaces the current digital stores of the Library of Congress. Not only this, but digital platforms — think software like the now-obsolete Adobe Flash — must also be preserved. Much like conservationists maintain and care for the books and other physical goods they handle, digital goods need technicians who care for and keep original computers and operating systems in working order. While the Library of Congress does have some practices in place for digitization of old media formats, they fail to meet the preservation demands of the vast landscape that is computing.

Groups like the Wikimedia Foundation and the Internet Archive do a great job at picking up the slack. The latter in particular keeps a thorough record of deprecated software and websites. However, these platforms face serious obstacles to their archival goals. Wikipedia often asks for donations and relies on volunteer input for writing and vetting articles. This has a host of problems, not least of which is the biases in what articles get written, and how they are written. The Internet Archive also relies on user input, for example with its Wayback Machine, which may limit what data gets archived, and when. The Internet Archive has also faced legal challenges from copyright holders, which threaten its scope and livelihood.

Government, however, is not nearly so bound by the same constraints. In my view, the additional funding and resources needed to expand the goals of the Library of Congress to archive web data would be almost negligible to the U.S. budget. The government also has the power to create necessary carve-outs to intellectual property in a way that is beneficial for all parties — see, for example, the New York Public Library’s Theatre on Film and Tape Archive, which has preserved many Broadway and off-Broadway productions for educational and research purposes despite these shows otherwise strongly forbidding people taking photos or videos of them. Finally, the government is, in theory, the steward of the public will and interest, which must include our collective knowledge and facts. Since any form of archiving involves some form of choosing what gets preserved (and by complement, what doesn’t), I don’t see a better option than an accountable public body making that decision.

Of course, just as analog recordkeeping did not end with physical libraries, data archiving should not end with this proposal. But it is a good start. Especially as politicians let libraries wither away (as they are doing in my home of New York City), it is more important than ever that we right the course. We must refocus our attention on updating our libraries, centers of information that they are, to the Information Age.

Print Friendly, PDF & Email

Subscribe to Post Comments
13 comments

Acacia September 26, 2024 at 6:46 am

Pre-AI online content – save and preserve it, as this article proposes, under the supervision of an accountable public body.

Post-AI content – que sera sera.
The Rev Kev September 26, 2024 at 7:51 am

I fully support preserving the pre-AI internet. But if they start to run short of storage space, they could always delete all those selfies and images of meals that people take to make more room.
Mikerw0 September 26, 2024 at 8:16 am

A close friend of mine was one of the two Post-Doc candidates that ran the Playboy sex survey. Other than Kinsey it is the only major database pre the STD crisis of the 80s. A lot of time and effort was spent to preserve the database. They had issues getting people to take it seriously and plan for the changes.

A small parable.
Earl September 26, 2024 at 9:56 am

This essay reminds me of Alexander Stille’s collection of essays “The Future of the Past.”
Socal Rhino September 26, 2024 at 10:06 am

Keeping my physical copy of a Canticle for Leibowitz in a cool, dry place.
1. Cetzer September 26, 2024 at 8:02 pm
  
  I have this and lots of other stuff (Offline-Versions of Wikipedia in different languages, text-only) on M-DISC DVDs, which are advertised to last thousand years. Unfortunately the DVD-version is unavailable and I had decided some time ago, not to bumble about with Blu-Ray.
  But who can read them in thousand(s) years? Take a look at the deciphering of ancient, mostly burned scrolls from Pompeii with assorted high-tech. Aliens that are able to visit the earth might decipher those DVDs even in the far future, if so inclined after they saw the mess we left.
scott s. September 26, 2024 at 1:39 pm

Knowing what to preserve and how to do it is a major issue. Agree with author that it is hard to tell if AI generated data/content is pollution or beneficial. But as post describes, preserving and accessing are two different things.

As copyright law has continually expanded the ability to access becomes increasingly constrained, at least in legal theory though the ability to enforce legal theories in actual systems such as government courts is limited by the shear volume.
herman_sampson September 26, 2024 at 1:47 pm

So family history/genealogy research will even more challenging in obtaining accuracy. It’s bad enough contending with human error. Might lead to George Washington being the ancestor for everyone in the U.S. as “he was the father of his country”. Of course, WWIII would render the whole problem moot.
AG September 26, 2024 at 3:55 pm

Chat GPT turns out to be a fucking dystopian nightmare already.
The last nail in the coffin of what once was critical knowledge.

As colleagues of mine who are not as crazy as me and not as Matt-Taibbi-savvy (they have never heard of him or Twitter Files) told me they are using Chat GPT to save time for online research.

US-Representative Tom Massie in reference to useless CIA briefings on Ukraine has this very telling and good point to make:

“(…)
I’m in the Speaker’s office, and I asked the Speaker if he knew the number of casualties in Ukraine? He began telling me how many Russian casualties there were. I said, Do you know how many Ukrainian casualties there are? And he said, No. I said, Have they ever told you? And he said, No. I said, Have you ever asked? He said, I should ask. So the Speaker of the House doesn’t even know. Didn’t even know at the time how many casualties there were. And if you ask, for instance, an AI assistant, how many Ukrainian casualties there are, even the AI assistants here in our country can’t tell you how many. At the time when I drafted this, AIs came up with wildly different numbers because they’re relying on everything that’s been published. And there’s virtually nothing reliable that’s been published.

see: https://responsiblestatecraft.org/massie-ukraine-casualty-numbers/

So you pile shitloads of useless propaganda onto each other and people are consuming it.
And NOBODY will realize.

Frankly this is terrifying.

p.s. records from the pre-Web and Web 2.0 era are increasingly endangered.
Once safe links are disappearing.
And I am talking about serious FOIA stuff, secret intelligence, governmental records and so on.

And what if some evil conspiracy decides to totally destroy the archive.is archived content???
AG September 26, 2024 at 3:57 pm

A less sophisticated question: which search engine on Earth am I supposed to use.

I am currently juggling with Qwant Google Duck sometimes Yandex
The first three are garbage. Yandex I haven´t used enough to judge.
1. Cetzer September 26, 2024 at 7:47 pm
  
  You might take a look at Metasearch-Engines . Sometimes search engines like duckduckgo are weak in languages other than english. Sometimes a search in another language is the way to gehen.
  Anyway, Google has announced, that it will kind of capitulate against the KI-Babble and not try anymore to collect (on spidery legs) the whole Internet, but only trusted sites; The other engines¹ will probably follow suit and morph back into a directory like Yahoo once was.
  In difficult cases: Why not ask other humans in an appropriate forum or in your favorite begging place: “Wise man, can you spare me an insight?
  
  ¹An evil genius could offer a search over the whole internet for paying (cash not personal data) clients only
Polar Socialist September 26, 2024 at 4:56 pm

I though most of the Internet was already in Chinese? Anyway, as most of the net I use is either NC or sites based on sources published on dead trees (or maybe AI can already falsify poor quality images of original sources) that I use to learn more – that is, build on a foundation that already exists and AI can’t comprehend nor accommodate – I assume I can keep on using the polluted Internet for the rest of my days.

It’s not like there hasn’t always been sites with an absolutely ridiculous noise-to-signal ratios (youtube comments, anyone?) provided by the thousands of humanoid internet warriors. One of the reasons, I know, why NC feels so good a place amidst all that chaos that existed everywhere even before ChatGPT was released.
hazelbrew September 26, 2024 at 6:51 pm

I like the topic, it is broad and at the heart of what I’ve done professionally for years, and as a Comp Sci graduate trained years ago understanding data, information, knowledge, knowhow, algorithms etc is at the heart of the discipline.

Yves your intro is good – the challenge of data preservation has always been there. the nuance is how that has changed with AI (Gen AI in particular).

so lets work an example.
Lets say I am recording my training times when running, as I am trying to finish a half marathon.

Each training run contains data – the distance, time taken, strides taken, heart rate etc. That is pure data. it’s a recorded fact based on the sensors I take with me. It is also something that gets regularly pruned and discarded. Why? because its not that useful until processed.

What can I do with it?
Comparing the training runs allows me to answer questions like “Am I getting faster? Am I able to run for longer? What is my efficiency? ” That is now the domain of information.

go on…
If I want to understand if my training plan is delivering on my expectations then the information about “am I getting faster?” can be set against the training plan – is the plan working? is it effective? do I need to do something different to hit my goals? That is now the domain of knowledge. I am now making decisions off the information.

But the same base data can be used to create different types of knowledge.
For example if I want to track the usage of particular shoes, and then effectiveness over time and replace at the appropriate time in their life? I use the same base data points – mileage run.

so what? (let me try and get to the point!).

How does this relate to AI or specifically Generative AI? and data ?

Lets take the example further.

If I use AI to create some novel useful analysis of, say, my stride pattern, or training times, or time of day etc.. then that is useful and worth preserving for me.
whereas the base data is not always worth preserving.

or in other words – keeping base data is not always good, and keeping AI derived analysis is not always bad.

I think some of the challenges with Gen AI perception and usage relate to these distinctions. i.e. we should not have the expectation that a genAI tool is good at fact retrieval – that is what a search engine is for. unfortunately perception is running ahead of what the tech could or should be used for.

Comments are closed.