Developers are Crabgrass : Model Collapse. When LLM bots create content, and feed on their own stuff or output of other bots, things happen.

Monday, March 11, 2024

Model Collapse. When LLM bots create content, and feed on their own stuff or output of other bots, things happen.

I asked Micro$oft's CoPIlot preview version about it, and got:

Model collapse, also known as AI collapse, is a phenomenon observed in generative artificial intelligence (AI) models. Let me break it down for you:
What is Model Collapse?
Model collapse refers to the gradual degradation in the output of a generative AI model trained on synthetic data, which includes the outputs of other models (including prior versions of itself) ¹.
In essence, as an AI model is exposed to more AI-generated data, its performance worsens over time. It produces more errors in responses and content, resulting in less non-erroneous variety in its outputs ².
Why Does Model Collapse Occur?
Initially, large language models (LLMs) like ChatGPT are trained on human-generated content (books, articles, etc.).
However, as more people use AI to produce and publish content, AI-generated data proliferates on the internet.
When LLMs are retrained on their own outputs or other AI-generated data, they suffer from irreversible defects, leading to model collapse ².
Consequences of Model Collapse:
Over time, mistakes in generated data accumulate, causing models to misperceive reality further.
The variety of non-erroneous responses narrows down, resulting in gibberish or nonsensical output ².
Handling Model Collapse:
Researchers propose strategies to mitigate model collapse:
Retain Original Human-Produced Data: Maintain a pristine copy of exclusively or nominally human-produced data. Avoid contaminating it with AI-generated content ².
Adaptive Regularization: Apply regularization techniques during training to prevent overfitting to AI-generated data ³.
The Challenge Ahead:
As AI-generated content proliferates online, preventing model collapse becomes crucial.
Researchers continue to explore ways to maintain model quality and robustness in the face of increasing web content generated by LLMs ⁴.
In summary, model collapse poses a significant challenge, but ongoing research aims to address it and ensure the reliability of AI-generated content.

Learn more - these Copilot returned links should all work, (clicking text, not boxes), test them

1en.wikipedia.org 2venturebeat.com 3arxiv.org 4computerweekly.com 5towardsdatascience.com 6appinventiv.com 7theregister.com 8arxiv.org

That was not the start of considering it. What triggered the bot questioning about bot difficulties? I watched, "The Turing Lectures: The future of generative AI," which is an hour and a half long, and recommended to any reader curious about the topic.

(Coincidentally - Apart from AI bot considerations, Jensen Huang, Founder and CEO of Nvidia, interviewed at Stanford, an hour long, is also worth the time taken to view it. Not having written code, it is refreshing that each linked video mentions the future will be bots doing the coding, which rings my bell and should resonate with others interested in new directions, without having to write code.)

So, basically a pitch of two video items, with text in support. Next, online items aside from what the bot returned, perhaps overlapping - via two search engine queries.

https://duckduckgo.com/?t=ffab&q=model+collapse&ia=web

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C24&q=LLM+model+collapse

The two CoPilot returned items from arXiv are technical computer science items, 18p and 16p long, by credible scholars, with such items difficult to read, while other items are popular reports any intelligent person can follow. The Google Scholar returns are likewise from a technical literature search, while the duckduck returns are popular news and analyses.

All for now. After posting, Crabgrass will be studying stuff about Model Collapse.

Readers not caring to pursue things further if at least trying out the videos, are expected - but some may be intrigued, so have at it.

__________UPDATE__________

Can you imagine if building architecture image search becomes too slanted if in the original training data set a web crawler picked up and retained too many fresh images from Gaza?

Leaving that thought to settle in, OpenAI (and the derivative Mirco$oft CoPilot stuff), are proprietary closed black box systems. It's technological "Trust me."

Is this any answer? Or is open source just a distinction without a difference, in terms of model collapse as more and more web content becomes artificially generated because it's cheaper than using human beings to write stuff?

Then, think, is life more interesting with dilemmas, or is certitude best? Or is your answer biased by personality traits, such as tidiness, or drug use?