Developers are Crabgrass : Using AI generated data to train AI? Great idea?

If you cannot trust MIT, who can you trust?

The trouble is, the types of data typically used for training language models may be used up in the near future—as early as 2026, according to a paper by researchers from Epoch, an AI research and forecasting organization, that is yet to be peer reviewed. The issue stems from the fact that, as researchers build more powerful models with greater capabilities, they have to find ever more texts to train them on. Large language model researchers are increasingly concerned that they are going to run out of this sort of data, says Teven Le Scao, a researcher at AI company Hugging Face, who was not involved in Epoch’s work.

The issue stems partly from the fact that language AI researchers filter the data they use to train models into two categories: high quality and low quality. The line between the two categories can be fuzzy, says Pablo Villalobos, a staff researcher at Epoch and the lead author of the paper, but text from the former is viewed as better-written and is often produced by professional writers.

Data from low-quality categories consists of texts like social media posts or comments on websites like 4chan, and these examples greatly outnumber those considered to be high quality. Researchers typically only train models using data that falls into the high-quality category because that is the type of language they want the models to reproduce. This approach has resulted in some impressive results for large language models such as GPT-3.

One way to overcome these data constraints would be to reassess what’s defined as “low” and “high” quality, according to Swabha Swayamdipta, a University of Southern California machine learning professor who specializes in data-set quality. If data shortages push AI researchers to incorporate more diverse data sets into the training process, it would be a “net positive” for language models, Swayamdipta says.

In the old days, when processing power and memory were in shorter supply, the word was garbage in garbage out.

Now MIT can say it with many more words. We have come a long way, and the way out is obscure.

Ask the Unibomber?

google = use AI output to train AI

See what you get.

UPDATE: I was contemplating a post with well linked authority about something I'd not seen posted about yet, expecting it for some months.

The Epstein and the Ehud Barak

Mossad. That's what some say about who created Epstein and the honey pot.

Epstein was Jewish, The Victoria's Secret guy was Jewish. Leonard Black? Jewish. So it's low-grade easy to say it was all part of the Jewish conspiracy, which gets mentioned a lot on various web segments.

What interests me, and I have nothing like proof either way beyond the circumstantial -

If you web search = Epstein Barak business ventures

See what you get. Tune the search if you feel that's needed. See what their mutual greed produced. Search Barak separately to get his Wiki Bio.

Then try Wikipedia for names you may get as to money making thoughts Epstein and Barak may have shared for trial ventures they considered, something that may have hit the shoals or still be going.

Something perhaps existing for one purpose, but having commercialization possibility elsewise?

There is stuff online but I will let you find it and make inferences from it. Suffice it to say that if Epstein had his honey pots stocked with cameras having a remote feed, via the internet available tech as then might have existed, such as then leading edge tech one might market for feeding video to a 911 site to aid its responsiveness, wouldn't that possibly suggest streaming the activity at the honey pot to remote storage, somewhere, so that if the FBI or US intelligence people were to have the right warrant to look, (warrantless searches, they don't happen do they), then they might not find any incriminating video evidence on site at the honey pot? And if live-streamed somewhere, where would you guess? Ventures, patents, there might be a circumstantial trail, but only hard evidence - if it happened that way which is uncertain - if you know the remote data storage site being streamed to and have access to it.

And, have a nice day. Technology is your friend. Look how it helped Kash and the Mormons track down Robinson. It can do that to, with you. The Unabomber wrote about yesterday's technology and we need to contemplate its movement given the tons of money, hardware, brain power, and electricity being invested into tomorrow's AI. We have a future, but who other than you is charting it and bringing it to being. And for what reasons are they routing things in the directions things are moving? With all of your best interests in mind? Making a better life for the children?

FURTHER: Selfless benevolence is the major human motive. What being "human" means.

Only the good die young.

Reconciling opposing thoughts is sometimes easy, sometimes hard, right?

See what you find if you search. See what you think about it.

Developers are Crabgrass

Pages

Wednesday, September 17, 2025

Using AI generated data to train AI? Great idea?