ChatGPT, Google’s Bard, and other “generative” AIs have been trained by filling them with the contents of the Internet—poems, legal analyses, scientific papers, travel guides, and any other items the trainers can find.
Problem: many of those items are copyrighted by their original authors. Now OpenAI and other companies that have created these all-knowing chatbots are facing a tsunami of lawsuits and demands for payment for using other people’s work without their expressed permission.
However, the problem is short-term, according to OpenAI founder Sam Altman. “Pretty soon, all data will be synthetic,” he said at a May conference.
What does that mean?
First, remember that “generative” AI means that an AI typically generates a response to a prompt or question by blending together information from all over the Internet. It’s hard to prove that a specific fact or statement in an AI’s response was lifted verbatim from a single copyrighted source.
If that’s true, then an AI’s creators can further anonymize an AI’s contents by using an AI to gather data to train its next descendant.
As an example, instead of having an AI learn trigonometry from a textbook, trainers could have two AIs talking to each other, one as a teacher and the other as a student. The student asks a series of questions and learns trigonometry from the teacher-AI instead of from a copyrighted source.
“It’s all synthetic,” Aiden Gomez, CEO of Cohere, an AI development firm. “They’re just having a conversation about trigonometry. It’s all imagined by the model.”
A human expert then reviews the data and smooths rough spots and corrects errors.
Microsoft Research recently published a paper describing a collection of short stories written by ChatGPT using words that a four-year-old probably would understand. That became a data set that was used to train another AI to produce grammatically correct stories that read as well as a skilled human author would produce.
Some companies have already sprung up to offer synthetic data as a service.
“You really want [AIs] to teach themselves,” Gomez said, “to ask their own questions, discover new truths, and create their own knowledge.”
TRENDPOST: “Synthesizing” data is akin to money laundering: the source of the data is still the same but the AI developer has put another layer of removal between his AI and the original source of the information being mined.