How ChatGPT and Our Language Models Are Developed

Written by Michael Schade - 31 July 2023

OpenAI’s large language models, including the models that power ChatGPT, are developed using three primary sources of information: (1) information that is publicly available on the internet, (2) information that we license from third parties, and (3) information that our users or our human trainers provide.

This article provides an overview of the publicly available information we use to help develop our models and how we collect and use that information in compliance with privacy laws. To understand how we collect and use information from users of our services, including how to opt out of having ChatGPT conversations used to help teach our models, please see our Privacy Policy and this help center article.

What is ChatGPT, and how does it work?

ChatGPT is an artificial intelligence-based service that you can access via the Internet. You can use ChatGPT to organize or summarize text or to write new text. ChatGPT has been developed in a way that allows it to understand and respond to user questions and instructions. It does this by “reading” a large amount of existing text and learning how words tend to appear in context with other words.

It then uses what it has learned to predict the next most likely word that might appear in response to a user request, and each subsequent word after that. This is similar to auto-complete capabilities on search engines, smartphones, and email programs.

As an example, during the model learning process (called “training”), we might have a model try to complete the sentence: “Instead of turning left, she turned ___.” Before training, the model will respond with random words, but as it reads and learns from many lines of text, it better understands this type of sentence and can predict the next word more accurately. It then repeats this process across a very large number of sentences.

Because there are many possible words that could come next in this sentence (e.g., instead of turning left, she turned “right,” “around,” or “back”), there is an element of randomness in the way a model can respond, and in many cases, our models will answer the same question in different ways.

Machine learning models are made up of large strings of numbers, called “weights” or “parameters,” and code that interprets and executes those numbers. Models do not contain or store copies of information that they learn from. Instead, as a model learns, some of the numbers that make up the model change slightly to reflect what it has learned.

In the example above, the model read the information that helped it improve from predicting random incorrect words to predicting more accurate words, but all that actually happened in the model itself was that the numbers changed slightly. The model did not store or copy the sentences that it read.

What type of information is used to teach ChatGPT?

As noted above, ChatGPT and our other services are developed using (1) information that is publicly available on the internet, (2) information that we license from third parties, and (3) information that our users or human trainers provide. This article focuses on the first set: information that is publicly available on the internet.

For this set of information, we only use publicly available information that is freely and openly available on the Internet – for example, we do not seek information behind paywalls or from the “dark web.” We apply filters and remove information that we do not want our models to learn from or output, such as hate speech, adult content, sites that primarily aggregate personal information and spam. We then use the information to teach our models.

As mentioned in the previous section, ChatGPT does not copy or store training information in a database. Instead, it learns about associations between words, and those learnings help the model update its numbers/weights. The model then uses those weights to predict and generate new words in response to a user request. It does not “copy and paste” training information – much like a person who has read a book and set it down, our models do not have access to training information after they have learned from it.

Is personal information used to teach ChatGPT?

A large amount of data on the internet relates to people, so our training information does incidentally include personal information. We don’t actively seek out personal information to train our models.

We use training information only to help our models learn about language and how to understand and respond to it. We do not and will not use any personal information in training information to build profiles about people, to contact them, to advertise to them, to try to sell them anything or to sell the information itself.

Our models may learn from personal information to understand how things like names and addresses fit within language and sentences or to learn about famous people and public figures. This makes our models better at providing relevant responses.

How does the development of ChatGPT comply with privacy laws?

We use training information lawfully. Large language models have many applications that provide significant benefits and are already helping people create content, improve customer service, develop software, customize education, support scientific research, and much more. These benefits cannot be realized without a large amount of information to teach the models. In addition, our use of training information is not meant to negatively impact individuals, and the sources of this training information are already publicly available.

For these reasons, we base our collection and use of personal information that is included in training information on legitimate interests according to privacy laws like the GDPR. To fulfill our compliance obligations, we have also completed a data protection impact assessment to help ensure we are collecting and using this information legally and responsibly.

We respond to objection requests and similar rights. As a result of learning the language, ChatGPT responses may sometimes include personal information about individuals whose personal information appears multiple times on the public internet (for example, public figures). Individuals in certain jurisdictions can object to the processing of their personal information by our models by filling out this form. Individuals also may have the right to access, correct, restrict, delete, or transfer their personal information that may be included in our training information. You can exercise these rights by reaching out to dsar@openai.com.

Please be aware that, in accordance with privacy laws, some rights may not be absolute. We may decline a request if we have a lawful reason for doing so. However, we strive to prioritize the protection of personal information and comply with all applicable privacy laws. If you feel we have not adequately addressed an issue, you have the right to lodge a complaint with your local supervisory authority.

We protect training information and limit how it is used and shared. To keep this information safe, we use commercially reasonable technical, physical, and administrative measures like access controls, audit logs, read-only permissions, and encrypting stored data. For more information on our security practices, please visit https://www.openai.com/security.

We also take steps to reduce the processing of personal information when training our models. For example, we remove websites that aggregate large volumes of personal information and we try to train our models to reject requests for private or sensitive information about people.

We do not sell training information to third parties and only disclose portions of the information when necessary and consistent with our Privacy Policy.

We only keep this information for as long as we need it to serve its intended purpose. How long we keep this information hinges on factors like its quantity, type, and sensitivity, the risk of harm from unauthorized use or sharing, whether the information is still necessary or useful to train or update our models, and any legal requirements.

Our data controller under the GDPR is OpenAI OpCo, LLC at 3180 18th Street, San Francisco, CA, United States. For information about our EEA and UK representative for data protection matters, please see our Privacy Policy.

Written by Openai - Michael Schade - 31 July 2023

Our Data Protection Officer can be contacted at privacy@openai.com.

The most popular large language model right now is ChatGPT, which is owned by OpenAI

The most popular large language model right now is ChatGPT, which is owned by OpenAI and runs on a generative pre-trained transformer, or GPT. Basically, it’s a whole lot of text taken from all over the web, which it then pulls from to respond in ways that seem naturally human. Currently, ChatGPT is running on GPT-3.5 and GPT-4, and it’s also integrated into the AI interface for Microsoft’s Bing search engine and Edge web browser.

Though you’ve probably heard that GPT language models are — or could become — sentient, you don’t really have to worry. They work the same way the autocomplete function on your phone works, except they’re using massive chunks of the internet to reply to you. But generative AI can’t generate something it doesn’t know.

For instance, ChatGPT still currently has a “knowledge cutoff” of 2021. There have been instances of current events slipping into its data set — it seems to know that the queen of England died — but for the most part, it still thinks it’s 2021 when you talk to it. When an AI doesn’t know something but is pushed to answer a question about it, it will “hallucinate,” or spit out nonsense errors. Users on sites like Reddit will also often share ways of “jailbreaking” these chatbots so they can answer questions that are banned by their terms of service.

‘Grandma exploit’ tricks Discord’s AI chatbot into breaking its own ethical rules

Meanwhile, Bing’s version of GPT-4 is actually connected to a live feed of the internet. It can summarize current events easily — though, when it first launched, it threatened journalists that were writing adversarial pieces about it and begged for users to kill it. A universal reaction to using the internet whether you’re a man or a machine, apparently. (It’s fixed now.)

Though GPT-4, OpenAI’s newest language model, is still only a few months old, we’ve already got a good handle on what it is and isn’t good at. It’s pretty bad at creating anything. If you ask it to tell you a joke or write you a song it will, but they’re pretty bad. It’s the same result if you ask it to write, say, a basic scene of two people having a meet-cute in a romantic comedy. But what it is good at is summarizing huge amounts of data.

One user asked it to read the entirety of J.R.R. Tolkien’s work on Middle-earth and then asked the AI if there was any evidence that people in Middle-earth poop (there isn’t). Another user asked it to summarize — and make a diagnosis based on — his dog’s blood charts (the diagnosis was largely correct). And back in February, a Twitch stream called Nothing, Forever launched, which used GPT-3 to endlessly generate Seinfeld scripts that would then be “acted out” by AI 24 hours a day, seven days a week. The scripts were mostly gibberish, but the channel was a huge hit… until its content became transphobic and its creators took it down for maintenance.

The trainability of these models is another, thornier risk for creatives trying to regulate this technology. Doctrow said that if a right to train an AI were created — as in if suddenly writers had a legal right to say who could and couldn’t train an AI on their writing — that right could become a demand from prospective employers.

“All the employers will demand that you assign that right to them as a condition of working for them,” he said. “It’s just a roundabout way of saying that large corporations who are in a buyer's market for creative labor will be the only people in a position to build models that can be used to fire all the creators.”

But exactly how good is this technology? Well, this month a Disney fan blog wrote a story titled “An AI rewrites the Star Wars Sequel Trilogy. It’s more coherent than Disney’s.” The AI was asked to “pretend that Disney did not create any more Star Wars movies after Return of the Jedi” and then imagine the plots of movies that George Lucas would have created instead of the sequel trilogy we got.

It came up with three movies: Episode VII: The Force Unleashed, Episode VIII: Rise of the Sith, and Episode IX: Balance of the Force. It even described a pretty tight narrative arc across the three movies, featuring a trio of main characters, including “a young Jedi named Kira,” “a former stormtrooper named Sam,” and “a wise old Jedi named Ben.” Honestly, the AI’s new sequel trilogy sort of reads like the one that actually happened, but it moves the discovery of an evil Sith home world and the reveal of a Sith master villain to the second movie, not the third.

The AI also offered up casting for the main characters, descriptions of new planets and vehicles and even took a stab at writing a scene where Luke Skywalker, Princess Leia, and Han Solo show up for cameos. Is any of this perfect? Absolutely not. Is it good enough for a human to potentially clean it up for actual production or publication? No question.

Users have found that generative AI chatbots are particularly good at brainstorming genre entertainment because there are more agreed-upon tropes to mix and match. But, as the Hollywood Reporter pointed out last week, it could also be used to cross the picket line during the strike for long-running shows like Law & Order. The bigger the number of scripts for the AI to summarize, the more novel ways it can remix them.

But, also, let’s be clear about what we’re talking about here. A generative AI like GPT-4 can’t create anything new. It can only reorganize what already exists into new combinations. A film industry where chatbots are allowed in writers' rooms — or allowed to replace them entirely — is an industry that has quite literally run out of ideas.

But Hollywood isn’t the only sector of the American workforce in the midst of a generative AI upheaval. IBM has completely paused its hiring, and one executive at the company recently speculated that almost a third of its employees could be replaced with a chatbot of some kind. And retail isn’t safe either. Wendy’s is currently testing a drive-thru chatbot as we speak. A recent Bloomberg article estimated that at least a quarter of the American workforce will have their jobs impacted by AI in the next five years.

And there are non-unionized creative industries that are already beginning to flirt with how to use these tools in a writer's room. Ashley Cooper, a narrative lead and video game developer, told Polygon that she was approached by an indie video game studio recently that was very interested in using AI to write scripts.

“An indie studio I’d done some work for a few years back emailed me asking, ‘We are looking for a writer to use some AI and get us some back-and-forth dialogue,’” she said.

Cooper said that the games industry is already pretty bad at hiring new creatives, and she fears that using AI could destroy ways for young people to get a foothold.

“At its core, [AI] doesn’t exist to make the lives of writers easier; it exists to minimize a studio’s need for writers,” she said.