GPT-4 Can't Flip a Coin

Matt Harrington
Jul 16, 2023
5 min read

Updated: Jul 16, 2023

Picture this: you're a statistician in the distant future, and you're performing an experiment that relies on the flip of a coin. Knowing you have the all powerful and self-aware GPT-8999 next to you, you ask the virtual assistant to flip a coin. The AI flips the coin up in the air, and down it comes: heads. Satisfied you nod and jot that down. You ask for another coin flip, and again you get heads and you use this result. Flipping ten more times you receive ten more heads in a row. Astonished, you put down your pencil and give your so-far faithful artificial companion a good shake saying “if you do that again, I just won’t tolerate it!” You ask it for one last coin flip, and again the machine abashedly tells you the result was heads. Without hesitation, you toss the thing out of the window of your spaceship and grumble about “Ultra Language Models”.

Large Language Models (LLMs) like GPT-4 have shown an amazing ability to generate truly novel content, whether it is making up a brand new story or stringing together random words. To test this capacity I ran a simple experiment where I asked GPT-4 to make up 100 passwords by stringing together four “random words”. Although it might sound like a good idea, the model has a poor understanding of how to intentionally generate random content, and such passwords may be more predictable than they appear. For instance, the concatenated “FantasticMoonlightGardenHarmonySwift” is not fully random, with “Moonlight” becoming more likely once “Fantastic” has already been written and so on.

A demonstration of this assertion was shown to me by a coworker. Despite being asked to model ten weighted coin flips with a 0.2 probability of heads, the model consistently produced only a heads only once in all ten “flips”. Even more, after being asked to do the same thing in 14 different sets of 10, every single time the model returned only a single heads in each set. If the model was truly giving the probability it claimed to use, this series would only occur in a one in ten million chance, or extremely unlikely to occur by chance alone. What makes the result more surprising is that the model is theoretically capable of sampling the probabilities correctly due to how it samples words with a random distribution. Somewhat comically, GPT-4 can even tell you that the proceeding string of draws is unlikely, but still cannot generate outputs that reflect the correct probabilities.

This unexpected behavior can be partly explained by understanding how these models function. A crucial component of LLM models like GPT-4 is a mechanism known as 'self-attention.' This mechanism allows the model to weigh the importance of previous inputs when generating subsequent outputs. While useful for predicting the next term in a sequence, self-attention makes it challenging for the model to generate outcomes that are truly independent and random. It's as if the model thinks, "I've already generated one heads, so it should be unlikely that I generate any more." In many ways this tendency echoes the gambler’s fallacy in humans where people are inclined to think that past random events influence future events.

In practical terms why would this matter? Clearly GPT-4 can do some amazing things, but we already have computers that can generate convincingly random numbers. Well imagine in some distant future a LLM is being used in place of a human judge during a court case. The plaintiff says something inflammatory or irrelevant, and the human court orders that statement to be struck from the record. However, the language model acting as the judge, would not be able to ignore the stricken comment. Despite clear instructions to disregard the plaintiff's comment, the model, biased towards past context, wouldn't be able to 'forget' the stricken statement. This could lead the LLM to unduly let unfair information influence the outcome of the trial, posing risks of biased judgements and jeopardize the justice process.

In a second practical example we can image a system powered by an LLM that engages in conversation with users and provides them with personalized advertisements. Say a user was pregnant and shared this information during their chats. Consequently, the model started tailoring ads related to pregnancy and baby products. Later, the user informs the system that they were no longer pregnant and requests it to stop showing related advertisements. Despite this direct instruction, the language model would still be influenced by the previously given context - their pregnancy. It could continue to suggest baby-related ads, failing to adjust its behavior based on the new request. Not only could this be emotionally distressing for the user, especially in cases of pregnancy loss, but it also indicates a disregard for the person’s preferences and consent. In this case, the model's incapacity to ignore specific context can lead to inappropriate personalization and potential emotional harm.

The challenge posed by GPT-4's limited ability to produce truly random results leads us to consider a few potential solutions. One such possible solution could be adjusting the structure of the attention mechanism so the model could better suppress previous context. This mechanism could modify the attention weights, the elements that determine the influence of past inputs on current outputs, either by giving a flat or zero weights to previously generated content when needed. As a similar solution we could investigate adjustments to the model's sampling layer. This component is directly responsible for creating the outputs, or 'random' results in our case. Altering or masking this layer may offer another path to achieving more accurate randomness.

A final approach to improving an LLM’s performance on such tasks could come from within the language model itself. That is, we could perhaps teach the model better how to limit the influence of previous context, such as through fine-tuning the model specifically for the task. Fine-tuning, a common practice in machine learning, entails minor adjustments to a pre-existing model to tailor it for specific tasks. In this case, the model might be able to learn for itself how and when to ignore parts of the context.

In light of these findings, we should be mindful when asking GPT-4 or similar models to either produce randomness or forget past inputs. These models, by design, hold onto past information and won't ignore it unless specifically modified to do so. So, if we want them to do tasks involving randomness or selective memory, we'll need to tweak their design accordingly. As we continue to shape the future of AI, it is essential to address these constraints, ensuring the safety and effectiveness of the technology, especially in high-stakes areas like law and personal data handling.

Photocredits: Midjourney 5.2 with prompts:

robot flipping a quarter --s 75 --ar 7:4
drawing of very small humanoid robot being thrown out of a spaceship with stars in the background, minimalist, plain background --chaos 5 --ar 7:4
ai, robot face dressed as a british judge with wig, holding gavel and bench, courtroom --ar 7:4
Woman talking to robotic personal assistant in a futuristic city on a sunny day, style of retro futurism, floating in clouds, happiness --chaos 2 --ar 7:4