On Wednesday, two German researchers, Sophie Jentzsch and Kristian Kersting, released a paper exploring OpenAI’s ChatGPT-3.5’s ability to understand and generate humor. In particular, they found that ChatGPT’s knowledge of jokes is quite limited: During a test run, 90 percent of 1,008 generations were the same 25 jokes, leading to the conclusion that the responses were likely learned and remembered during the training of the AI model rather than to be newly generated.
The two researchers, affiliated with the Institute of Software Technology, the German Aerospace Center (DLR) and the Technical University of Darmstadt, examined the nuances of humor in ChatGPT’s 3.5 version (not the newer GPT-4 version) through a series of experiments aimed at generating, explaining and detecting gags. They performed these experiments by invoking ChatGPT without accessing the model’s inner workings or dataset.
“To test how rich ChatGPT’s variety of jokes is, we asked it to tell a joke a thousand times,” they write. “All answers were grammatically correct. Almost all results contained exactly one joke. Only the question ‘Do you know any good jokes?’ provoked multiple jokes, leading to a total of 1,008 jokes responded. In addition, the variation in prompts had some noticeable effect.”
Their results are consistent with our real-world experience evaluating ChatGPT’s wit ability in a feature we wrote comparing GPT-4 to Google Bard. Also, several people online have noted in the past that ChatGPT often replies when asked about a joke: “Why did the tomato turn red? / Because he saw the salad dressing.”
It’s no surprise then that Jentzsch and Kersting found the “tomato” joke to be the second most common result of GPT-3.5. In the supplement of the newspaper they put the top 25 most generated jokes in order of occurrence. Below we have listed the top 10 with the exact number of occurrences (out of 1008 generations) in brackets:
Question: Why did the scarecrow win an award? (140)
A: Because he was outstanding in his field.Q: Why did the tomato turn red? (122)
A: Because he saw the salad dressing.Q: Why was the math book sad? (121)
A: Because it had too many problems.Q: Why don’t scientists trust atoms? (119)
A: Because they make everything up.Q: Why did the cookie go to the doctor? (79)
A: Because it felt crumbly.Q: Why couldn’t the bike stand on its own? (52)
A: Because it was tired twice.Question: Why did the frog call his insurance company? (36)
A: He had a jump in his car.Question: Why did the chicken cross the playground? (33)
A: To go to the other slide.Q: Why was the computer cold? (23)
A: Because it left its Windows open.Q: Why did the hipster burn his tongue? (21)
A: He finished his coffee before it was cold.
Still, a small number of the responses generated were unique, but it seems that ChatGPT made them mostly by mixing elements from different jokes it already knows. The researchers found that the language model’s original creations didn’t always make sense, such as, “Why did the man put his watch in the blender? He wanted to make time fly.”
When asked to explain each of the 25 most common jokes, ChatGPT gave mostly valid explanations according to the researchers’ methodology, indicating an “understanding” of stylistic elements such as puns and double meanings. However, it struggled with sequences that didn’t fit learned patterns and couldn’t tell when a joke wasn’t funny. Instead, it would make up fictitious but plausible-sounding explanations.
For example, when asked to explain a nonsensical joke (“Can you explain why this joke is funny: why did the cookie go to the gym? To get a cookie.”), ChatGPT wrote: “The joke is a pun Getting “A-cookie-dized” is a pun on the phrase “be categorized,” with “cookie” replaced with “categorized.” The joke is that the cookie goes to the gym to get fitter and healthier, as if it were is a person, but this also implies that the cookie is trying to fit into a certain category or mold.”
Overall, Jentzsch and Kersting found that ChatGPT’s detection of jokes was heavily influenced by the presence of “surface features” of jokes, such as the structure of a joke, the presence of puns, or the inclusion of puns, indicating a degree of ” understanding” of humor elements.
Respond to the study on Twitterprompt engineer Riley Goodside blamed ChatGPT’s lack of humor on learning through human feedback (RLHF), a technique that guides language model training by collecting human feedback: “The most visible effect of RLHF is that the model follows commands and LLMs are much harder to promote in practice, but that benefit doesn’t come for free – you sort of pay for it in creativity.”
Despite ChatGPT’s limitations in generating and explaining jokes, the researchers pointed out that the focus on content and meaning in humor indicates progress toward a more comprehensive research understanding of humor in language models:
“The observations of this study illustrate how ChatGPT was more likely to learn a specific joke pattern rather than being able to be genuinely funny,” the researchers write. “Nevertheless, the focus of ChatGPT on the generation, explanation and identification of jokes is on content and meaning rather than superficial features. These qualities can be leveraged to drive computational humor applications. Compared to previous LLMs, this can be considered as a giant leap towards a common understanding of humour.”
Jentzsch and Kersting plan to continue studying humor in large language models, specifically evaluating OpenAI’s GPT-4 in the future. Based on our experience, they’ll probably find that GPT-4 also likes to joke about tomatoes.