Skip to content

The Words That Give Away Generative AI Text

    So far even AI companies have struggled to come up with tools that can reliably detect when a piece of text was generated using a large language model. Now, a group of researchers has developed a new method for estimating LLM usage in a large set of scientific texts by measuring which “redundant words” were much more common during the LLM era (i.e., 2023 and 2024). The results “suggest that at least 10 percent of 2024 abstracts were processed using LLMs,” the researchers said.

    In a preprint paper published earlier this month, four researchers from Germany’s University of Tübingen and Northwestern University said they were inspired by studies that measured the impact of the Covid-19 pandemic by looking at excess deaths compared to the recent past. Taking a similar look at “excessive word use” after LLM writing tools became widely available in late 2022, the researchers found that “the rise of LLMs led to an abrupt increase in the frequency of certain style words” that was “unprecedented in both quality and quantity.”

    Dive in

    To measure these changes in vocabulary, the researchers analyzed 14 million paper abstracts published on PubMed between 2010 and 2024, tracking the relative frequency of each word as it appeared in each year. They then compared the expected frequency of those words (based on the pre-2023 trend line) with the actual frequency of those words in abstracts from 2023 and 2024, when LLMs were more widespread.

    The results found a number of words that were extremely uncommon in these scholarly abstracts before 2023, but which suddenly increased in popularity after LLMs were introduced. For example, the word “delves” appears in 25 times as many 2024 papers as the pre-LLM trend would expect; words like “showcasing” and “underscores” also increased in usage nine times. Other previously common words became noticeably more common in post-LLM abstracts: the frequency of “potential” increased by 4.1 percentage points, “findings” by 2.7 percentage points, and “crucial” by 2.6 percentage points, for example.

    Of course, these kinds of changes in word usage can happen independently of LLM usage: the natural evolution of language means that words sometimes go in and out of fashion. However, the researchers found that in the pre-LLM era, such huge and sudden year-on-year increases were only seen for words associated with major global health events: “ebola” in 2015; “zika” in 2017; and words like “coronavirus,” “lockdown,” and “pandemic” in the period 2020 to 2022.

    In the post-LLM period, however, the researchers found hundreds of words with sudden, pronounced increases in scientific usage that had no common link to world events. While the excess words during the Covid pandemic were overwhelmingly nouns, the researchers found that the words with a post-LLM frequency bump were overwhelmingly “style words” such as verbs, adjectives, and adverbs (a small sample: “about, in addition, extensively, crucially, improving, exhibiting, insights, in particular, particularly, within”).

    This is not an entirely new finding: the increased prevalence of “delve” in scientific papers, for example, has been noted many times in the recent past. But previous studies have typically relied on comparisons with “ground truth” human writing samples or lists of predefined LLM markers obtained outside the study. Here, the pre-2023 set of abstracts acts as its own effective control group to show how vocabulary choice has generally changed in the post-LLM era.

    A complicated interplay

    By highlighting hundreds of so-called “marker words” that have become significantly more common in the post-LLM era, the telltale signs of LLM usage can sometimes be easily picked up. Take this exemplary abstract rule cited by the researchers, with the marker words highlighted: “A extended grip of the complicated interaction between […] And […] is decisive for effective therapeutic strategies.”

    After performing some statistical measurements of the occurrence of marker words in individual papers, the researchers estimate that at least 10 percent of papers in the post-2022 PubMed corpus were written with at least some LLM assistance. The number could be even higher, the researchers say, because their set may be missing LLM-assisted abstracts that don’t contain any of the marker words they identified.