Perplexity and Burstiness — How AI Detection Actually Works

When you run text through an AI detector, the result is not magic — it is math. Behind that "78% AI probability" score are specific algorithms measuring specific properties of the text. If you understand what those algorithms look for, you can make sense of detection results instead of just trusting or distrusting them blindly.

The two most commonly discussed metrics are perplexity and burstiness, but they are only part of the picture. Here is how each metric works, what it catches, and where it falls short.

Perplexity: How Surprising Is the Text?

Perplexity measures how predictable a sequence of words is. In technical terms, it is the exponentiated average negative log-likelihood of each token given the preceding context. In plain language: if you covered up each word in a sentence and tried to guess it from what came before, perplexity measures how hard that guessing game would be.

Low perplexity means the word choices are predictable. High perplexity means the text contains surprises — unusual word choices, unexpected turns of phrase, or creative combinations that a language model would not rank as the most likely next token.

This matters because of how AI generates text. Models like GPT-4, Claude, and Gemini work by predicting the most probable next token at each step. Even with temperature sampling, the output gravitates toward high-probability sequences. The result is fluent and coherent but statistically predictable.

Human writing works differently. We choose words for reasons beyond probability — emphasis, humor, rhythm, personal habit, or simply because we like how a word sounds. A human might write "the algorithm chewed through the dataset" where a language model would write "the algorithm processed the dataset." Both are correct, but "chewed" is a lower-probability choice that raises perplexity.

N-Gram Perplexity

Basic perplexity looks at individual words, but n-gram perplexity examines sequences of two, three, or more words together. This is more powerful because AI's predictability shows up most clearly in phrases, not just individual word choices. The phrase "it is important to note that" has very low n-gram perplexity because it is a cliche that language models produce constantly. A human might write "worth flagging here" instead — same meaning, higher perplexity, more distinctive voice.

Burstiness: How Much Does Complexity Vary?

If perplexity measures the average predictability of text, burstiness measures how much that predictability varies from sentence to sentence.

Human writing is naturally bursty. We write a short, blunt sentence. Then we follow it with a longer one that qualifies and extends the point, adding subordinate clauses and parenthetical details before arriving at the conclusion we were building toward. Then another short one for emphasis. This variation in sentence length, complexity, and information density is a fundamental feature of how people write.

AI-generated text is remarkably uniform by comparison. Sentence lengths cluster in a narrow range — typically between 15 and 25 words. The complexity stays consistent paragraph after paragraph. There are no three-word punches, no 50-word sprawls. The rhythm is flat, like a metronome.

Burstiness is typically measured as the variance or standard deviation of sentence-level perplexity scores across a document. High variance means bursty (human-like); low variance means uniform (AI-like).

This metric is particularly useful because it is hard for AI to fake. Even when prompted to "vary your sentence lengths," the variation AI produces is less extreme than genuine human variation. Real burstiness comes from shifts in thinking — you slow down to explain something complex, speed up when the point is obvious, and throw in a fragment when you want impact. AI does not think this way.

Entropy: Information Density

Entropy in text analysis measures the diversity of vocabulary and phrasing. Borrowed from information theory, it quantifies the average amount of information per element in a message.

High-entropy text uses a wide range of vocabulary, draws from different registers (formal, casual, technical, colloquial), and distributes word usage broadly. Low-entropy text relies on a smaller set of words and phrases, repeating patterns more frequently.

AI-generated text tends toward lower entropy because language models converge on a "standard" vocabulary. They use "significant" instead of "hefty," "implement" instead of "pull off," "demonstrate" instead of "show." The vocabulary is correct but narrow. Human writers pull from a personal vocabulary shaped by reading, dialect, profession, and personality — producing higher entropy.

Entropy complements perplexity because it captures a different dimension. A text could have moderately high perplexity but low entropy. Measuring both gives a more complete picture.

Zipf's Law: The Frequency Fingerprint

Zipf's law states that in any large body of natural text, the frequency of a word is inversely proportional to its rank. The most common word appears roughly twice as often as the second most common, three times as often as the third, and so on. Plot word frequency against rank on a log-log scale and natural language produces a remarkably straight line.

Human writing follows this distribution naturally because it reflects the organic structure of language. We use common words like "the," "is," and "of" constantly, sprinkle in moderately common words, and reach for rare words only occasionally.

AI-generated text deviates from Zipf's law in subtle but measurable ways. Language models tend to over-use words in the middle frequency range, producing a distribution that is slightly flatter than the Zipfian ideal. You would never notice it reading the text, but a statistical test detects it reliably.

Zipf's compliance is valuable because it is independent of content. The frequency distribution is a structural property of how the text was generated, not what it says.

Stylometry: The Writing Fingerprint

Stylometry is the oldest form of text analysis, predating AI by centuries. It was originally used to attribute anonymous works to known authors by measuring unconscious writing habits — sentence length, punctuation patterns, function word usage, and vocabulary richness.

In AI detection, stylometry checks whether a text exhibits the personality of a human writer. Key features include function word ratios (humans use "but," "yet," "though" in idiosyncratic patterns), punctuation habits (semicolons, em dashes, and parenthetical asides appear at different rates in human vs. AI text), vocabulary richness measured by the type-token ratio, and paragraph length variation.

Stylometric features are individually weak signals, but they complement the other metrics by capturing patterns that perplexity and burstiness miss. For a deeper look at how these differences manifest across models, see our comparison of ChatGPT vs. Claude vs. Gemini writing styles.

Why No Single Metric Is Enough

Each of these six signals has blind spots — which is why AI detector accuracy depends on how these metrics are combined. Perplexity alone flags formal academic writing as AI because it is naturally predictable. Burstiness alone misses AI text that has been lightly edited to vary sentence lengths. Entropy alone cannot distinguish between a technical paper with narrow vocabulary (human) and a general essay with narrow vocabulary (AI). Zipf's deviation is subtle enough that short texts produce noisy measurements. Stylometric features vary widely between individual human writers.

The power is in combination. ShaamAI Detector runs all six analyses simultaneously and synthesizes them into a single score with sentence-level detail. When perplexity, burstiness, entropy, Zipf's compliance, and stylometric features all point in the same direction, the confidence is high. When they disagree, the ambiguity is visible — and that is useful information too.

This multi-signal approach is why statistical detection works well despite running entirely in your browser with no neural network models. Each metric is lightweight to compute, but together they capture the statistical fingerprint of AI-generated text from multiple independent angles.

What This Means for You

Understanding these metrics makes you a better writer. The qualities that detectors look for — varied sentence lengths, surprising word choices, personal vocabulary, natural rhythm — are the same qualities that make writing engaging to read.

If you run your own text through a detector and see high AI probability on certain sentences, look at why. Is that sentence generic? Is it structured like something ChatGPT would write? You can check your text with ShaamAI Detector to see exactly which sentences trigger each metric. Revising those sentences with more voice and specificity will not only lower your AI score — it will make your writing stronger.

The math behind AI detection is not a mystery. It measures real properties of language that distinguish human creativity from statistical prediction — and understanding these properties also explains why false positives happen when human writing shares statistical traits with AI output. The more you understand those properties, the better you can write, evaluate, and think critically about the text you encounter.