What AI detection scores can and can't tell you

title: "What AI detection scores can and can't tell you" description: "A follow-up to our heuristic-detector post. Three illustrative iteration examples showing what the score moves on — and what it doesn't." date: "2026-05-13" tag: "Product" author: "Inksong"

Inksong shows a before-and-after AI-likelihood score on every document. The previous post explained what's inside the number — burstiness, hedge density, function-word ratios, and the rest. This post is about what to do with the number once you have it. Three short scenarios from our own iteration runs, anonymized and lightly composited so they're useful without pretending to be specific customer claims. The thread connecting them is the same: the score is one signal among three, and reading it as the only signal is how you end up over-tuning a document into a worse one.

Case 1: an academic abstract that wouldn't budge

Anonymized example. A 200-word biology abstract, opening like this:

We investigated the effect of varying light wavelengths on the growth rate of Chlorella vulgaris under controlled laboratory conditions. Cultures were exposed to monochromatic light at 450, 525, and 660 nm over a 14-day period. Biomass accumulation was measured every 48 hours using optical density at 680 nm. Results indicate that blue light (450 nm) produced significantly higher growth rates than red or green wavelengths, suggesting wavelength-specific photosynthetic efficiency in this strain.

Initial Inksong score: 78. Default-settings humanization (tone: academic, domain: academic, humanness: 40) brought it to 41. The author wanted lower. They pushed humanness to 70 and ran a second pass.

Score dropped to 22. Looked great on the dashboard. The prose, less so. Sentence length now varied dramatically — some short and punchy, some loosely-strung subordinate clauses — which is exactly what burstiness rewards. But abstracts aren't supposed to read that way. The genre demands uniform, declarative sentence rhythm. The high-humanness pass had pushed the document out of its own genre. The author reverted to the humanness-40 version.

The lesson: for some content, a score floor is structural. Abstracts are uniform-by-design. Conference summaries, executive summaries, regulatory filings, patent claims — same pattern. Uniform sentence length there is a function of genre, not of AI generation. Iterating past the structural floor doesn't make the document more human. It makes the document worse at its job, and the score doesn't know that.

Case 2: a blog draft where the score lied

A marketing blog draft, around 1,400 words. Initial score: 35. After one default-settings pass: 14. Dramatic drop. If you were grading by score alone, you'd ship it.

The author read it. Paragraphs 1 through 3 were fine. Paragraph 4 had a problem. The original draft opened that section with a punchy, intentionally short opening sentence — call it a stylistic choice the writer had made deliberately. The rewrite, optimizing for the kind of varied-but-not-too-varied rhythm that lowers the burstiness signal, had flattened it into something competent and forgettable. Five sentences of medium length, all syntactically similar.

The fix took two minutes — restore the punchy opener, leave the rest. But it wouldn't have happened from looking at the score. The score had already declared victory.

The lesson: the score is necessary but not sufficient. Look at the diff. Inksong's review view shows you, paragraph by paragraph, what changed. Spend the time reading it, especially on content where voice carries weight — marketing copy, opinion writing, anything with intentional rhythm.

Case 3: a sermon where a voice profile fixed what humanness couldn't

Pastoral content. Initial score on an AI-assisted sermon draft: 82. The author ran it through humanization at humanness 60, tone: pastoral, domain: pastoral. Score dropped to 35. A real improvement on paper.

But the rewrite read generic-warm rather than pulpit-specific. The pastoral tone preset knows about warmth, second-person address, scripture references, gentle hedging. What it doesn't know is this preacher's cadence — the specific way they alternate long meditative sentences with two-word imperatives, the recurring phrases, the rhythm of build-and-release that their congregation recognizes.

So the author added a Voice Profile trained on three prior sermons they'd preached themselves — about 8,000 words of transcript. Re-ran the humanization with the profile applied. Score dropped further, from 35 to 22, but more importantly the register came back. The rewrite now sounded like a sermon by that specific preacher, not a sermon by a competent stranger.

The lesson: voice profiles do work that humanness alone cannot. Humanness moves toward "any human." A voice profile moves toward "you." When the score plateaus and the prose still feels off, that's usually a voice problem, not a humanness problem.

The general rule

Treat the score as one of three signals.

The score itself. A meaningful drop — say 50 points or more — tells you the document moved on the axes the heuristic measures. A small drop or a plateau tells you you've hit a floor.
The diff. Read it. The score doesn't know what the document is supposed to sound like. You do. If the rewrite has flattened a deliberate stylistic choice or shifted a key argument, you'll see it in the diff before any detector flags it.
Any third-party detector your workflow requires. If your institution gates on Turnitin or GPTZero, run those externally and iterate against their feedback. Our score is internal and won't be in perfect agreement.

When the score plateaus, stop pushing humanness higher. Pushing humanness past the point where the score stops moving is where documents start to read worse. The right next move is usually one of three things: apply or improve a voice profile, switch the domain preset, or hand-edit the specific paragraphs the diff flags as awkward.

Closer

The score is a useful instrument. It's not the only one in the kit, and it isn't the goal. The goal is a document that reads the way you want it to read and survives whatever scrutiny it will face downstream. For the technical breakdown of what the score actually contains, see Honest about AI detectors.