Format preservation at scale: parsing DOCX without breaking the formatting

title: "Format preservation at scale: parsing DOCX without breaking the formatting" description: "Humanized prose is only useful if it lands back in your workflow with the formatting intact. Here's the parser-rebuilder pipeline." date: "2026-05-13" tag: "Engineering" author: "Inksong"

The quiet productivity killer in most humanizers is the input field. You paste your document, you get plain text back, and you spend the next twenty minutes putting the bold and the bullets and the headings back where they belong. The rewrite saved you nothing. Inksong takes DOCX in and gives DOCX out with formatting preserved. Doing that well turned out to be a much larger engineering problem than humanizing the prose itself.

This post is about the parser-rebuilder pipeline that sits between the file upload and the Claude call, and the specific places where it gets hard.

What DOCX actually is

A .docx file is a zip archive. Unzip one and you get a directory of XML files plus images and a manifest. The XML files describe the document's structure: paragraphs, the runs inside them, the styles applied to those runs, the section properties, the numbering definitions, and everything else Word needs to render the file the same way twice.

The hierarchy that matters for humanization is: paragraph → run → text. A paragraph is a block-level container. A run is a span inside the paragraph with uniform formatting. The text is the actual characters. Formatting — bold, italic, font, size, color, underline, strikethrough — lives at the run level, not the paragraph level and not the text level.

This means a single visually simple paragraph can be a surprising number of runs. The sentence "This is bold and italic and normal text" is at least five runs. Word splits and merges runs aggressively as the user edits, so even paragraphs that look clean can be a mess of single-character runs left over from old edits.

The implication for us is that we can't humanize at the run level. The model needs whole sentences. But we can't lose run-level formatting on the way back, either.

The parsing strategy

We use python-docx to walk the document. For each paragraph, two things happen in parallel.

First, we concatenate the text across all runs to produce the paragraph as a single string. This is what gets sent to Claude.

Second, we record a parallel structure: for each run in the paragraph, we store its starting character offset, its ending character offset, and the formatting it carries. This is a flat list of (start, end, formatting) tuples per paragraph.

The model humanizes the concatenated text. We get back the rewritten paragraph, also as a single string.

The hard part is mapping the rewritten string back to runs. We do this by re-aligning on a longest-common-subsequence basis: which characters in the new string correspond to which characters in the old string, where possible? Where alignment is clear, formatting carries over. Where alignment is ambiguous, we apply a tie-break rule that favors preserving formatting boundaries at word edges rather than mid-word.

The output is a rebuilt paragraph with new text and old formatting attached as faithfully as we can manage.

The hard case: text spans formatting boundaries

The honest failure mode is this. The source has "this is bold important" — three runs, the middle word bolded. The humanizer rewrites the whole phrase as "this matters." The word "bold" is gone. Where does the bold formatting attach now?

There is no correct answer. We use a heuristic: if the humanized version preserves at least the first character of the formatted span, formatting attaches to whatever word begins at that position. If it does not, the formatting boundary is dropped and the span becomes a single unformatted run.

This produces output that is right most of the time and visibly wrong sometimes. A reviewer who knows the original might notice that an emphasis they cared about has moved or disappeared. We don't claim to preserve every inline formatting decision on aggressive rewrites. We do preserve block-level formatting — paragraph styles, headings, lists, tables — reliably.

Users have asked for a "preserve formatting strictly" mode where we refuse to humanize spans that cross formatting boundaries. We've thought about it. The downside is that aggressive rewrites become impossible in formatted documents. It's a tradeoff we may expose as a setting.

PDFs: why we return DOCX

PDFs are the question we get most often. The answer is unsatisfying: PDFs aren't really a content format. They're a layout language. A PDF describes which glyphs appear at which coordinates on which page. There is no concept of "this paragraph reflows when the text changes."

We tried returning PDFs early on. We used pdfminer.six to extract the text, ran it through the humanizer, and rebuilt a PDF with the new text in the original layout. The output was always broken. Sentences ran off the page, line breaks happened mid-word, columns no longer aligned, footnote anchors pointed to the wrong line. Every approach to in-place text replacement in a PDF produces something you can't actually use.

So we return DOCX. The user uploads a PDF, we extract the text, we humanize, we emit a fresh DOCX with reasonable defaults — Times New Roman, 11pt, 1.15 spacing — and the user opens it in Word and applies whatever house style they need. We considered going to plain text as the safer default; DOCX won because it gives users a document, not a transcript.

TXT and Markdown

TXT is the easy case. We read the file, humanize the text, write the result back. No structure to preserve.

Markdown is preserved verbatim. The prompt explicitly instructs Claude not to alter *, _, #, list markers, code fences, or link syntax. Headings stay as headings. Bullets stay as bullets. Inline code stays as inline code. We don't reformat the Markdown either way — the user's choice of * or _ for emphasis is preserved.

The one Markdown case that's still imperfect is heavily-nested lists. The model occasionally collapses two levels of indentation into one. We catch most of this in a post-parse validation pass and re-run when we detect it.

What's still hard

Footnotes inside tables are rare and degrade gracefully. The footnote text is preserved, the anchor is preserved, but the footnote may not render in exactly the same position relative to the table cell. Most users don't notice; the people who do tend to fix it in Word in seconds.

Embedded images and shapes are preserved by position. We don't touch them. We don't humanize captions that live inside text frames, because we can't reliably tell text from layout in that case. Captions that are normal paragraphs do get humanized.

Tracked changes are accepted before humanization. If you upload a document with pending edits, those edits are applied as part of preprocessing. We don't yet have a mode that preserves tracked changes through the rewrite; it's on the list.

Closer

The format pipeline is the part of Inksong nobody asks about until something breaks. That's roughly the point — preserving formatting should be invisible, and we've spent more engineering time on it than on anything else. Pair this with the prompt strategy post for the rest of the picture.