practicaldocumentsdevelopment

How PDFs Actually Work: Structure, Fonts, and Why They're So Difficult to Edit

PDF is the most universal document format — and one of the most misunderstood. Understanding its structure explains why editing PDFs is hard, why fonts go missing, and how tools like compression and redaction actually work.

·7 min read

What Is a PDF File?

A PDF (Portable Document Format) was created by Adobe in 1993 with a specific goal: a document that looks identical on any device, any operating system, any printer. The format achieved this by describing a page as a set of drawing instructions rather than a reflow-able stream of text.

When you open a PDF, you're not reading a "document" the way HTML is a document. You're executing a program that draws pixels on screen at specific coordinates.

The Structure of a PDF File

A PDF is composed of objects. Every piece of content in the file — text strings, images, fonts, page properties — is stored as an object with a unique number.

At the end of the file is a cross-reference table (xref) that lists the byte offset of every object, so the PDF reader can jump directly to any object without reading the whole file sequentially.

The document dictionary lists the page tree, which lists pages, which each contain a content stream — a sequence of drawing commands in PDF's postfix notation:

BT % Begin Text /Helvetica 12 Tf % Set font and size 100 700 Td % Move to position x=100, y=700 (Hello, World) Tj % Draw text ET % End Text

Text isn't stored as "a paragraph at position X" — it's stored as "draw this string at these exact coordinates."

Why Editing PDFs Is Hard

When you edit a Word document, you're editing a semantic structure: this is a heading, these are paragraphs, this is a table. The renderer figures out the layout.

A PDF has none of that structure. The word "Hello" is stored as a text drawing command at pixel coordinates. There are no paragraphs, no flow, no concept of "the sentence that contains this word." To edit a sentence, a PDF editor needs to:

1. Find all the drawing commands that constitute the sentence 2. Calculate the new text's width (which depends on the specific font) 3. Reposition all the subsequent text 4. Handle line wrapping if the new text is longer

This is genuinely hard, which is why most PDF "editors" are really annotation tools that add a layer over the original content rather than modifying the underlying drawing commands.

Fonts in PDFs

PDF handles fonts in three ways:

**Embedded fonts** — the font data is included inside the PDF. The document renders identically everywhere. This is the correct approach and most PDFs from modern tools do this. Embedding adds file size but guarantees accuracy.

**Font substitution** — the PDF references a font by name but doesn't embed it. The PDF reader substitutes a similar font. This can cause subtle layout changes and broken characters.

**Font subsetting** — only the glyphs (characters) actually used in the document are embedded. Reduces file size but means you can't add new text using that font without the original font file.

Metadata in PDFs

PDFs contain metadata that most people don't know is there:

  • Title, author, subject, keywords (Document Properties)
  • Creation date and modification date
  • Application that created the file (e.g., "Microsoft Word 2021")
  • Sometimes: the username of the person who created it

This metadata is often inadvertently included when sharing PDFs externally. Law firms have leaked client information through PDF metadata. Journalists have exposed sources by forgetting to strip metadata before publishing documents.

NoxaKit's PDF Metadata Editor lets you view and remove this metadata. The PDF Compress Preset strips metadata as part of the compression process.

Layers and Annotations

PDFs support optional content groups (layers) — you can have content that's visible only with certain layers active. This is used for engineering drawings, multilingual documents, and print vs screen versions.

Annotations are additions to a page (comments, highlights, form fields) that exist as a separate layer over the original content. Flattening a PDF merges annotations into the page content, making them permanent and uneditable.

Security

PDF passwords work at two levels. The owner password controls editing permissions (though these can be bypassed with readily available tools — they're not true encryption). The user password actually encrypts the content and is required to open the file. AES-256 encryption in PDF/1.7 or higher is genuinely secure.

PDF redaction that works: blacking out text by drawing a black rectangle over it does not redact — the text is still in the drawing stream underneath. Proper redaction removes the underlying content from the file entirely.

Try These Free Tools

More Articles