How it works
The Text Cleaner performs multiple cleanup operations on text in one pass: removing extra spaces, stripping special characters, normalizing line endings, removing HTML tags, fixing smart quotes, and removing zero-width characters. It's a catch-all tool for text that arrives from messy sources.
Text imported from PDFs, Word documents, web scraping, OCR, or copy-paste often contains invisible garbage: zero-width spaces (U+200B), non-breaking spaces (U+00A0), smart/curly quotes (", "), em dashes (—) instead of hyphens, stray HTML entities, and Windows-style line endings (CRLF) instead of Unix (LF). These invisible characters cause failures in string comparison, JSON parsing, regex matching, and database lookups.
How to use it: paste your text. Toggle the cleanup operations you want — each operation shows a before/after character count so you can see exactly what was changed. Operations include: Remove HTML tags, Normalize whitespace, Convert smart quotes to straight quotes, Remove zero-width characters, Fix line endings (CRLF → LF), Strip non-ASCII characters, and Trim each line.
Development use case: when a regex pattern that works in your test suite fails on real user input, the problem is usually one of these invisible characters. Run the input through the Text Cleaner to strip the invisible garbage and test again.
Frequently Asked Questions
- Zero-width characters (like U+200B ZERO WIDTH SPACE, U+200C ZERO WIDTH NON-JOINER, U+FEFF BOM) are invisible characters that appear in copy-pasted text from web pages, PDFs, and Word documents. They cause regex failures, string comparison issues, and unexpected behavior in code.
- Smart quotes are the typographic curved quotation marks (“ ” ‘ ’) used by Word, Pages, and many word processors. They look better in print but cause syntax errors in code, JSON, and YAML where only straight quotes are valid.
- Yes. The 'Strip HTML tags' option removes all <tag> elements from the text while preserving the inner text content, giving you clean readable text from HTML source.
- It replaces all Unicode whitespace variants (non-breaking spaces, ideographic spaces, hair spaces, thin spaces) with standard ASCII spaces, and collapses multiple consecutive spaces into one.