What is ftfy?

ftfy is a Python library that aims to automatically fix Unicode text that has common encoding errors. It's particularly useful for cleaning up text that has been mangled due to issues like:

  • Mojibake: Correcting sequences of characters that appear garbled due to misinterpretations of encoding.

  • Double Encoding: Resolving instances where text has been accidentally encoded multiple times (e.g., encoded as UTF-8, then encoded as Latin-1).

  • HTML Entities: Converting HTML entities (like &) back to their corresponding characters.

  • Whitespace Problems: Normalizing whitespace characters.

ftfy automatically detects and fixes these common errors. The core function is ftfy.fix_text(), which takes a string as input and returns a corrected string. The library also includes configurable methods for specific tasks such as unescaping HTML and XML. The library attempts to repair corrupted Unicode with minimal changes to the original text, prioritizing readability.