Monday, 17 April 2017

How to reliably transform ISO-8859 encoded characters into HTML entities with NodeJS?

The Expedia Hotel Database is providing some of its data using the ISO-8859 encoding:

  1. Files with ONLY English content are ISO-8859.

However:

ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.

So it a series of different encodings with notable differences, rather than a single one. My problem:

How can I convert their "ONLY English content" data into a safer form to store in my database, that I can reliably deliver to user's browser, without worrying that the data get corrupted at the user end?

I am thinking of trying to convert all data from ISO/IEC 8859-X (for each X = 1,...,16) into HTML entities first, and then check for presence of non-ASCII characters, which means the encoding was not correct and I have to take the next X. If none of the X works, that means this data entry is corrupted and should be discarded I suppose, as it will be unlikely displayed correctly. The whole task feel somewhat cumbersome, so I am wondering if there are simpler ways.

Note that even though the content is declared "ONLY English content", many data entries do actually contain accented characters that might get corrupted in a wrong encoding.



via Dmitri Zaitsev

No comments:

Post a Comment