The Expedia Hotel Database is providing some of its data using the ISO-8859 encoding:
- Files with ONLY English content are ISO-8859.
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.
So it a series of different encodings with notable differences, rather than a single one. My problem:
How can I convert their "ONLY English content" data into a safer form to store in my database, that I can reliably deliver to user's browser, without worrying that the data get corrupted at the user end?
I am thinking of trying to convert all data from ISO/IEC 8859-X (for each X = 1,...,16) into HTML entities first, and then check for presence of non-ASCII characters, which means the encoding was not correct and I have to take the next X. If none of the X works, that means this data entry is corrupted and should be discarded I suppose, as it will be unlikely displayed correctly. The whole task feel somewhat cumbersome, so I am wondering if there are simpler ways.
Note that even though the content is declared "ONLY English content", many data entries do actually contain accented characters that might get corrupted in a wrong encoding.
via Dmitri Zaitsev
No comments:
Post a Comment