Skip to content Skip to sidebar Skip to footer

Convert Iso/windows Charsets To Utf-8 In Javascript

I'm developing a firefox plugin and i fetch web pages to do some analysis for the user. The problem is when i try to get (XMLHttpRequest) pages that are not utf-8 encoded the strin

Solution 1:

Once XMLHttpRequest has tried to decode a non-UTF-8 string using UTF-8, you've already lost. The byte sequences in the page that weren't valid UTF-8 sequences will have been mangled (typically converted to , the U+FFFD replacement character). No amount of re-encoding/decoding will get them back.

Pages that specify a Content-Type: text/html;charset=something HTTP header should be OK. Pages that don't have a real HTTP header but do have a <meta> version of it won't be, because XMLHttpRequest doesn't know about parsing HTML so it won't see the meta. If you know in advance the charset you want, you can tell XMLHttpRequest and it'll use it:

xhr.open(...);
xhr.overrideMimeType('text/html;charset=gb2312');
xhr.send();

(This is a currently non-standardised Mozilla extension.)

If you don't know the charset in advance, you can request the page once, hack about with the header for a <meta> charset, parse that out and request again with the new charset.

In theory you could get a binary response in a single request:

xhr.overrideMimeType('text/html;charset=iso-8859-1');

and then convert that from bytes-as-chars to UTF-8. However, iso-8859-1 wouldn't work for this because the browser interprets that charset as really being Windows code page 1252.

You could maybe use another codepage that maps every byte to a character, and do a load of tedious character replacements to map every character in that codepage to the character it would have been in real-ISO-8859-1, then do the conversion. Most encodings don't map every byte, but Arabic (cp1256) might be a candidate for this?

Post a Comment for "Convert Iso/windows Charsets To Utf-8 In Javascript"