JavaScript utf8_decode
Converts a UTF-8 encoded string to ISO-8859-1
1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 27 28 29 3031 32 33 34 3536 37 38 | function utf8_decode ( str_data ) { // Converts a UTF-8 encoded string to ISO-8859-1 // // version: 909.322 // discuss at: http://phpjs.org/functions/utf8_decode // + original by: Webtoolkit.info (http://www.webtoolkit.info/) // + input by: Aman Gupta // + improved by: Kevin van Zonneveld (http://kevin.vanzonneveld.net) // + improved by: Norman "zEh" Fuchs // + bugfixed by: hitwork // + bugfixed by: Onno Marsman // + input by: Brett Zamir (http://brett-zamir.me) // + bugfixed by: Kevin van Zonneveld (http://kevin.vanzonneveld.net) // * example 1: utf8_decode('Kevin van Zonneveld'); // * returns 1: 'Kevin van Zonneveld' var tmp_arr = [], i = 0, ac = 0, c1 = 0, c2 = 0, c3 = 0; str_data += ''; while ( i < str_data.length ) { c1 = str_data.charCodeAt(i); if (c1 < 128) { tmp_arr[ac++] = String.fromCharCode(c1); i++; } else if ((c1 > 191) && (c1 < 224)) { c2 = str_data.charCodeAt(i+1); tmp_arr[ac++] = String.fromCharCode(((c1 & 31) << 6) | (c2 & 63)); i += 2; } else { c2 = str_data.charCodeAt(i+1); c3 = str_data.charCodeAt(i+2); tmp_arr[ac++] = String.fromCharCode(((c1 & 15) << 12) | ((c2 & 63) << 6) | (c3 & 63)); i += 3; } } return tmp_arr.join(''); } |
Examples
Running
1 | utf8_decode('Kevin van Zonneveld'); |
Should return
1 | 'Kevin van Zonneveld' |
Dependencies
No dependencies, you can use this function standalone.
Open syntax issues
php.js uses JsLint to help us keep our code consistent and prevent some common bugs.
Eventually we want all code to pass or at least take into consideration most fixes suggested by JsLint, following this JsLint configuration we’ve decided on.
Authors
Thanks to the following developers, you get to have utf8_decode goodness in JavaScript.
@Chris Ahrweiler: Are you sure you are using it correctly?
Please note that these functions in php.js are not related to whether the current page itself is in UTF-8 or not; JavaScript always uses UTF-16 internally.
As a little background if you weren't aware, JavaScript internally uses two-bytes to represent every character (or four bytes in some extremely rare characters--surrogate pairs which are two pseudo-characters joined together to form a single character but whose string length is 2). UTF-8, on the other hand, uses 1-4 bytes for each character, while ISO-8859-1 uses exactly 1 byte (it can do that because is far more limited, in handling only Latin characters). UTF-8 uses 1-2 bytes for the same range as ISO-8859-1 (1 bytes for the ASCII); the reason it must use 2 bytes in some cases, even though Unicode assigns the same code points as used in in ISO-8859-1, is because one of the bits in a UTF-8 byte is reserved for indicating whether the byte is a single 1-byte sequence or part of a multi-byte sequence.
So, in order to represent UTF-8 in JavaScript (and even ISO-8859-1), we are forced with either using an array of code points, where each number represents the value for a single byte, or using a string where each character has the code point value of a single byte.
Internally, JavaScript will use two bytes (or rarely 4 bytes), but we can use its strings to represent a sequence of single byte characters (ISO-8859-1) or a sequence of 1-4 byte character sequences (UTF-8). (If you put the text in an alert, our "UTF-8" will not be human-readable, while our "ISO-8859-1" will be.)
In JavaScript you can easily discover what the UTF-8 sequence for a character should be by using encodeURIComponent. For the sharp-S, it gives "%C3%9F", which, translated into a multiple single-byte string sequence is "\u00C3\u009F" or, more visually, "Ã\u009F".
In ISO-8859-1, it should simply be "\u00DF" or "ß".
In other words, we can treat regular JavaScript as though it were ISO-8859-1 since both have a fixed correspondence between a character and number of bytes (ISO-8859-1 is 1-to-1 while UTF-16 is 1 to 2, unless for rare (and non-Latin) characters not present in ISO-8859-1 where the correspondence is 1-to-4; 4-byte characters are necessary in some cases since even 2 bytes is not enough to represent all of the scripts supported by Unicode).
So, "utf8_decode" would only be used if you already have a string of characters in this artificial "UTF-8" where each "character" has up to a full byte of value (0-255)--representing a single byte in a UTF-8 1-4 byte sequence.
"utf8_encode", on the other hand, would let you encode a ISO-8859-1 string (or Latin UTF-16 strings) into such an artificial "UTF-8" sequence of 1-2 byte sequences.
If the above doesn't help, feel free to clarify exactly what you are trying to do and why.
Hello,
tried the function, looking good, but german Umlaut "ß" wouldn't be converted correctly. Any improved version of this function available?
Regards, Chris.
@Walessio: The declaration was already there (and for some more variables too), but I think the syntax highlighter doesn't accurately insert newlines, so when you copy-paste, it doesn't always preserve them. To be safer, just paste from the "raw js source" link.
@Kevin, I think here's some proof that it'd be better to go back to the old commenting system... The function shows the declaration, but when people copy-paste it, some lines get merged into comments, etc. Sorry for all of these "unfunded" requests... :)
Bugs found from Firefox 3.0.16 (on Windows Vista):
i is not defined
tmp_arr is not defined
add this to solve the problem:
var i;
var tmp_arr = new Array();
also, unserialize does not unserialize serialized associative arrays
@Michel Corne: Thank you for the function. The site still has some apparent problems with Unicode in comments (if not the function!), so would you mind submitting your code to http://pastebin.com/ and sending us a link?
I am afraid there is a problem with this function. Try to convert "à" that is multibyte character. "str_data.length" equals "1" (char) since javascript is UTF-8 compliant. But C1 = 224 that is > 128 and you end up playing with C2 and C3 that don't even exist :-) Here is the code I propose. It simply translates the string from UTF-8 to iso-8859-15 (improved from iso-8859-1) with a translation table for all non ASCII characters. Let me know what you think. Thanks, MC.
[code lang="javascript"]
function utf8_decode (utf8) {
// control characters are left for alignment reasons, they will not be used anyway!
var i, iso885915 = '',
utf8ToIso885915 = {
'NBSP': '\xA0', '¡': '\xA1', '¢': '\xA2', '£': '\xA3', '€': '\xA4', '¥': '\xA5', 'Š': '\xA6', '§': '\xA7',
'š': '\xA8', '©': '\xA9', 'ª': '\xAA', '«': '\xAB', '¬': '\xAC', 'SHY': '\xAD', '®': '\xAE', '¯': '\xAF',
'°': '\xB0', '±': '\xB1', '²': '\xB2', '³': '\xB3', 'Ž': '\xB4', 'µ': '\xB5', '¶': '\xB6', '·': '\xB7',
'ž': '\xB8', '¹': '\xB9', 'º': '\xBA', '»': '\xBB', 'Œ': '\xBC', 'œ': '\xBD', 'Ÿ': '\xBE', '¿': '\xBF',
'À': '\xC0', 'Á': '\xC1', 'Â': '\xC2', 'Ã': '\xC3', 'Ä': '\xC4', 'Å': '\xC5', 'Æ': '\xC6', 'Ç': '\xC7',
'È': '\xC8', 'É': '\xC9', 'Ê': '\xCA', 'Ë': '\xCB', 'Ì': '\xCC', 'Í': '\xCD', 'Î': '\xCE', 'Ï': '\xCF',
'Ð': '\xD0', 'Ñ': '\xD1', 'Ò': '\xD2', 'Ó': '\xD3', 'Ô': '\xD4', 'Õ': '\xD5', 'Ö': '\xD6', '×': '\xD7',
'Ø': '\xD8', 'Ù': '\xD9', 'Ú': '\xDA', 'Û': '\xDB', 'Ü': '\xDC', 'Ý': '\xDD', 'Þ': '\xDE', 'ß': '\xDF',
'à': '\xE0', 'á': '\xE1', 'â': '\xE2', 'ã': '\xE3', 'ä': '\xE4', 'å': '\xE5', 'æ': '\xE6', 'ç': '\xE7',
'è': '\xE8', 'é': '\xE9', 'ê': '\xEA', 'ë': '\xEB', 'ì': '\xEC', 'í': '\xED', 'î': '\xEE', 'ï': '\xEF',
'ð': '\xF0', 'ñ': '\xF1', 'ò': '\xF2', 'ó': '\xF3', 'ô': '\xF4', 'õ': '\xF5', 'ö': '\xF6', '÷': '\xF7',
'ø': '\xF8', 'ù': '\xF9', 'ú': '\xFA', 'û': '\xFB', 'ü': '\xFC', 'ý': '\xFD', 'þ': '\xFE', 'ÿ': '\xFF'
}
for (i = 0; i < utf8.length; i++){
iso885915 += utf8ToIso885915[utf8[i]]? utf8ToIso885915[utf8[i]] : utf8[i];
}
return iso885915;
}
[/CODE
@ Ben Pettit: There has been some talk about it over at the utf8_encode function. Let's work it out over there, and then - if needed - we'll fix utf8_decode accordingly.
This function had troubles with the utf8_encode function when i included the transport.min.js file into my adobe javascript plugin, while using the md5 function.
The way i fixed it was by adding as the first line:
1 | str_data = str_data.valueOf(); |
Thanks for the great open source library! I'm just trying to give back a little bit.
Cheers,
Ben.
@otto
PHP
1 | utf8_encode('Sihlhölzli') |
returns 'Sihlhölzli'
JAVASCRIPT
1 | utf8_decode('Sihlhölzli'); |
returns 'Sihlhölzli'
Seems fine to me...?
I've a folder name "Sihlhölzli" which I utf8_encode on the server with PHP. This gives me "Sihlh\u00f6lzli" and if it's displayed by the browser it nicely shows "Sihlhölzli". Of course I can't use this name to load an image therefore use the utf8_decode in Javascript to decode the name. Unfortunately that doesn't work, not even the browser can display it. Any idea whats wrong?


Brett Zamir
Jan 31st