Use PHP functions in JavaScript

JavaScript utf8_decode

Converts a UTF-8 encoded string to ISO-8859-1

1
2
3
4
56
7
8
9
1011
12
13
14
1516
17
18
19
2021
22
23
24
2526
27
28
29
3031
32
33
34
3536
37
38
function utf8_decode ( str_data ) {
    // Converts a UTF-8 encoded string to ISO-8859-1  
    // 
    // version: 909.322
    // discuss at: http://phpjs.org/functions/utf8_decode    // +   original by: Webtoolkit.info (http://www.webtoolkit.info/)
    // +      input by: Aman Gupta
    // +   improved by: Kevin van Zonneveld (http://kevin.vanzonneveld.net)
    // +   improved by: Norman "zEh" Fuchs
    // +   bugfixed by: hitwork    // +   bugfixed by: Onno Marsman
    // +      input by: Brett Zamir (http://brett-zamir.me)
    // +   bugfixed by: Kevin van Zonneveld (http://kevin.vanzonneveld.net)
    // *     example 1: utf8_decode('Kevin van Zonneveld');
    // *     returns 1: 'Kevin van Zonneveld'    var tmp_arr = [], i = 0, ac = 0, c1 = 0, c2 = 0, c3 = 0;
    
    str_data += '';
    
    while ( i < str_data.length ) {        c1 = str_data.charCodeAt(i);
        if (c1 < 128) {
            tmp_arr[ac++] = String.fromCharCode(c1);
            i++;
        } else if ((c1 > 191) && (c1 < 224)) {            c2 = str_data.charCodeAt(i+1);
            tmp_arr[ac++] = String.fromCharCode(((c1 & 31) << 6) | (c2 & 63));
            i += 2;
        } else {
            c2 = str_data.charCodeAt(i+1);            c3 = str_data.charCodeAt(i+2);
            tmp_arr[ac++] = String.fromCharCode(((c1 & 15) << 12) | ((c2 & 63) << 6) | (c3 & 63));
            i += 3;
        }
    } 
    return tmp_arr.join('');
}
external links: original PHP docs | raw js source

Examples

Running

1
utf8_decode('Kevin van Zonneveld');

Should return

1
'Kevin van Zonneveld'

Dependencies

No dependencies, you can use this function standalone.

Open syntax issues

php.js uses JsLint to help us keep our code consistent and prevent some common bugs.

Eventually we want all code to pass or at least take into consideration most fixes suggested by JsLint, following this JsLint configuration we’ve decided on.


Authors

Thanks to the following developers, you get to have utf8_decode goodness in JavaScript.

Comments

Add Comment
Use:
[CODE]
your_stuff('here');
[/CODE]
for proper code formatting
By submitting code here you are allowing us to use it in php.js hence dual licensing it under the MIT and GPL licenses

Gravatar
Brett Zamir
Jan 31st Permalink

q  By the way, if you change the functions to avoid String.fromCharCode() and charCodeAt(), and just deal with an array of numbers, you can convert between the character sets more efficiently (unless you need to print to string format).

Gravatar
Brett Zamir
Jan 31st Permalink

q  @Chris Ahrweiler: Are you sure you are using it correctly?

Please note that these functions in php.js are not related to whether the current page itself is in UTF-8 or not; JavaScript always uses UTF-16 internally.

As a little background if you weren't aware, JavaScript internally uses two-bytes to represent every character (or four bytes in some extremely rare characters--surrogate pairs which are two pseudo-characters joined together to form a single character but whose string length is 2). UTF-8, on the other hand, uses 1-4 bytes for each character, while ISO-8859-1 uses exactly 1 byte (it can do that because is far more limited, in handling only Latin characters). UTF-8 uses 1-2 bytes for the same range as ISO-8859-1 (1 bytes for the ASCII); the reason it must use 2 bytes in some cases, even though Unicode assigns the same code points as used in in ISO-8859-1, is because one of the bits in a UTF-8 byte is reserved for indicating whether the byte is a single 1-byte sequence or part of a multi-byte sequence.

So, in order to represent UTF-8 in JavaScript (and even ISO-8859-1), we are forced with either using an array of code points, where each number represents the value for a single byte, or using a string where each character has the code point value of a single byte.

Internally, JavaScript will use two bytes (or rarely 4 bytes), but we can use its strings to represent a sequence of single byte characters (ISO-8859-1) or a sequence of 1-4 byte character sequences (UTF-8). (If you put the text in an alert, our "UTF-8" will not be human-readable, while our "ISO-8859-1" will be.)

In JavaScript you can easily discover what the UTF-8 sequence for a character should be by using encodeURIComponent. For the sharp-S, it gives "%C3%9F", which, translated into a multiple single-byte string sequence is "\u00C3\u009F" or, more visually, "Ã\u009F".

In ISO-8859-1, it should simply be "\u00DF" or "ß".

In other words, we can treat regular JavaScript as though it were ISO-8859-1 since both have a fixed correspondence between a character and number of bytes (ISO-8859-1 is 1-to-1 while UTF-16 is 1 to 2, unless for rare (and non-Latin) characters not present in ISO-8859-1 where the correspondence is 1-to-4; 4-byte characters are necessary in some cases since even 2 bytes is not enough to represent all of the scripts supported by Unicode).

So, "utf8_decode" would only be used if you already have a string of characters in this artificial "UTF-8" where each "character" has up to a full byte of value (0-255)--representing a single byte in a UTF-8 1-4 byte sequence.

"utf8_encode", on the other hand, would let you encode a ISO-8859-1 string (or Latin UTF-16 strings) into such an artificial "UTF-8" sequence of 1-2 byte sequences.

If the above doesn't help, feel free to clarify exactly what you are trying to do and why.

Gravatar
Chris Ahrweiler
Jan 27th Permalink

q  Hello,

tried the function, looking good, but german Umlaut "ß" wouldn't be converted correctly. Any improved version of this function available?

Regards, Chris.

Gravatar
Brett Zamir
30 Dec '09 Permalink

q  @Walessio: The declaration was already there (and for some more variables too), but I think the syntax highlighter doesn't accurately insert newlines, so when you copy-paste, it doesn't always preserve them. To be safer, just paste from the "raw js source" link.
@Kevin, I think here's some proof that it'd be better to go back to the old commenting system... The function shows the declaration, but when people copy-paste it, some lines get merged into comments, etc. Sorry for all of these "unfunded" requests... :)

Gravatar
Walessio
29 Dec '09 Permalink

q  Bugs found from Firefox 3.0.16 (on Windows Vista):
i is not defined
tmp_arr is not defined

add this to solve the problem:
var i;
var tmp_arr = new Array();

also, unserialize does not unserialize serialized associative arrays

Gravatar
Brett Zamir
24 Sep '09 Permalink

q  @Michel Corne: Thank you for the function. The site still has some apparent problems with Unicode in comments (if not the function!), so would you mind submitting your code to http://pastebin.com/ and sending us a link?

Gravatar
Michel Corne
23 Sep '09 Permalink

q   I am afraid there is a problem with this function. Try to convert "à" that is multibyte character. "str_data.length" equals "1" (char) since javascript is UTF-8 compliant. But C1 = 224 that is > 128 and you end up playing with C2 and C3 that don't even exist :-) Here is the code I propose. It simply translates the string from UTF-8 to iso-8859-15 (improved from iso-8859-1) with a translation table for all non ASCII characters. Let me know what you think. Thanks, MC.

[code lang="javascript"]
function utf8_decode (utf8) {
// control characters are left for alignment reasons, they will not be used anyway!
var i, iso885915 = '',
utf8ToIso885915 = {
'NBSP': '\xA0', '¡': '\xA1', '¢': '\xA2', '£': '\xA3', '€': '\xA4', '¥': '\xA5', 'Š': '\xA6', '§': '\xA7',
'š': '\xA8', '©': '\xA9', 'ª': '\xAA', '«': '\xAB', '¬': '\xAC', 'SHY': '\xAD', '®': '\xAE', '¯': '\xAF',
'°': '\xB0', '±': '\xB1', '²': '\xB2', '³': '\xB3', 'Ž': '\xB4', 'µ': '\xB5', '¶': '\xB6', '·': '\xB7',
'ž': '\xB8', '¹': '\xB9', 'º': '\xBA', '»': '\xBB', 'Œ': '\xBC', 'œ': '\xBD', 'Ÿ': '\xBE', '¿': '\xBF',
'À': '\xC0', 'Á': '\xC1', 'Â': '\xC2', 'Ã': '\xC3', 'Ä': '\xC4', 'Å': '\xC5', 'Æ': '\xC6', 'Ç': '\xC7',
'È': '\xC8', 'É': '\xC9', 'Ê': '\xCA', 'Ë': '\xCB', 'Ì': '\xCC', 'Í': '\xCD', 'Î': '\xCE', 'Ï': '\xCF',
'Ð': '\xD0', 'Ñ': '\xD1', 'Ò': '\xD2', 'Ó': '\xD3', 'Ô': '\xD4', 'Õ': '\xD5', 'Ö': '\xD6', '×': '\xD7',
'Ø': '\xD8', 'Ù': '\xD9', 'Ú': '\xDA', 'Û': '\xDB', 'Ü': '\xDC', 'Ý': '\xDD', 'Þ': '\xDE', 'ß': '\xDF',
'à': '\xE0', 'á': '\xE1', 'â': '\xE2', 'ã': '\xE3', 'ä': '\xE4', 'å': '\xE5', 'æ': '\xE6', 'ç': '\xE7',
'è': '\xE8', 'é': '\xE9', 'ê': '\xEA', 'ë': '\xEB', 'ì': '\xEC', 'í': '\xED', 'î': '\xEE', 'ï': '\xEF',
'ð': '\xF0', 'ñ': '\xF1', 'ò': '\xF2', 'ó': '\xF3', 'ô': '\xF4', 'õ': '\xF5', 'ö': '\xF6', '÷': '\xF7',
'ø': '\xF8', 'ù': '\xF9', 'ú': '\xFA', 'û': '\xFB', 'ü': '\xFC', 'ý': '\xFD', 'þ': '\xFE', 'ÿ': '\xFF'
}

for (i = 0; i < utf8.length; i++){
iso885915 += utf8ToIso885915[utf8[i]]? utf8ToIso885915[utf8[i]] : utf8[i];
}

return iso885915;
}
[/CODE

Gravatar
Kevin van Zonneveld
12 May '09 Permalink

q  @ Ben Pettit: There has been some talk about it over at the utf8_encode function. Let's work it out over there, and then - if needed - we'll fix utf8_decode accordingly.

Gravatar
Ben Pettit
5 May '09 Permalink

q   This function had troubles with the utf8_encode function when i included the transport.min.js file into my adobe javascript plugin, while using the md5 function.
The way i fixed it was by adding as the first line:

1
str_data = str_data.valueOf();


Thanks for the great open source library! I'm just trying to give back a little bit.
Cheers,
Ben.

Gravatar
Kevin van Zonneveld
3 Apr '09 Permalink

q  @ Tim de Koning: thx for helping out man!
@ Otto Wyss: Can you confirm this?

Gravatar
Tim de Koning
1 Apr '09 Permalink

q   @otto

PHP

1
utf8_encode('Sihlhölzli')

returns 'Sihlhölzli'

JAVASCRIPT
1
utf8_decode('Sihlhölzli');

returns 'Sihlhölzli'

Seems fine to me...?

Gravatar
Otto Wyss
1 Apr '09 Permalink

q  I've a folder name "Sihlhölzli" which I utf8_encode on the server with PHP. This gives me "Sihlh\u00f6lzli" and if it's displayed by the browser it nicely shows "Sihlhölzli". Of course I can't use this name to load an image therefore use the utf8_decode in Javascript to decode the name. Unfortunately that doesn't work, not even the browser can display it. Any idea whats wrong?

Gravatar
Kevin van Zonneveld
6 Oct '08 Permalink

q  @ hitwork: Wow thank you hitwork!

Gravatar
hitwork
5 Oct '08 Permalink

q  There is an error in this function on the following line

} else if ((c1 > 191) && (c < 224)) {

it should be c1<224

br,
margus

Gravatar
Kevin van Zonneveld
21 Sep '08 Permalink

q  @ Norman: Thank you, fixed. What do you mean by 2 times htmlentities though?

Gravatar
Norman "zEh" Fuchs
18 Sep '08 Permalink

q  .. and c3 declaration is missing :D

Gravatar
Norman "zEh" Fuchs
18 Sep '08 Permalink

q  2 times htmlentities -.-

#3. .. &quot;zEh&quot;

Gravatar
Norman "zEh" Fuchs
18 Sep '08 Permalink

q  Unused c1 - fix it : )

Gravatar
Kevin van Zonneveld
8 May '08 Permalink

q  @ Aman Gupta: Thanks I've updated all of the base64 & utf functions based on your findings.

Gravatar
Aman Gupta
8 May '08 Permalink

q  This implementation is extremely slow in IE due to string concatenation. It is much faster to push onto an array and return array.join('').


Contribute a New function

More functions

In this category

» utf8_decode
utf8_encode

Support us

spread the word:


Use any PHP function in JavaScript


These kind folks have already donated: Anonymous and Shawn Houser.
<your name here>

Click here to lend your support to: phpjs and make a donation at www.pledgie.com !