JavaScript utf8_encode
Encodes an ISO-8859-1 string to UTF-8
1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 27 28 29 3031 32 33 34 3536 37 38 39 4041 42 43 44 4546 47 48 49 5051 52 | function utf8_encode (argString) { // Encodes an ISO-8859-1 string to UTF-8 // // version: 1109.2015 // discuss at: http://phpjs.org/functions/utf8_encode // + original by: Webtoolkit.info (http://www.webtoolkit.info/) // + improved by: Kevin van Zonneveld (http://kevin.vanzonneveld.net) // + improved by: sowberry // + tweaked by: Jack // + bugfixed by: Onno Marsman // + improved by: Yves Sucaet // + bugfixed by: Onno Marsman // + bugfixed by: Ulrich // + bugfixed by: Rafal Kukawski // * example 1: utf8_encode('Kevin van Zonneveld'); // * returns 1: 'Kevin van Zonneveld' if (argString === null || typeof argString === "undefined") { return ""; } var string = (argString + ''); // .replace(/\r\n/g, "\n").replace(/\r/g, "\n"); var utftext = "", start, end, stringl = 0; start = end = 0; stringl = string.length; for (var n = 0; n < stringl; n++) { var c1 = string.charCodeAt(n); var enc = null; if (c1 < 128) { end++; } else if (c1 > 127 && c1 < 2048) { enc = String.fromCharCode((c1 >> 6) | 192) + String.fromCharCode((c1 & 63) | 128); } else { enc = String.fromCharCode((c1 >> 12) | 224) + String.fromCharCode(((c1 >> 6) & 63) | 128) + String.fromCharCode((c1 & 63) | 128); } if (enc !== null) { if (end > start) { utftext += string.slice(start, end); } utftext += enc; start = end = n + 1; } } if (end > start) { utftext += string.slice(start, stringl); } return utftext; } |
Examples
Running
1 | utf8_encode('Kevin van Zonneveld'); |
Should return
1 | 'Kevin van Zonneveld' |
Dependencies
No dependencies, you can use this function standalone.
Open syntax issues
php.js uses JsLint to help us keep our code consistent and prevent some common bugs.
Eventually we want all code to pass or at least take into consideration most fixes suggested by JsLint, following this JsLint configuration we’ve decided on.
Authors
Thanks to the following developers, you get to have utf8_encode goodness in JavaScript.
@Gajus: JavaScript uses Unicode internally, even if your document is encoded in ISO-8859-1. This function should only be needed if you have a string already using that encoding (or otherwise you are double-encoding). You could provide the codepoints you have, and what you expect it to return.
function utf8_encode(){
var str = arguments[0] + "",
len = str.length - 1,
i = -1,
result = "";
while( !!(i++ - len) ){
var c = str.charCodeAt(i),
ops = [
c,
(c >> 6 | 192) + (c & 63 | 128),
(c >> 12 | 224) + (c >> 6 & 63 | 128) + (c & 63 | 128)
],
i = c < 128 ? 0 : c < 2048 ? 1 : 2;
result += String.fromCharCode(ops[i]);
}
return result;
}
Patch using Eli's suggestion (though, he posted the equivalent for utf8_decode):
--- utf8_encode.js.old 2011-01-09 12:23:22.000000000 -0500
+++ utf8_encode.js 2011-01-09 12:23:49.000000000 -0500
@@ -11,6 +11,10 @@
// * example 1: utf8_encode('Kevin van Zonneveld');
// * returns 1: 'Kevin van Zonneveld'
+ if (typepof window.encodeURIComponent !== 'undefined') {
+ return unescape( window.encodeURIComponent( argString ));
+ }
+
var string = (argString+''); // .replace(/\r\n/g, "\n").replace(/\r/g, "\n");
var utftext = "";
Eli. That's not cross browser portable, plus it won't work with some input, e.g., "malformed URI sequence" errors on FF.
This entire function can be replaced with the following.
function utf8_encode (argString) {
return decodeURIComponent(escape(argString));
}
@Cristián: I think what happened is that you must have copied the text directly from this page, but the commenting code often messes up the new lines. To get a pristine version (and the latest version), use the link "raw js source"...
Hey, you have an error.
You are using into the code the object string.*, but the argument name is "argString" instead of "string".
Using argString instead of string it works correctly
This function will throw an exception if passed an empty string.
I think it needs to include "
try {} catch(e) {} return'';" around its contents and the following line at the start:
if (argString == '') return '';
I made a fix so this function ran correctly in adobe javascript.
function utf8_encode ( string ) {
// Encodes an ISO-8859-1 string to UTF-8
//
// version: 812.316
// discuss at: http://phpjs.org/functions/utf8_encode
// + original by: Webtoolkit.info (http://www.webtoolkit.info/)
// + improved by: Kevin van Zonneveld (http://kevin.vanzonneveld.net)
// + improved by: sowberry
// + tweaked by: Jack
// + bugfixed by: Onno Marsman
// + improved by: Yves Sucaet
// + bugfixed by: Onno Marsman
// + adobe js by: Ben Pettit
// * example 1: utf8_encode('Kevin van Zonneveld');
// * returns 1: 'Kevin van Zonneveld'
string = string.valueOf(); // <-bp: I added this line.
string = (string+'').replace(/\r\n/g, "\n").replace(/\r/g, "\n");
var utftext = "";
var start, end;
var stringl = 0;
This is just weird. Of course the extra (string+'') is not necessary. The following would do exactly the same:
[CODE="Javascript"]
string = (string+'').replace(/\r\n/g, "\n").replace(/\r/g, "\n");
[/CODE]
or even something like (not tested):
[CODE="Javascript"]
string = (string+'').replace(/\r\n?/g, "\n");
[/CODE]
I think it makes sense to replace
[CODE="Javascript"]
string = (string+'').replace(/\r\n/g,"\n");
[/CODE]
with
[CODE="Javascript"]
string = (string+'').replace(/\r\n/g,"\n");
string = (string+'').replace(/\r/g,"\n");
[/CODE]
While looking for a javascript crc script, I found the version on webtoolkit.info as well as your subsequent modification.
Testing with a chunk of text a couple hundred characters long, with just a couple non-ascii values, I saw no significant improvement with your approach of using an array as a pseudo-StringBuilder. The issue is the use of String.fromCharCode for even ascii values, which forces too many string creations. The code below is about 3 times faster in my tests:
[CODE="Javascript"]
function utf8_encode(string) {
string = string.replace(/\r\n/g,"\n");
var utftext = "";
var start, end;
start = end = 0;
for (var n = 0; n < string.length; n++) {
var c = string.charCodeAt(n);
var enc = null;
if (c < 128) {
end++;
}
else if((c > 127) && (c < 2048)) {
enc = String.fromCharCode((c >> 6) | 192) + String.fromCharCode((c & 63) | 128);
}
else {
enc = String.fromCharCode((c >> 12) | 224) + String.fromCharCode(((c >> 6) & 63) | 128) + String.fromCharCode((c & 63) | 128);
}
if (enc != null)
{
if (end > start)
{
utftext += string.substring(start, end);
}
utftext += enc;
start = end = n+1;
}
}
if (end > start)
{
utftext += string.substring(start, string.length);
}
return utftext;
}
[/CODE]
Please feel free to post this to the various script repositories, as I am not especially active on the web. Thanks.


kirilloid
Feb 8th
Replacing lines 34-36 with
enc = String.fromCharCode((c1 >> 6) | 192, (c1 & 63) | 128); } else { enc = String.fromCharCode((c1 >> 12) | 224, ((c1 >> 6) & 63) | 128, (c1 & 63) | 128);may reduce execution time from 20x to 12x on mostly non-ascii strings (e.g. cyrillic text).