JavaScript html_entity_decode
Convert all HTML entities to their applicable characters
1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 27 28 29 3031 32 33 34 3536 37 38 39 4041 | function html_entity_decode (string, quote_style) { // Convert all HTML entities to their applicable characters // // version: 1008.1718 // discuss at: http://phpjs.org/functions/html_entity_decode // + original by: john (http://www.jd-tech.net) // + input by: ger // + improved by: Kevin van Zonneveld (http://kevin.vanzonneveld.net) // + revised by: Kevin van Zonneveld (http://kevin.vanzonneveld.net) // + bugfixed by: Onno Marsman // + improved by: marc andreu // + revised by: Kevin van Zonneveld (http://kevin.vanzonneveld.net) // + input by: Ratheous // + bugfixed by: Brett Zamir (http://brett-zamir.me) // + input by: Nick Kolosov (http://sammy.ru) // + bugfixed by: Fox // - depends on: get_html_translation_table // * example 1: html_entity_decode('Kevin & van Zonneveld'); // * returns 1: 'Kevin & van Zonneveld' // * example 2: html_entity_decode('&lt;'); // * returns 2: '<' var hash_map = {}, symbol = '', tmp_str = '', entity = ''; tmp_str = string.toString(); if (false === (hash_map = this.get_html_translation_table('HTML_ENTITIES', quote_style))) { return false; } // fix & problem // http://phpjs.org/functions/get_html_translation_table:416#comment_97660 delete(hash_map['&']); hash_map['&'] = '&'; for (symbol in hash_map) { entity = hash_map[symbol]; tmp_str = tmp_str.split(entity).join(symbol); } tmp_str = tmp_str.split(''').join("'"); return tmp_str;} |
Examples
» Example 1
Running
1 | html_entity_decode('Kevin & van Zonneveld'); |
Should return
1 | 'Kevin & van Zonneveld' |
» Example 2
Running
1 | html_entity_decode('&lt;'); |
Should return
1 | '<' |
Dependencies
In order to use this function, you also need:
Open syntax issues
php.js uses JsLint to help us keep our code consistent and prevent some common bugs.
Eventually we want all code to pass or at least take into consideration most fixes suggested by JsLint, following this JsLint configuration we’ve decided on.
Authors
Thanks to the following developers, you get to have html_entity_decode goodness in JavaScript.
@ Brett Zamir: YEah I already have:
class DATABASE_CONFIG {
var $default = array(
'driver' => 'mysql',
'....',
'encoding' => 'utf8',
);
in my cake datasource which should execute that statement ever time. I'm kind of puzzled what else I need to make utf8 aware to avoid these question marks..
@ Brett Zamir: Good job man! I'm thinking the only place left that could screw us with unicode is mysql. I've changed the table collation to utf8_unicode_ci. Let's see if things improve.
Hello ?ukasz (Kevin, a Unicode bug?--otherwise, I can't credit this person for "input by"),
I did modify get_html_translation_table() to keep the order of what PHP returns for that function (and as a result removed the hack within this and other functions for adding &amp; at the end). One catch is that although get_html_translation_table() returns &#39;, the functions we use like htmlspecialchars, return &#039;. But we cannot modify get_html_translation_table() to add &#039; since that histogram (correctly) is keyed with an apostrophe leading necessarily to only one value (&#39;).
So, we have to modify the functions to work with &#039; as well (which is not a problem really since this is the only numeric character reference in the list (&apos; is XML-only, so it couldn't be used)).
So, I've fixed htmlspecialchars_decode() and html_entity_decode() to work with both &#039; and &#39; and also "fixed" htmlspecialchars() and htmlentities() to use &#039; for output as they do in PHP (without modifying get_html_translation_table() which uses &#39;).
I think that should address all the issues.
I have noticed that ' is decoded by html_entity_decode() as ' (apostrophe), but ' isn't!!! (of course when using 'ENT_QUOTES') The same problem is with htmlspecialchars_decode(). I have checked that in PHP decodes both ' and ' I tried to find the code in PHP sources, but they seems to be veery complicated. I have only found a structure that stores several entities - those decoded by htmlspecialchars_decode:
php-5.2.9.tar.bz2/ext/standard/html.c, lines 454-466
static const struct {
unsigned short charcode;
char *entity;
int entitylen;
int flags;
} basic_entities[] = {
{ '"', """, 6, ENT_HTML_QUOTE_DOUBLE },
{ '\'', "'", 6, ENT_HTML_QUOTE_SINGLE },
{ '\'', "'", 5, ENT_HTML_QUOTE_SINGLE },
{ '<', "<", 4, 0 },
{ '>', ">", 4, 0 },
{ 0, NULL, 0, 0 }
};
As you can see, both ' and ' are listed.
In case of JS code of these two functions (in fact I think we should modify get_html_transition_table), the modification is quite complicated...
@ Azriel Fasten: Yes but that would also make it harder for people to just copy 1 function:
http://trac.plutonia.nl/projects/phpjs/wiki/DeveloperGuidelines#DependencyvsRedundancy
The less dependencies the better, but of course we are not about to duplicate the histogram from get_html_translation_table 4 times, so dependencies are already made in this function family.
I think we should probably first come up with the fastest str_replace as possible. And base our decision (Dependency vs Redundancy) on the final algorithm used.
I think that perhaps the replace should be relegated to str_replace, and that function should be highly optimized. Many other parts of the library all use different ways of replacing. These should all use str_replace.
@ Azriel Fasten: You reported a bug by mail, that is exactly the same as the real PHP encountered at one point: http://bugs.php.net/bug.php?id=25707
I've read the bug report more thorough, and applied the same fix as was proposed there.
I put the & entity at the bottom of the histogram.
Faster ways to replace (without using regex) can still be explored.
@ marc andreu: I've revised all of the functions like get_html_translation_table, htmlentities & htmlspecialchars and their decoding counterparts, they now also support your second argument. Thank you!
Hi I needed to deal with secodn parameter of html_entity_decode() funcion, and I added it as follows. I hope to be right, however it's a suggestion. That's all folks.
// {{{ html_entity_decode
function html_entity_decode(string, quote_style ) {
// Convert all HTML entities to their applicable characters
//
// + discuss at: http://kevin.vanzonneveld.net/techblog/article/javascript_equivalent_for_phps_html_entity_decode/
// + version: 810.621
// + original by: john (http://www.jd-tech.net)
// + input by: ger
// + improved by: Kevin van Zonneveld (http://kevin.vanzonneveld.net)
// + revised by: Kevin van Zonneveld (http://kevin.vanzonneveld.net)
// + bugfixed by: Onno Marsman
// % note: table from http://www.the-art-of-web.com/html/character-codes/
// * example 1: html_entity_decode('Kevin &amp; van Zonneveld');
// * returns 1: 'Kevin & van Zonneveld'
var histogram = {}, histogram_r = {}, code = 0;
var entity = chr = '';
histogram['34'] = 'quot';
histogram['38'] = 'amp';
histogram['60'] = 'lt';
histogram['62'] = 'gt';
histogram['160'] = 'nbsp';
histogram['161'] = 'iexcl';
histogram['162'] = 'cent';
histogram['163'] = 'pound';
histogram['164'] = 'curren';
histogram['165'] = 'yen';
histogram['166'] = 'brvbar';
histogram['167'] = 'sect';
histogram['168'] = 'uml';
histogram['169'] = 'copy';
histogram['170'] = 'ordf';
histogram['171'] = 'laquo';
histogram['172'] = 'not';
histogram['173'] = 'shy';
histogram['174'] = 'reg';
histogram['175'] = 'macr';
histogram['176'] = 'deg';
histogram['177'] = 'plusmn';
histogram['178'] = 'sup2';
histogram['179'] = 'sup3';
histogram['180'] = 'acute';
histogram['181'] = 'micro';
histogram['182'] = 'para';
histogram['183'] = 'middot';
histogram['184'] = 'cedil';
histogram['185'] = 'sup1';
histogram['186'] = 'ordm';
histogram['187'] = 'raquo';
histogram['188'] = 'frac14';
histogram['189'] = 'frac12';
histogram['190'] = 'frac34';
histogram['191'] = 'iquest';
histogram['192'] = 'Agrave';
histogram['193'] = 'Aacute';
histogram['194'] = 'Acirc';
histogram['195'] = 'Atilde';
histogram['196'] = 'Auml';
histogram['197'] = 'Aring';
histogram['198'] = 'AElig';
histogram['199'] = 'Ccedil';
histogram['200'] = 'Egrave';
histogram['201'] = 'Eacute';
histogram['202'] = 'Ecirc';
histogram['203'] = 'Euml';
histogram['204'] = 'Igrave';
histogram['205'] = 'Iacute';
histogram['206'] = 'Icirc';
histogram['207'] = 'Iuml';
histogram['208'] = 'ETH';
histogram['209'] = 'Ntilde';
histogram['210'] = 'Ograve';
histogram['211'] = 'Oacute';
histogram['212'] = 'Ocirc';
histogram['213'] = 'Otilde';
histogram['214'] = 'Ouml';
histogram['215'] = 'times';
histogram['216'] = 'Oslash';
histogram['217'] = 'Ugrave';
histogram['218'] = 'Uacute';
histogram['219'] = 'Ucirc';
histogram['220'] = 'Uuml';
histogram['221'] = 'Yacute';
histogram['222'] = 'THORN';
histogram['223'] = 'szlig';
histogram['224'] = 'agrave';
histogram['225'] = 'aacute';
histogram['226'] = 'acirc';
histogram['227'] = 'atilde';
histogram['228'] = 'auml';
histogram['229'] = 'aring';
histogram['230'] = 'aelig';
histogram['231'] = 'ccedil';
histogram['232'] = 'egrave';
histogram['233'] = 'eacute';
histogram['234'] = 'ecirc';
histogram['235'] = 'euml';
histogram['236'] = 'igrave';
histogram['237'] = 'iacute';
histogram['238'] = 'icirc';
histogram['239'] = 'iuml';
histogram['240'] = 'eth';
histogram['241'] = 'ntilde';
histogram['242'] = 'ograve';
histogram['243'] = 'oacute';
histogram['244'] = 'ocirc';
histogram['245'] = 'otilde';
histogram['246'] = 'ouml';
histogram['247'] = 'divide';
histogram['248'] = 'oslash';
histogram['249'] = 'ugrave';
histogram['250'] = 'uacute';
histogram['251'] = 'ucirc';
histogram['252'] = 'uuml';
histogram['253'] = 'yacute';
histogram['254'] = 'thorn';
histogram['255'] = 'yuml';
// Reverse table. Cause for maintainability purposes, the histogram is
// identical to the one in htmlentities.
for (code in histogram) {
entity = histogram[code];
histogram_r[entity] = code;
}
var retTemp = (string+'').replace(/(\&([a-zA-Z]+)\;)/g, function(full, m1, m2){
if (m2 in histogram_r) {
return String.fromCharCode(histogram_r[m2]);
} else {
return m2;
}
});
//Add for Marc Andreu Fernadnez. To decode quotes.
// Encode depending on quote_style
if (quote_style == 'ENT_QUOTES') {
retTemp = retTemp.replace('&quot;','"');
retTemp = retTemp.replace('&#039;',"'");
} else if (quote_style != 'ENT_NOQUOTES') {
// All other cases (ENT_COMPAT, default, but not ENT_NOQUOTES)
retTemp = retTemp.replace('&quot;','"');
}
return retTemp;
}// }}}
Thanks for the code!
But shouldn't you destroy
tarea
(otherwise we will end up with n numbers of textareas floating around in the DOM's hyperspace)
@lubber: You sure did! And as I said, as soon as php.js supports optional components, I will include them. Thanks again!
@Kevin: i use these functions to shrink my GET-Parameters in cases where POST wasnt possible (imagine an img-tag which will generate a custom picture and the parameters will exceed the 2048 url-chars limit on IE (that was the case for me)) Anyway, i just wanted to contribute my 2cent for this project :)
@ lubber: Wow that is some awesome code and I will definitely save the links. However, the 2 functions are probably rarely used in JavaScript. That hasn't stopped me before, but in this case the 2 functions alone (72kB) will increase the total project size by 52%. That's a bit to much for now.
However, when php.js gets a page for component customization, I will include the functions and just leave them unchecked by default. Sounds good?


pedro
20 Aug '09