JavaScript str_word_count
Counts the number of words inside a string. If format of 1 is specified, then the function will return an array containing all the words found inside the string. If format of 2 is specified, then the function will return an associated array where the position of the word is the key and the word itself is the value. For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.
1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 2021 22 23 24 2526 27 28 29 3031 32 33 34 3536 37 38 39 4041 42 43 44 4546 47 48 49 5051 52 53 54 5556 | function str_word_count (str, format, charlist) { // Counts the number of words inside a string. If format of 1 is specified, then the function will return an array containing all the words found inside the string. If format of 2 is specified, then the function will return an associated array where the position of the word is the key and the word itself is the value. For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters. // // version: 909.322 // discuss at: http://phpjs.org/functions/str_word_count // + original by: Ole Vrijenhoek // + bugfixed by: Kevin van Zonneveld (http://kevin.vanzonneveld.net) // + bugfixed by: Brett Zamir (http://brett-zamir.me) // % note 1: Original author stated that "charlist parameter works correct but the last word in the given string will not be counted", but seems to work // * example 1: str_word_count('Hello fri3nd, youre looking good today!', 1, 'àáãç3'); // * returns 1: ['Hello', 'fri3nd', 'youre', 'looking', 'good', 'today'] // A word is valid when it contains a-z A-Z ' - Ole Vrijenhoek var l = str.length, tmpStr = ""; var i = 0; var c = ''; var wArr = [], wC = 0; var assArr = {}, aC = 0, reg = ""; if (charlist) { for (i = 0; i<=charlist.length - 1; i++) { if (i != charlist.length - 1) { reg = reg + charlist.charCodeAt(i) + "|"; } else { reg = reg + charlist.charCodeAt(i); } } reg = new RegExp(reg); } for (i = 0; i <= l-1; i++) { c = str.charCodeAt(i); if ((c<91&&c>64)||(c<123&&c>96)||c == 45||c == 39||(reg && reg.test(c))) { if (tmpStr == "" && format == 2) { aC = i; } tmpStr = tmpStr + String.fromCharCode(c); } else if (tmpStr != "") { if (format != 2) { wArr[wArr.length] = tmpStr; } else { assArr[aC] = tmpStr; } tmpStr = ""; wC++; } } if (!format) { return wC; } else if (format == 1) { return wArr; } else if (format == 2) { return assArr; } throw 'You have supplied an incorrect format';} |
Examples
Running
1 | str_word_count('Hello fri3nd, youre looking good today!', 1, 'àáãç3'); |
Should return
1 | ['Hello', 'fri3nd', 'youre', 'looking', 'good', 'today'] |
Dependencies
No dependencies, you can use this function standalone.
Open syntax issues
php.js uses JsLint to help us keep our code consistent and prevent some common bugs.
Eventually we want all code to pass or at least take into consideration most fixes suggested by JsLint, following this JsLint configuration we’ve decided on.
Authors
Thanks to the following developers, you get to have str_word_count goodness in JavaScript.
The javascript function returns 5 words for this string:
1 | Lorem ipsum dolor asdf asdf asdf |
And 6 words for this one:
1 | Lorem ipsum dolor asdf asdf asdf. |
The PHP function returns 6 for both.
Cheers,
Chris
I believe that to do this correctly for not only Chinese but other languages, we'll need to take a good look at the source code, specifically for PHP 6, since that is where full Unicode support is being added.
We might be able to use XRegExp (see http://stevenlevithan.com/regex/xregexp/ ) and its Unicode plug-in (at http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin ) for our preg_ functions and then make str_word_count() dependent on it, though that won't really help determine what a "word" is (since, in Chinese, a character is technically only a graphical morpheme, and not necessarily also an independent word), though at least it will tell us definitively what a "letter" is. Of course, we can just go back to the source to see how PHP interprets a "word" since we're aiming for that anyways, but again, that will take some work, especially if we wish to make it work for other languages as well as Chinese. I'm pretty busy for now, but feel free to take a shot at it if you like.
FYI, as you can see by your Chinese characters getting mangled, the site is having some problems at the moment with Unicode characters, so if you need to refer to any in the future, maybe you could try using entities or the JavaScript Unicode escape sequences instead (e.g., \u0020). But I think the issue is beyond just Chinese (though Chinese in particular also raises the particular need for also handling characters beyond the Basic Multilingual Plane (BMP) since some Chinese characters fall beyond this plane--in JavaScript, such characters must be represented by two Unicode characters called surrogates (characters which are not used outside of such pairs), so we can't rely on the length of the string--see http://phpjs.org/functions/strlen for a solution).


Brett Zamir
Feb 13th