This is a function that is meant to calculate the density of the words from a text. Since there are many words that have less then 3 characters, I’ve decided to add a filter that will not take into account words that aren’t bigger then (X) characters (examples: if, or, is, it etc.). Also, you can setup an array with a list of words that you do not want to add in the ranking calculation. Here’s the function (I’ll explain you how it works below):
<?php
function calculate_word_popularity($string, $min_word_char = 2, $exclude_words = array())
{
$string = strip_tags($string);
$initial_words_array = str_word_count($string, 1);
$total_words = sizeof($initial_words_array);
$new_string = $string;
foreach($exclude_words as $filter_word)
{
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
}
$words_array = str_word_count($new_string, 1);
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));
$popularity = array();
$unique_words_array = array_unique($words_array);
foreach($unique_words_array as $key => $word)
{
preg_match_all('/\b'.$word.'\b/i', $string, $out);
$count = count($out[0]);
$percent = number_format((($count * 100) / $total_words), 2);
$popularity[$key]['word'] = $word;
$popularity[$key]['count'] = $count;
$popularity[$key]['percent'] = $percent.'%';
}
function cmp($a, $b)
{
return ($a['count'] > $b['count']) ? +1 : -1;
}
usort($popularity, "cmp");
return $popularity;
}
?>
How it works?
The function has 3 parameters:
1. $string – which is the text that will be analyzed
2. $min_word_char – the words that are taken into consideration should have a minimum character limit. This is used to avoid calculating importance for very common words (ex: a, in, to etc.)
3. $exclude_words – Do you need to exclude specific words from the rank calculator? Set an array with these words (ex: array(‘company’, ‘friends’, ‘values’)).
First, we will strip any tags found in the string using strip_tags():
$string = strip_tags($string);
Let’s count the number of words from the text:
$initial_words_array = str_word_count($string, 1); $total_words = sizeof($initial_words_array);
Now, let’s apply the ‘exclude words’ filter:
$new_string = $string;
foreach($exclude_words as $filter_word)
{
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
}
After we used the filter, let’s calculated again the number of words from the (filtered) text & apply the ‘minimum characters’ filter:
$words_array = str_word_count($new_string, 1);
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));
All the filters were applied. Now we will begin creating the ‘popularity’ array, each key containing the word, the number of occurrences in the text, and the percent (for instance, if ‘software’ is found in 10 times in a text with 40 words it will have 25%).
$popularity = array();
$unique_words_array = array_unique($words_array);
foreach($unique_words_array as $key => $word)
{
preg_match_all('/\b'.$word.'\b/i', $string, $out);
$count = count($out[0]);
$percent = number_format((($count * 100) / $total_words), 2);
$popularity[$key]['word'] = $word;
$popularity[$key]['count'] = $count;
$popularity[$key]['percent'] = $percent.'%';
}
Eventually, we will sort the array by the ‘count’ value:
function cmp($a, $b)
{
return ($a['count'] > $b['count']) ? +1 : -1;
}
usort($popularity, "cmp");
This is an usage example of this function:
<?php
$text = "The PHP development team would like to announce the immediate availability
of PHP 4.4.9. It continues to improve the security and the stability of the 4.4 branch
and all users are strongly encouraged to upgrade to it as soon as possible.
This release wraps up all the outstanding patches for the PHP 4.4 series,
and is therefore the last PHP 4.4 release.";
$exclude_words = array('would','the','and','all','for','are');
$popularity = calculate_word_popularity($text, 3, $exclude_words);
// Check words => 3 characters and exclude the words from the $exclude_words array
krsort($popularity); // sort array (from higher to lower)
$key = 1;
echo 'Total words in the text: '.str_word_count($text).'<br><br>';
echo 'Word / Popularity / Count<br><br>';
foreach($popularity as $value)
{
echo $key.".<b>".$value['word'].'</b> - <font color="green">'.$value['percent'].'</font> ('.$value["count"].')'. "<br>\n";
$key++;
}
?>

- November 7, 2008
- article by Gabriel C.
- 11 comments
Related Posts
-
How to Create an Advanced PHP (bad, naughty) Words Filterat October 26, 2008 with 10 comments
-
How to replace multiple spaces from a string in PHPat August 29, 2008 with 2 comments
-
How to emphasize specific words from a string (text)at September 6, 2008
-
Creating a Random Quote Scriptat August 29, 2008 with 2 comments
-
Display Values from an Array in Random Orderat October 2, 2008

Comment via Facebook
11 Replies to "How to Create a PHP Word Popularity Script"
December 15, 2008 at 2:52 PM
Good Tutorial! It was chosen for the home page of http://www.tutorialsroom.com
Waiting for your next tutorial :)
February 23, 2009 at 6:06 AM
what about array_count_values?
March 13, 2009 at 7:15 PM
Thank you .
September 14, 2009 at 7:39 AM
Amazing! this is exactly what I had been looking for till now.
Thanks man
March 18, 2010 at 1:16 AM
Would absolutely love to see a similar function that would do 2 to 4 word phrases. I’ve attempted writing one, but it got complex too quickly for me!
July 14, 2010 at 7:33 PM
Hey there,
Thanks a lot for this function. I was surprised not to see more of them on the net, this is very usefull. There a small enhancement, certainly not beautifully coded, but now the function can calculate the popularity of groups of 2,3,4.. words
function calculate_word_popularity($string, $min_word_char = 3, $exclude_words = array(), $nbre_words = 1 )
{
$string = strip_tags(strtolower($string));
$initial_words_array = str_word_count($string, 1);
$total_words = sizeof($initial_words_array);
$new_string = $string;
foreach($exclude_words as $filter_word)
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
$words_array = str_word_count($new_string, 1,'àèéêïîùç');
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));
// on récupère les combinaisons de 2 mots ou plus..
if ( $nbre_words > 1 ) {
// on parcours word array dont on a enlevé les mots inférieur à la limite, de 0 jusqu'à max-nbre_mots+1
foreach ( $words_array as $key => $value ) {
// var temporaire dans laquelle on va stoquer les x mots à la suite
$temp = '' ;
$flag = true ;
// pour j qui va de 0 à taille d'un mot - 1 on colle les mots trouvés
for ( $j = 0 ; $j = '.$nbre_words.';'));
}
$popularity = array();
$unique_words_array = array_unique($words_array);
foreach($unique_words_array as $key => $word)
{
preg_match_all('/\b'.$word.'\b/i', $string, $out);
$count = count($out[0]);
$percent = number_format((($count * 100) / $total_words), 2);
if ( $count ) {
$popularity[$key]['word'] = $word;
$popularity[$key]['count'] = $count;
$popularity[$key]['percent'] = $percent.'%';
}
}
function cmp($a, $b)
{
return ($a['count'] > $b['count']) ? +1 : -1;
}
usort($popularity, "cmp");
return $popularity;
}
January 8, 2011 at 2:57 AM
Hi,
but your code don’t go because it has an error, do you have correct code?
Thank you a lot
Bye
November 24, 2011 at 4:56 AM
Did anyone get the correct code. This looks intersting, but it’s incomplete…
Thanks!!
July 19, 2011 at 1:39 AM
Thank you very much! Came here from a Stack Overflow link.
Even as a C# developer, your code helped me a lot :-)
October 20, 2011 at 8:33 PM
Hi,
Thanks for sharing this, very useful. I was wondering how I could add a drop text area to this script with a “check text” button and also make it available for foreign languages with accent. I have tested the secript in a different language and when there is an accent the rest of word is cut.
Thanks for your help,
October 24, 2011 at 1:06 AM
Hi,
Thanks very useful script. How can I make it work with foreign language? At the moment words with accents are cut.
Many Thanks,