How to Create a PHP Word Popularity Script
Posted on November 7, 2008, Filled under PHP,
Bookmark it
This is a function that is meant to calculate the density of the words from a text. Since there are many words that have less then 3 characters, I’ve decided to add a filter that will not take into account words that aren’t bigger then (X) characters (examples: if, or, is, it etc.). Also, you can setup an array with a list of words that you do not want to add in the ranking calculation. Here’s the function (I’ll explain you how it works below):
<?php
function calculate_word_popularity($string, $min_word_char = 2, $exclude_words = array())
{
$string = strip_tags($string);
$initial_words_array = str_word_count($string, 1);
$total_words = sizeof($initial_words_array);
$new_string = $string;
foreach($exclude_words as $filter_word)
{
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
}
$words_array = str_word_count($new_string, 1);
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));
$popularity = array();
$unique_words_array = array_unique($words_array);
foreach($unique_words_array as $key => $word)
{
preg_match_all('/\b'.$word.'\b/i', $string, $out);
$count = count($out[0]);
$percent = number_format((($count * 100) / $total_words), 2);
$popularity[$key]['word'] = $word;
$popularity[$key]['count'] = $count;
$popularity[$key]['percent'] = $percent.'%';
}
function cmp($a, $b)
{
return ($a['count'] > $b['count']) ? +1 : -1;
}
usort($popularity, "cmp");
return $popularity;
}
?>How it works?
The function has 3 parameters:
1. $string – which is the text that will be analyzed
2. $min_word_char – the words that are taken into consideration should have a minimum character limit. This is used to avoid calculating importance for very common words (ex: a, in, to etc.)
3. $exclude_words – Do you need to exclude specific words from the rank calculator? Set an array with these words (ex: array(‘company’, ‘friends’, ‘values’)).
First, we will strip any tags found in the string using strip_tags():
$string = strip_tags($string);
Let’s count the number of words from the text:
$initial_words_array = str_word_count($string, 1); $total_words = sizeof($initial_words_array);
Now, let’s apply the ‘exclude words’ filter:
$new_string = $string;
foreach($exclude_words as $filter_word)
{
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
} After we used the filter, let’s calculated again the number of words from the (filtered) text & apply the ‘minimum characters’ filter:
$words_array = str_word_count($new_string, 1);
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));
All the filters were applied. Now we will begin creating the ‘popularity’ array, each key containing the word, the number of occurrences in the text, and the percent (for instance, if ‘software’ is found in 10 times in a text with 40 words it will have 25%).
$popularity = array();
$unique_words_array = array_unique($words_array);
foreach($unique_words_array as $key => $word)
{
preg_match_all('/\b'.$word.'\b/i', $string, $out);
$count = count($out[0]);
$percent = number_format((($count * 100) / $total_words), 2);
$popularity[$key]['word'] = $word;
$popularity[$key]['count'] = $count;
$popularity[$key]['percent'] = $percent.'%';
}
Eventually, we will sort the array by the ‘count’ value:
function cmp($a, $b)
{
return ($a['count'] > $b['count']) ? +1 : -1;
}
usort($popularity, "cmp");
This is an usage example of this function:
<?php
$text = "The PHP development team would like to announce the immediate availability
of PHP 4.4.9. It continues to improve the security and the stability of the 4.4 branch
and all users are strongly encouraged to upgrade to it as soon as possible.
This release wraps up all the outstanding patches for the PHP 4.4 series,
and is therefore the last PHP 4.4 release.";
$exclude_words = array('would','the','and','all','for','are');
$popularity = calculate_word_popularity($text, 3, $exclude_words);
// Check words => 3 characters and exclude the words from the $exclude_words array
krsort($popularity); // sort array (from higher to lower)
$key = 1;
echo 'Total words in the text: '.str_word_count($text).'<br><br>';
echo 'Word / Popularity / Count<br><br>';
foreach($popularity as $value)
{
echo $key.".<b>".$value['word'].'</b> - <font color="green">'.$value['percent'].'</font> ('.$value["count"].')'. "<br>\n";
$key++;
}
?>

Do you wish to receive the latest updates as soon as they are posted? Get our RSS Feed or Subscribe to the Newsletter!
- November 7, 2008
- article by Gabriel C.
- 6 comments
Related Posts
How to Create an Advanced PHP (bad, naughty) Words Filterat October 26, 2008 with 6 comments
How to emphasize specific words from a string (text)at September 6, 2008
How to replace multiple spaces from a string in PHPat August 29, 2008 with 1 comment
PHP: Extract Alphabetical Sequences from a Stringat October 5, 2008
Display Values from an Array in Random Orderat October 2, 2008

6 Replies to "How to Create a PHP Word Popularity Script"
December 15, 2008 at 2:52 PM
Good Tutorial! It was chosen for the home page of http://www.tutorialsroom.com
Waiting for your next tutorial
February 23, 2009 at 6:06 AM
what about array_count_values?
March 13, 2009 at 7:15 PM
Thank you .
September 14, 2009 at 7:39 AM
Amazing! this is exactly what I had been looking for till now.
Thanks man
March 18, 2010 at 1:16 AM
Would absolutely love to see a similar function that would do 2 to 4 word phrases. I’ve attempted writing one, but it got complex too quickly for me!
July 14, 2010 at 7:33 PM
Hey there,
Thanks a lot for this function. I was surprised not to see more of them on the net, this is very usefull. There a small enhancement, certainly not beautifully coded, but now the function can calculate the popularity of groups of 2,3,4.. words
function calculate_word_popularity($string, $min_word_char = 3, $exclude_words = array(), $nbre_words = 1 )
{
$string = strip_tags(strtolower($string));
$initial_words_array = str_word_count($string, 1);
$total_words = sizeof($initial_words_array);
$new_string = $string;
foreach($exclude_words as $filter_word)
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
$words_array = str_word_count($new_string, 1,'àèéêïîùç');
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));
// on récupère les combinaisons de 2 mots ou plus..
if ( $nbre_words > 1 ) {
// on parcours word array dont on a enlevé les mots inférieur à la limite, de 0 jusqu'à max-nbre_mots+1
foreach ( $words_array as $key => $value ) {
// var temporaire dans laquelle on va stoquer les x mots à la suite
$temp = '' ;
$flag = true ;
// pour j qui va de 0 à taille d'un mot - 1 on colle les mots trouvés
for ( $j = 0 ; $j = '.$nbre_words.';'));
}
$popularity = array();
$unique_words_array = array_unique($words_array);
foreach($unique_words_array as $key => $word)
{
preg_match_all('/\b'.$word.'\b/i', $string, $out);
$count = count($out[0]);
$percent = number_format((($count * 100) / $total_words), 2);
if ( $count ) {
$popularity[$key]['word'] = $word;
$popularity[$key]['count'] = $count;
$popularity[$key]['percent'] = $percent.'%';
}
}
function cmp($a, $b)
{
return ($a['count'] > $b['count']) ? +1 : -1;
}
usort($popularity, "cmp");
return $popularity;
}