How to Create a PHP Word Popularity Script

Posted on November 7, 2008, under PHP 

This is a function that is meant to calculate the density of the words from a text. Since there are many words that have less then 3 characters, I’ve decided to add a filter that will not take into account words that aren’t bigger then (X) characters (examples: if, or, is, it etc.). Also, you can setup an array with a list of words that you do not want to add in the ranking calculation. Here’s the function (I’ll explain you how it works below):

<?php
function calculate_word_popularity($string, $min_word_char = 2, $exclude_words = array())
{
$string = strip_tags($string);

$initial_words_array  =  str_word_count($string, 1);
$total_words = sizeof($initial_words_array);

$new_string = $string;

foreach($exclude_words as $filter_word)
{
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
}

$words_array = str_word_count($new_string, 1);

$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));

$popularity = array();

$unique_words_array = array_unique($words_array);

foreach($unique_words_array as $key => $word)
	{
	preg_match_all('/\b'.$word.'\b/i', $string, $out);

	$count = count($out[0]);

	$percent = number_format((($count * 100) / $total_words), 2); 

	$popularity[$key]['word'] = $word;
	$popularity[$key]['count'] = $count;
	$popularity[$key]['percent'] = $percent.'%';
	}

function cmp($a, $b)
{
    return ($a['count'] > $b['count']) ? +1 : -1;
}

usort($popularity, "cmp");

return $popularity;
}
?>

How it works?

The function has 3 parameters:

1. $string – which is the text that will be analyzed
2. $min_word_char – the words that are taken into consideration should have a minimum character limit. This is used to avoid calculating importance for very common words (ex: a, in, to etc.)
3. $exclude_words – Do you need to exclude specific words from the rank calculator? Set an array with these words (ex: array(‘company’, ‘friends’, ‘values’)).

First, we will strip any tags found in the string using strip_tags():

$string = strip_tags($string);

Let’s count the number of words from the text:

$initial_words_array  =  str_word_count($string, 1);  
$total_words = sizeof($initial_words_array);

Now, let’s apply the ‘exclude words’ filter:

$new_string = $string;  
   
 foreach($exclude_words as $filter_word)  
 {  
 $new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words  
 } 

After we used the filter, let’s calculated again the number of words from the (filtered) text & apply the ‘minimum characters’ filter:

$words_array = str_word_count($new_string, 1);  
   
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));  

All the filters were applied. Now we will begin creating the ‘popularity’ array, each key containing the word, the number of occurrences in the text, and the percent (for instance, if ‘software’ is found in 10 times in a text with 40 words it will have 25%).

$popularity = array();  
   
 $unique_words_array = array_unique($words_array);  
   
 foreach($unique_words_array as $key => $word)  
     {  
     preg_match_all('/\b'.$word.'\b/i', $string, $out);  
   
     $count = count($out[0]);  
   
     $percent = number_format((($count * 100) / $total_words), 2);   
   
     $popularity[$key]['word'] = $word;  
     $popularity[$key]['count'] = $count;  
     $popularity[$key]['percent'] = $percent.'%';  
     }  

Eventually, we will sort the array by the ‘count’ value:

function cmp($a, $b)  
 {  
     return ($a['count'] > $b['count']) ? +1 : -1;  
 }  
 usort($popularity, "cmp");  

This is an usage example of this function:

<?php
$text = "The PHP development team would like to announce the immediate availability 
of PHP 4.4.9. It continues to improve the security and the stability of the 4.4 branch 
and all users are strongly encouraged to upgrade to it as soon as possible. 
This release wraps up all the outstanding patches for the PHP 4.4 series, 
and is therefore the last PHP 4.4 release.";

$exclude_words = array('would','the','and','all','for','are');

$popularity = calculate_word_popularity($text, 3, $exclude_words);

// Check words => 3 characters and exclude the words from the $exclude_words array

krsort($popularity); // sort array (from higher to lower)

$key = 1;

echo 'Total words in the text: '.str_word_count($text).'<br><br>';

echo 'Word / Popularity / Count<br><br>';

foreach($popularity as $value)
{
    echo $key.".<b>".$value['word'].'</b> - <font color="green">'.$value['percent'].'</font> ('.$value["count"].')'. "<br>\n";
	$key++;
}
?>


Comment via Facebook

comments

Leave a Reply


* = required fields

  (will not be published)


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Note: If you want to post CODE Snippets, please make them postable first!
(e.g. <br /> should be converted to &lt;br /&gt;)

POSTING RULES:

  • The comment must be relevant with the topic of the post.
  • Only comments with real email addresses will get approved. So, emails like 'abc@domain.com' will not be accepted.
  • Do not post the same message in multiple articles through the site.
  • Do not post advertisements, junk mail or pyramid schemes.
  • In case you post a link to another site, please explain briefly where the link goes as a courtesy to other users.
  • Do not post comments such as: "Thank you", "Awesome", "Nice tutorial", "Merci", etc.