How to Create a PHP Word Popularity Script

This is a function that is meant to calculate the density of the words from a text. Since there are many words that have less then 3 characters, I’ve decided to add a filter that will not take into account words that aren’t bigger then (X) characters (examples: if, or, is, it etc.). Also, you can setup an array with a list of words that you do not want to add in the ranking calculation. Here’s the function (I’ll explain you how it works below):

<?php
function calculate_word_popularity($string, $min_word_char = 2, $exclude_words = array())
{
$string = strip_tags($string);

$initial_words_array  =  str_word_count($string, 1);
$total_words = sizeof($initial_words_array);

$new_string = $string;

foreach($exclude_words as $filter_word)
{
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
}

$words_array = str_word_count($new_string, 1);

$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));

$popularity = array();

$unique_words_array = array_unique($words_array);

foreach($unique_words_array as $key => $word)
	{
	preg_match_all('/\b'.$word.'\b/i', $string, $out);

	$count = count($out[0]);

	$percent = number_format((($count * 100) / $total_words), 2); 

	$popularity[$key]['word'] = $word;
	$popularity[$key]['count'] = $count;
	$popularity[$key]['percent'] = $percent.'%';
	}

function cmp($a, $b)
{
    return ($a['count'] > $b['count']) ? +1 : -1;
}

usort($popularity, "cmp");

return $popularity;
}
?>

How it works?

The function has 3 parameters:

1. $string – which is the text that will be analyzed
2. $min_word_char – the words that are taken into consideration should have a minimum character limit. This is used to avoid calculating importance for very common words (ex: a, in, to etc.)
3. $exclude_words – Do you need to exclude specific words from the rank calculator? Set an array with these words (ex: array(‘company’, ‘friends’, ‘values’)).

First, we will strip any tags found in the string using strip_tags():

$string = strip_tags($string);

Let’s count the number of words from the text:

$initial_words_array  =  str_word_count($string, 1);  
$total_words = sizeof($initial_words_array);

Now, let’s apply the ‘exclude words’ filter:

$new_string = $string;  
   
 foreach($exclude_words as $filter_word)  
 {  
 $new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words  
 } 

After we used the filter, let’s calculated again the number of words from the (filtered) text & apply the ‘minimum characters’ filter:

$words_array = str_word_count($new_string, 1);  
   
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));  

All the filters were applied. Now we will begin creating the ‘popularity’ array, each key containing the word, the number of occurrences in the text, and the percent (for instance, if ‘software’ is found in 10 times in a text with 40 words it will have 25%).

$popularity = array();  
   
 $unique_words_array = array_unique($words_array);  
   
 foreach($unique_words_array as $key => $word)  
     {  
     preg_match_all('/\b'.$word.'\b/i', $string, $out);  
   
     $count = count($out[0]);  
   
     $percent = number_format((($count * 100) / $total_words), 2);   
   
     $popularity[$key]['word'] = $word;  
     $popularity[$key]['count'] = $count;  
     $popularity[$key]['percent'] = $percent.'%';  
     }  

Eventually, we will sort the array by the ‘count’ value:

function cmp($a, $b)  
 {  
     return ($a['count'] > $b['count']) ? +1 : -1;  
 }  
 usort($popularity, "cmp");  

This is an usage example of this function:

<?php
$text = "The PHP development team would like to announce the immediate availability 
of PHP 4.4.9. It continues to improve the security and the stability of the 4.4 branch 
and all users are strongly encouraged to upgrade to it as soon as possible. 
This release wraps up all the outstanding patches for the PHP 4.4 series, 
and is therefore the last PHP 4.4 release.";

$exclude_words = array('would','the','and','all','for','are');

$popularity = calculate_word_popularity($text, 3, $exclude_words);

// Check words => 3 characters and exclude the words from the $exclude_words array

krsort($popularity); // sort array (from higher to lower)

$key = 1;

echo 'Total words in the text: '.str_word_count($text).'<br><br>';

echo 'Word / Popularity / Count<br><br>';

foreach($popularity as $value)
{
    echo $key.".<b>".$value['word'].'</b> - <font color="green">'.$value['percent'].'</font> ('.$value["count"].')'. "<br>\n";
	$key++;
}
?>


Comment via Facebook

comments

11 Comments

  1. Seb says

    Hi,

    Thanks very useful script. How can I make it work with foreign language? At the moment words with accents are cut.

    Many Thanks,

  2. seb says

    Hi,

    Thanks for sharing this, very useful. I was wondering how I could add a drop text area to this script with a “check text” button and also make it available for foreign languages with accent. I have tested the secript in a different language and when there is an accent the rest of word is cut.

    Thanks for your help,

  3. Jim says

    Would absolutely love to see a similar function that would do 2 to 4 word phrases. I’ve attempted writing one, but it got complex too quickly for me!

    • Cobalt says

      Hey there,

      Thanks a lot for this function. I was surprised not to see more of them on the net, this is very usefull. There a small enhancement, certainly not beautifully coded, but now the function can calculate the popularity of groups of 2,3,4.. words


      function calculate_word_popularity($string, $min_word_char = 3, $exclude_words = array(), $nbre_words = 1 )
      {
      $string = strip_tags(strtolower($string));

      $initial_words_array = str_word_count($string, 1);
      $total_words = sizeof($initial_words_array);

      $new_string = $string;

      foreach($exclude_words as $filter_word)
      $new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words

      $words_array = str_word_count($new_string, 1,'àèéêïîùç');

      $words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));

      // on récupère les combinaisons de 2 mots ou plus..
      if ( $nbre_words > 1 ) {
      // on parcours word array dont on a enlevé les mots inférieur à la limite, de 0 jusqu'à max-nbre_mots+1
      foreach ( $words_array as $key => $value ) {
      // var temporaire dans laquelle on va stoquer les x mots à la suite
      $temp = '' ;
      $flag = true ;
      // pour j qui va de 0 à taille d'un mot - 1 on colle les mots trouvés
      for ( $j = 0 ; $j = '.$nbre_words.';'));
      }

      $popularity = array();

      $unique_words_array = array_unique($words_array);

      foreach($unique_words_array as $key => $word)
      {
      preg_match_all('/\b'.$word.'\b/i', $string, $out);

      $count = count($out[0]);

      $percent = number_format((($count * 100) / $total_words), 2);

      if ( $count ) {
      $popularity[$key]['word'] = $word;
      $popularity[$key]['count'] = $count;
      $popularity[$key]['percent'] = $percent.'%';
      }
      }

      function cmp($a, $b)
      {
      return ($a['count'] > $b['count']) ? +1 : -1;
      }

      usort($popularity, "cmp");

      return $popularity;
      }

Leave a Reply

Your email address will not be published. Required fields are marked *