How to Create a PHP Word Popularity Script

Posted on November 7, 2008, under PHP 

This is a function that is meant to calculate the density of the words from a text. Since there are many words that have less then 3 characters, I’ve decided to add a filter that will not take into account words that aren’t bigger then (X) characters (examples: if, or, is, it etc.). Also, you can setup an array with a list of words that you do not want to add in the ranking calculation. Here’s the function (I’ll explain you how it works below):

<?php
function calculate_word_popularity($string, $min_word_char = 2, $exclude_words = array())
{
$string = strip_tags($string);

$initial_words_array  =  str_word_count($string, 1);
$total_words = sizeof($initial_words_array);

$new_string = $string;

foreach($exclude_words as $filter_word)
{
$new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words
}

$words_array = str_word_count($new_string, 1);

$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));

$popularity = array();

$unique_words_array = array_unique($words_array);

foreach($unique_words_array as $key => $word)
	{
	preg_match_all('/\b'.$word.'\b/i', $string, $out);

	$count = count($out[0]);

	$percent = number_format((($count * 100) / $total_words), 2); 

	$popularity[$key]['word'] = $word;
	$popularity[$key]['count'] = $count;
	$popularity[$key]['percent'] = $percent.'%';
	}

function cmp($a, $b)
{
    return ($a['count'] > $b['count']) ? +1 : -1;
}

usort($popularity, "cmp");

return $popularity;
}
?>

How it works?

The function has 3 parameters:

1. $string – which is the text that will be analyzed
2. $min_word_char – the words that are taken into consideration should have a minimum character limit. This is used to avoid calculating importance for very common words (ex: a, in, to etc.)
3. $exclude_words – Do you need to exclude specific words from the rank calculator? Set an array with these words (ex: array(‘company’, ‘friends’, ‘values’)).

First, we will strip any tags found in the string using strip_tags():

$string = strip_tags($string);

Let’s count the number of words from the text:

$initial_words_array  =  str_word_count($string, 1);  
$total_words = sizeof($initial_words_array);

Now, let’s apply the ‘exclude words’ filter:

$new_string = $string;  
   
 foreach($exclude_words as $filter_word)  
 {  
 $new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words  
 } 

After we used the filter, let’s calculated again the number of words from the (filtered) text & apply the ‘minimum characters’ filter:

$words_array = str_word_count($new_string, 1);  
   
$words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));  

All the filters were applied. Now we will begin creating the ‘popularity’ array, each key containing the word, the number of occurrences in the text, and the percent (for instance, if ‘software’ is found in 10 times in a text with 40 words it will have 25%).

$popularity = array();  
   
 $unique_words_array = array_unique($words_array);  
   
 foreach($unique_words_array as $key => $word)  
     {  
     preg_match_all('/\b'.$word.'\b/i', $string, $out);  
   
     $count = count($out[0]);  
   
     $percent = number_format((($count * 100) / $total_words), 2);   
   
     $popularity[$key]['word'] = $word;  
     $popularity[$key]['count'] = $count;  
     $popularity[$key]['percent'] = $percent.'%';  
     }  

Eventually, we will sort the array by the ‘count’ value:

function cmp($a, $b)  
 {  
     return ($a['count'] > $b['count']) ? +1 : -1;  
 }  
 usort($popularity, "cmp");  

This is an usage example of this function:

<?php
$text = "The PHP development team would like to announce the immediate availability 
of PHP 4.4.9. It continues to improve the security and the stability of the 4.4 branch 
and all users are strongly encouraged to upgrade to it as soon as possible. 
This release wraps up all the outstanding patches for the PHP 4.4 series, 
and is therefore the last PHP 4.4 release.";

$exclude_words = array('would','the','and','all','for','are');

$popularity = calculate_word_popularity($text, 3, $exclude_words);

// Check words => 3 characters and exclude the words from the $exclude_words array

krsort($popularity); // sort array (from higher to lower)

$key = 1;

echo 'Total words in the text: '.str_word_count($text).'<br><br>';

echo 'Word / Popularity / Count<br><br>';

foreach($popularity as $value)
{
    echo $key.".<b>".$value['word'].'</b> - <font color="green">'.$value['percent'].'</font> ('.$value["count"].')'. "<br>\n";
	$key++;
}
?>


Comment via Facebook

comments

11 Replies to "How to Create a PHP Word Popularity Script"

  1. Good Tutorial! It was chosen for the home page of http://www.tutorialsroom.com

    Waiting for your next tutorial :)

  2. what about array_count_values?

  3. Amazing! this is exactly what I had been looking for till now.
    Thanks man

  4. Would absolutely love to see a similar function that would do 2 to 4 word phrases. I’ve attempted writing one, but it got complex too quickly for me!

    1. Hey there,

      Thanks a lot for this function. I was surprised not to see more of them on the net, this is very usefull. There a small enhancement, certainly not beautifully coded, but now the function can calculate the popularity of groups of 2,3,4.. words


      function calculate_word_popularity($string, $min_word_char = 3, $exclude_words = array(), $nbre_words = 1 )
      {
      $string = strip_tags(strtolower($string));

      $initial_words_array = str_word_count($string, 1);
      $total_words = sizeof($initial_words_array);

      $new_string = $string;

      foreach($exclude_words as $filter_word)
      $new_string = preg_replace("/\b".$filter_word."\b/i", "", $new_string); // strip excluded words

      $words_array = str_word_count($new_string, 1,'àèéêïîùç');

      $words_array = array_filter($words_array, create_function('$var', 'return (strlen($var) >= '.$min_word_char.');'));

      // on récupère les combinaisons de 2 mots ou plus..
      if ( $nbre_words > 1 ) {
      // on parcours word array dont on a enlevé les mots inférieur à la limite, de 0 jusqu'à max-nbre_mots+1
      foreach ( $words_array as $key => $value ) {
      // var temporaire dans laquelle on va stoquer les x mots à la suite
      $temp = '' ;
      $flag = true ;
      // pour j qui va de 0 à taille d'un mot - 1 on colle les mots trouvés
      for ( $j = 0 ; $j = '.$nbre_words.';'));
      }

      $popularity = array();

      $unique_words_array = array_unique($words_array);

      foreach($unique_words_array as $key => $word)
      {
      preg_match_all('/\b'.$word.'\b/i', $string, $out);

      $count = count($out[0]);

      $percent = number_format((($count * 100) / $total_words), 2);

      if ( $count ) {
      $popularity[$key]['word'] = $word;
      $popularity[$key]['count'] = $count;
      $popularity[$key]['percent'] = $percent.'%';
      }
      }

      function cmp($a, $b)
      {
      return ($a['count'] > $b['count']) ? +1 : -1;
      }

      usort($popularity, "cmp");

      return $popularity;
      }

      1. Hi,
        but your code don’t go because it has an error, do you have correct code?

        Thank you a lot

        Bye

      2. Did anyone get the correct code. This looks intersting, but it’s incomplete…

        Thanks!!

  5. Thank you very much! Came here from a Stack Overflow link.

    Even as a C# developer, your code helped me a lot :-)

  6. Hi,

    Thanks for sharing this, very useful. I was wondering how I could add a drop text area to this script with a “check text” button and also make it available for foreign languages with accent. I have tested the secript in a different language and when there is an accent the rest of word is cut.

    Thanks for your help,

  7. Hi,

    Thanks very useful script. How can I make it work with foreign language? At the moment words with accents are cut.

    Many Thanks,

Leave a Reply


* = required fields

  (will not be published)


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Note: If you want to post CODE Snippets, please make them postable first!
(e.g. <br /> should be converted to &lt;br /&gt;)

POSTING RULES:

  • The comment must be relevant with the topic of the post.
  • Only comments with real email addresses will get approved. So, emails like 'abc@domain.com' will not be accepted.
  • Do not post the same message in multiple articles through the site.
  • Do not post advertisements, junk mail or pyramid schemes.
  • In case you post a link to another site, please explain briefly where the link goes as a courtesy to other users.
  • Do not post comments such as: "Thank you", "Awesome", "Nice tutorial", "Merci", etc.