How to Create an Advanced PHP (bad, naughty) Words Filter
Posted on October 26, 2008, under PHP,
Bookmark it
This is a PHP Class useful if you need to filter (bad, naughty) words from a string (text), whether is a simple string or one containing HTML tags.
Here’s the complete source code (we will explain it below):
filter.string.class.php
<?php
/*
Credits: Bit Repository
*/
class Filter_String {
var $strings;
var $text;
var $keep_first_last;
var $replace_matches_inside_words;
function filter()
{
$new_text = '';
$regex = '/<\/?(?:\w+(?:=["\'][^\'"]*["\'])?\s*)*>/'; // Tag Extractor
preg_match_all($regex, $this->text, $out, PREG_OFFSET_CAPTURE);
$array = $out[0];
if(!empty($array))
{
if($array[0][1] > 0)
{
$new_text .= $this->do_filter(substr($this->text, 0, $array[0][1]));
}
foreach($array as $value)
{
$tag = $value[0];
$offset = $value[1];
$strlen = strlen($tag); // characters length of the tag
$start_str_pos = ($offset + $strlen); // start position for the non-tag element
$next = next($array);
// end position for the non-tag element
$end_str_pos = $next[1];
// no end position?
// This is the last text from the string and it is not followed by any tags
if(!$end_str_pos) $end_str_pos = strlen($this->text);
// Start constructing the new resulted string. We'll add tags now!
$new_text .= substr($this->text, $offset, $strlen);
$diff = ($end_str_pos - $start_str_pos);
// Is this a simple string without any tags? Apply the filter to it
if($diff > 0)
{
$str = substr($this->text, $start_str_pos, $diff);
$str = $this->do_filter($str);
$new_text .= $str; // Continue constructing the text with the (filtered) text
}
}
}
else // No tags were found in the string? Just apply the filter
{
$new_text = $this->do_filter($this->text);
}
return $new_text;
}
function do_filter($var)
{
if(is_string($this->strings)) $this->strings = array($this->strings);
foreach($this->strings as $word)
{
$word = trim($word);
$replacement = '';
$str = strlen($word);
$first = ($this->keep_first_last) ? $word[0] : '';
$str = ($this->keep_first_last) ? $str - 2 : $str;
$last = ($this->keep_first_last) ? $word[strlen($word) - 1] : '';
$replacement = str_repeat('*', $str);
if($this->replace_matches_inside_words)
{
$var = str_replace($word, $first.$replacement.$last, $var);
}
else
{
$var = preg_replace('/\b'.$word.'\b/i', $first.$replacement.$last, $var);
}
}
return $var;
}
}
?>
How it works?
First, a regex is used to extract the HTML tags (if any). To do that we use preg_match_all() with the flag PREG_OFFSET_CAPTURE. This is used to get the actual position (offset) of the matched element. Based on this information we will calculate the numeric position for elements that aren’t tags & apply the filter to them.
How to use it?
In this example the script is replacing all the words from the array and keeps the first and last letter for each one. It replaces the middle letters with stars (*). For example the word ‘turpis’ will be replaced with ‘t****s’ (notice that four stars are used in this case). The filter ignores the HTML tags. The attribute ‘href’ is not replaced in the A tag.
<?php
error_reporting (E_ALL ^ E_NOTICE);
include 'filter.string.class.php';
$filter = new Filter_String;
$filter->strings = array('consectetuer','consequat','turpis', 'href');
$filter->text = 'Lorem ipsum dolor sit amet, href <a href="http://www.domain.com/">consectetuer</a> adipiscing elit. Nulla mi nunc, consequat vitae, condimentum at, iaculis at, turpis. Praesent suscipit. Maecenas et lectus.';
$filter->keep_first_last = false;
$filter->replace_matches_inside_words = false;
$new_text = $filter->filter();
echo $new_text;
?>
Lorem ipsum dolor sit amet, h**f c**********r adipiscing elit. Nulla mi nunc, c*******t vitae, condimentum at, iaculis at, t****s. Praesent suscipit. Maecenas et lectus.
For example you need to filter the word ‘eat’ and there is a word in the text ‘create’. In this case, if we set $filter->replace_matches_inside_words to true ‘create’ will become ‘cr***te’. If set to false, ‘create’ will remain the same. This is the case when only the distinct word ‘eat’ is filtered.
|
Share the Love
|
Get Free Updates
|
- October 26, 2008
- article by Gabriel C.
- 8 comments
Related Posts
-
PHP: Randomize Text Letters while Keeping the Words Readableat October 4, 2008 with 3 comments
-
How to Create a PHP Word Popularity Scriptat November 7, 2008 with 11 comments
-
How to emphasize specific words from a string (text)at September 6, 2008
-
Highlight (Search) Key Words in a Text | PHPat August 30, 2008 with 15 comments
-
Advanced IP Ban PHP Scriptat September 12, 2008 with 2 comments


8 Replies to "How to Create an Advanced PHP (bad, naughty) Words Filter"
October 27, 2008 at 12:48 PM
Just taking a quick glance at this… The regex on line 17 is not going to catch tags which don’t have quotes around attributes. While not valid, browsers will accept it in some circumstances…
You could still easily <script src=mysite.com/script.js></script> or whatever for your XSS.
Stuff like this is hard. REALLY hard.
October 27, 2008 at 4:36 PM
People can use this class on their own sites (and many others) where they know that standard is respected & only VALID syntax is used. I’ve seen many word filters in PHP that didn’t even check any tags in the string. This script is GOOD ENOUGH. I realize this stuff is really hard.
July 24, 2009 at 2:53 PM
Hi, I have written a similar bad word filtering script but I have also added phonetic filtering. You can check it out here:
http://webroy.blogspot.com/2009/07/advanced-bad-word-filter-in-php.html
August 27, 2009 at 3:10 AM
On line 73, the variable $words throws up an ‘Undefined variable’ error, which makes sense, as it hasn’t been defined!
;-)
Should it have been in the singular?
August 28, 2009 at 5:16 AM
Yes, you are right. It should have been is_string($this->strings). It can be used in case you need only to filter a word, not multiple words (elements) that are in an array.
The result is the same anyway in either of these cases:
$filter->strings = 'word here';$filter->strings = array('word here');August 17, 2010 at 6:55 PM
I am facing an issue for Arabic text in UTF-8 encoding.
$str=preg_replace(‘/\b(‘.$row[0].’\b)/i’,”" .$row[0] . “” , $str);
not highlight/replace in RED color.
November 24, 2010 at 8:00 PM
if you need the script to be case insensitive, edit like this:
if($this->replace_matches_inside_words)
{
// $var = str_replace($word, $first.$replacement.$last, $var);
$var = eregi_replace( $word, $first.$replacement.$last, $var);
}
else
{
// $var = preg_replace(‘/\b’.$word.’\b/i’, $first.$replacement.$last, $var);
$var = eregi_replace( ‘/\b’.$word.’\b/i’, $first.$replacement.$last, $var);
}
note, however, that eregi_replace is the php4 function, deprecated with php5, which introduced some replacement for it. should work nonetheless.
April 17, 2011 at 8:49 PM
how can i put this to my website?
what will i put in the $filter->text
i want to filter the bad words
in a particular page.