Creating a simple web data (spider) extractor
by Gabriel on 14/09/08 at 11:41 am
Greetings! Subscribe to my RSS feed or get my latest post directly in your mailbox. Thanks for visiting!
Greetings coders,
In this tutorial we will learn how to create a simple web spider that will extract specific information from a web page. Our script will have 2 files: index.php & functions.php. In our sample, the extractor will check how many pages from a site are indexed by Google.
First, we will create the library file which will have 2 functions: one to fetch the content from our pages and the other one to extract content between two strings (delimiters).
functions.php
<?php
function LoadCURLPage($url, $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
Gecko/20030624 Netscape/7.1 (ax)",
$cookie = '', $referer = '', $post_fields = '', $return_transfer = 1,
$follow_location = 1, $ssl = '', $curlopt_header = 0)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
}
curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);
if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}
if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}
if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$unit = trim($str_three); // remove whitespaces
return $unit;
}
?>
Let’s continue creating the index.php file. We will start by including the functions file & setting up some configuration variables:
index.php
<?php error_reporting (E_ALL ^ E_NOTICE); include 'functions.php'; // Site to check $site = 'www.microsoft.com'; // Connect to this url using CURL $url = 'http://www.google.com/search?hl=en&q=site%3A'.$site.'&btnG=Search';
Let’s use cURL to connect to the $url.
$data = LoadCURLPage($url);
Now $data contains the html output for $url;
We will use extract_unit() to get the information between 2 strings. In our case the total indexed pages for $site is between ‘<em></b> of about <b></em>‘ and ‘<em></b></em>‘.
// Extract information between STRING 1 & STRING 2 $string_one = '</b> of about <b>'; $string_two = '</b>'; $info = extract_unit($data, $string_one, $string_two);
Output our result:
echo 'Google has indexed '.$info.' pages for '.$site.'.'; ?>
The script will output something like this:

BitRepository RSS Feed.Please spread the word! |
|
7 Comments
Scott
Sep 15th, 2008
I was just searching on something like this, came across your post.
Any idea how PHP would compare to Python for performance in an instance like this?
Also, how might this work for more advanced operations like indexing entire pages? Or pulling snippets of content via some sort of fuzzy logic or other AI methodology?
Cheers
Gabriel
Sep 16th, 2008
I haven’t used Python for creating web fetching scripts so I can’t make a comparison (consider googling “python web fetch”). You can use regular expressions in PHP to fetch snippets of content based on specific patterns. I’ve even written a script that can extract URLs from Links which can be founded here: http://www.bitrepository.com/web-programming/php/extract-urls-from-links.html. I hope it will give you an idea of how to make other scripts that will fetch data.
geo
Sep 28th, 2008
If you don’t want to code a spider or a Web spider by yourself. MetaSeeker is a good choice.
MetaSeeker( http://www.gooseeker.com/en/node/product/front ) is a toolkit to precisely extract data from the Web.
The noise information are effectively inhibited from the results.
The result files, in format of XML, are like tables in a relational database where each field of the table holds just contents in exact semantics. Further manipulation against the results becomes very straight-forward with the help of data schema. At the same time, the manipulation doesn’t need person’s involvement to handle special case. This is a must for manipulating and mining data in bulk.
Please visit http://www.gooseeker.com/en for more information.
reza
Feb 24th, 2009
hi dear friend , thats all i need . thanks you very much.
meir peres
Feb 26th, 2009
thanks for the code
how can i get the url from a file and then fetch it?
i dont want to put 1 url at a time
thank you
Vedran
Mar 23rd, 2009
How can i loop thru $data to find more same matches?
etc. i have more than one result and i want to display all of them?
tnh allot!
moody
Apr 30th, 2009
This is great code when I tired it it gave me “Please Enable Cookies”, any suggestion?
Leave a Comment