In this tutorial we will learn how to create a simple web spider that will extract specific information from a web page. Our script will have 2 files: index.php & functions.php. In our sample, the extractor will check how many pages from a site are indexed by Google.
First, we will create the library file which will have 2 functions: one to fetch the content from our pages and the other one to extract content between two strings (delimiters).
functions.php
<?php
function LoadCURLPage($url, $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
Gecko/20030624 Netscape/7.1 (ax)",
$cookie = '', $referer = '', $post_fields = '', $return_transfer = 1,
$follow_location = 1, $ssl = '', $curlopt_header = 0)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
}
curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);
if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}
if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}
if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$unit = trim($str_three); // remove whitespaces
return $unit;
}
?>
Let’s continue creating the index.php file. We will start by including the functions file & setting up some configuration variables:
index.php
<?php error_reporting (E_ALL ^ E_NOTICE); include 'functions.php'; // Site to check $site = 'www.microsoft.com'; // Connect to this url using CURL $url = 'http://www.google.com/search?hl=en&q=site%3A'.$site.'&btnG=Search';
Let’s use cURL to connect to the $url.
$data = LoadCURLPage($url);
Now $data contains the html output for $url;
We will use extract_unit() to get the information between 2 strings. In our case the total indexed pages for $site is between ‘<em></b> of about <b></em>‘ and ‘<em></b></em>‘.
// Extract information between STRING 1 & STRING 2 $string_one = '</b> of about <b>'; $string_two = '</b>'; $info = extract_unit($data, $string_one, $string_two);
Output our result:
echo 'Google has indexed '.$info.' pages for '.$site.'.'; ?>
The script will output something like this:

- September 14, 2008
- article by Gabriel C.
- 12 comments
Related Posts
-
PHP: Practical cURL functionat September 2, 2008 with 2 comments
-
Making Friendly URLsat August 29, 2008 with 2 comments

Comment via Facebook
12 Replies to "PHP: Creating a simple web data (spider) extractor"
September 15, 2008 at 12:39 PM
I was just searching on something like this, came across your post.
Any idea how PHP would compare to Python for performance in an instance like this?
Also, how might this work for more advanced operations like indexing entire pages? Or pulling snippets of content via some sort of fuzzy logic or other AI methodology?
Cheers
September 16, 2008 at 4:29 AM
I haven’t used Python for creating web fetching scripts so I can’t make a comparison (consider googling “python web fetch”). You can use regular expressions in PHP to fetch snippets of content based on specific patterns. I’ve even written a script that can extract URLs from Links which can be founded here: http://www.bitrepository.com/web-programming/php/extract-urls-from-links.html. I hope it will give you an idea of how to make other scripts that will fetch data.
September 28, 2008 at 8:33 PM
If you don’t want to code a spider or a Web spider by yourself. MetaSeeker is a good choice.
MetaSeeker( http://www.gooseeker.com/en/node/product/front ) is a toolkit to precisely extract data from the Web.
The noise information are effectively inhibited from the results.
The result files, in format of XML, are like tables in a relational database where each field of the table holds just contents in exact semantics. Further manipulation against the results becomes very straight-forward with the help of data schema. At the same time, the manipulation doesn’t need person’s involvement to handle special case. This is a must for manipulating and mining data in bulk.
Please visit http://www.gooseeker.com/en for more information.
February 24, 2009 at 3:10 PM
hi dear friend , thats all i need . thanks you very much.
February 26, 2009 at 11:51 AM
thanks for the code
how can i get the url from a file and then fetch it?
i dont want to put 1 url at a time
thank you
March 23, 2009 at 12:41 PM
How can i loop thru $data to find more same matches?
etc. i have more than one result and i want to display all of them?
tnh allot!
April 30, 2009 at 8:26 PM
This is great code when I tired it it gave me “Please Enable Cookies”, any suggestion?
November 1, 2009 at 5:24 AM
ur the man.
thanks for the inspiration and keep on posting this kind of stuff.
February 21, 2010 at 7:38 AM
hi, are U going to post more difficult examples of web extractors?
March 13, 2010 at 9:23 AM
Hey Vedrin, did you find your answer about getting all matches on the pages? You need to use a foreach loop, but I’m still having trouble getting it to work. Any help?
March 20, 2011 at 4:02 PM
Thanks so much for posting this. It is extremely useful.
May 2, 2013 at 6:29 AM
hi,
can anyone please help me out in getting the images form the websites.
infact, i am getting but each website has its own method of uploading the images,some post along with the domain some dosn’t.
can anyone tel me how can i check that and append the domain if it is not present.
thanks!