PHP: Creating a simple web data (spider) extractor

In this tutorial we will learn how to create a simple web spider that will extract specific information from a web page. Our script will have 2 files: index.php & functions.php. In our sample, the extractor will check how many pages from a site are indexed by Google.

First, we will create the library file which will have 2 functions: one to fetch the content from our pages and the other one to extract content between two strings (delimiters).


function LoadCURLPage($url, $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
 Gecko/20030624 Netscape/7.1 (ax)", 
$cookie = '', $referer = '', $post_fields = '', $return_transfer = 1, 
$follow_location = 1, $ssl = '', $curlopt_header = 0)
$ch = curl_init(); 

curl_setopt($ch, CURLOPT_URL, $url);

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);

curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);

curl_setopt($ch, CURLOPT_USERAGENT, $agent);

curl_setopt($ch, CURLOPT_POST, 1); 
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields); 

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

curl_setopt($ch, CURLOPT_REFERER, $referer);

curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);

$result = curl_exec ($ch);

curl_close ($ch);

return $result;

function extract_unit($string, $start, $end)
$pos = stripos($string, $start);

$str = substr($string, $pos);

$str_two = substr($str, strlen($start));

$second_pos = stripos($str_two, $end);

$str_three = substr($str_two, 0, $second_pos);

$unit = trim($str_three); // remove whitespaces

return $unit;

Let’s continue creating the index.php file. We will start by including the functions file & setting up some configuration variables:


error_reporting (E_ALL ^ E_NOTICE);

include 'functions.php';

// Site to check
$site = '';

// Connect to this url using CURL
$url = ''.$site.'&btnG=Search';

Let’s use cURL to connect to the $url.

$data = LoadCURLPage($url);

Now $data contains the html output for $url;

We will use extract_unit() to get the information between 2 strings. In our case the total indexed pages for $site is between ‘<em></b> of about <b></em>‘ and ‘<em></b></em>‘.

Results <b>1</b> – <b>10</b> of about <b>678,000</b> from <b></b>. (<b>0.04</b> seconds)
// Extract information between STRING 1 & STRING 2

$string_one = '</b> of about <b>';
$string_two = '</b>';

$info = extract_unit($data, $string_one, $string_two);

Output our result:

echo 'Google has indexed '.$info.' pages for '.$site.'.';

The script will output something like this:

Google has indexed 678,000 pages for

Comment via Facebook



  1. channa says

    can anyone please help me out in getting the images form the websites.
    infact, i am getting but each website has its own method of uploading the images,some post along with the domain some dosn’t.
    can anyone tel me how can i check that and append the domain if it is not present.

  2. nonecantest says

    Hey Vedrin, did you find your answer about getting all matches on the pages? You need to use a foreach loop, but I’m still having trouble getting it to work. Any help?

  3. Vedran says

    How can i loop thru $data to find more same matches?

    etc. i have more than one result and i want to display all of them?

    tnh allot!

  4. geo says

    If you don’t want to code a spider or a Web spider by yourself. MetaSeeker is a good choice.

    MetaSeeker( ) is a toolkit to precisely extract data from the Web.

    The noise information are effectively inhibited from the results.

    The result files, in format of XML, are like tables in a relational database where each field of the table holds just contents in exact semantics. Further manipulation against the results becomes very straight-forward with the help of data schema. At the same time, the manipulation doesn’t need person’s involvement to handle special case. This is a must for manipulating and mining data in bulk.

    Please visit for more information.

  5. GabrielGabriel says

    I haven’t used Python for creating web fetching scripts so I can’t make a comparison (consider googling “python web fetch”). You can use regular expressions in PHP to fetch snippets of content based on specific patterns. I’ve even written a script that can extract URLs from Links which can be founded here: I hope it will give you an idea of how to make other scripts that will fetch data.

  6. Scott says

    I was just searching on something like this, came across your post.

    Any idea how PHP would compare to Python for performance in an instance like this?

    Also, how might this work for more advanced operations like indexing entire pages? Or pulling snippets of content via some sort of fuzzy logic or other AI methodology?


Leave a Reply

Your email address will not be published. Required fields are marked *