PHP: Creating a simple web data (spider) extractor

Posted on September 14, 2008, under PHP 

In this tutorial we will learn how to create a simple web spider that will extract specific information from a web page. Our script will have 2 files: index.php & functions.php. In our sample, the extractor will check how many pages from a site are indexed by Google.

First, we will create the library file which will have 2 functions: one to fetch the content from our pages and the other one to extract content between two strings (delimiters).

functions.php

<?php
function LoadCURLPage($url, $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
 Gecko/20030624 Netscape/7.1 (ax)", 
$cookie = '', $referer = '', $post_fields = '', $return_transfer = 1, 
$follow_location = 1, $ssl = '', $curlopt_header = 0)
{
$ch = curl_init(); 

curl_setopt($ch, CURLOPT_URL, $url);

if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);
}

curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);

if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}

if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1); 
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields); 
}

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}

if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}

$result = curl_exec ($ch);

curl_close ($ch);

return $result;
}

function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);

$str = substr($string, $pos);

$str_two = substr($str, strlen($start));

$second_pos = stripos($str_two, $end);

$str_three = substr($str_two, 0, $second_pos);

$unit = trim($str_three); // remove whitespaces

return $unit;
}
?>

Let’s continue creating the index.php file. We will start by including the functions file & setting up some configuration variables:

index.php

<?php
error_reporting (E_ALL ^ E_NOTICE);

include 'functions.php';

// Site to check
$site = 'www.microsoft.com';

// Connect to this url using CURL
$url = 'http://www.google.com/search?hl=en&q=site%3A'.$site.'&btnG=Search';

Let’s use cURL to connect to the $url.

$data = LoadCURLPage($url);

Now $data contains the html output for $url;

We will use extract_unit() to get the information between 2 strings. In our case the total indexed pages for $site is between ‘<em></b> of about <b></em>‘ and ‘<em></b></em>‘.

Results <b>1</b> – <b>10</b> of about <b>678,000</b> from <b>www.microsoft.com</b>. (<b>0.04</b> seconds)
// Extract information between STRING 1 & STRING 2

$string_one = '</b> of about <b>';
$string_two = '</b>';

$info = extract_unit($data, $string_one, $string_two);

Output our result:

echo 'Google has indexed '.$info.' pages for '.$site.'.';
?>

The script will output something like this:

Google has indexed 678,000 pages for www.microsoft.com.

Comment via Facebook

comments

12 Replies to "PHP: Creating a simple web data (spider) extractor"

  1. I was just searching on something like this, came across your post.

    Any idea how PHP would compare to Python for performance in an instance like this?

    Also, how might this work for more advanced operations like indexing entire pages? Or pulling snippets of content via some sort of fuzzy logic or other AI methodology?

    Cheers

  2. I haven’t used Python for creating web fetching scripts so I can’t make a comparison (consider googling “python web fetch”). You can use regular expressions in PHP to fetch snippets of content based on specific patterns. I’ve even written a script that can extract URLs from Links which can be founded here: http://www.bitrepository.com/web-programming/php/extract-urls-from-links.html. I hope it will give you an idea of how to make other scripts that will fetch data.

  3. If you don’t want to code a spider or a Web spider by yourself. MetaSeeker is a good choice.

    MetaSeeker( http://www.gooseeker.com/en/node/product/front ) is a toolkit to precisely extract data from the Web.

    The noise information are effectively inhibited from the results.

    The result files, in format of XML, are like tables in a relational database where each field of the table holds just contents in exact semantics. Further manipulation against the results becomes very straight-forward with the help of data schema. At the same time, the manipulation doesn’t need person’s involvement to handle special case. This is a must for manipulating and mining data in bulk.

    Please visit http://www.gooseeker.com/en for more information.

  4. hi dear friend , thats all i need . thanks you very much.

  5. thanks for the code

    how can i get the url from a file and then fetch it?

    i dont want to put 1 url at a time

    thank you

  6. How can i loop thru $data to find more same matches?

    etc. i have more than one result and i want to display all of them?

    tnh allot!

  7. This is great code when I tired it it gave me “Please Enable Cookies”, any suggestion?

  8. ur the man.
    thanks for the inspiration and keep on posting this kind of stuff.

  9. hi, are U going to post more difficult examples of web extractors?

  10. Hey Vedrin, did you find your answer about getting all matches on the pages? You need to use a foreach loop, but I’m still having trouble getting it to work. Any help?

  11. Thanks so much for posting this. It is extremely useful.

  12. hi,
    can anyone please help me out in getting the images form the websites.
    infact, i am getting but each website has its own method of uploading the images,some post along with the domain some dosn’t.
    can anyone tel me how can i check that and append the domain if it is not present.
    thanks!

Leave a Reply


* = required fields

  (will not be published)


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Note: If you want to post CODE Snippets, please make them postable first!
(e.g. <br /> should be converted to &lt;br /&gt;)

POSTING RULES:

  • The comment must be relevant with the topic of the post.
  • Only comments with real email addresses will get approved. So, emails like 'abc@domain.com' will not be accepted.
  • Do not post the same message in multiple articles through the site.
  • Do not post advertisements, junk mail or pyramid schemes.
  • In case you post a link to another site, please explain briefly where the link goes as a courtesy to other users.
  • Do not post comments such as: "Thank you", "Awesome", "Nice tutorial", "Merci", etc.