PHP: Creating a simple web data (spider) extractor

Posted on September 14, 2008, under PHP 

In this tutorial we will learn how to create a simple web spider that will extract specific information from a web page. Our script will have 2 files: index.php & functions.php. In our sample, the extractor will check how many pages from a site are indexed by Google.

First, we will create the library file which will have 2 functions: one to fetch the content from our pages and the other one to extract content between two strings (delimiters).

functions.php

<?php
function LoadCURLPage($url, $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
 Gecko/20030624 Netscape/7.1 (ax)", 
$cookie = '', $referer = '', $post_fields = '', $return_transfer = 1, 
$follow_location = 1, $ssl = '', $curlopt_header = 0)
{
$ch = curl_init(); 

curl_setopt($ch, CURLOPT_URL, $url);

if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);
}

curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);

if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}

if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1); 
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields); 
}

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}

if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}

$result = curl_exec ($ch);

curl_close ($ch);

return $result;
}

function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);

$str = substr($string, $pos);

$str_two = substr($str, strlen($start));

$second_pos = stripos($str_two, $end);

$str_three = substr($str_two, 0, $second_pos);

$unit = trim($str_three); // remove whitespaces

return $unit;
}
?>

Let’s continue creating the index.php file. We will start by including the functions file & setting up some configuration variables:

index.php

<?php
error_reporting (E_ALL ^ E_NOTICE);

include 'functions.php';

// Site to check
$site = 'www.microsoft.com';

// Connect to this url using CURL
$url = 'http://www.google.com/search?hl=en&q=site%3A'.$site.'&btnG=Search';

Let’s use cURL to connect to the $url.

$data = LoadCURLPage($url);

Now $data contains the html output for $url;

We will use extract_unit() to get the information between 2 strings. In our case the total indexed pages for $site is between ‘<em></b> of about <b></em>‘ and ‘<em></b></em>‘.

Results <b>1</b> – <b>10</b> of about <b>678,000</b> from <b>www.microsoft.com</b>. (<b>0.04</b> seconds)
// Extract information between STRING 1 & STRING 2

$string_one = '</b> of about <b>';
$string_two = '</b>';

$info = extract_unit($data, $string_one, $string_two);

Output our result:

echo 'Google has indexed '.$info.' pages for '.$site.'.';
?>

The script will output something like this:

Google has indexed 678,000 pages for www.microsoft.com.

Comment via Facebook

comments

Leave a Reply


* = required fields

  (will not be published)


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Note: If you want to post CODE Snippets, please make them postable first!
(e.g. <br /> should be converted to &lt;br /&gt;)

POSTING RULES:

  • The comment must be relevant with the topic of the post.
  • Only comments with real email addresses will get approved. So, emails like 'abc@domain.com' will not be accepted.
  • Do not post the same message in multiple articles through the site.
  • Do not post advertisements, junk mail or pyramid schemes.
  • In case you post a link to another site, please explain briefly where the link goes as a courtesy to other users.
  • Do not post comments such as: "Thank you", "Awesome", "Nice tutorial", "Merci", etc.