Saturday, July 4th, 2009

Creating a simple web data (spider) extractor

by Gabriel on 14/09/08 at 11:41 am

Save to StumbleUpon Stumble Upon it!     Save to Del.icio.us Save to Del.icio.us    Share on Twitter! Share on Twitter!

Greetings! Subscribe to my RSS feed or get my latest post directly in your mailbox. Thanks for visiting!

Greetings coders,

In this tutorial we will learn how to create a simple web spider that will extract specific information from a web page. Our script will have 2 files: index.php & functions.php. In our sample, the extractor will check how many pages from a site are indexed by Google.

First, we will create the library file which will have 2 functions: one to fetch the content from our pages and the other one to extract content between two strings (delimiters).

functions.php

<?php
function LoadCURLPage($url, $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
 Gecko/20030624 Netscape/7.1 (ax)",
$cookie = '', $referer = '', $post_fields = '', $return_transfer = 1,
$follow_location = 1, $ssl = '', $curlopt_header = 0)
{
$ch = curl_init(); 

curl_setopt($ch, CURLOPT_URL, $url);

if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST,  2);
}

curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);

if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}

if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}

if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}

$result = curl_exec ($ch);

curl_close ($ch);

return $result;
}

function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);

$str = substr($string, $pos);

$str_two = substr($str, strlen($start));

$second_pos = stripos($str_two, $end);

$str_three = substr($str_two, 0, $second_pos);

$unit = trim($str_three); // remove whitespaces

return $unit;
}
?>

Let’s continue creating the index.php file. We will start by including the functions file & setting up some configuration variables:

index.php

<?php
error_reporting (E_ALL ^ E_NOTICE);

include 'functions.php';

// Site to check
$site = 'www.microsoft.com';

// Connect to this url using CURL
$url = 'http://www.google.com/search?hl=en&q=site%3A'.$site.'&btnG=Search';

Let’s use cURL to connect to the $url.

$data = LoadCURLPage($url);

Now $data contains the html output for $url;

We will use extract_unit() to get the information between 2 strings. In our case the total indexed pages for $site is between ‘<em></b> of about <b></em>‘ and ‘<em></b></em>‘.

Results <b>1</b> – <b>10</b> of about <b>678,000</b> from <b>www.microsoft.com</b>. (<b>0.04</b> seconds)
// Extract information between STRING 1 & STRING 2

$string_one = '</b> of about <b>';
$string_two = '</b>';

$info = extract_unit($data, $string_one, $string_two);

Output our result:

echo 'Google has indexed '.$info.' pages for '.$site.'.';
?>

The script will output something like this:

Google has indexed 678,000 pages for www.microsoft.com.


Be notified when we have new posts by subscribing to BitRepository RSS Feed.
Support us!Did you like this post?
Please spread the word!
Save to StumbleUpon  Save to Del.icio.us  Share on Twitter!    

7 Comments

Scott

Sep 15th, 2008

I was just searching on something like this, came across your post.

Any idea how PHP would compare to Python for performance in an instance like this?

Also, how might this work for more advanced operations like indexing entire pages? Or pulling snippets of content via some sort of fuzzy logic or other AI methodology?

Cheers

Gabriel

Sep 16th, 2008

I haven’t used Python for creating web fetching scripts so I can’t make a comparison (consider googling “python web fetch”). You can use regular expressions in PHP to fetch snippets of content based on specific patterns. I’ve even written a script that can extract URLs from Links which can be founded here: http://www.bitrepository.com/web-programming/php/extract-urls-from-links.html. I hope it will give you an idea of how to make other scripts that will fetch data.

geo

Sep 28th, 2008

If you don’t want to code a spider or a Web spider by yourself. MetaSeeker is a good choice.

MetaSeeker( http://www.gooseeker.com/en/node/product/front ) is a toolkit to precisely extract data from the Web.

The noise information are effectively inhibited from the results.

The result files, in format of XML, are like tables in a relational database where each field of the table holds just contents in exact semantics. Further manipulation against the results becomes very straight-forward with the help of data schema. At the same time, the manipulation doesn’t need person’s involvement to handle special case. This is a must for manipulating and mining data in bulk.

Please visit http://www.gooseeker.com/en for more information.

reza

Feb 24th, 2009

hi dear friend , thats all i need . thanks you very much.

meir peres

Feb 26th, 2009

thanks for the code

how can i get the url from a file and then fetch it?

i dont want to put 1 url at a time

thank you

Vedran

Mar 23rd, 2009

How can i loop thru $data to find more same matches?

etc. i have more than one result and i want to display all of them?

tnh allot!

moody

Apr 30th, 2009

This is great code when I tired it it gave me “Please Enable Cookies”, any suggestion?

Leave a Comment