Posted on December 17, 2008, Filled under PHP,
Bookmark it
Thanks for visiting our website! We regularly publish posts like this one. If you are interested in receiving the latest updates as soon as they are posted, please consider subscribing to the RSS feed or to our e-mail newsletter.
The aim of this tutorial is to help you create a web fetching script that can extract content from a password protected area using the necessary login credentials. I will use the well know cURL command line tool to connect to the protected web area. PHP supports libcurl which is required in order to use cURL functions in PHP.
The following script signs in to YouTube and fetches the latest favorite videos.
First, let’s create the file functions.php which should contain a practical cURL function & another function that extracts content between 2 delimiters (click here to view details about it):
<?php
function LoadCURLPage($url, $agent='', $cookie='', $referer='', $post_fields='', $ssl='')
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
if($ssl) curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt ($ch, CURLOPT_HEADER, 0);
if($agent) curl_setopt($ch, CURLOPT_USERAGENT, $agent);
if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
if($referer) curl_setopt($ch, CURLOPT_REFERER, $referer);
if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$unit = trim($str_three); // remove whitespaces
return $unit;
}
?>
Read more from this entry…
Posted on December 12, 2008, Filled under PHP,
Bookmark it
PHP Simple HTML DOM Parser, written in PHP5+, allows you to manipulate HTML in a very easy way. Supporting invalid HTML, this parser is better then other PHP scripts that use complicated regexes to extract information from web pages.
Before getting the necessary info, a DOM should be created from either URL or file. The following script extracts links & images from a website:
// Create DOM from URL or file
$html = file_get_html('http://www.microsoft.com/');
// Extract links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
// Extract images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
The parser can also be used to modify HTML elements:
// Create DOM from string
$html = str_get_html('<div id="simple">Simple</div><div id="parser">Parser</div>');
$html->find('div', 1)->class = 'bar';
$html->find('div[id=simple]', 0)->innertext = 'Foo';
// Output: <div id="simple">Foo</div><div id="parser" class="bar">Parser</div>
echo $html;
Read more from this entry…
Posted on December 10, 2008, Filled under AJAX, JavaScript,
Bookmark it

In this tutorial I will show you how to display the Yahoo! Weather on your website, using a PHP RSS Parser, called Magpie RSS and JQuery (AJAX). The Weather RSS Feed uses 2 parameters: ‘p’ for location & ‘u’ for degree units. The base URL for the Weather RSS Feed is http://weather.yahooapis.com/forecastrss. An address that gets the weather forecast in Sunnyvale, CA (zip code: 94089) in Celsius (c) looks like this: http://weather.yahooapis.com/forecastrss?p=94089&u=c.
Read more from this entry…
Posted on September 14, 2008, Filled under PHP,
Bookmark it
In this tutorial we will learn how to create a simple web spider that will extract specific information from a web page. Our script will have 2 files: index.php & functions.php. In our sample, the extractor will check how many pages from a site are indexed by Google.
First, we will create the library file which will have 2 functions: one to fetch the content from our pages and the other one to extract content between two strings (delimiters).
functions.php
<?php
function LoadCURLPage($url, $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
Gecko/20030624 Netscape/7.1 (ax)",
$cookie = '', $referer = '', $post_fields = '', $return_transfer = 1,
$follow_location = 1, $ssl = '', $curlopt_header = 0)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
}
curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);
if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}
if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}
if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$unit = trim($str_three); // remove whitespaces
return $unit;
}
?>
Let’s continue creating the index.php file. We will start by including the functions file & setting up some configuration variables:
index.php
<?php
error_reporting (E_ALL ^ E_NOTICE);
include 'functions.php';
// Site to check
$site = 'www.microsoft.com';
// Connect to this url using CURL
$url = 'http://www.google.com/search?hl=en&q=site%3A'.$site.'&btnG=Search';
Let’s use cURL to connect to the $url.
$data = LoadCURLPage($url);
Now $data contains the html output for $url;
We will use extract_unit() to get the information between 2 strings. In our case the total indexed pages for $site is between ‘<em></b> of about <b></em>‘ and ‘<em></b></em>‘.
Results <b>1</b> – <b>10</b> of about <b>678,000</b> from <b>www.microsoft.com</b>. (<b>0.04</b> seconds)
// Extract information between STRING 1 & STRING 2
$string_one = '</b> of about <b>';
$string_two = '</b>';
$info = extract_unit($data, $string_one, $string_two);
Output our result:
echo 'Google has indexed '.$info.' pages for '.$site.'.';
?>
The script will output something like this:
Google has indexed 678,000 pages for www.microsoft.com.
Posted on September 2, 2008, Filled under PHP,
Bookmark it
A common cURL function that can be used for multiple purposes:
<?php
function LoadCURLPage($url, $agent = '', $cookie = "", $referer = "",
$post_fields = "", $return_transfer = 1, $follow_location = 1)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 0);
if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}
if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}
if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
### Usage Examples ####
/* LOGIN (POST Method) */
$login_url = 'http://www.domain.com/login';
// Login
$post_info = array("username" => "username_here",
"password" => "password_here");
$post_data = '';
foreach($post_info as $name => $value)
{
$post_data .= $name.'='.$value.'&';
}
$post_data = trim(substr($post_data, 0, -1));
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4)
Gecko/20030624 Netscape/7.1 (ax)";
$cookie_file_path = 'temp/cookie.txt';
$result = LoadCURLPage($login_url, $agent, $cookie_file_path,
$login_page, $post_data);
echo $result;
/* Fetch a web page (GET Method) */
$result = LoadCURLPage('http://www.yahoo.com/', $agent);
echo $result;
?>