Scraping Data: PHP Simple HTML DOM Parser

Posted on December 12, 2008, Filled under PHP,  Bookmark it

PHP Simple HTML DOM Parser, written in PHP5+, allows you to manipulate HTML in a very easy way. Supporting invalid HTML, this parser is better then other PHP scripts that use complicated regexes to extract information from web pages.

Before getting the necessary info, a DOM should be created from either URL or file. The following script extracts links & images from a website:

// Create DOM from URL or file
$html = file_get_html('http://www.microsoft.com/');

// Extract links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

// Extract images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

The parser can also be used to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="simple">Simple</div><div id="parser">Parser</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=simple]', 0)->innertext = 'Foo';

// Output: <div id="simple">Foo</div><div id="parser" class="bar">Parser</div>
echo $html;

Read more from this entry…

PHP: Some Ways to Scan a Directory

Posted on December 12, 2008, Filled under PHP,  Bookmark it

This is a list with some methods to retrieve the contents from a directory (folder). The following functions show both the files & the folders located in a specific directory.

scandir()

array scandir ( string $directory [, int $sorting_order [, resource $context]] )

This function only works in PHP5+. It returns an array of files and folders from the specified path. Here’s an usage example:

<?php
$dir    = '/some_files';
$files1 = scandir($dir); // sort in alphabetical order (Ascending)
$files2 = scandir($dir, 1); // sort in alphabetical order (Descending)

echo "<pre>"; print_r($files1); echo "</pre>";
echo "<pre>"; print_r($files2); echo "</pre>";
?>

The example above would result in something like:

Array
(
    [0] => .
    [1] => ..
    [2] => index.php
    [3] => text_file.txt
    [4] => images
)
Array
(
    [0] => images
    [1] => text_file.txt
    [2] => index.php
    [3] => ..
    [4] => .
)

Read more from this entry…