Scraping Data: PHP Simple HTML DOM Parser

Posted on December 12, 2008, under PHP 

PHP Simple HTML DOM Parser, written in PHP5+, allows you to manipulate HTML in a very easy way. Supporting invalid HTML, this parser is better then other PHP scripts that use complicated regexes to extract information from web pages.

Before getting the necessary info, a DOM should be created from either URL or file. The following script extracts links & images from a website:

// Create DOM from URL or file
$html = file_get_html('http://www.microsoft.com/');

// Extract links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

// Extract images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

The parser can also be used to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="simple">Simple</div><div id="parser">Parser</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=simple]', 0)->innertext = 'Foo';

// Output: <div id="simple">Foo</div><div id="parser" class="bar">Parser</div>
echo $html; 


Do you wish to retrieve content without any tags?

echo file_get_html('http://www.yahoo.com/')->plaintext;

In the package files of this parser (http://simplehtmldom.sourceforge.net/) you can find some scraping examples from digg, imdb, slashdot. Let’s create one that extracts the first 10 results (titles only) for the keyword “php” from Google:

$url = 'http://www.google.com/search?hl=en&q=php&btnG=Search';

// Create DOM from URL
$html = file_get_html($url);

// Match all 'A' tags that have the class attribute equal with 'l'
foreach($html->find('a[class=l]') as $key => $info) 
{
echo ($key + 1).'. '.$info->plaintext."<br />\n";
}

NOTE Make sure to include the parser before using any functions of it:

include 'simple_html_dom.php';

For more information regarding the usage of this function consider checking the ‘PHP Simple HTML Dom Parser’ Manual. To download the package files use the following URL: http://sourceforge.net/project/showfiles.php?group_id=218559.

Comment via Facebook

comments

39 Replies to "Scraping Data: PHP Simple HTML DOM Parser"

  1. Thank you for the information. I’ve been trying wp-scrapper and feedWordpress plugins with some customization to work out my purpose but nothing worked out.
    Thanks again. :)

  2. With the following code:

    <div class="foreCondition">Chance of T-storms<div>70% chance of precipitation</div></div>
    </div>
    <div class="foreGlance">
    <div class="titleSubtle">Wednesday</div>
    <div class="foreSummary">
    <a class="iconSwitchSmall"><img src="http://icons-ak.wxug.com/i/c/g/chancetstorms.gif&quot; alt="Chance of a Thunderstorm" title="Chance of a Thunderstorm" alt="Chance of a Thunderstorm" title="Chance of a Thunderstorm" class="condIcon" /></a>
    <span class="b">28</span> | 21 &deg;C
    </div>

    How do you extract to get only the "21 &deg;C"

    Thanks!

Leave a Reply


* = required fields

  (will not be published)


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Note: If you want to post CODE Snippets, please make them postable first!
(e.g. <br /> should be converted to &lt;br /&gt;)

POSTING RULES:

  • The comment must be relevant with the topic of the post.
  • Only comments with real email addresses will get approved. So, emails like 'abc@domain.com' will not be accepted.
  • Do not post the same message in multiple articles through the site.
  • Do not post advertisements, junk mail or pyramid schemes.
  • In case you post a link to another site, please explain briefly where the link goes as a courtesy to other users.
  • Do not post comments such as: "Thank you", "Awesome", "Nice tutorial", "Merci", etc.