Scraping Data: PHP Simple HTML DOM Parser

Posted on December 12, 2008, Filled under PHP,  Bookmark it

PHP Simple HTML DOM Parser, written in PHP5+, allows you to manipulate HTML in a very easy way. Supporting invalid HTML, this parser is better then other PHP scripts that use complicated regexes to extract information from web pages.

Before getting the necessary info, a DOM should be created from either URL or file. The following script extracts links & images from a website:

// Create DOM from URL or file
$html = file_get_html('http://www.microsoft.com/');

// Extract links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

// Extract images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

The parser can also be used to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="simple">Simple</div><div id="parser">Parser</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=simple]', 0)->innertext = 'Foo';

// Output: <div id="simple">Foo</div><div id="parser" class="bar">Parser</div>
echo $html;


Do you wish to retrieve content without any tags?

echo file_get_html('http://www.yahoo.com/')->plaintext;

In the package files of this parser (http://simplehtmldom.sourceforge.net/) you can find some scraping examples from digg, imdb, slashdot. Let’s create one that extracts the first 10 results (titles only) for the keyword “php” from Google:

$url = 'http://www.google.com/search?hl=en&q=php&btnG=Search';

// Create DOM from URL
$html = file_get_html($url);

// Match all 'A' tags that have the class attribute equal with 'l'
foreach($html->find('a[class=l]') as $key => $info)
{
echo ($key + 1).'. '.$info->plaintext."<br />\n";
}

NOTE Make sure to include the parser before using any functions of it:

include 'simple_html_dom.php';

For more information regarding the usage of this function consider checking the ‘PHP Simple HTML Dom Parser’ Manual. To download the package files use the following URL: http://sourceforge.net/project/showfiles.php?group_id=218559.

Do you wish to receive the latest updates as soon as they are posted? Get our RSS Feed or Subscribe to the Newsletter!

Get our RSS Feed!

Sponsors

Related Posts

21 Replies to "Scraping Data: PHP Simple HTML DOM Parser"

  1. [...] 14. Scraping data with PHP Simple HTML DOM Parser [...]

  2. how about when do you want to scrape data hve a search result with a post method

  3. I can not make it work. I simply put the files on my server and it gave me the error:

    "Parse error: parse error, expecting `T_OLD_FUNCTION' or `T_FUNCTION' or `T_VAR' or `'}'' in c:\easyphp\www\r2\3\simple_html_dom.php on line 84 "

    how can I make it to work?

  4. It should work 100%. Make sure you didn't make any changes on simple_html_dom.php and you have PHP5 installed on your server.

  5. Is php 5 required for simple_html_dom – says on the website that it's php4 and there is no mention of minimum requirements but I'm getting the same parse error as popescu is getting. It works on my local test server php5 but not on production php4

  6. Woooops..nevermind. I read it wrong. It does say php5 required.

    Go on about your normal daily business.

  7. I see "file_get_html('http://www.yahoo.com/')->plaintext; "
    return a big plain string. Is there a way I can get an array of string stead?

  8. How to scrape page with different charset than utf-8? In my case I want to scrape page with charset=windows-1257 but in the results I get unknown symbols for some non latin letters

  9. Ok I found the solution function iconv can solve the problem http://uk.php.net/manual/en/function.iconv.php

  10. I cannot run file_get_html. whats are the requirement for this html dom parser.

    1. PHP5+ is required for the PHP HTML Dom Parser.

  11. Is there an easy way to get more than 10 results from Google? Lets say you want 50. The url would get &start=10 for page 2, &start=20 for page 3, etc.

    1. add &num=50 to the google url or any umber u want between 1-100

  12. this looks great. i just had a go but couldnt really get it working :(

    would really appreciate some help.

    what im trying to do is parse some information within a particular div with all tags intact. Eg grab everything as it is within the

    The particular div has a table in it. i want it to copy parse it all intact. i dont need a for loop i dont think as im not repeating the search. just want it to copy the contents of the first id test it hits (as there is only one on the page anyway)

    thank you for your help, really appreciate it :)

    fl3×7

  13. ahhh sorry your comments box seems to have stripped some of the code i put in :(

  14. This plugins is just brilliant so easy to sort out large amounts of data :D <3 this plugin!

  15. What about if I want to find tag with multiple attributes. let say that I have a table with width attribute, colspan, cellpading, border… And I want to find that particular table. In xpath this can be done with conjunction, what is the conjunction in simple_html_dom?b

  16. i can’t user. Because for read anothet url i need Curl,in the file, i can insert it

  17. This DOM creator is fantastic! I just spent the last several hours messing around with very complex regex, curl, and preg_match code trying to accomplish only 1/2 of what I was able to accomplish with your script in 2 minutes! Thank you so much!

Leave a Reply


* = required fields

(will not be published)


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


  

CommentLuv Enabled