Scraping Data: PHP Simple HTML DOM Parser

Posted on December 12, 2008, under PHP,  Bookmark it

PHP Simple HTML DOM Parser, written in PHP5+, allows you to manipulate HTML in a very easy way. Supporting invalid HTML, this parser is better then other PHP scripts that use complicated regexes to extract information from web pages.

Before getting the necessary info, a DOM should be created from either URL or file. The following script extracts links & images from a website:

// Create DOM from URL or file
$html = file_get_html('http://www.microsoft.com/');

// Extract links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

// Extract images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

The parser can also be used to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="simple">Simple</div><div id="parser">Parser</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=simple]', 0)->innertext = 'Foo';

// Output: <div id="simple">Foo</div><div id="parser" class="bar">Parser</div>
echo $html;


Do you wish to retrieve content without any tags?

echo file_get_html('http://www.yahoo.com/')->plaintext;

In the package files of this parser (http://simplehtmldom.sourceforge.net/) you can find some scraping examples from digg, imdb, slashdot. Let’s create one that extracts the first 10 results (titles only) for the keyword “php” from Google:

$url = 'http://www.google.com/search?hl=en&q=php&btnG=Search';

// Create DOM from URL
$html = file_get_html($url);

// Match all 'A' tags that have the class attribute equal with 'l'
foreach($html->find('a[class=l]') as $key => $info)
{
echo ($key + 1).'. '.$info->plaintext."<br />\n";
}

NOTE Make sure to include the parser before using any functions of it:

include 'simple_html_dom.php';

For more information regarding the usage of this function consider checking the ‘PHP Simple HTML Dom Parser’ Manual. To download the package files use the following URL: http://sourceforge.net/project/showfiles.php?group_id=218559.

Do you wish to receive the latest updates as soon as they are posted? Get our RSS Feed or Subscribe to the Newsletter!

Get our RSS Feed!

Related Posts

35 Replies to "Scraping Data: PHP Simple HTML DOM Parser"

  1. [...] 14. Scraping data with PHP Simple HTML DOM Parser [...]

  2. how about when do you want to scrape data hve a search result with a post method

  3. I can not make it work. I simply put the files on my server and it gave me the error:

    "Parse error: parse error, expecting `T_OLD_FUNCTION' or `T_FUNCTION' or `T_VAR' or `'}'' in c:\easyphp\www\r2\3\simple_html_dom.php on line 84 "

    how can I make it to work?

  4. It should work 100%. Make sure you didn't make any changes on simple_html_dom.php and you have PHP5 installed on your server.

  5. Is php 5 required for simple_html_dom – says on the website that it's php4 and there is no mention of minimum requirements but I'm getting the same parse error as popescu is getting. It works on my local test server php5 but not on production php4

  6. Woooops..nevermind. I read it wrong. It does say php5 required.

    Go on about your normal daily business.

  7. I see "file_get_html('http://www.yahoo.com/&apos;)->plaintext; "
    return a big plain string. Is there a way I can get an array of string stead?

  8. How to scrape page with different charset than utf-8? In my case I want to scrape page with charset=windows-1257 but in the results I get unknown symbols for some non latin letters

  9. Ok I found the solution function iconv can solve the problem http://uk.php.net/manual/en/function.iconv.php

  10. I cannot run file_get_html. whats are the requirement for this html dom parser.

    1. PHP5+ is required for the PHP HTML Dom Parser.

  11. Is there an easy way to get more than 10 results from Google? Lets say you want 50. The url would get &start=10 for page 2, &start=20 for page 3, etc.

    1. add &num=50 to the google url or any umber u want between 1-100

      1. a simple array will generate 1-1000

  12. this looks great. i just had a go but couldnt really get it working :(

    would really appreciate some help.

    what im trying to do is parse some information within a particular div with all tags intact. Eg grab everything as it is within the

    The particular div has a table in it. i want it to copy parse it all intact. i dont need a for loop i dont think as im not repeating the search. just want it to copy the contents of the first id test it hits (as there is only one on the page anyway)

    thank you for your help, really appreciate it :)

    fl3x7

  13. ahhh sorry your comments box seems to have stripped some of the code i put in :(

  14. This plugins is just brilliant so easy to sort out large amounts of data :D <3 this plugin!

  15. What about if I want to find tag with multiple attributes. let say that I have a table with width attribute, colspan, cellpading, border… And I want to find that particular table. In xpath this can be done with conjunction, what is the conjunction in simple_html_dom?b

  16. i can’t user. Because for read anothet url i need Curl,in the file, i can insert it

  17. This DOM creator is fantastic! I just spent the last several hours messing around with very complex regex, curl, and preg_match code trying to accomplish only 1/2 of what I was able to accomplish with your script in 2 minutes! Thank you so much!

  18. this parser is excellent.
    10/10 for this parser

  19. i noticed you had ways of getting data from div, a, img. is there a way to get data from that has an id associated with it?

    1. Does this work?

      $html->find(‘div[id=EnterIdHere]‘,0);

  20. I can’t get by the example code for scraping slashdot.org it gives errors then appears to display data. dont understand why it complains with errors and then acts like it works and displays data?

    Notice: Trying to get property of non-object in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\example_scraping_slashdot.php on line 11

    Notice: Trying to get property of non-object in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\example_scraping_slashdot.php on line 13

    1. I’ve tested the same code, and indeed, it doesn’t work properly. That code was written back in 2008. Meanwhile, the http://slashdot.org website, changed its HTML structure. You get those notices because the data can’t be extracted properly.

      I have changed the scraping_slashdot() function to parse correctly with the new site. Here it is:

      function scraping_slashdot() {
          // create HTML DOM
          $html = file_get_html('http://slashdot.org/&#39;);
      
          // get article block
          foreach($html->find('div[id^=firehose-]') as $article) {
      
      		if(isSet($article->find('a.datitle', 0)->plaintext)) {
      
      			if(isSet($article->find('a.datitle', 0)->plaintext)) {
                   // get title
                   $item['title'] = trim($article->find('a.datitle', 0)->plaintext);
      			}
      
      		if(isSet($article->find('div[id^=text-]', 0)->plaintext)) {
              // get body
              $item['body'] = trim($article->find('div[id^=text-]', 0)->plaintext);
      		} else {
      		$item['body'] = 'No Information';
      		}
      
              $ret[] = $item;
      		}
          }
      
          // clean up memory
          $html->clear();
          unset($html);
      
          return $ret;
      }
      1. I removed the second check you do for a.datitle

        function scraping_slashdot() {
        // create HTML DOM
        $html = file_get_html('http://slashdot.org/&#039;);

        // get article block
        foreach($html->find('div[id^=firehose-]') as $article) {

        if(isSet($article->find('a.datitle', 0)->plaintext)) {

        // get title
        $item['title'] = trim($article->find('a.datitle', 0)->plaintext);

        if(isSet($article->find('div[id^=text-]', 0)->plaintext)) {
        // get body
        $item['body'] = trim($article->find('div[id^=text-]', 0)->plaintext);
        } else {
        $item['body'] = 'No Information';
        }
        $ret[] = $item;
        }

        }

        // clean up memory
        $html->clear();
        unset($html);

        return $ret;
        }

  21. I edited the above just to play with this scraping stuff. changed site to my site and want to check for the div with id =main and return all the classes named pagetext. errors were :

    Notice: Undefined variable: ret in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\scraping_slashdot.php on line 26

    Warning: Invalid argument supplied for foreach() in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\scraping_slashdot.php on line 33

    code is

    find('#main]') as $article) {

    if(isSet($article->find('.pagetext', 0)->plaintext)) {

    // get title
    $item['body'] = trim($article->find('.pagetext', 0)->plaintext);

    } else {
    $item['body'] = 'No Information';
    }
    $ret[] = $item;
    }

    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
    }
    // -----------------------------------------------------------------------------
    // test it!

    $ret = scraping_slashdot();

    foreach($ret as $v) {

    echo '';
    echo ''.$v['body'].'';
    echo '';
    }
    ?>

    why is $ret coming back empty?

  22. i beliee it may have something to do with the fact that may not have plaintext. check for plaintext in is coming up false. this may be it.

    so how does one scrape the text in between tags?

  23. I used this in a project recently, I especially like the jquery style selectors. This library was much needed. Before I had been using xml parsers, but they aren’t suited for html DOM.

    @ Gabriel – thanks for the rewriting the slashdot scraper!

  24. I can’t understand how i can fetch my data if its has div based structure? Can you guide me plz

    1. Checkout the following page: http://simplehtmldom.sourceforge.net/

      Click on the “Scraping Slashdot!” tab and you will see an example of scraping DIVs.

  25. On:
    $url = 'http://www.google.com/search?hl=en&q=php&btnG=Search';

    // Create DOM from URL
    $html = file_get_html($url);

    // Match all 'A' tags that have the class attribute equal with 'l'
    foreach($html->find('a[class=l]') as $key => $info)
    {
    echo ($key 1).'. '.$info->plaintext."<br />\n";
    }

    Can someone give me a hint on how can I output the data on each table cell.
    Example:

    <table width="100%" border="0" cellspacing="0" cellpadding="0">
    <tr>
    <td>Data 1</td>
    <td>Data 2</td>
    <td>Data 3</td>
    <td>Data 4</td>
    </tr>
    </table>

    Thanks in advance.

  26. I am trying to scrape a page with simple_html_dom.php but have run into a problem. I am looking for an html tag but the page only has an opening tag on some of the elements to be scraped. EG
    <p class=”blue”> blaa blaa blaa<p>
    <p class=”blue”>hey hey hey
    <p class=”blue”>ha ha ha<p>
    Note the missing p tag on the second element. How can i scrape this??? I cannot change the html.

Leave a Reply


* = required fields

  (will not be published)


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Note: If you want to post CODE Snippets, please make them postable first!
(e.g. <br /> should be converted to &lt;br /&gt;)