Scraping Data: PHP Simple HTML DOM Parser

PHP Simple HTML DOM Parser, written in PHP5+, allows you to manipulate HTML in a very easy way. Supporting invalid HTML, this parser is better then other PHP scripts that use complicated regexes to extract information from web pages.

Before getting the necessary info, a DOM should be created from either URL or file. The following script extracts links & images from a website:

// Create DOM from URL or file
$html = file_get_html('http://www.microsoft.com/');

// Extract links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

// Extract images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

The parser can also be used to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="simple">Simple</div><div id="parser">Parser</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=simple]', 0)->innertext = 'Foo';

// Output: <div id="simple">Foo</div><div id="parser" class="bar">Parser</div>
echo $html; 


Do you wish to retrieve content without any tags?

echo file_get_html('http://www.yahoo.com/')->plaintext;

In the package files of this parser (http://simplehtmldom.sourceforge.net/) you can find some scraping examples from digg, imdb, slashdot. Let’s create one that extracts the first 10 results (titles only) for the keyword “php” from Google:

$url = 'http://www.google.com/search?hl=en&q=php&btnG=Search';

// Create DOM from URL
$html = file_get_html($url);

// Match all 'A' tags that have the class attribute equal with 'l'
foreach($html->find('a[class=l]') as $key => $info) 
{
echo ($key + 1).'. '.$info->plaintext."<br />\n";
}

NOTE Make sure to include the parser before using any functions of it:

include 'simple_html_dom.php';

For more information regarding the usage of this function consider checking the ‘PHP Simple HTML Dom Parser’ Manual. To download the package files use the following URL: http://sourceforge.net/project/showfiles.php?group_id=218559.

Comment via Facebook

comments

39 Comments

  1. says

    I can not make it work. I simply put the files on my server and it gave me the error:

    "Parse error: parse error, expecting `T_OLD_FUNCTION' or `T_FUNCTION' or `T_VAR' or `'}'' in c:\easyphp\www\r2\3\simple_html_dom.php on line 84 "

    how can I make it to work?

  2. GabrielGabriel says

    It should work 100%. Make sure you didn't make any changes on simple_html_dom.php and you have PHP5 installed on your server.

  3. says

    Is php 5 required for simple_html_dom – says on the website that it's php4 and there is no mention of minimum requirements but I'm getting the same parse error as popescu is getting. It works on my local test server php5 but not on production php4

  4. anonymous says

    How to scrape page with different charset than utf-8? In my case I want to scrape page with charset=windows-1257 but in the results I get unknown symbols for some non latin letters

  5. Bob Brodie says

    Is there an easy way to get more than 10 results from Google? Lets say you want 50. The url would get &start=10 for page 2, &start=20 for page 3, etc.

  6. fl3x7 says

    this looks great. i just had a go but couldnt really get it working :(

    would really appreciate some help.

    what im trying to do is parse some information within a particular div with all tags intact. Eg grab everything as it is within the

    The particular div has a table in it. i want it to copy parse it all intact. i dont need a for loop i dont think as im not repeating the search. just want it to copy the contents of the first id test it hits (as there is only one on the page anyway)

    thank you for your help, really appreciate it :)

    fl3x7

  7. Vasile says

    What about if I want to find tag with multiple attributes. let say that I have a table with width attribute, colspan, cellpading, border… And I want to find that particular table. In xpath this can be done with conjunction, what is the conjunction in simple_html_dom?b

  8. Hawknut says

    This DOM creator is fantastic! I just spent the last several hours messing around with very complex regex, curl, and preg_match code trying to accomplish only 1/2 of what I was able to accomplish with your script in 2 minutes! Thank you so much!

  9. Jeff says

    i noticed you had ways of getting data from div, a, img. is there a way to get data from that has an id associated with it?

  10. Norm says

    I can’t get by the example code for scraping slashdot.org it gives errors then appears to display data. dont understand why it complains with errors and then acts like it works and displays data?

    Notice: Trying to get property of non-object in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\example_scraping_slashdot.php on line 11

    Notice: Trying to get property of non-object in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\example_scraping_slashdot.php on line 13

    • Gabriel C.Gabriel C. says

      I’ve tested the same code, and indeed, it doesn’t work properly. That code was written back in 2008. Meanwhile, the http://slashdot.org website, changed its HTML structure. You get those notices because the data can’t be extracted properly.

      I have changed the scraping_slashdot() function to parse correctly with the new site. Here it is:

      function scraping_slashdot() {
          // create HTML DOM
          $html = file_get_html('http://slashdot.org/');
      
          // get article block
          foreach($html->find('div[id^=firehose-]') as $article) {
      
      		if(isSet($article->find('a.datitle', 0)->plaintext)) {
      
      			if(isSet($article->find('a.datitle', 0)->plaintext)) {
                   // get title
                   $item['title'] = trim($article->find('a.datitle', 0)->plaintext);
      			}
      
      		if(isSet($article->find('div[id^=text-]', 0)->plaintext)) {	
              // get body
              $item['body'] = trim($article->find('div[id^=text-]', 0)->plaintext);
      		} else {
      		$item['body'] = 'No Information';
      		}
      
              $ret[] = $item;
      		}
          }
          
          // clean up memory
          $html->clear();
          unset($html);
      
          return $ret;
      }
      • says

        I removed the second check you do for a.datitle

        function scraping_slashdot() {
        // create HTML DOM
        $html = file_get_html('http://slashdot.org/&#039;);

        // get article block
        foreach($html->find('div[id^=firehose-]') as $article) {

        if(isSet($article->find('a.datitle', 0)->plaintext)) {

        // get title
        $item['title'] = trim($article->find('a.datitle', 0)->plaintext);

        if(isSet($article->find('div[id^=text-]', 0)->plaintext)) {
        // get body
        $item['body'] = trim($article->find('div[id^=text-]', 0)->plaintext);
        } else {
        $item['body'] = 'No Information';
        }
        $ret[] = $item;
        }

        }

        // clean up memory
        $html->clear();
        unset($html);

        return $ret;
        }

  11. Norm says

    I edited the above just to play with this scraping stuff. changed site to my site and want to check for the div with id =main and return all the classes named pagetext. errors were :

    Notice: Undefined variable: ret in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\scraping_slashdot.php on line 26

    Warning: Invalid argument supplied for foreach() in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\scraping_slashdot.php on line 33

    code is

    find('#main]') as $article) {

    if(isSet($article->find('.pagetext', 0)->plaintext)) {

    // get title
    $item['body'] = trim($article->find('.pagetext', 0)->plaintext);

    } else {
    $item['body'] = 'No Information';
    }
    $ret[] = $item;
    }

    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
    }
    // -----------------------------------------------------------------------------
    // test it!

    $ret = scraping_slashdot();

    foreach($ret as $v) {

    echo '';
    echo ''.$v['body'].'';
    echo '';
    }
    ?>

    why is $ret coming back empty?

  12. Norm says

    i beliee it may have something to do with the fact that may not have plaintext. check for plaintext in is coming up false. this may be it.

    so how does one scrape the text in between tags?

  13. says

    I used this in a project recently, I especially like the jquery style selectors. This library was much needed. Before I had been using xml parsers, but they aren’t suited for html DOM.

    @ Gabriel – thanks for the rewriting the slashdot scraper!

  14. saijin says

    On:
    $url = 'http://www.google.com/search?hl=en&q=php&btnG=Search&#039;;

    // Create DOM from URL
    $html = file_get_html($url);

    // Match all 'A' tags that have the class attribute equal with 'l'
    foreach($html->find('a[class=l]') as $key => $info)
    {
    echo ($key 1).'. '.$info->plaintext."<br />\n";
    }

    Can someone give me a hint on how can I output the data on each table cell.
    Example:

    <table width="100%" border="0" cellspacing="0" cellpadding="0">
    <tr>
    <td>Data 1</td>
    <td>Data 2</td>
    <td>Data 3</td>
    <td>Data 4</td>
    </tr>
    </table>

    Thanks in advance.

  15. Mark says

    I am trying to scrape a page with simple_html_dom.php but have run into a problem. I am looking for an html tag but the page only has an opening tag on some of the elements to be scraped. EG
    <p class=”blue”> blaa blaa blaa<p>
    <p class=”blue”>hey hey hey
    <p class=”blue”>ha ha ha<p>
    Note the missing p tag on the second element. How can i scrape this??? I cannot change the html.

  16. Andrea Vartanian says

    I love you dude…
    You saved my life!!! :D
    I have a same project ( to university ) for tomorrow and i didn’t write anything yet, even a line! :-p

    and you saved my life… :D
    Tnx dude…

  17. says

    Thank you for the information. I’ve been trying wp-scrapper and feedWordpress plugins with some customization to work out my purpose but nothing worked out.
    Thanks again. :)

  18. says

    With the following code:

    <div class="foreCondition">Chance of T-storms<div>70% chance of precipitation</div></div>
    </div>
    <div class="foreGlance">
    <div class="titleSubtle">Wednesday</div>
    <div class="foreSummary">
    <a class="iconSwitchSmall"><img src="http://icons-ak.wxug.com/i/c/g/chancetstorms.gif&quot; alt="Chance of a Thunderstorm" title="Chance of a Thunderstorm" alt="Chance of a Thunderstorm" title="Chance of a Thunderstorm" class="condIcon" /></a>
    <span class="b">28</span> | 21 &deg;C
    </div>

    How do you extract to get only the "21 &deg;C"

    Thanks!

Trackbacks

Leave a Reply

Your email address will not be published. Required fields are marked *