Scraping Data: PHP Simple HTML DOM Parser
Posted on December 12, 2008, under PHP,
Bookmark it
PHP Simple HTML DOM Parser, written in PHP5+, allows you to manipulate HTML in a very easy way. Supporting invalid HTML, this parser is better then other PHP scripts that use complicated regexes to extract information from web pages.
Before getting the necessary info, a DOM should be created from either URL or file. The following script extracts links & images from a website:
// Create DOM from URL or file
$html = file_get_html('http://www.microsoft.com/');
// Extract links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
// Extract images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
The parser can also be used to modify HTML elements:
// Create DOM from string
$html = str_get_html('<div id="simple">Simple</div><div id="parser">Parser</div>');
$html->find('div', 1)->class = 'bar';
$html->find('div[id=simple]', 0)->innertext = 'Foo';
// Output: <div id="simple">Foo</div><div id="parser" class="bar">Parser</div>
echo $html;
Do you wish to retrieve content without any tags?
echo file_get_html('http://www.yahoo.com/')->plaintext;
In the package files of this parser (http://simplehtmldom.sourceforge.net/) you can find some scraping examples from digg, imdb, slashdot. Let’s create one that extracts the first 10 results (titles only) for the keyword “php” from Google:
$url = 'http://www.google.com/search?hl=en&q=php&btnG=Search';
// Create DOM from URL
$html = file_get_html($url);
// Match all 'A' tags that have the class attribute equal with 'l'
foreach($html->find('a[class=l]') as $key => $info)
{
echo ($key + 1).'. '.$info->plaintext."<br />\n";
}
NOTE Make sure to include the parser before using any functions of it:
include 'simple_html_dom.php';
For more information regarding the usage of this function consider checking the ‘PHP Simple HTML Dom Parser’ Manual. To download the package files use the following URL: http://sourceforge.net/project/showfiles.php?group_id=218559.
Do you wish to receive the latest updates as soon as they are posted? Get our RSS Feed or Subscribe to the Newsletter!
- December 12, 2008
- article by Gabriel C.
- 35 comments
Related Posts
-
PHP: Creating a simple web data (spider) extractorat September 14, 2008 with 11 comments
-
How to extract images from an URL in PHPat August 30, 2008 with 8 comments
-
Extract URL(s) from Link(s) with PHPat September 4, 2008 with 2 comments
-
Simple and Easy to Customize XML based CMS: GetSimpleat August 5, 2009 with 4 comments
-
Create a PHP Script that Logins in to a Password Protected Areaat December 17, 2008 with 14 comments

35 Replies to "Scraping Data: PHP Simple HTML DOM Parser"
January 14, 2009 at 1:47 PM
[...] 14. Scraping data with PHP Simple HTML DOM Parser [...]
January 21, 2009 at 11:57 PM
how about when do you want to scrape data hve a search result with a post method
January 22, 2009 at 6:10 AM
In this case, I recommend you the following post: Create a PHP script that logins in to a password protected area.
January 23, 2009 at 7:38 AM
I can not make it work. I simply put the files on my server and it gave me the error:
"Parse error: parse error, expecting `T_OLD_FUNCTION' or `T_FUNCTION' or `T_VAR' or `'}'' in c:\easyphp\www\r2\3\simple_html_dom.php on line 84 "
how can I make it to work?
January 29, 2009 at 7:09 AM
It should work 100%. Make sure you didn't make any changes on simple_html_dom.php and you have PHP5 installed on your server.
February 1, 2009 at 6:21 PM
Is php 5 required for simple_html_dom – says on the website that it's php4 and there is no mention of minimum requirements but I'm getting the same parse error as popescu is getting. It works on my local test server php5 but not on production php4
February 2, 2009 at 7:08 AM
Woooops..nevermind. I read it wrong. It does say php5 required.
Go on about your normal daily business.
February 3, 2009 at 1:24 PM
I see "file_get_html('http://www.yahoo.com/')->plaintext; "
return a big plain string. Is there a way I can get an array of string stead?
March 26, 2009 at 10:56 PM
How to scrape page with different charset than utf-8? In my case I want to scrape page with charset=windows-1257 but in the results I get unknown symbols for some non latin letters
March 28, 2009 at 10:58 AM
Ok I found the solution function iconv can solve the problem http://uk.php.net/manual/en/function.iconv.php
April 25, 2009 at 3:54 PM
I cannot run file_get_html. whats are the requirement for this html dom parser.
April 26, 2009 at 8:54 PM
PHP5+ is required for the PHP HTML Dom Parser.
May 13, 2009 at 5:07 PM
Is there an easy way to get more than 10 results from Google? Lets say you want 50. The url would get &start=10 for page 2, &start=20 for page 3, etc.
February 11, 2010 at 12:45 PM
add &num=50 to the google url or any umber u want between 1-100
March 12, 2010 at 10:13 AM
a simple array will generate 1-1000
July 20, 2009 at 11:49 AM
[...] [...]
July 20, 2009 at 5:03 PM
this looks great. i just had a go but couldnt really get it working :(
would really appreciate some help.
what im trying to do is parse some information within a particular div with all tags intact. Eg grab everything as it is within the
The particular div has a table in it. i want it to copy parse it all intact. i dont need a for loop i dont think as im not repeating the search. just want it to copy the contents of the first id test it hits (as there is only one on the page anyway)
thank you for your help, really appreciate it :)
fl3x7
July 20, 2009 at 5:06 PM
ahhh sorry your comments box seems to have stripped some of the code i put in :(
September 14, 2009 at 9:33 AM
This plugins is just brilliant so easy to sort out large amounts of data :D <3 this plugin!
November 16, 2009 at 8:29 PM
What about if I want to find tag with multiple attributes. let say that I have a table with width attribute, colspan, cellpading, border… And I want to find that particular table. In xpath this can be done with conjunction, what is the conjunction in simple_html_dom?b
January 26, 2010 at 9:15 PM
i can’t user. Because for read anothet url i need Curl,in the file, i can insert it
February 9, 2010 at 2:59 PM
This DOM creator is fantastic! I just spent the last several hours messing around with very complex regex, curl, and preg_match code trying to accomplish only 1/2 of what I was able to accomplish with your script in 2 minutes! Thank you so much!
August 8, 2010 at 8:05 PM
this parser is excellent.
10/10 for this parser
August 14, 2010 at 11:08 AM
i noticed you had ways of getting data from div, a, img. is there a way to get data from that has an id associated with it?
October 22, 2010 at 9:42 AM
Does this work?
$html->find(‘div[id=EnterIdHere]‘,0);
December 21, 2010 at 10:09 PM
I can’t get by the example code for scraping slashdot.org it gives errors then appears to display data. dont understand why it complains with errors and then acts like it works and displays data?
Notice: Trying to get property of non-object in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\example_scraping_slashdot.php on line 11Notice: Trying to get property of non-object in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\example_scraping_slashdot.php on line 13December 21, 2010 at 10:48 PM
I’ve tested the same code, and indeed, it doesn’t work properly. That code was written back in 2008. Meanwhile, the http://slashdot.org website, changed its HTML structure. You get those notices because the data can’t be extracted properly.
I have changed the scraping_slashdot() function to parse correctly with the new site. Here it is:
function scraping_slashdot() { // create HTML DOM $html = file_get_html('http://slashdot.org/'); // get article block foreach($html->find('div[id^=firehose-]') as $article) { if(isSet($article->find('a.datitle', 0)->plaintext)) { if(isSet($article->find('a.datitle', 0)->plaintext)) { // get title $item['title'] = trim($article->find('a.datitle', 0)->plaintext); } if(isSet($article->find('div[id^=text-]', 0)->plaintext)) { // get body $item['body'] = trim($article->find('div[id^=text-]', 0)->plaintext); } else { $item['body'] = 'No Information'; } $ret[] = $item; } } // clean up memory $html->clear(); unset($html); return $ret; }December 22, 2010 at 3:29 AM
I removed the second check you do for a.datitle
function scraping_slashdot() {
// create HTML DOM
$html = file_get_html('http://slashdot.org/');
// get article block
foreach($html->find('div[id^=firehose-]') as $article) {
if(isSet($article->find('a.datitle', 0)->plaintext)) {
// get title
$item['title'] = trim($article->find('a.datitle', 0)->plaintext);
if(isSet($article->find('div[id^=text-]', 0)->plaintext)) {
// get body
$item['body'] = trim($article->find('div[id^=text-]', 0)->plaintext);
} else {
$item['body'] = 'No Information';
}
$ret[] = $item;
}
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}
December 22, 2010 at 8:34 AM
I edited the above just to play with this scraping stuff. changed site to my site and want to check for the div with id =main and return all the classes named pagetext. errors were :
Notice: Undefined variable: ret in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\scraping_slashdot.php on line 26
Warning: Invalid argument supplied for foreach() in C:\Users\owner\websites\testcode\simplehtmldom\example\scraping\scraping_slashdot.php on line 33
code is
find('#main]') as $article) {
if(isSet($article->find('.pagetext', 0)->plaintext)) {
// get title
$item['body'] = trim($article->find('.pagetext', 0)->plaintext);
} else {
$item['body'] = 'No Information';
}
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_slashdot();
foreach($ret as $v) {
echo '';
echo ''.$v['body'].'';
echo '';
}
?>
why is $ret coming back empty?
December 22, 2010 at 8:56 AM
i beliee it may have something to do with the fact that may not have plaintext. check for plaintext in is coming up false. this may be it.
so how does one scrape the text in between tags?
April 3, 2011 at 6:32 PM
I used this in a project recently, I especially like the jquery style selectors. This library was much needed. Before I had been using xml parsers, but they aren’t suited for html DOM.
@ Gabriel – thanks for the rewriting the slashdot scraper!
July 22, 2011 at 7:22 PM
I can’t understand how i can fetch my data if its has div based structure? Can you guide me plz
July 22, 2011 at 7:26 PM
Checkout the following page: http://simplehtmldom.sourceforge.net/
Click on the “Scraping Slashdot!” tab and you will see an example of scraping DIVs.
September 22, 2011 at 11:25 AM
On:
$url = 'http://www.google.com/search?hl=en&q=php&btnG=Search';
// Create DOM from URL
$html = file_get_html($url);
// Match all 'A' tags that have the class attribute equal with 'l'
foreach($html->find('a[class=l]') as $key => $info)
{
echo ($key 1).'. '.$info->plaintext."<br />\n";
}
Can someone give me a hint on how can I output the data on each table cell.
Example:
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td>Data 1</td>
<td>Data 2</td>
<td>Data 3</td>
<td>Data 4</td>
</tr>
</table>
Thanks in advance.
January 26, 2012 at 8:02 AM
I am trying to scrape a page with simple_html_dom.php but have run into a problem. I am looking for an html tag but the page only has an opening tag on some of the elements to be scraped. EG
<p class=”blue”> blaa blaa blaa<p>
<p class=”blue”>hey hey hey
<p class=”blue”>ha ha ha<p>
Note the missing p tag on the second element. How can i scrape this??? I cannot change the html.