[RESOLVED] Getting What I Need With "new DOMDocument"
Hello to all. I'm creating a miniature sports site where I'm referencing some schedule information from a professional sports team. I'm trying to grab some HTML from another site and I'm using cURL to do it. I've successfully grabbed the HTML and created a DOM instance as well as successfully loaded the HTML using:
PHP Code:
$bears_games = new DOMDocument();
$bears_games->loadHTML($bears_html);
The problem I'm having is that certain <div> tags that I'm targeting are coming up as NULL. I'm referencing certain id's like "element1" and "element6" and I'm getting nothing. I'm thinking that I'm not referencing what I want correctly. I do know that what is returned is an object, but I'm missing something. Any suggestions?
If the tags have ID attributes, then I'd suggest using getElementById(), which will directly get that node instead of giving you a node list.
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
Ooh...I don't think you want that htmlspecialchars() call on the cURL result, as you will no longer have any [x]HTML mark-up if you do that.
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
This is the string that I'm loading into the DOM variable. This line
PHP Code:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
tells cURL to return the page as a string instead of "outputting" the page to the browser. From my research, I read that I needed to use htmlspecialchars() with this because it would return predefined characters for < and > instead of the HTML entities themselves. When looking at the page link above, is it that this is too much to load and to hard to navigate through in a DOM instance?
I'm 99.99999% sure you do not want htmlspecialchars there. The cURL call will get exactly the same thing that would be sent to your browser, which would be the raw HTML text, where tags would be enclosed in literal "<" and ">" characters. If you apply htmlspecialchars() to it then this...
Code:
<div id="foo">
...would become this...
Code:
<div id="foo">
...which ain't gonna be parsed as a DIV tag.
PS: You would only want to apply htmlspecialchars() if/when you wanted to output the result to the browser in a way that would show the actual mark-up (such as within a <pre> element if you were demonstrating something about the HTML mark-up of that page.
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
What you are saying makes perfect sense and it's right because I'm getting something now, but for some reason I'm getting an error when I remove the htmlspecialchars().
PHP Code:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 1221 in /home/isaiahb/public_html/game_schedules_bears.php on line 10
Here's the full line of code that I'm tampering with:
This is weird because I wasn't getting anything when I echoed the nodeValue, but when I removed htmlspecialchars I get something even though it's not what I want. If you click on the link you'll see what I mean.
Sounds like the page you're accessing is malformed -- one of the perils of screen-scraping. You could try to repair it on the fly with the PHP Tidy extension.
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
You can add this before loading the document. It will allow DOMDocument to continue parsing through malformed markup in the same fashion as browsers that error correct.
The combination of getting rid of htmlspecialchars(), passing the HTML string through PHP tidy, and using libxml_use_internal_errors() allowed me to use DOMXPath to grab the class I need and traverse to get the node values. Thanks again.
Bookmarks