[RESOLVED] Getting What I Need With "new DOMDocument"
Hello to all. I'm creating a miniature sports site where I'm referencing some schedule information from a professional sports team. I'm trying to grab some HTML from another site and I'm using cURL to do it. I've successfully grabbed the HTML and created a DOM instance as well as successfully loaded the HTML using:
$bears_games = new DOMDocument();
The problem I'm having is that certain <div> tags that I'm targeting are coming up as NULL. I'm referencing certain id's like "element1" and "element6" and I'm getting nothing. I'm thinking that I'm not referencing what I want correctly. I do know that what is returned is an object, but I'm missing something. Any suggestions?
This is the string that I'm loading into the DOM variable. This line
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
tells cURL to return the page as a string instead of "outputting" the page to the browser. From my research, I read that I needed to use htmlspecialchars() with this because it would return predefined characters for < and > instead of the HTML entities themselves. When looking at the page link above, is it that this is too much to load and to hard to navigate through in a DOM instance?
I'm 99.99999% sure you do not want htmlspecialchars there. The cURL call will get exactly the same thing that would be sent to your browser, which would be the raw HTML text, where tags would be enclosed in literal "<" and ">" characters. If you apply htmlspecialchars() to it then this...
...would become this...
...which ain't gonna be parsed as a DIV tag.
PS: You would only want to apply htmlspecialchars() if/when you wanted to output the result to the browser in a way that would show the actual mark-up (such as within a <pre> element if you were demonstrating something about the HTML mark-up of that page.
"Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
~ Terry Pratchett in Nation
This is weird because I wasn't getting anything when I echoed the nodeValue, but when I removed htmlspecialchars I get something even though it's not what I want. If you click on the link you'll see what I mean.
The combination of getting rid of htmlspecialchars(), passing the HTML string through PHP tidy, and using libxml_use_internal_errors() allowed me to use DOMXPath to grab the class I need and traverse to get the node values. Thanks again.