www.webdeveloper.com
Results 1 to 10 of 10

Thread: [RESOLVED] Getting What I Need With "new DOMDocument"

  1. #1
    Join Date
    Sep 2008
    Posts
    260

    resolved [RESOLVED] Getting What I Need With "new DOMDocument"

    Hello to all. I'm creating a miniature sports site where I'm referencing some schedule information from a professional sports team. I'm trying to grab some HTML from another site and I'm using cURL to do it. I've successfully grabbed the HTML and created a DOM instance as well as successfully loaded the HTML using:
    PHP Code:
    $bears_games = new DOMDocument();
    $bears_games->loadHTML($bears_html); 
    The problem I'm having is that certain <div> tags that I'm targeting are coming up as NULL. I'm referencing certain id's like "element1" and "element6" and I'm getting nothing. I'm thinking that I'm not referencing what I want correctly. I do know that what is returned is an object, but I'm missing something. Any suggestions?

    Here's my full cURL code:

    PHP Code:
    $ch curl_init('http://www.chicagobears.com/gameday/schedule.html');         
    curl_setopt($chCURLOPT_RETURNTRANSFER1);
    curl_setopt($chCURLOPT_HEADER0);            
    $bears_html htmlspecialchars(curl_exec($ch));
            
    $bears_games = new DOMDocument();
    $bears_games->loadHTML($bears_html);

    var_dump($bears_games->getElementsByTagName('div')->item(0));
    curl_close($ch); 

  2. #2
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,221
    If the tags have ID attributes, then I'd suggest using getElementById(), which will directly get that node instead of giving you a node list.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  3. #3
    Join Date
    Sep 2008
    Posts
    260
    Thanks for your reply NogDog,

    That's what I did and I'm getting a NULL value.

  4. #4
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,221
    Ooh...I don't think you want that htmlspecialchars() call on the cURL result, as you will no longer have any [x]HTML mark-up if you do that.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  5. #5
    Join Date
    Sep 2008
    Posts
    260
    Take a look at this link http://sociallyaffluent.com/game_schedules_bears.php

    This is the string that I'm loading into the DOM variable. This line
    PHP Code:
    curl_setopt($chCURLOPT_RETURNTRANSFER1); 
    tells cURL to return the page as a string instead of "outputting" the page to the browser. From my research, I read that I needed to use htmlspecialchars() with this because it would return predefined characters for < and > instead of the HTML entities themselves. When looking at the page link above, is it that this is too much to load and to hard to navigate through in a DOM instance?

  6. #6
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,221
    I'm 99.99999% sure you do not want htmlspecialchars there. The cURL call will get exactly the same thing that would be sent to your browser, which would be the raw HTML text, where tags would be enclosed in literal "<" and ">" characters. If you apply htmlspecialchars() to it then this...
    Code:
    <div id="foo">
    ...would become this...
    Code:
    &lt;div id="foo"&gt;
    ...which ain't gonna be parsed as a DIV tag.

    PS: You would only want to apply htmlspecialchars() if/when you wanted to output the result to the browser in a way that would show the actual mark-up (such as within a <pre> element if you were demonstrating something about the HTML mark-up of that page.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  7. #7
    Join Date
    Sep 2008
    Posts
    260
    What you are saying makes perfect sense and it's right because I'm getting something now, but for some reason I'm getting an error when I remove the htmlspecialchars().

    PHP Code:
    WarningDOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRefexpecting ';' in Entityline1221 in /home/isaiahb/public_html/game_schedules_bears.php on line 10 
    Here's the full line of code that I'm tampering with:

    PHP Code:

    <?php
                       
    $ch 
    curl_init('http://www.chicagobears.com/gameday/schedule.html');         
    curl_setopt($chCURLOPT_RETURNTRANSFER1);
    curl_setopt($chCURLOPT_HEADER0);            
    $bears_html curl_exec($ch);
            
    $bears_games = new DOMDocument();
    $bears_games->preserveWhiteSpace false;
    $bears_games->loadHTML($bears_html);

    echo 
    $bears_games->getElementsByTagName('div')->item(0)->nodeValue;

    curl_close($ch);    
        
    ?>
    This is weird because I wasn't getting anything when I echoed the nodeValue, but when I removed htmlspecialchars I get something even though it's not what I want. If you click on the link you'll see what I mean.
    Last edited by ChuckB; 10-12-2012 at 03:49 PM.

  8. #8
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,221
    Sounds like the page you're accessing is malformed -- one of the perils of screen-scraping. You could try to repair it on the fly with the PHP Tidy extension.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  9. #9
    Join Date
    Jul 2003
    Location
    The City of Roses
    Posts
    2,503
    You can add this before loading the document. It will allow DOMDocument to continue parsing through malformed markup in the same fashion as browsers that error correct.

    libxml_use_internal_errors(true);
    for(split(//,'))*))91:+9.*4:1A1+9,1))2*:..)))2*:31.-1)4131)1))2*:3)"'))
    {for(ord){$i+=$_&7;grep(vec($s,$i++,1)=1,1..($_>>3)-4);}}print"$s\n";

  10. #10
    Join Date
    Sep 2008
    Posts
    260
    Thanks a lot NogDog and Jeff Mott. It's solved.

    The combination of getting rid of htmlspecialchars(), passing the HTML string through PHP tidy, and using libxml_use_internal_errors() allowed me to use DOMXPath to grab the class I need and traverse to get the node values. Thanks again.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles