Why cURL fails To Extract From Title Tag ?

@developer_webMay 09.2020

Php Gurus,

Why is the following Web crawler code always manages to grab the title of 1.php but not 2.php or 10.php ?
Look at the img under cols: path, title, description, h1, referre, download_time.
The crawler only dumpts to these cols in mysql.

[upl-image-preview url=https://www.webdeveloper.com/assets/files/2020-05-09/1589066222-481332-crawler-5.png]

Can you see it managed to fetch the title of 1.php but not 2.php or 10.php ? Why is that ?
1.php looks like this:

[code] <html> <head> <title> 1st Page Title </title> <head> <meta charset=”UTF-8″> <meta name=”description” content=”Page 1 description is this”> <meta name=”keywords” content=”page, 1, one”> <meta name=”author” content=”Mr 1″> <meta name=”viewport” content=”width=device-width, initial-scale=1.0″> </head> <body> This is page <a href=”2.php”> 2</a> or <a href=”2.php”>two</a>. This is page <a href=”10.php”> 10</a> or <a href=”10.php”>ten</a>. </body> </html> [/code]

2.php looks like this:

[code] <html> <head> <title> 2nd Page Title </title> <head> <meta charset=”UTF-8″> <meta name=”description” content=”Page 2 description is this”> <meta name=”keywords” content=”page, 2, two”> <meta name=”author” content=”Mr 2″> <meta name=”viewport” content=”width=device-width, initial-scale=1.0″> </head> <body> This is page <a href=”3.php”> 3</a> or <a href=”3.php”>three</a>. This is page <a href=”20.php”> 20</a> or <a href=”20.php”>twenty</a>. </body> </html> [/code]

10.php looks like this:

[code] <html> <head> <title> 10th Page Title </title> <head> <meta charset=”UTF-8″> <meta name=”description” content=”Page 10 description is this”> <meta name=”keywords” content=”page, 10, ten”> <meta name=”author” content=”Mr 10″> <meta name=”viewport” content=”width=device-width, initial-scale=1.0″> </head> <body> This is page <a href=”20.php”> 20</a> or <a href=”20.php”>twenty</a>. This is page <a href=”2.php”> 2</a> or <a href=”2.php”>two</a>. </body> </html> [/code]

As you can see, all 3 files are identical with title tags present.

Full Script:

[code] <?php

//https://potentpages.com/web-crawler-development/tutorials/php/creating-a-simple-php-website-crawler $mysql_host = ‘localhost’; $mysql_username = ‘root’; $mysql_password = ”; $mysql_database = ‘powerpage’; $mysql_conn = mysqli_connect( $mysql_host, $mysql_username, $mysql_password, $mysql_database ); if ( !$mysql_conn ) { echo “Error: Unable to connect to MySQL.” . PHP_EOL; echo “Debugging errno: ” . mysqli_connect_errno() . PHP_EOL; echo “Debugging error: ” . mysqli_connect_error() . PHP_EOL; exit; } /** * Download a Webpage via the HTTP GET Protocol using libcurl * Developed By: Potent Pages, LLC (https://potentpages.com/) */ function _http ( $target, $referer ) { //Initialize Handle $handle = curl_init(); //Define Settings curl_setopt ( $handle, CURLOPT_HTTPGET, true ); curl_setopt ( $handle, CURLOPT_HEADER, true ); curl_setopt ( $handle, CURLOPT_COOKIEJAR, “cookie_jar.txt” ); curl_setopt ( $handle, CURLOPT_COOKIEFILE, “cookies.txt” ); curl_setopt ( $handle, CURLOPT_USERAGENT, “web-crawler-tutorial-test” ); curl_setopt ( $handle, CURLOPT_URL, $target ); curl_setopt ( $handle, CURLOPT_REFERER, $referer ); curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true ); curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 ); curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true ); //Execute Request $output = curl_exec ( $handle ); //Close cURL handle curl_close ( $handle ); //Separate Header and Body $separator = “rnrn”; $header = substr( $output, 0, strpos( $output, $separator ) ); $body_start = strlen( $header ) + strlen( $separator ); $body = substr( $output, $body_start, strlen( $output ) – $body_start ); //Parse Headers $header_array = Array(); foreach ( explode ( “rn”, $header ) as $i => $line ) { if($i === 0) { $header_array[‘http_code’] = $line; $status_info = explode( ” “, $line ); $header_array[‘status_info’] = $status_info; } else { list ( $key, $value ) = explode ( ‘: ‘, $line ); $header_array[$key] = $value; } } //Form Return Structure $ret = Array(“headers” => $header_array, “body” => $body ); return $ret; } /** * Convert Relative to Absolute URL * Developed By: Potent Pages, LLC (https://potentpages.com/) * From: https://potentpages.com/web-crawler-development/tutorials/php/simple-php-web-spider * Based On: https://stackoverflow.com/questions/4444475/transfrom-relative-path-into-absolute-url-using-php */ function relativeToAbsolute( $relative, $base ) { if($relative == “” || $base == “”) return “”; //Check Base $base_parsed = parse_url($base); if( !array_key_exists( ‘scheme’, $base_parsed ) || !array_key_exists( ‘host’, $base_parsed ) || !array_key_exists( ‘path’, $base_parsed ) ) { echo “Base Path “$base” Not Absolute Linkn”; return “”; } //Parse Relative $relative_parsed = parse_url($relative); //If relative URL already has a scheme, it’s already absolute if( array_key_exists( ‘scheme’, $relative_parsed ) && $relative_parsed[‘scheme’] != ” ) { return $relative; } //If only a query or a fragment, return base (without any fragment or query) + relative if( !array_key_exists( ‘scheme’, $relative_parsed ) && !array_key_exists( ‘host’, $relative_parsed ) && !array_key_exists( ‘path’, $relative_parsed ) ) { return $base_parsed[‘scheme’]. ‘://’. $base_parsed[‘host’]. $base_parsed[‘path’]. $relative; } //Remove non-directory portion from path $path = preg_replace( ‘#/[^/]*$#’, ”, $base_parsed[‘path’] ); //If relative path already points to root, remove base return absolute path if( $relative[0] == ‘/’ ) { $path = ”; } //Working Absolute URL $abs = ”; //If user in URL if( array_key_exists( ‘user’, $base_parsed ) ) { $abs .= $base_parsed[‘user’]; //If password in URL as well if( array_key_exists( ‘pass’, $base_parsed ) ) { $abs .= ‘:’. $base_parsed[‘pass’]; } //Append location prefix $abs .= ‘@’; } //Append Host $abs .= $base_parsed[‘host’]; //If port in URL if( array_key_exists( ‘port’, $base_parsed ) ) { $abs .= ‘:’. $base_parsed[‘port’]; } //Append New Relative Path $abs .= $path. ‘/’. $relative; //Replace any ‘//’ or ‘/./’ or ‘/foo/../’ with ‘/’ $regex = array(‘#(/.?/)#’, ‘#/(?!..)[^/]+/../#’); for( $n=1; $n>0; $abs = preg_replace( $regex, ‘/’, $abs, -1, $n ) ) {} //Return Absolute URL return $base_parsed[‘scheme’]. ‘://’. $abs; } function parsePage( $target, $referer ) { global $mysql_conn; //Parse URL and get Components $url_components = parse_url( $target ); if($url_components === false) { die( ‘Unable to Parse URL’ ); } $url_host = $url_components[‘host’]; $url_path = ”; if(array_key_exists( ‘path’, $url_components ) == false) { //If not a valid path, mark as done $query = “INSERT INTO pages (path, download_time) VALUES (“”. mysqli_real_escape_string( $mysql_conn, $target ). “”, NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()”; if( !mysqli_query($mysql_conn, $query) ) { echo “Line 128”;//MY OWN ADDED LINE die( “Error: Unable to perform Download Time Update Query (path)n” ); } return false; } else { $url_path = $url_components[‘path’]; } //Download Page echo “Downloading: $targetn”; $contents = _http ( $target, $referer ); echo “Donen”; //Check Status if( $contents[‘headers’][‘status_info’][1] != 200 ) { //If not ok, mark as downloaded but skip $query = “INSERT INTO pages (path, download_time) VALUES (“”. mysqli_real_escape_string( $mysql_conn, $url_path ). “”, NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()”; if( !mysqli_query($mysql_conn, $query) ) { echo “Line 144”;//MY OWN ADDED LINE die( “Error: Unable to perform Download Time Update Query (http status)n” ); } return false; } //Parse Contents $doc = new DOMDocument(); libxml_use_internal_errors( true ); $doc->loadHTML( $contents[‘body’] ); //Get title $title = ”; $titleTags = $doc->getElementsByTagName(‘title’); if( count( $titleTags ) > 0 ) { $title = mysqli_real_escape_string( $mysql_conn, $titleTags[0]->nodeValue ); } //Get Description $description = ”; $metaTags = $doc->getElementsByTagName(‘meta’); foreach( $metaTags as $tag ) { if( $tag->getAttribute(‘name’) == ‘description’ ) { $description = mysqli_real_escape_string( $mysql_conn, $tag->getAttribute( ‘content’ ) ); } } //Get first h1 $h1 = ”; $h1Tags = $doc->getElementsByTagName(‘h1’); if( count( $h1Tags ) > 0 ) { $h1 = mysqli_real_escape_string( $mysql_conn, $h1Tags[0]->nodeValue ); } //Insert/Update Page Data $query = “INSERT INTO pages (path, title, description, h1, download_time) VALUES (“”. mysqli_real_escape_string( $mysql_conn, $url_path ). “”, “$title”, “$description”, “$h1″, NOW()) ON DUPLICATE KEY UPDATE title=”$title”, description=”$description”, h1=”$h1″, download_time=NOW()”; if( !mysqli_query($mysql_conn, $query) ) { echo “Line 176”;//MY OWN ADDED LINE die( “Error: Unable to perform Insert Queryn” ); } //Get Links $links = Array(); $link_tags = $doc->getElementsByTagName( ‘a’ ); foreach( $link_tags as $tag ) { if( ($href_value = $tag->getAttribute( ‘href’ ))) { $link_absolute = relativeToAbsolute( $href_value, $target ); $link_parsed = parse_url( $link_absolute ); if($link_parsed === null || $link_parsed === false) { die( ‘Unable to Parse Link URL’ ); } if(( !array_key_exists( ‘host’, $link_parsed ) || $link_parsed[‘host’] == “” || $link_parsed[‘host’] == $url_host ) && array_key_exists( ‘path’, $link_parsed ) && $link_parsed[‘path’] != “” && array_search( $link_parsed[‘path’], $links ) === false) { $links[] = $link_parsed[‘path’]; } } } //Insert Links foreach($links as $link) { $link_escaped = mysqli_real_escape_string( $mysql_conn, $link ); $query = “INSERT IGNORE INTO pages (path, referer, download_time) VALUES (“$link_escaped”, “”. mysqli_real_escape_string( $mysql_conn, $target ). “”, NULL)”; if( !mysqli_query($mysql_conn, $query) ) { echo “Line 199”;//MY OWN ADDED LINE die( “Error: Unable to perform Insert Link Value Queryn” ); } } return true; } //Define Seed Settings $seed_url = “http://localhost/test/crawler/5/1.php”; $seed_components = parse_url( $seed_url ); if($seed_components === false) { die( ‘Unable to Seed Parse URL’ ); } $seed_scheme = $seed_components[‘scheme’]; $seed_host = $seed_components[‘host’]; $url_start = $seed_scheme. ‘://’. $seed_host; //Download Seed URL parsePage( $seed_url, “” ); //Loop through all pages on site. while(1) { $counter = 0; $select_query = “SELECT * FROM pages WHERE download_time IS NULL”; if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) { if( ($rowCount = mysqli_num_rows($select_result)) > 0 ) { for( $i = 0; $i < $rowCount; $i++ ) { if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) { $path = $row[‘path’]; $referer = $row[‘referer’]; //Check if first character isn’t a ‘/’ if( $path[0] != ‘/’ ) { continue; } $path = $row[‘path’]; $referer = $row[‘referer’]; if( parsePage( $url_start. $path, $referer ) ) { $counter++; } sleep(1); } } } else { break; } } else { die( “Unable to select un-downloaded pagesn” ); } if($counter == 0) { break; } }

?> [/code]

to post a comment

PHP

11 Comments(s) _↴

@developer_webauthorMay 10.2020 — #Bump!

@Steve_R_JonesmoderatorMay 10.2020 — #developer_web - food for thought.... While you are waiting for help - why don't you look to see if YOU can return the favor and help other people?!?!?

@NogDogMay 10.2020 — #And try to narrow down where the problem is and only post that code. Most of us have neither the time nor inclination to read through hundreds of lines of code (for free). In fact, one way to debug is to try to write the smallest amount of code that still has the same problem -- at which point it may then already be obvious to you where the mistake is.

@developer_webauthorMay 11.2020 — #@Steve_R_Jones#1618277

Lol! I'm not good enough to help others as i might suggest buggy lines of codes. Still learning. Consider me still at beginner point.

However, I may contribute codes now and then when I'm sure the finished work is perfect or is not buggy.

@developer_webauthorMay 11.2020 — #@NogDog#1618299

If you look carefully, I always do give that piece of line (snippet) or a few lines where the trouble is.

But, at the end of the post, I do decide to give the "context" (long lines of code) just incase someone is wondering why I wrote the snippet the way I did.

@developer_webauthorMay 11.2020 — #Fellows.

I got this web crawler code from a tutorial.

//https://potentpages.com/web-crawler-development/tutorials/php/creating-a-simple-php-website-crawler

Running the code, I get no errors. It's just the crawler sometimes manages to extract the title from the title tag and sometimes it doesn't. Why is that ? That was my big question.

Here is the code bit that extracts the title:

<i>
 </i>//Parse Contents
 $doc = new DOMDocument();
 libxml_use_internal_errors( true );
 $doc-&gt;loadHTML( $contents['body'] );
 //Get title
 $title = '';
 $titleTags = $doc-&gt;getElementsByTagName('title');
 if( count( $titleTags ) &gt; 0 ) {
 $title = mysqli_real_escape_string( $mysql_conn, $titleTags[0]-&gt;nodeValue );
 }

Remember, the pages that the crawler is failing to extract the titles do have the page titles as they are my own pages in localhost and I showed you the html contents of those particular pages in my original post so you can see for yourself that the misfired pages do have the title tag.

If you want to peer through the full crawler then check my original post.

Thanks!

@NogDogMay 11.2020 — #> @developer_web#1618314 $doc->loadHTML( $contents['body'] );

Start by testing if $doc is actually populated after that command. If it is false or empty, look at $contents['body'] to make sure it's actually populated.

And as always, turn on all warnings at the top of your script while in debug mode.

<i>
 </i>&lt;?php
 ini_set('display_errors', true);
 error_reporting(E_ALL);

Debugging is one of the major activities for any software developer: you need to think about coding defensively and being suspicious of everything, and never assuming anything should work -- especially other people's code that you've copied. ;)

@developer_webauthorMay 12.2020 — #@NogDog#1618338

Tell me about copying other peoples' codes from tutorials!

I had to fix 2 with typos.

Some reference to functions but don't provide the functions in their tutorial codes.

And so on. Even though commenters ask for the missing stuffs they don't reply back. The authors I mean. That's how they lose subscribers. Sloppy service to the public. Wasting peoples' times.

I'll be mentioning some of these 'tutorial codes' in this forum where I come to dead-end to fix them.

I mean look at this code ...

Do you see any typos at the end ? ;)

http://timvanosch.blogspot.com/2013/02/php-tutorial-making-webcrawler.html

Remember, I need to learn to build a basic web crawler.

@developer_webauthorMay 12.2020 — #@NogDog#1618338

Just below this line:

<i>
 </i>$doc-&gt;loadHTML( $contents['body'] );

I added:

<i>
 </i>echo "Line 157"; echo $contents['body']; echo "&lt;br&gt;";

Guess what ? It's populated because it's echoing the page content on my screen.

Remember, the pages I am running the crawler are my own created pages on localhost. And I did create populated pages for testing purpose. Not blank ones.

This is a mystery now.

@developer_webauthorMay 12.2020 — #I did echo the $contents['body'] and the page contents did get echoed. Originally, I said that, I set the crawler at 1.php and it successfully extracted it's title but not the title of the links it followed (2.php and 10.php). Now, I set the crawler to 2.php (failed page) and this time it successfully extracted it's title but it failed to extract the titles of the links it found on 2.php which it followed. It seems the crawler only manages to extract the title from the seed url ($seed_url) but fail to extract from any of the links it follows. Can anyone figure-out the cause of this ?

On which line or under which line in the code should I add what code so the php script not only extracts the title from the $seed_url only but from found links to ?

@developer_webauthorMay 12.2020 — #Folks,

What does this bit of code do ?

<i>
 </i>//Define Seed Settings
 $seed_url = "http://localhost/test/crawler/5/3.php";
 $seed_components = parse_url( $seed_url );
 if($seed_components === false) {
 die( 'Unable to Seed Parse URL' );
 }
 $seed_scheme = $seed_components['scheme'];
 $seed_host = $seed_components['host'];
 $url_start = $seed_scheme. '://'. $seed_host;
 //Download Seed URL
 parsePage( $seed_url, "" );
 //Loop through all pages on site.
 while(1) {
 $counter = 0;
 $select_query = "SELECT * FROM pages WHERE download_time IS NULL";
 if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) {
 if( ($rowCount = mysqli_num_rows($select_result)) &gt; 0 ) {
 for( $i = 0; $i &lt; $rowCount; $i++ ) {
 if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) {
 $path = $row['path'];
 $referer = $row['referer'];
 //Check if first character isn't a '/'
 if( $path[0] != '/' ) {
 continue;
 }
 $path = $row['path'];
 $referer = $row['referer'];
 if( parsePage( $url_start. $path, $referer ) ) {
 $counter++;
 }
 sleep(1);
 }
 }
 } else {
 break;
 }
 } else {
 die( "Unable to select un-downloaded pagesn" );
 }
 if($counter == 0) {
 break;
 }
 }
 
 ?&gt;

What is the purpose of querying mysql table here for pages ?

Which pages it is looking for in the mysql query ? The $seed_url or links found on $seed_url pages ?

If you need the full script then checkout my original post.

Thanks!

Also in #PHP _↴

php array problem Browse database table in html?How to remove 'PHPSESSID' cookie?

Success!

Help @developer_web spread the word by sharing this article on Twitter...

Tweet This

Why cURL fails To Extract From Title Tag ?

11 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

Why cURL fails To Extract From Title Tag ?

11 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

11 Comments(s) _↴

Also in #PHP _↴