Click to See Complete Forum and Search --> : Dead links in php txt file


netsavage
01-04-2004, 12:28 PM
Hi

Wondering if anyone can help me with this, and sorry if
this is the wrong place to ask it.

I need to check for dead links in a list of
hundreds of url's which I have saved in txt
format.

My site accesses this list thru a link
(gallery generator) with an exit php like this:

<?php
$urls = file($QUERY_STRING.'.txt');
header("Location: ".$urls[rand(0,count($urls)-1)]);
?>

Once in a while a 404 not found page comes up and I need
an easy way to find these dead links and delete them.

It would take forever to manually check each one,
I need to automate the process some how.

Any suggestions?

Thank you

pnaj
01-04-2004, 02:48 PM
There's a section in the manual about remote (http://www.phpbuilder.com/manual/features.remote-files.php) which should be part of the solution
(look at the bit Example 19-1. Getting the title of a remote page).

Then you could do something like ...

$old_urls = file($QUERY_STRING.'.txt');
$new_urls = array();
foreach($old_urls as $url){
$file = fopen($url, "r");
if ($file) {
$new_urls[] = $url;
}
fclose($file);
}
$file = fopen($QUERY_STRING.'.txt', "w");
fwrite($file, join("\n", $new_urls));
fclose($file);


I haven't debugged this, but it should be a start,at least.

netsavage
01-04-2004, 03:04 PM
pnaj

Thanks alot for the reply

I dont have a clue what to do with the code you provided, I assume it will weed out the dead links?

Can you email me with more of an explainantion, I am quite ignorant about much else than simple html or java. I'm willing to learn more though.

By the way pnaj we share the same birthday



:D

pnaj
01-04-2004, 06:08 PM
Yeah ... I sort of assumed that you knew PHP but just didn't know a way to solve this particular problem.

OK ...

1) I'm assuming that your server runs PHP (seems so, 'cause of the code you showed us in the first place) so you could run a script page on your server.

2) Open up any text editor, paste in the following code, and save as check_urls.php (or whatever you want). You put this on the same server that is displaying the links ... at the root say, but anywhere will do.

3) This is not the same as the first lot I sent. The first lot assumed that you knew what was going on but this will just produce a list of 'dead' links for now. Get this going first and then we can fix it to actually remove entries.

4) You will have to replace 'full_path_to_url_file' with whatever the path/name of the urls text file is. eg. If the check_urls.php is at the root and urls.txt is in the docs folder then replace 'full_path_to_url_file' with './docs/urls.txt'.

5) Upload check_urls.php to your server, navigate to it using any browser and see if it displays the links as dead or live correctly. You can try them.

6) See how this works ... come back to let me know how you get on and we can go to next stage.


$old_urls = file('full_path_to_url_file');
$live_urls = array();
$dead_urls = array();

foreach($old_urls as $url){
$a_url = '<a href="'.$url.'" target="_blank">'.$url.'</a>';
$file = fopen($url, "r");
if ($file) {
$live_urls[] = $a_url;
} else {
$dead_urls[] = $a_url;
}
fclose($file);
}

echo '<h1>Dead URLs</h1>'.join("<br>", $dead_urls));
echo '<h1>Live URLs</h1>'.join("<br>", $live_urls));

netsavage
01-04-2004, 06:28 PM
This is much appreciated.

I followed your directions, this is the url to the check_urls.php

http://www.netsavage.com/galleries/check_urls.php

I think I see what you are getting at here and it's just awsome!

Guess it needs a little tweekin though :)

Standing by

pnaj
01-04-2004, 06:48 PM
Yup. we need to enclose ALL the code in PHP tags like ...

<?php
... PUT ALL CODE HERE ...
?>

Also, you've got to change the 'full_path_to_url_file' to the text file that contains the URLs.

Anyway, the whole page should look like:
<?php
$old_urls = file('full_path_to_url_file');
$live_urls = array();
$dead_urls = array();

foreach($old_urls as $url){
$a_url = '<a href="'.$url.'" target="_blank">'.$url.'</a>';
$file = fopen($url, "r");
if ($file) {
$live_urls[] = $a_url;
} else {
$dead_urls[] = $a_url;
}
fclose($file);
}

echo '<h1>Dead URLs</h1>'.join("<br>", $dead_urls));
echo '<h1>Live URLs</h1>'.join("<br>", $live_urls));
?>

pnaj
01-04-2004, 07:01 PM
Just noticed another tweak.

The last two lines (starting with echo) both have one close-parenthesis too many ... remove one from each line.

netsavage
01-04-2004, 07:13 PM
Having some problems.

The sites.txt is in the same place as the check_urls.php so I'm not sure if I just call it ?>./sites.txt'or include the folder they are in ./galleries/sites.txtanyway I uploaded a few combinations, this is the one I assume is correct:

<?php
$old_urls = file('/sites.txt');
$live_urls = array();
$dead_urls = array();

foreach($old_urls as $url){
$a_url = '<a href="'.$url.'" target="_blank">'.$url.'</a>';
$file = fopen($url, "r");
if ($file) {
$live_urls[] = $a_url;
} else {
$dead_urls[] = $a_url;
}
fclose($file);
}

echo '<h1>Dead URLs</h1>'.join("<br>", $dead_urls);
echo '<h1>Live URLs</h1>'.join("<br>", $live_urls);


But I keep getting this:

Parse error: parse error in /home/netsava/public_html/galleries/check_urls.php on line 7

UPDATE:

I cleaned this up ('/sites.txt'); and things got a little better

I must be doing something else wrong now I'm getting this:

Warning: file(/sites.txt): failed to open stream: No such file or directory in /home/netsava/public_html/galleries/check_urls.php on line 2

Warning: Invalid argument supplied for foreach() in /home/netsava/public_html/galleries/check_urls.php on line 6

Dead URLs
Live URLs

UPDATE:

I removed the / from sites.txt and things are looking up,
I no get a full list of all the urls but all lines have an error message like this:

Warning: fopen(http://www.picsurfer.com/m/bravegirls-069/index1.htm?t1/revs=netsavage ): failed to open stream: HTTP request failed! HTTP/1.1 400 Bad Request in /home/netsava/public_html/galleries/check_urls.php on line 8

Warning: fclose(): supplied argument is not a valid stream resource in /home/netsava/public_html/galleries/check_urls.php on line 14

pnaj
01-05-2004, 04:26 AM
Hi there,

I'll have a look and see if I can find what's causing those errors but will have to be later.

pnaj
01-05-2004, 02:57 PM
Ok. I'm back.

I had a look at the page myself ... noticed that it was taking absolutely ages to load up. Also, the messages you're getting are actually just warnings.

There might still be some (ugh!) further problems, but try this slightly different version. This is a complete replacement and should be put in the same place as before.

The @ in front of the fopen will kill all those warning messages.

(I've put in a few comments, so you can see what's going on.)

<?php
$old_urls = file('sites.txt'); // Load all urls into an list
$live_urls = array(); // Create empty list for live urls
$dead_urls = array(); // Create empty list for dead urls

foreach($old_urls as $url){ //cycle thru all items on list
// Build an HTML link for later
$a_url = '<a href="'.$url.'" target="_blank">'.$url.'</a>';
$file = @fopen($url, "r"); // Try opening the url
if ($file !== FALSE) { // If it does open ...
fclose($file); // ... close it ...
$live_urls[] = $a_url; // ... and add HTML link to live list
} else { // If it doesn't open ...
$dead_urls[] = $a_url; // ... add it to dead list
}
}
// Display all the urls
echo '<h1>Dead URLs</h1>'.join("<br>", $dead_urls);
echo '<h1>Live URLs</h1>'.join("<br>", $live_urls);
?>

netsavage
01-05-2004, 06:31 PM
Good day

I tried this new one and the page loads so slow it times out.
There are around 1500 urls, maybe it takes too long to check them all and the page times out?

pnaj
01-06-2004, 06:35 AM
Put this on the line just before the foreach(...) line.

-------------------

set_time_limit(300);

-------------------

It allows for a longer execution time. If it still times out, try
upping the numbers of secs to 500.

netsavage
01-06-2004, 10:52 AM
Hi pnaj

I tried what you said, set up to 800 but am still getting nothing. Did I put the set_time_limit(300); in the right spot?


<?php
$old_urls = file('sites.txt'); // Load all urls into an list
$live_urls = array(); // Create empty list for live urls
$dead_urls = array(); // Create empty list for dead urls

foreach($old_urls as $url){ //cycle thru all items on list
// Build an HTML link for later
$a_url = '<a href="'.$url.'" target="_blank">'.$url.'</a>';
$file = @fopen($url, "r"); // Try opening the url
set_time_limit(800);if ($file !== FALSE) { // If it does open ...
set_time_limit(800);fclose($file); // ... close it ...

set_time_limit(800);$live_urls[] = $a_url; // ... and add HTML link to live list
} else { // If it doesn't open ...

set_time_limit(800);$dead_urls[] = $a_url; // ... add it to dead list
}
}
// Display all the urls
echo '<h1>Dead URLs</h1>'.join("<br>", $dead_urls);
echo '<h1>Live URLs</h1>'.join("<br>", $live_urls);
?>

pnaj
01-06-2004, 10:57 AM
Here it is all again, with the set_time_limit(300) in place ...


<?php
$old_urls = file('sites.txt'); // Load all urls into an list
$live_urls = array(); // Create empty list for live urls
$dead_urls = array(); // Create empty list for dead urls

set_time_limit(300);
foreach($old_urls as $url){ //cycle thru all items on list
// Build an HTML link for later
$a_url = '<a href="'.$url.'" target="_blank">'.$url.'</a>';
$file = @fopen($url, "r"); // Try opening the url
if ($file !== FALSE) { // If it does open ...
fclose($file); // ... close it ...
$live_urls[] = $a_url; // ... and add HTML link to live list
} else { // If it doesn't open ...
$dead_urls[] = $a_url; // ... add it to dead list
}
}
// Display all the urls
echo '<h1>Dead URLs</h1>'.join("<br>", $dead_urls);
echo '<h1>Live URLs</h1>'.join("<br>", $live_urls);
?>

netsavage
01-06-2004, 11:05 AM
No Luck, shes still timing out at 500

pnaj
01-06-2004, 11:14 AM
Not gonna give up ...

Two things:

1) Just to see if the code works, make a cut-down copy of the sites.txt file to about 100 entries (first back up your original, of course).

2) I've attached a php page that will display, if your web server allows it, a page showing the current configuration of PHP. Put it in the same folder as check_urls.php and let me know.

pnaj
01-06-2004, 11:17 AM
Doh! Forgot to attach the php page ... here it is ... save it as phpinfo.php.


<html>
<body>
<?php
phpinfo();
?>
</body>
</html>

netsavage
01-06-2004, 01:35 PM
Getting there.

I made a list of around 100 urls and it worked, well kinda.
The urls that its saying are dead are still good, try it.

http://www.netsavage.com/galleries/check_urls.php

Thanks so much for your time with this.

pnaj
01-06-2004, 01:51 PM
Hey, don't worry about taking up my time ... I'm learning about things as I go along as well!!

Anyway, not bad ... at least the code is essentially doing the right thing.

It seems most of the dead list comprise items that don't actually reference a file directly, but refer to some kind of index page (i.e. end with a '/'). We might be able to something about that as well but before that, would you check and see exactly which of the other links (ones that don't end in '/') are actually dead.

You might also want to check a few of the supposed live ones as well.

P.S. Due to the nature of the content, I don't really want to check links myself ... hope you're OK with that.

Also, did you do the phpinfo page as well. That will help us find out whether or not we can fix the execution time problem.

netsavage
01-06-2004, 02:13 PM
Well every one of those 79 links is live, even the ones listed as dead. I kinda figured all the links would be good but I must have 1 or 2 bad ones in the original list. It would take forever to check the original list manually, my mouse finger is already numb from clicking on the 79.

I thank you for helping me even though you may not approve of the content I am woking with.

"Also, did you do the phpinfo page as well. That will help us find out whether or not we can fix the execution time problem"

I pasted that little php code to a page in note pad and saved it as phpinfo.php then uploaded to the server.

This is fun!

pnaj
01-06-2004, 03:16 PM
It's strange ... there doesn't seem to be a obvious pattern. I'm surprised that the one on your server (http://www.netsavage.com/flirt4free.html) is being treated as dead. That's definitely there, isn't it?

One thing that could be causing problems is whether there is some sort of re-direction going on.

You're going to have to get your clicking finger out again! Run the page again an click one of our supposed dead links. Have a look at the Browsers Address Bar. Have you been re-directed somewhere (is this the same url as in your list)?
Do the same for a few of each of the dead and live ones just to see.

In the meantime, I'll stick my thinking cap on again.

Also, that phpinfo page is probably best off the server once your sites live. Keep hold of it and if you need it again later, put it back. There's nothing on it that can't be got from other ways, but it puts all your server/php/etc info together on one page ... you might not want anyone else to sit and browse it.

What I was looking for (in the phpinfo file) was whether or not PHP was running in safe mode. It's not, but if it was then it would have stopped us from using the set_time_limit() function.

So, the other thing you can do is start pushing the limits of the execution time ... to get a feel for how long we need.

Below is a new version. Load it up and at the same time, start adding more and more urls (50 at a time) to the sites.txt file.
See where we get to before it gives up

<?php
$old_urls = file('sites.txt'); // Load all urls into an list
$live_urls = array(); // Create empty list for live urls
$dead_urls = array(); // Create empty list for dead urls

//set_time_limit(300);
foreach($old_urls as $url){ //cycle thru all items on list
// Build an HTML link for later
$a_url = '<a href="'.$url.'" target="_blank">'.$url.'</a>';
$file = @fopen($url, "r"); // Try opening the url
if ($file !== FALSE) { // If it does open ...
fclose($file); // ... close it ...
$live_urls[] = $a_url; // ... and add HTML link to live list
} else { // If it doesn't open ...
$dead_urls[] = $a_url; // ... add it to dead list
}
}
// Display all the urls
$retval = '
<html>
<body>
<b>URL Count: '.count($old_urls).'</b>
<hr>
<h1>Dead URLs</h1>
'.join("<br>", $dead_urls).'
<h1>Live URLs</h1>
'.join("<br>", $live_urls).'
</body>
</html>
';
?>


Laters.

netsavage
01-06-2004, 04:02 PM
Hi

I uploaded the new code and now Im getting a blank screen.

URL removed by moderator is a cloaked url but I have never had a problem with any of the urls I have cloaked, besides 99% of the urls in the sites.txt are straight urls not redirects. A lot of the urls are rotating galleries and pictures, maybe that has something to do with it?

I think I might streamline the original list of urls, might make things easier to manage.

pnaj
01-06-2004, 04:09 PM
Sh*t, I always miss a bit! Here it is again ...

<?php
$old_urls = file('sites.txt'); // Load all urls into an list
$live_urls = array(); // Create empty list for live urls
$dead_urls = array(); // Create empty list for dead urls

//set_time_limit(300);
foreach($old_urls as $url){ //cycle thru all items on list
// Build an HTML link for later
$a_url = '<a href="'.$url.'" target="_blank">'.$url.'</a>';
$file = @fopen($url, "r"); // Try opening the url
if ($file !== FALSE) { // If it does open ...
fclose($file); // ... close it ...
$live_urls[] = $a_url; // ... and add HTML link to live list
} else { // If it doesn't open ...
$dead_urls[] = $a_url; // ... add it to dead list
}
}
// Display all the urls
$retval = '
<html>
<body>
<b>URL Count: '.count($old_urls).'</b>
<hr>
<h1>Dead URLs</h1>
'.join("<br>", $dead_urls).'
<h1>Live URLs</h1>
'.join("<br>", $live_urls).'
</body>
</html>
';

echo $retval;
?>


Ok, if you're sure about the re-directs then as I said last time, I'll have to have a think.

One thing though - what does 'cloaked' actually mean?

netsavage
01-06-2004, 05:11 PM
Hey

That worked but still the issue with non dead links. I cannot access more than about 127 urls and thats set at (900)

There are some urls in the live links which have the same url string as some which the php is calling dead and but the dead ones are not dead.

Too bad there was not some kind of utility which I could just cut and paste the urls into and get them checked that way.
I use cute html pro and thought maybe there was a dead link checker on that but you have to enter the urls one at a time or something like that.

Link cloaking- I have some software which when you have a url like this Link Removed

you can change it into this (or whatever you want)

Link Removed

Along with having a nice url in the address bar your affiliate id's are hidden. You can add meta tags and dynamic content so the se's will index the cloaked html.

This site explains it all, its pretty cool and I have not run into anyproblems with it yet. Link Removed

pyro
01-06-2004, 06:19 PM
If you continue posting links of that nature, you will leave me no option but to remove your posting privileges at these forums. Please read the AUP (http://www.internet.com/corporate/privacy/aup.html).

pnaj
01-06-2004, 06:33 PM
Pyro,

Sorry about that ... I shouldn't have mentioned the name of the file in the first place.

I must admit, I was a bit concerned about dealing with this thread in the first place, but netsavage's problem probably has quite a useful solution.

As long as we show any links, can we carry on?

In fact, have you got any ideas why some 'live' links might not be able to opened by fopen()?

pyro
01-06-2004, 06:42 PM
Indeed, please carry on. Did you get my PM? If not, please read it, as there is another request there. Thanks...

netsavage
01-06-2004, 06:47 PM
Sorry Guys...

pnaj
01-06-2004, 06:51 PM
Pyro,

I hadn't got private messages turned on ... but I've turned them on now.

pyro
01-06-2004, 06:58 PM
Originally posted by pnaj
Pyro,

I hadn't got private messages turned on ... but I've turned them on now. Oops. I thought your above post was by netsavage. My PM was directed to him, in relation to his signature and homepage URL - they must be removed. Netsavage, please remove them, or our administrator will do it for you.

pyro
01-06-2004, 06:59 PM
Originally posted by pnaj
In fact, have you got any ideas why some 'live' links might not be able to opened by fopen()?Try running a trim() around the $url in your fopen. I whipped up this, and it worked for me (granted, I only tested on about 10 links...):

<?PHP
$urls = file("urls.txt");

$valid = array();
$invalid = array();

foreach ($urls as $line_num => $url) {
if ($fp = @fopen(trim($url), "r")) {
$valid[] = $line_num.": ".$url;
fclose($fp);
}
else {
$invalid[] = $line_num.": ".$url;
}
}
?>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Link Checker</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<h1 style="font-size: medium; color: #060;">Valid Links</h1>
<p>
<?PHP
foreach ($valid as $url) {
echo $url."<br />";
}
?>
</p>
<h1 style="font-size: medium; color: #990000;">Invalid Links</h1>
<p>
<?PHP
foreach ($invalid as $url) {
echo $url."<br />";
}
?>
</p>
</body>
</html>I'd be interested to hear if that works any better...

edit - Note, the number that appears before the URL is the line number that the url appears on in the .txt file. I thought this would help make it easier to remove broken links.

netsavage
01-06-2004, 08:23 PM
pyro & pnaj

That works great, still a little problem with the load time, what does it take to increase the loading time? I cut down the number of urls to about 300.

It spat out 150 when I reduced the urls even more but it would be nice to have the whole 300 checked. The number system is a great idea.

The only dead links that showed up were really dead so I fugure between the 2 of you good fellows we got it licked.

Sorry breaking the rules back there.

pyro
01-06-2004, 08:51 PM
Try using my script, and upping the set_time_limit(). Alternatly, send me the .txt file, so I have a list of the URLs (as I don't particularily feel like typing that many out :p ) and I'll see what I can do to get it working. :)