Click to See Complete Forum and Search --> : App to parse web files in a directory (preferably directly on the server folder)


adj
12-22-2009, 10:21 AM
I'm looking for a application that can parse the contents of files. I want to be able to select a directory on a computer and have the program look through all the webpage-type files (.htm, .asp, etc) and spit out a list of all the links and images in the page into a text file or into Excel. Do you know of any program that can do that?
Example:

page1.html

links
http://www.google.com
archive/page.asp


images
images/1.jpg
images/2.gif
images/category/abc.png

page2.asp

links
anotherpage.asp

images
images/picture.jpg

I'd like to be able to run this on the directory on the server, not via the web because there may be some pages not linked to or from anything sitting out there.

It seems like such a useful thing that I'm sure somebody has written an application like this othere there for download, I just am having trouble finding it. (And it will save me a lot of time if I can find a pre-made one rather than me having to write.)

Thanks!

donatello
12-22-2009, 05:13 PM
I have a couple... this one works and I use it to extract links.

You can modify it to grab emails or whatever.


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>PHP Website Crawler - Link extractor</title></head>
<body>
<font face="verdana" color=#66ccff">
<form id="crawl" method="post" action="">

<label>URL:
<input name="url" type="text" id="url" value="<?php $url; ?>http://website.com" size="70″ maxlength="255″ />
</label>
<br />
<br />
<label>
<input type="submit" name="Submit" value="Crawl!" />
</label>
<br />
</form>
</body>
</html>
<?php
if (isset($_POST['url'])) {
$url = $_POST['url'];
$f = @fopen($url,"r");
while( $buf = fgets($f,1024) )
{
$buf = fgets($f, 4096);
preg_match_all("/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU",$buf,$words);
for( $i = 0; $words[$i]; $i++ )
{
for( $j = 0; $words[$i][$j]; $j++ )
{
$cur_word = strtolower($words[$i][$j]);
print "$cur_word<br>";
}
}
}
}
?>

chris22
12-23-2009, 03:14 PM
Hi, this is a simple and nice PHP class that will help to extract links, images etc.:

http://simplehtmldom.sourceforge.net/