Click to See Complete Forum and Search --> : trace through entire site?
andersonbd1
02-25-2003, 08:39 PM
I have taken over as webmaster of a site that has outgrown it's roots. Because space on the server is an issue, I'd like to get rid of files that aren't needed and accessed any longer. I was going to write a java program that parsed the files, starting at index.htm and opening each link it finds and parsing that file and so on and so forth. A list will be generated of every page ( and image ) that is "on" the site. I would imagine that this is a common issue and that someone has already written a program like this. I searched the internet for a while, but didn't really know how to search for such a thing and didn't find much. Does anyone know of such a thing or have a better solution. It doesn't matter what language.
Thanks,
Ben Anderson
DaiWelsh
02-26-2003, 05:34 AM
Hmmm, I wrote something like this in perl once as I recall, to grab a client site I was going to have to take over.
It works fine for straight hrefs and images, but I dont recall whether I coded for forms and certainly I didnt allow for javascript redirects/file opens plus it is not possible to pick up all possible routes through many dynamic pages. In other words, you need to decide first whether the site in question is suitable for this approach. A 'straight' html site should be ok, anything built by a more....hmmm..... 'experimental' mind will be difficult to do.
If you want what I had (it was a long time ago so no guarantees about how well it is written) I can try to dig it out when I am back in the office?
Alternatively you could do the sensible thing ;) and search for products at download.com, using searches like 'spider' or 'offline browser', e.g. http://download.com.com/3000-2377-9465860.html?tag=lst-0-8 (http://download.com.com/3000-2377-9465860.html?tag=lst-0-8) or http://download.com.com/3000-2377-3359674.html?tag=lst-0-2
Regards,
Dai
andersonbd1
03-01-2003, 06:47 AM
Sorry, I forgot to metion that it is a static site. I'm converting to MySQL/PHP, so that's why I want to clean all the junk off.