I meant: Ignore the disallow instruction, going ahead and retrieve the page.
I tried sphider, sphider-plus and some mods to make it "ignore robots", but it seems not enough.
I'm trying to index a third party website to help users to find other's posts, since the owner seems too busy with the "sales, sales, sales" part.
The problem is seems they deliberately want us not to find help because they also added some "disallow" rule.
I can browse the pages, and even changed sphider agent to Firefox's, no success.
Is it even possible to browse a website as a browser, other than faking the user agent? in other words: How many ways a server has to figure out whether it's a robot or not what is reading the pages?
What I'm stating could be wrong, and there could be other instructions/rules in robot.txt or somewhere else, but bear with me
Thanks.
Last edited by sergiozambrano; 03-22-2012 at 06:22 AM.
The internet is designed to spread information, not keep it safe. As with life itself, the best that you can do is ask politely for the spiders to leave you alone.
“The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect.”
—Tim Berners-Lee, W3C Director and inventor of the World Wide Web
The internet is designed to spread information, not keep it safe. As with life itself, the best that you can do is ask politely for the spiders to leave you alone.
ejem… amen?
What?
Did you read my description or just the title?
Is that an answer? or your signature in an empty post?
Stupidly I didn't check HOW the links appear, just where the links pointed to.
It seems the links open the pages I want with JavaScript, which Sphider can't process.
At least I know how the pages are called, and I can increment the query string while downloading. That won't index the original pages but I'll be able to create a DB I can work with.
Is there any php script or Mac Software (or Firefox/Chrome extension?) to download webpages from a url range?
Bookmarks