I meant: Ignore the disallow instruction, going ahead and retrieve the page.
I tried sphider, sphider-plus and some mods to make it "ignore robots", but it seems not enough.
I'm trying to index a third party website to help users to find other's posts, since the owner seems too busy with the "sales, sales, sales" part.
The problem is seems they deliberately want us not to find help because they also added some "disallow" rule.
I can browse the pages, and even changed sphider agent to Firefox's, no success.
Is it even possible to browse a website as a browser, other than faking the user agent? in other words: How many ways a server has to figure out whether it's a robot or not what is reading the pages?
What I'm stating could be wrong, and there could be other instructions/rules in robot.txt or somewhere else, but bear with me
Last edited by sergiozambrano; 03-22-2012 at 06:22 AM.