www.webdeveloper.com
Results 1 to 4 of 4

Thread: Making Sphider ignore disallowed pages?

  1. #1
    Join Date
    Jun 2009
    Posts
    14

    Unhappy Making Sphider ignore disallowed pages?

    I meant: Ignore the disallow instruction, going ahead and retrieve the page.

    I tried sphider, sphider-plus and some mods to make it "ignore robots", but it seems not enough.
    I'm trying to index a third party website to help users to find other's posts, since the owner seems too busy with the "sales, sales, sales" part.
    The problem is seems they deliberately want us not to find help because they also added some "disallow" rule.

    I can browse the pages, and even changed sphider agent to Firefox's, no success.

    Is it even possible to browse a website as a browser, other than faking the user agent? in other words: How many ways a server has to figure out whether it's a robot or not what is reading the pages?

    What I'm stating could be wrong, and there could be other instructions/rules in robot.txt or somewhere else, but bear with me

    Thanks.
    Last edited by sergiozambrano; 03-22-2012 at 07:22 AM.

  2. #2
    Join Date
    Nov 2002
    Location
    Baltimore, Maryland
    Posts
    12,278
    The internet is designed to spread information, not keep it safe. As with life itself, the best that you can do is ask politely for the spiders to leave you alone.
    “The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect.”
    —Tim Berners-Lee, W3C Director and inventor of the World Wide Web

  3. #3
    Join Date
    Jun 2009
    Posts
    14

    What?

    Quote Originally Posted by Charles View Post
    The internet is designed to spread information, not keep it safe. As with life itself, the best that you can do is ask politely for the spiders to leave you alone.
    ejem… amen?

    What?

    Did you read my description or just the title?
    Is that an answer? or your signature in an empty post?

  4. #4
    Join Date
    Jun 2009
    Posts
    14

    Unhappy Update

    Stupidly I didn't check HOW the links appear, just where the links pointed to.

    It seems the links open the pages I want with JavaScript, which Sphider can't process.

    At least I know how the pages are called, and I can increment the query string while downloading. That won't index the original pages but I'll be able to create a DB I can work with.

    Is there any php script or Mac Software (or Firefox/Chrome extension?) to download webpages from a url range?

    Any idea?

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles