Results 1 to 1 of 1

Thread: Files for an Crawler to ignore

  1. #1
    Join Date
    Nov 2005

    Links a Crawler should ignore

    Hi, I have developed some code that crawls web pages looking for links. I need to filter out irrelevant links such as those that refer to css, javascript functions, favicons, this is simple enough to achieve with regex. What i need to know is what other irrelevant links am i likely to find on web pages?
    Also is there a name for links of the following form: -

    http://www. bbc.co.uk/go/homepage/www/lht/h2/t/-/http://www.tvlicensing.co.uk/index.jsp

    Last edited by Solaar; 10-06-2006 at 02:56 PM. Reason: typo

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
HTML5 Development Center