www.webdeveloper.com
Results 1 to 3 of 3

Thread: Can I develop my own standalone search engine?

  1. #1
    Join Date
    Jan 2007
    Posts
    196

    Can I develop my own standalone search engine?

    I have about 80 old school magazines in searchable pdf files. I want to put these on a disc and then add a search facility specifically for the files on the disc.

    I do not want to use an external facility, although I would like it to look like those on offer.

    A very long-winded way would be to use thw windows Seach facility to locate the files and then search each revealed file individually.

    Any suggestions? Is this the right forum?

  2. #2
    Join Date
    Mar 2010
    Location
    Singapore
    Posts
    367
    Quote Originally Posted by spiresgate View Post
    I have about 80 old school magazines in searchable pdf files. I want to put these on a disc and then add a search facility specifically for the files on the disc.

    I do not want to use an external facility, although I would like it to look like those on offer.

    A very long-winded way would be to use thw windows Seach facility to locate the files and then search each revealed file individually.

    Any suggestions? Is this the right forum?
    Do you want to be able to search for that particular PDF file OR do you want to search for keywords within each PDF file ?

    Search for particular PDF file should not be very difficult depending on the computer language and platform you want.

    Search for keywords within each PDF file will be trickier but I have explored a Index/Search engine offered at Apache called Lucene.

    Unfortunately, Lucene does not come with extraction features. You may need the sub-project Tika to help you.

    Step 1
    Use Tika to search and then extract keywords from your PDF files

    Step 2
    Based on Step 1 results, feed the keywords into Lucene engine

    Step 3
    Use Lucene to do index and then you search keywords using Lucene

    Above assume you are a developer and comfortable with Java. Lucene is a library/API, it is not a complete product. You need to write code to "interface" with it. If you want a out of the box Index/Search server that uses Lucene underlying, you can try Apache Solr or Apache Nutch which are finished products for use.

    You can visit below website to understand more.

    http://lucene.apache.org/
    http://nutch.apache.org/ - promoted to top-level Apache project 11 May 2010
    http://tika.apache.org/ - promoted to top-level Apache project 11 May 2010

    In times to come, all of them will be good Open Source alternatives.

  3. #3
    Join Date
    Jan 2007
    Posts
    196
    Thanks so much for the quick reply. There's much food for thought and I will explore Lucene.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles