Click to See Complete Forum and Search --> : Can I develop my own standalone search engine?


spiresgate
05-14-2010, 01:20 AM
I have about 80 old school magazines in searchable pdf files. I want to put these on a disc and then add a search facility specifically for the files on the disc.

I do not want to use an external facility, although I would like it to look like those on offer.

A very long-winded way would be to use thw windows Seach facility to locate the files and then search each revealed file individually.

Any suggestions? Is this the right forum?

sohguanh
05-14-2010, 03:04 AM
I have about 80 old school magazines in searchable pdf files. I want to put these on a disc and then add a search facility specifically for the files on the disc.

I do not want to use an external facility, although I would like it to look like those on offer.

A very long-winded way would be to use thw windows Seach facility to locate the files and then search each revealed file individually.

Any suggestions? Is this the right forum?

Do you want to be able to search for that particular PDF file OR do you want to search for keywords within each PDF file ?

Search for particular PDF file should not be very difficult depending on the computer language and platform you want.

Search for keywords within each PDF file will be trickier but I have explored a Index/Search engine offered at Apache called Lucene.

Unfortunately, Lucene does not come with extraction features. You may need the sub-project Tika to help you.

Step 1
Use Tika to search and then extract keywords from your PDF files

Step 2
Based on Step 1 results, feed the keywords into Lucene engine

Step 3
Use Lucene to do index and then you search keywords using Lucene

Above assume you are a developer and comfortable with Java. Lucene is a library/API, it is not a complete product. You need to write code to "interface" with it. If you want a out of the box Index/Search server that uses Lucene underlying, you can try Apache Solr or Apache Nutch which are finished products for use.

You can visit below website to understand more.

http://lucene.apache.org/
http://nutch.apache.org/ - promoted to top-level Apache project 11 May 2010
http://tika.apache.org/ - promoted to top-level Apache project 11 May 2010

In times to come, all of them will be good Open Source alternatives.

spiresgate
05-14-2010, 05:58 AM
Thanks so much for the quick reply. There's much food for thought and I will explore Lucene.