Click to See Complete Forum and Search --> : Website - Lucene or ??


Boblebad
08-08-2011, 10:06 AM
Hi

I'm going to build a historical website, which is supposed to hold all of his writings, several books, letters and a magazine, photo's(gallery), video's, mp3 lectures, a blog and a forum ..

I've bin looking around for a couple of days now, reading and reading about different ways to approach the task, and MySQL alone won't do it, cause the need of searching through thousands of textpages(books, letters, magazine aso.), so i found Lucene ..

But is this the right app/enigne for my project ??

Now, there won't be that much traffic on the site, but in the future there will, and i need a system that can handle the load ..

I've read about running Lucene alone, and let it use flat files, and about it hooked together with MySQL, where the database holds the data and Lucene just do the searching - but what is best ??

I would like to work with pdf's, and i would like to have orig scans of the books aso. for display as ebooks, where you read it as the actual book, but i would also like to have the searched text to be displayed to users just as text for quick online reading ..

The search needs to be single, multiple words and phrases .. and i needs to be across books, photo and video description too, and possible to be specific in what to search(books, letters, photo's aso.)

I know there's a PHP client for Lucene, but haven't checked i out yet, and i would like to use what's around if it fits my project to cut time and work, and specially be course i'm not a high end php programmer, i know the basics, and i don't know SQL yet, but if needed i'll read and learn ;-) - have only worked on databases like access aso. ..

I hope i have described my project with enough details, so a direction can be pointed out for me to work in :)

What's the best setup for my website ??

Best regards
Carsten, Denmark

likethecolor
08-14-2011, 05:49 PM
This can certainly be done using Lucene and it is an outstanding way to get fast full-text search results. Lucene is a library and using it by itself will take you a long time to do what you want to do. I suggest an application called SOLR (http://lucene.apache.org/solr/tutorial.html). SOLR is built on top of Lucene and handles a lot of the low-level dirty work. It also provides you with some really nice features out of the box. It will handle single, multiple words and phrases (as well as things you may not have thought about like stemming, facets and filtering out stop words like "an", "a", "the" - e.g., you generally don't want all results containing the term "the", if that makes sense).

SOLR provides a web interface to searching and can return to you the results in number of formats (e.g., xml - default, JSON, php code, serialized php, and others). It also provides what is known as a Data Import Handler (DIH). Using DIH SOLR can create a Lucene index from various data sources with little effort.

To get started I suggest you use a hybrid approach. Use SOLR for searching and MySQL to store the data for display. SOLR can certainly handle large data sets (we have one index with >20,000,000 documents and still get results in milliseconds). However, keep in mind, a smaller search index will get you faster results. Also, since it sounds like you are more familiar with MySQL than SOLR the more time you spend in a familiar area the faster the final product will be built (get it working then tweak/optimize as you get more familiar with SOLR).

Here's a very simplistic example to give you an idea. Say you have one database table indexed on a unique id. Have (at least 2 fields in SOLR - for future keep in mind that the number of fields are not really limited): ID, DATA The first call would be to SOLR to get the ID (e.g., php curl request). The next call is to MySQL to fetch the row(s) matching the id(s) returned by SOLR.

A few hints on those two fields. SOLR has a concept of an indexed field and a stored field. An indexed field is one that will be used to search against. A stored field is a field that SOLR can return in human readable form (e.g., the name of a person, historic accounts). Except for debugging (checking by ID to see if a document is in the index) you probably don't need to search on the ID field, however, you will want to return the ID value so you can use it in the database query (i.e., indexed="false" stored="true). The DATA field is just the opposite. The DATA field will only be used for searching and doesn't need to be returned since you're getting that from the database (indexed="true" stored="false"). It's not essential but thinking about these things can save space and give a faster response.

I hope all this made sense. Feel free to contact me if you have any questions (I work with SOLR on a daily basis at work and have been doing so for about 2 years): http://www.likethecolor.com/

-Dan