The company I work for has currently got a web site linked to an Index Server catalog. Queries are being issued to the catalog via SQL Server and results are returned for matching documents. The problem we are having is when documents contain Chinese characters Index Server is unable to find the Chinese characters as required.
I believe the issue is caused by the word breaker being used on the files. Chinese writing does not require the same delimiting characters (like spaces and carriage returns) as English, and so Index Server treats an entire Chinese sentence as one word. This makes searching for a couple of characters in a long string impossible using our current system.
The Index Server catalog is on a Windows 2003 Server. The database used to issue the queries to Index Server is SQL Server 2005 SP2. A Linked Server has been setup in SQL Server to allow access to the Index Server.
The queries being executed are like the following.
from OpenQuery(FileSystem2 , 'Select FileName , Directory , Rank From Scope('' DEEP TRAVERSAL OF "e:\indextest\"'')
Where CONTAINS ( Contents , ''藏民主改革五'' ) ' ) as q
An example document that would not be found by the search query above would contain the string "改革五藏民主改革五改革五". (Note that the characters being searched for are in the middle of the document's contents)
From what I've read on the subject, a Chinese word breaker does exist in Index Server and does break up individual characters which I think might be what we want. However, it appears this isn't being used. Documentation on the MSDN explains that it's possible to put a "MSLocale" tag in HTML documents to make it use the desired word breaker, unfortunately the documents we have are all in MS Word format.
So I've got a few questions. How does Index Server decide which word breaker to use on a MS Word file? Is it possible to put wildcards into a search of the "contents" property so we can find partial words? Are there any other solutions to this issue? We obviously don't want the fix for this to affect the system's ability to find documents containing English content.