Seek And Ye Shall Be Found -- Or Else
by Eric C. Richardson Big companies like Microsoft have them. Itty-
bitty companies with itty-bitty Web sites have them, too. And if you don't have one, you may find your site has become an unreachable island in the vast sea that is the World Wide Web.
What we're talking about here is a search engine running on your server. If you can't offer people visiting your site a way to search the contents of the site, they cannot be expected to easily move through the pages looking for things they find interesting. And if they can't find what they're looking for, there's really no realistic hope of them coming back. In this article, we'll take a look at what you need to do to get your search engine going, and also examine what's on the market, from freeware to the priciest commercial programs.
How It Works
The current "standard" in creating a Web site search is based on the WAIS, or Z39.50 standard. WAIS (Wide Area Information Server) sets forth the basic method for reading the contents of a directory (or directories) and creating an index of words. Although WAIS itself never really became the standard in practice, it did create the basic structure of the search engine that programmers follow, using separately executable modules instead of one large program.
Search engines usually have two major modules that work with each other. The first piece is the indexer module, which is responsible for reading all the information in your database. The database can consist of all of your HTML files as well as a portion of other files (usually ASCII or RTF). The finished index file will contain pointers to the exact place in each document where the indexed words can be found. The index file can grow to monstrous size if the amount of initial data to be indexed is also very large, though this also depends on the indexing algorithm.
The second module is the actual search engine itself--that is, the module that will interrogate the indexed file. Once the indexer creates the index--a file with a count of words and the frequency they appear in the area to be searched--the search module will read the index file and compare words entered by the user to words in the index file, returning the results.
Although all search engines run by these basic principles, they do differ. Some use phrase/word combinations, some will return an abstract of the returned document, and others will just return a small link to the document. Before installing a search engine, decide which is best from the standpoints of ease of use, machine and financial resources, and how much trouble and time it takes to re-index all the documents when any of them change.
Freeware and Shareware Search Engines
Though choosing a freeware or shareware search engine makes good economic sense for smaller sites, beware: It can take longer to install, and technical support may be difficult to obtain. And integrating a shareware search engine usually requires some programming skill. But if the need for a Web search utility is not extreme, going low-budget may be the right move.
When installing freeware search engines, the settings in the source code or makefile usually reflect the specific server it was developed on. Whoever downloads the software and attempts to install it will need to make sure that all the paths and specific directory structure variables reflect the local server. There is usually not a high degree of programming skill needed to alter the source code, but it is common to find the help files very unhelpful when trying to figure things out. Usually, familiarity with Perl or C++ is needed to work on the source code of search engines. Finally, it can take longer than expected to set freeware up, so if time is crucial, commercial search engines are probably the route to take.
Harvest. This is a very good platform for generating search results formatted in HTML; it was designed to return all hits as URLs and has the advantage of working well with SGML (Standard Generalized Markup Language). Like WAIS, Harvest has distinct major components: The gatherer acts as the indexer, and the broker acts as the search engine.
Harvest is one of the easiest search engines to add to your site. The Harvest software is broken into two halves; the broker and the gatherer, acting as search module and indexer, respectively. These programs need to be edited to reflect the specific directory structure of the host computer, and making those changes is well documented in the source code and by the online help available at the Harvest site. Once the programs are compiled, they need little alteration.
Harvest is written in C++ and designed to run on a Sun operating system (SunOS and Solaris) or DEC's OSF/1, but it will support a wide range of other Unix-based systems such as Linux, AIX, and BSDI. There are abundant notes located on the Harvest site concerned with porting this search engine to other systems. The major problem with Harvest is not in use or ease of installation, but in the index file. When Harvest creates the index file, the file can be up to a 1:1 ratio with the initial database. Therefore, Harvest is not for you if you're concerned about hard-drive device space being limited.
ICE. ICE is a search engine that was developed by Christen Neuss, a German programmer. Neuss is trying to market ICE, but as of now it is still shareware, with a $50 registration fee requested. ICE is a simple, efficient search utility that allows basic searches on the Web server.
ICE is a bit difficult to set up, as it requires a large amount of source code configuration, with the code written in Perl 4.0. Once completed, the indexer is not very difficult to run from the command line. Of course, any simple search engine will not have very many automatic attributes incorporated in it, so the Webmaster will need to run the indexer manually as the site is updated or changed. When the indexer is run, the final indexed file is about 60 percent of the original database, depending on the initial information indexed. ICE is easy to install and moderately easy to run. It is very small (less than 30 KB) and gets fair results.
Glimpse/Web Glimpse. The twin programs of Glimpse (the indexer) and Web Glimpse (the user interface and search engine) are a product of Arizona University's computer science department. Designed for SPARC systems and DEC Alphas, Glimpse was initially a WAIS-like searcher.
When Web Glimpse was added to augment the search utility, the programs took on a very professional look. Unlike many of the low-end or free search utilities, with Web Glimpse the Webmaster can set the index file size. Depending on how fast the searches are to be, the index file can be generated at 2 percent to 3 percent; 7 percent to 9 percent; or 20 percent to 30 percent of the size of the original data. Webmasters short on hard-drive space will be able to use this product set to the lowest index size and still reap the benefits of Glimpse, though with a slight sacrifice in search speed.
Mid-Range Search Utilities
Excite. Not surprisingly, some of the high-end Web URL search sites have very solid search engines running behind their interface. Excite is no different, but what makes Excite unique is that it is marketed as a very inexpensive version of the search engine. The Excite Web Search (EWS) utility uses concept-based searching. As Excite states on its Web site: "When EWS goes through its indexing process, it uses probabilistic techniques to analyze the interrelationships between words within a collection of documents. This index supports concepts-based capabilities." This all boils down to using a more "human" search technique, by seeing patterns in the documents more than just simply counting instances of words popping up.
Excite 1.0 is free to download and use, though an annual support and update contract will cost you $995 if you need it. EWS works on a wide variety of platforms, including Solaris, SGI Irix, and Windows NT. It is one of the few search engines to run well on Unix and NT platforms, which is a very important extra and will make migration from one OS to the other that much easier.
AskSam/askSam Web Publisher. Just as Glimpse and WebGlimpse use two separate programs together, askSam has created an interesting team of programs that goes far beyond the simple indexer-searcher.
The askSam program, which is in part an indexer, has a wide range of filters built into it to accept many different data types. You can import a Word document or Eudora mailbox, and askSam will automatically convert it into a proprietary askSam database. The great thing is that the documents in that database can easily be exported directly to HTML if you want. This is an excellent labor-saver and keeps the MIS team from acting as text-to-HTML converters for the many different formats available; askSam's user interface is virtually the same as a high-end word processing program, so that anyone can use it.
The index file runs in the neighborhood of 30 percent of the original askSam database file. AskSam Web Publisher may be used along with your server software to publish the database directly to the Web, providing search capabilities without having to export the documents individually to HTML.
A unique feature of askSam Web Publisher is if there is a relatively small amount of information in a database, it can actually search the whole database quickly on the fly. With no index file being mandated, it's quite possible to set it up to run on a server with limited resources. The askSam products will run on either a Windows 95 or Windows NT system, and the two together take up less than 6.5 MB of disk space fully installed. The askSam Web Publisher costs about $1,495, while askSam's price is $395. Both software packages are needed in order to publish directly to the Web; the fact that they are sold as different packages might be confusing to some users.
Although askSam works very well on the Internet, it really shines on intranets. By using so many filters, askSam allows many different people with no HTML skills to turn new or legacy data into useful databases created by askSam. Accounting can input Excel files, while personnel can enter Word files, and askSam will take care of converting.
AnchorPage. Once you've installed and run Iconovex' AnchorPage a few times, you find that it almost runs by itself: Just fire it up and collect your finished index at the other end as HTML, ready to go on your server. AnchorPage works by a control list, which is a series of words with extra weight used during a search. To narrow search areas, one simply needs to edit this very large list (with an included editor) during installation to bring about the best results. You can also define nebulous ideas such as concepts, which you can then control with a built-in "threshold" setting to work in conjunction with the control list. AnchorPage databases are Rich Text Files (RTF), and the program converts the RTF into HTML as it indexes. It will extract phrases and concepts from searched documents and then return a page with proper HTML links to your pages. The final index file is in the area of 20 percent to 30 percent of the parent database.
There are two features about this software that make it an excellent choice for many sites: the automatic indexing features associated with the software, and the price. For $295, you get a full version of AnchorPage with all appropriate support. One additional feature is that AnchorPage will even run on a 486 running Windows 3.x or Windows 95.
Magnet Find For Web Servers. Formerly known as CompasSearch Web server, this is one of a large suite of programs from Compassware. Compassware defines their software as a "find" engine, not a search engine. They have created a unique algorithm of searching, thus attaching the "magnet" term to all of their software. The magnet search engine at the heart of Magnet Find For Web Servers (MFWS) goes far beyond the Boolean searches we are all used to making. MFWS uses conceptual querying, which assigns weights to concepts and links; those concepts are the basis for the searching.
There are many options that the end user can apply to this software, such as accessing the full text of a document's content; flagging documents for later printout; and setting the exact threshold for the conceptual queries.
MFWS is a stand-alone application that works extremely well for Web servers. Compassware also provides a developer's kit to create an array of Magnet Find-based programs designed to link with each other.
MFWS will run on Windows NT 3.51 and higher, along with Solaris 2.4 and up. Compassware is in the process of porting this software to other platforms, and will make it available to them as the need arises. One area in which MFWS stands out is in reducing the size of the index file from the database. The company currently states that a 2 Gigabyte database can be reduced to a 100 MB index file--a whopping 95 percent smaller than the original file.
Surfboard. Fulcrum Technologies Inc.'s popular standard, Surfboard, is also part of a large suite of programs for searching your site. Surfboard has filters that allow HTML, MS Office documents, and SQL databases to be indexed. Surfboard will convert from the database to HTML on the fly, reducing the time it takes to return a query.
You can begin a Surfboard search by using natural language, without the odd syntax of Boolean modifiers. Surfboard uses intuitive searching, which lets users basically teach the search engine what they like and don't like in the way of returned information. This makes search results very high in quality for the end user.
What really sets Surfboard apart from its Windows NT competition is the inclusion (as of September) of an ActiveX control interface. This ActiveX viewer, sold as a workstation add-on with the purchase of Surfboard, will be a large help to Microsoft's attempts to standardize the ActiveX language.
Pricing for Surfboard has come down considerably in the past few months to $6,250 for a server license and $995 for a bulk order of 20 workstation ActiveX viewers. Sites running Windows NT 4.0 will find that Surfboard has very definite advantages.
Choosing price as your leading consideration is an option if you have time to set up a search engine, but if time itself is your leading worry, the cost of the commercial search engines will surely be offset by the support and documentation available for your site. While it is tempting to use the freeware and shareware, it is not at all recommended unless someone in-house has excellent programming skills for building on to the search engine.
The cost of the larger search engines is decreasing, and as the technology behind the search engines changes and improves, more companies will be getting into commercial search engine coding. This competition will make high-quality search engines less expensive and more prevalent. A year ago, the Surfboard search engine (version 1.0) from Fulcrum cost $15,000; the new version is less than half that price now. At this rate, it will not be a stretch for smaller companies to be able to afford quality search engines with proper support. The winners? Both users and Webmasters alike.