Chapter 12:  Indexing and Searching


This chapter explains how to use SWISH and WWWWAIS, stronghold's standalone indexing and search facilities, including how to


SWISH and WWWAIS

Stronghold comes with two standalone programs for indexing your site and offering search capabilities to users:

Configuring SWISH

Stronghold's installation program places SWISH in ServerRoot/swish, and its configuration file is, ServerRoot/conf/swish.conf. Like httpd.conf, it's a simple text file, but does not use wrappers.

This section explains the SWISH configuration directives, including ReplaceRules and FileRules, the directives that set the indexing parameters.

IndexDir

Syntax: Index Dir directory
Context: Swish.conf

sets the directory that SWISH indexes. usually, this is the same as the value you set for the
directive in httpd.conf. you can use more than one of these directives, although each can only take one value. For example, if the Web documents for your virtual hosts are in different directories that the main server documents, you can enter something like this:

IndexDir /usr/local/ww/htdocs
IndexDir /usr/local/ww/vhosts/vhost1
IndexDir /usr/local/ww/vhosts/vhost2
IndexDir /usr/local/ww/vhosts/vhost2
...

SWISH indexes each directory recursively

IndexFile

Syntax: IndexFile filemname.swish
Context: swish.conf

This directive sets the path to the index file, where SWISH saves the results of each indexing sweep. Be sure to include a.swish filename suffix.

IndexOnly

Syntax: IndexOnly .suffix1 [.suffix2 .suffix3 ...]
Context: swish.conf

specifies the types of file SWISH is allowed to index using their filename suffixes. For example, you can limit the index to HTML and PHP files like this:

IndexOnly .html .phtml .php

This directive is case-sensitive. If you omit it, SWISH indexes every file in the directory specified.Dir.

IndexReport

Syntax: Indexreport 0|1|2|3
Context: swish.conf

This directive sets the reporting option, which can be an integer from 0 through 3, 3 being the most verbose output option.

FollowSymLinks

Syntax: FolowSymLinks yes/no
Context: swish.conf

When FollowSymLink is set to "yes," SWISH follows symbolic links while indexing. When set to "no," SWISH ignores them.

NoContents

Syntax: NoContents .suffix1 [.suffix2 .suffix3 ...]
Context: swish.conf

SWISH can index entire files or only their filenames. NoContents sets the suffixes of files whose contents SWISH should ignore. For example,

NoContents .ps .gif .au .hqx .xbm .mpg .mpeg .pict .jpg .jpeg ...

SWISH indexes only their filenames instead. this directive is case-sensitive.

IgnoreWords

Syntax: IgnoreWords word1 [word2 word3 ...]
Context: swish.conf

Certain frequently-occurring words are irrelevant for indexing and searching purposes, such as prepositions, pronouns, articles, and indexicals. IgnoreWords sets the list of words that SWISH ignores. SWISH comes with a default list of several hundred common words, and it adds this list if one of the values for IgnoreWords is "SWISHDefault."

Along with IgnoreWords, this directive can save CPU resources and control the size of the index file.

IgnoreLimit

Syntax: IgnoreLimit percentage integer
Context: swish.conf

IgnoreLimit, like IgnoreWords, provides a method of filtering out frequently-occurring words when indexing. Each IgnoreLimit directive takes two values:

For Example:

IgnoreLimit 80 256
IgnoreLimit 50 50

The first instance instructs SWISH to ignore words that occur in a least 80 percent of files and in at least 256 separate files. the second instance instructs SWISH to ignore words that occur in at least 50 percent of files and in at least 50 separate files.

IndexName

Syntax: IndexName "name"
Context: swish.conf

IndexName sets the title of the index file.

IndexDescription

Syntax: indexDescription "description" \URL
Context: swish.conf

IndexDescription is a short description of the index file, or the URL of a description file.

IndexPointer

Syntax: IndexPointer URL
Context: swish.conf

URL is the location of your site's home page.

IndexAdmin

Syntax: IndexAdmin "administrator information"
Context: swish.conf

IndexAdmin gives descriptive information about the administrator responsible for Web indexing. SWISH includes this information in the index file.

ReplaceRules

Syntax: ReplaceRules replace|append|prepend "string" ["replace-string"]
Context: swish.conf

Since SWISH does not read the httpd.conf file, it knows nothing about the aliases and virtual hosts you've set up for your site.

ReplaceRules operates on the paths of the files that SWISH indexes, converting them into URL's. For example:

prepend "http://"
replace "/usr/local/httpd/htdocs/"
"www.mainhost.com/"
replace "/usr/local/httpd/vhost/vhostl/"
"www.vhostl.com

FileRules

Syntax: FileRules operator string1 [string2 string3 ...] Context: swish.conf

FileRules is the opposite of IndexOnly; instead of limiting the indexed files inclusively, it limits them exclusively. SWISH ignores all files that match the parameters you have set with this directive. Operator is one of the following:

Operator Description

pathname contains

This operator takes one or more strings, and SWISH ignores any directory or file whose path contains one of these strings.

filename

This operator is case-insensitive and specifies a single, exact filename, without a path.

filename contains

This operator is case-sensitive and specifies one or more filename strings. SWISH ignores any file whose filename (not path) contains one of these strings.

title contains

This operator is case-insensitive and specifies one or more strings. SWISH reads the contents of each <TITLE> tag in HTML, and ignores any file whose title contains one of these strings.

directory contains

This operator is case-sensitive, and specifies one or more filename strings. SWISH ignores an entire directory if it contains a filename that includes one of these strings.


Configuring WWWWAIS

Stonghold's installation program places WWWAIS in ServerRoot/cgi-bin, and its configuration file is ServerRoot/conf/wwwwais.conf. Like swish.conf, it's a simple text file that does not use containers.

This section explains the WWWWAIS configuration directives.

PageTitle

Syntax: PageTitle "title"|filename
Context: wwwwais.conf

This sets the title of the search results file. If the quoted value is a string, only the string is used. If the value is a filename, WWWWAIS prepends the contents of the file to the search results.

SelfURL

Syntax: SelfURL "URL"
Context: wwwias.conf

This is the URL of the WWWWAIS search engine.

MaxHits

Syntax: MaxHits n
Context: wwwwais.conf

The integer value for MaxHits is the maximum number of search results WWWWAIS is allowed to return.

SortType

Syntax: SortType type
Context: wwwais.conf

WWWAIS sorts its search results according to the value for SortType. Type can be one of the following:


Type Description

score

Sort the results to the relevancy scores assigned to each and list the most relevant results first

lines

Sort the results according to the lengths of the resulting documents, measured by the number of lines in each.

bytes

Sort the results according to the byte size of each file found.

title

Sort the results alphabetically by title.

type

Sort the results by filetype.

AddrMask

Syntax: AddrMask all|IP1 [ IP2 ...]
Context: wwwais.conf

AddrMask specifies the IP addresses that are authorized to access the search gateway. The value can be "all," in which case any host can use the gateway, or a list of IP addresses. IP values can include wildcard strings.

SwishBin

Syntax: SwishBin path
Context: wwwais.conf

SwishBin sets the path to the SWISH indexing engine.

SwishSource

Syntax: SwishSource path "description"
Context: wwwais.conf

This directive sets the path and description of the SWISH index file.

SourceRules

Syntax: SourceRules replace|append|prepend arg1[arg2 ...]
Context: wwwais.conf

Search results are pathnames from the index file generated by SWISH. In order to make this information useful to users, you can use SourceRule to modify the paths. For example:

SourceRules replace "/www/" "http://your.host.com/"

This converts document root paths to proper URLs that take advantage of the DocumentRoot alias.

WaisSource

Syntax: WaisSource path "description"
Context: wwwwis.conf

This directive sets the source descriptions for WAIS sources. For WAISSEARCH sources, the syntax is

WaisSource hostname port path "description"

UseIcons

Syntax: UseIcons yes/no
Context: wwwwais.conf

WWWWAIS can include icons in the search results page, according to the option set in UseIcons.

IconUrl

Syntax: iconUrl URL
Context: wwwwais.conf

This directive sets the location of your icons directory. WWWAIS uses the icons in this location if UseIcons is set to "yes."

TypeDef

Syntax: TypeDef .suffix "type description" iconfile MIME-type
Context: wwwais.conf

TypeDef matches filename suffixes to filetype descriptions, icon files, and MIME types. WWWWAIS uses this information to generate and sort search response pages. Use as many of these directives as you need.


Creating a Site Index

Each site on your server platform requires its own site index. For example, if you have many virtual hosts, users who access one host must be able to search that host without receiving results from other hosts. Before you can use the WWWWAIS search facility, you must use SWISH to create a site index for each virtual host on your server. You must also update these indexes periodically to ensure that search results are current. With a little creativity, you can create scripts that automate these tasks for you.

The SWISH executable is ServerRoot/swish/swish. To create a site index, run SWISH from the command line, using the following flags to specify options:


Flag Description

-i dir file

Can create an index of the specified directory or file. use this flag to index the directory that contains an individual virtual host's files. SWISH indexes directories recursively. you can specify multiple directories or files, separated by spaces.

-w word1 [word2 ..]

Search only for the specified keywords. You can use "and," "or," "not," and parentheses. The search is case-insensitive, and SWISH evaluates your criteria from left to right.

-t|H|B|t|H|e|c

Search only the specified HTML tags:

H=<HEAD>

B=<BODY>

t = <TITLE>

h=<hn>

e-<EM>

c=<!--comment--!>

-f filename1 [filename2]

Create or search the specified file. If you are indexing, you can specify only on file--the index file for SWISH to create. The default is index.swish in the current directory. If you are searching, you can give a list of files to search.

-c filename

Index using the specified configuration. You can use this flag to implement custom SWISH configuration for individual hosts, or to eliminate the need for command-line flags by including all options in the configuration file.

-v 0|1|2|3

Run this with the specified level of verbosity. 0 is silence, and 3 is the most verbose. the default is 3.

-|

Follow symbolic links when indexing.

-M n

Return no more than n, the maximum number of results. When n is not specified, SWISH assumes 40. To set no limit on the number of results, specify 0 or "all," or omit this flag. The value given in the configuration file overrides any value you specify with this flag.

-M index1 index2 ... filename

merge the specified index files into one file, specified by filename. SWISH removes all redundant data during this operation.

-D filename

Decode the specified index file.

-V

Print the current version number.

Once you have an index file for a host, any HTML search interface for that host must reference the appropriate index file. make sure the administrator of each virtual host has access to that host's index field, and the instructions contained in the next section, "Creating a Search Interface."


Creating a Search Interface

WWWWAIS searches the site indexes created by SWISh. It supports the Boolean operators AND and OR, and can use either GET or POST. The basic form that passes search parameters to WWWWAIS looks like this:

<FORM METHOD=GET
ACTION="/cgi-bin/wwwwais?sourcedir=/usr/local/www/htdocs/vhost/swish&source=index.swish"
Search for:
<INPUT TYPE=TEXT NAME="keywords" SIZE=40>
<INPUT TYPE=SUBMIT VALUE="Search">
</FORM>

You can customize any search form by appending search options to the URL in the ACTION field. you can use as many options as you like in a single form. Options strings must be separated from the ACTION path by a question mark (?). Options must be separated from each other by ampersands (&). For example:

ACTION=/cgi-bin/wwwais?source=index.src&keywords=sample+search

The rest of this section lists the available search options. Many of them are equivalent to certain configuration directives and can be used to override the settings in the configuration file. If you have multiple virtual hosts, it's important to specify the source index file in each search from using the sourcedir and source options. If you have only one host, you can simply set the source index in the configuration file.

sourcedir

Syntax: sourcedir=path
Context: ACTION

This option specifies the directory that contains the index database. This setting is especially useful if you have several databases in the same directory, in which case WWWWAIS searches them all.

source

Syntax: source=filename
Context: ACTION

The source option specifies which index database the search engine should search.

maxhits

Syntax: maxhits = n
Context: ACTION

This option sets the maximum number of URLs WWWWAIS returns to the user.

keywords

Syntax: Keywords=keyword1[+keyword2+keyword3 ...]
Context: ACTION

The keywords option specifies the keywords to search for.

isindex

Syntax: isindex=keyword1[+keyword2 +keyword3 ...]
Context: ACTION

The isindex option works identically to the keywords option.

sorttype

Syntax: sorttype=scorelines|bytes|title|type
Context: ACTION

This option sets the criteria that WWWWAIS uses to sort the results of a search.

version

Syntax: version=true|false
Context: ACTION

When this option is set, WWWWAIS returns its version information and that of WAISQ or WAISSEARCH.

host

Syntax: host=hostname
Context: ACTION

If you are using WAISSEARCH as your search mechanism, you can use this option to specify a remote host to search instead of your local host.

port

Syntax: port=n
Context: ACTION

If you are using WAISSEARCH and the host option, you can use this option to specify the port number to access on the remote host you are searching.

useicons

Syntax: useicons=yes|no
Context: ACTION

When the useicons option is set to "yes," WWWWAIS displays icons with the different files in the search results.

iconurl

Syntax: iconurl=URL
Context: ACTION

If you use the useicons option, use iconurl to specify the location of your icon files.

selection

Syntax: URL?selection="source+description"
Context: ACTION

This option specifies a source description, which must be one of the values set for WaisSource or SwishSource in the wwwwais.conf file.

searchprog

Syntax: URL?searchprog= waisq|waissearch|swish
Context: ACTION

You can use searchprog to specify one of these alternative search programs:

Only SWISH comes with stronghold. If you want to use WAISQ or WAISSEARCH, you must install them separately.