Search Tool Architecture

The Search Tool consists of 2 parts, the front end tool and the back end service.

The front end tool is a simple MVC JSP based webapp that uses the Search Service back end for the model, very simple JSP pages for the view and a simple Servlet for the controller.

The Search Service that provides the back end is somewhat more complex. The Search Service consists of 3 parts.

A Search Service
A Search Index Builder
A supporting persistence model
The API for the above components is deployed in shared. The model, which uses Hibernate, has POJO's expressing an anemic domain model which is also deployed in shared. The implementation of both the Search Service and Search Index Builder are deployed as Components.

The Search Service

The Search Service exposes 3 main API's.

An API to provide search capabilities.
An API to allow administration of the search index
An API to allow Entities to register to be indexed.

The Search Service Search API

This api allows a user of the API to search the index using a search string and get a unfiltered list of search matches. The search string may specify a simple free text search or a complex combination of free text and keyword. Currently keywords include items like the work site ID, and Sakai Entity properties.
In specifying the search it is also possible to specify a starting point within the search result set, and the number of items to return. This enables the Tool using the search service to request a window into the search result set without incurring any overhead associated with paging trough preceding results.

The Search Service Administration API

The Search Service has an administrative section to the API, that allows a Tool to request the rebuild of the parts of the index associated with the work site, or the entire Sakai instance. It is the responsibility of the tool to implement the necessary security checks required by this API.

Entity Registration

The Search service contains a sub-component that registers itself as a listener on the Sakai Event Service. Any Sakai Service may post events to this service. If the poster of the event also implements a EntityContentProducer (search api) that responds to the event, its content may be indexed.

The Search Index Builder

When the Search Service receives a message, it persists that message in in the persistence model. This may be as a result of an administrative action or as a result of an Event notification on a change to a Sakai entity.

The Search Index Builder monitors the persistence model and responds to a Queue of indexing tasks to be performed. In an asynchronous batch mode it wakes up, takes groups of actions from the queue and performs the necessary indexing operations. These operations take the form or

index entity reference
re-index registered content (either for whole instance or work site)
re-build index from scratch (either for whole instance or work site)

Where an individual event reference is the queue, the Index Builder questions all registered EntityContentProviders. If one of the EntityContentProviders claims responsibility for the event reference, the Search Index Builder uses that EntityContentProvider to perform the index operation.

Where there is a re-index operation, the Search Index Builder retrieves a list of all indexed items within the specified context (site or instance) and performs a re-index operation on all items.

Where there is a re-build operation, the Search Index Builder queries all EntityContentProviders to build a list of all know items of that type within the context (site or instance). It then performs an index of all these items.

The search index builder operates in an asynchronous batch mode, writing in the updated index to storage at the end of each cycle. Once written to storage, all search services are notified to reload the index.

There is also a newer concurrent indexer that allows all nodes to perform indexing adding journal entries of segments to a shared journal. That journal is replayed on each application server to reconstruct the full index.

Cluster Operation.

The older indexer used a clusterwide mutex to lock the index to a single node.

The Search Index Builder operates in a clustered environment. The Search Services only need to read the index. The Search Index Builder needs to write to the index. With the current indexing layout, there can only be a single index writer. I have therefore implemented a distributed lock manager that uses the database and its transaction manager as its mutex.

All Search Index Builders compete to get permission to write the index. The winner locks out all other Search Index Builders and performs one cycle before returning the mutex back to the cluster.

The newer indexer uses a journalled transaction to allow all nodes to perform indexing operations concurrently.

Mode information on Cluster options can be found here IndexClusterOperation

Index Technology.

We are using Lucene to perform the index operation. We add documents to the index in a pain text digested form and we add a number of keywords to the index, based on properties delivered from the EntityContentProducer. The Index storage is on Disk. In a clustered environment a number of strategies might be employed.

Share the index file space using SAN or NAS technologies.
Ship the index out to each node on rebuild and reload.
Store the index in the Sakai Database.

The later is likely to be more efficient but might slow the rate of index update.
Should the indexing operation become expensive, we might employ the Map Reduce Algorithm which has been made public by Google, and is used by Nutch.

Word Analysis

Lucene provides Token Analysers for both the content and the searching operation. It is possible to select at runtime which analyzers are user on a per request basis or searching. Unfortunately each document in the index is indexed with a single analyser. This could be configured on a worksite by worksite basis. Finder grained control is unlikely to be productive. At the moment an English/American Stemmer Analysier is configured (Snowball) that will enable search to recognize cowslip and cowslips as being the same word since they have the same stem.

In the future it will be possible to use Ontology based analysers on index and search at the worksite level.

Index Storage.

There are currently 3 mechanisms implemented for index storage. These can be configured from sakai.properties. See the configuration page for more information