Sakai Search Tool

Aim

To provide a tool that allows a Google like search all content in a Sakai instance.

Requirements

The Tool must provide a simple search interface that can be added to a work site. (search-tool)
The Tool must provide a back end component that can be used by other tools to provide search functionality

Tool requirements

The Tool must provide a simple search box into which the user types what they want to search for.
The Tool must respond with a paged list of search matches, ranked by probability of match.
The search matches must contain a link to the real content
The search matches must show a highlighted digest of the matching content.
The search must only show search matches that the user can read.

Update: 2/6/2014

As of Sakai 10 the legacy search implementation described here has been replace with an embedded Elasticsearch implementation Elasticsearch page.

Back End Requirements

The back end must be contain an index that can be queried using free text searches.
It must be possible to pre-filter the search space using one or more context descriptions.
1. work site context
2. tool type
3. content type
4. Combination
The back end must update its search index automatically as content is added
The back end must be able to re-index
1. A work site
2. An entire Sakai instance
Once an index is rebuild, the search components must automatically use the new index.
The back end must maintain a persistent record of the documents that it has indexed or is in the process of indexing
The back end must be able to operate in a Sakai cluster
The back end must be resilient, so that if there is a failure to index, it will restart and recover automatically.
Any indexing operation, should be detached from normal request processing to avoid adding time to the request processing cycle.
Where the Sakai instance is a cluster, the back end must work correctly and produce an index that is the same on all nodes in the cluster.

Implementation Details

Hardware Requirements

Its quite dificult to estimate the storage requirements needed for the search index as it depends on the number of words that are in each entity. However there are some rules of thumb that make sense. The Sakai search engine is simular in structure to Nutch, so its worth looking at the Nutch hardware requirements to get some idea of what is required. THe main difference is that Nutch uses a distributed MapReduce file system, whereas the Sakai search engine is considerably less sophisticated. It uses a replicated file system on the nodes, requireing that there is sufficient storage on each node for the entire index.

Nutch hardware requirements at http://wiki.apache.org/nutch/HardwareRequirements estimates that 100M documents will require 1TB of storage. Since Sakai search does not store the content in the index, its requirement should be less than this. It also suggests that a single cpu can should be able to handle an index of 20M documents at 1 search query per second, and if there were 20M documents on a node it might need 4GB to handle 20 queries per second. Obviously these figures are for a much larger search load and index size than we are likely to encounder with Sakai, where the larger installations will ont exceed 10M entites.

Configuration

For the very brave, there are the spring components files. For the less brave, and more sensible, there is sakai.properties.

There was a mjor rewrite of the core indexer in 2.5 which changes many of the settings, so the settings are split into 2 versions, the node locked indexer which was in all versions prior to 2.4 and the concurrent indexer which was in versions 2.5.0 and is available as a patch to 2.4.x.

Node Locked Indexer, 2.2 - 2.4

As of the 2.2 release there were some issues with deployment of search in production. These issues were largely sorted out when search was put into production at both Cape Town and Cambridge. Thanks goes to Stephen Marquard for his help analysing the problems encountered at Cape Town.

For 2.2 it is recomended that you take search from the 2.2.x branch or from trunk, post 2.2 there is a version of search in the release as a provisional tool.

You may change the following settings in sakai.properties
In 2.2.x the settings were

search.experimental = true to turn search on search on by default user search.enable=false to disable as of 2.4.009
sharedSegments@org.sakaiproject.search.api.SearchService.SegmentStore=/tmp/segments/ to specify a shared segments location

You may omit the sharedSegments location if you want the SharedSegments to be stored in the database, but, this may slow the database down and has shown poor performance on MySQL. storing segments in the database has to be enabled in the components file by unsetting sharedSegments 2.4.009

In trunk you may also set the location of the local indexes

location@org.sakaiproject.search.api.SearchService.SegmentStore=/tmp/index/

post r33751 in trunk (ie 2.5) the location of where data is stored is controlled as follows

for a non cluster this is
- indexStorageName@org.sakaiproject.search.index.IndexStorage=filessystem
- location@org.sakaiproject.search.api.SearchService.LocalIndexStorage=
and for cluster this is
- indexStorageName@org.sakaiproject.search.index.IndexStorage=cluster default
- location@org.sakaiproject.search.api.SearchService.SegmentStore=$sakai.home/segments/ default
- sharedSegments@org.sakaiproject.search.api.SearchService.SegmentStore=$sakai.home/searchindex/ default

Other settings.

localStructuredStorage@org.sakaiproject.search.api.SearchService.SegmentStore=false default
sharedStructuredStorage@org.sakaiproject.search.api.SearchService.SegmentStore=false default
localSegmentsOnly@org.sakaiproject.search.api.SearchService.SegmentStore=false default
indexStorageName@org.sakaiproject.search.index.IndexStorage = filesystem
indexStorageName@org.sakaiproject.search.index.IndexStorage = cluster default
indexStorageName@org.sakaiproject.search.index.IndexStorage = db
recoverCorruptedIndex@org.sakaiproject.search.index.IndexStorage = false
location@org.sakaiproject.search.index.IndexStorage = tableName|localDirectory
onlyIndexSearchToolSites@org.sakaiproject.search.api.SearchIndexBuilder=false default
search.enable = true defualt
sharedSegments@org.sakaiproject.search.api.SearchService.SegmentStore=$sakai.home/segments/ default
location@org.sakaiproject.search.api.SearchService.SegmentStore=$sakai.home/searchindex/ default
segmentThreshold@org.sakaiproject.search.api.SearchService.ClusterIndexStorage=20971520 default 20M
maxSegmentSize@org.sakaiproject.search.api.SearchService.ClusterIndexStorage=1572864000 default 1.5G
maxMegeSegmentSize@org.sakaiproject.search.api.SearchService.ClusterIndexStorage=1258291200 defualt 1.2G

The indexStorageName is the name of the configured class that is used for index storage.

'filesystem' stores the file on the localfilesystem and assumes the system is being run either as a single node, or with a shared filesystem.
'db' stores the index in the sakai database, performance of this mechanism is only just bearable on Oracle
'cluster' stored the index on local filesystems but performs clusterwide updates as the index is updated. This is the recommended approach that works out of the box

recoverCorruptedIndex controls what happens if the index is found to be corrupt. If true, the index can be automatically deleted and restarted, but it will have no content. The content can be re-initialised by admin rebuilding the index of the whole instance. If false a corrupted index will cause an exception to be reported to the end user. This is true by default, and best left that way.

location is the location of the index. If the driver is a database driver, the location will be a database table. If the driver is a filesystem it will be a directory where the index files are stored.

localStructuredStorage if true will place the search index directories on local disk into a hierarchical strucuture. On startup the node will migrate its data, but all nodes in the cluster must use the same setting.

sharedStructuredStorage when used in conjunction with sharedSegments will cause the segment back up files to be stored in a strcutured storage with no more than 100 directories at the base level. Again all nodes in the cluster must have hte same setting. This feature is intended for large installations where a large number of files in a directory might cause problems with the underlying filing system (eg AFS)

localSegmentsOnly when true, will only put segments on the local disk and not use shared storage. This is usefull if you hae a deployment that will only ever use a single app server node, or a situation where you are using a single search server inside a larger cluster. DO NOT use if there is more than one node in the cluster updating the segments in the search index otherwise you will find that each node has a different copy of the search index. Also you MUST provide your own backup and recovery of the search index as it will not self heal in the event of a failure on a node. The Local Store must always be placed on a local disk subsystem.

onlyIndexSearchToolSites when true will cause the indexing operation to only index sites that contain a search tool. All other sites will be ignored.

search.enable = false will disable search altogether, default is true as of 2.4.009

Concurrent Indexer, 2.5 and as a special patch to 2.4.x

All settings have been located into a single bean and the configuration settings bean.

search.enable = true
- If set to false the search indexing and searching is totally diabled.

localIndexBase@org.sakaiproject.search.api.JournalSettings=${sakai.home}indexwork
- The location on local disk of the index. This will contain the state of the local index node, and will require a rebuild if the node is destroyed. The node should assume that this space is required to persist between restarts of the application server, but it does not need to be backed up as if the node is recreated with a different name, the yournal will be re-run to create the local index. Unlike previous versions of the indexer, the local index is not saved back to a central store but acts as an accumulator for journaled index segments.
sharedJournalBase@org.sakaiproject.search.api.JournalSettings=${sakai.home}searchjournal
- The location of central storage for journalled searche segments. This location must be accessable from all nodes.
minimumOptimizeSavePoints@org.sakaiproject.search.api.JournalSettings=10
- This defines he minimum number of segments required before the indexer will attempt to merge older journal records into a single master journal record. The cound of the number of segments only considers journals that have been applied to all application servers.
optimizMergeSize@org.sakaiproject.search.api.JournalSettings=5
- This defines the minimum number of local segments required to trigger a local merge and optimize operation. THe indexer will try and maintain the number of open segments to this number + 1 hence limiting the number of files open when the index is opened.
soakTest@org.sakaiproject.search.api.JournalSettings=false
- Setting Soak test to true will cause the index queue to be reset to be all pending when there are no items left to index. This will effectively cause the indexer to run continuously. This should not be used in production.
onlyIndexSearchToolSites@org.sakaiproject.search.api.SearchIndexBuilder=false
- As in earlier versions: when true will cause the indexing operation to only index sites that contain a search tool. All other sites will be ignored.

Search Server

In large clusters with a large number of Entities or a large number of sites you may want to limit the scope of search. To do this there are 2 approaches. Limit the number of things indexed and/or limit the number of nodes performing search.

In version 2.5 you may still run the client server indexer but you will have to enable by editing the sakai-serach-pack/WEB-INF/components.xml to include coreSearchComponents.xml instead of parallelIndexComponents.xml

Search Server/Search Client

A new feature post r20035 (2.4) allows you to specify if a node is a search server or a search client. Search Servers have a copy of the index and provide a HTTP-XML Search web service to Search clients. Search Clients do not have a copy of the index (for search purposes) and use the configured Search Service to perform the search operation. For mere details of the impl see SearchServerImpl

searchServer@org.sakaiproject.search.api.SearchService=true

will make the node in question a search service and expose an HTTP-XML Web service to other clients. This service is relatively insecure and can be made more secure using a shared Ket, but should not be exposed outside the cluster.

sharedKey@org.sakaiproject.search.api.SearchService=<some shared secret>

will set the shared key which must be the same on both the search server and its clients.

On the client you must indicate which search service to use with a URL.

searchServerUrl@org.sakaiproject.search.api.SearchService=http://<host>:<port>/sakai-search-tool/xmlsearch/

Additionally you will want to stop the client from participating in index building with

search.indexbuild=false

segmentThreshold is the maximum size a segment is allowed to grow to before a new segment is started.

maxSegmentSize is the maximum size a segment will grow to as a result of the merging process

maxMegeSegmentSize is the maximum size that the segment will grow to before no more segments are added in the merging process

So the server config might be

searchServer@org.sakaiproject.search.api.SearchService=true
sharedKey@org.sakaiproject.search.api.SearchService=SharedKetToStopSecurity6423134Leaks

# If you only have 1 node acting as a dedicated search server you may also want to
# disable Shared Segments (but read the warnings above first)
#
# localSegmentsOnly@org.sakaiproject.search.api.SearchService.SegmentStore=true

and the client(s)

searchServerUrl@org.sakaiproject.search.api.SearchService=http://<host>:<port>/sakai-search-tool/xmlsearch/
search.indexbuild=false
sharedKey@org.sakaiproject.search.api.SearchService=SharedKetToStopSecurity6423134Leaks

Limiting what is indexed.

If you have a deployment with a large number of sited and you want to limit the numer of sites that are indexed you can control which sites are indexed with

onlyIndexSearchToolSites@org.sakaiproject.search.api.SearchIndexBuilder=true

This will cause the indexer only to conside candidates for indexing from sites that have a search tool placed within them.

Applications indexed

Resources
Message Center
Assignments
Wiki

Additional tools, as of Sakai 2.5:

Announcements
Email Archive
Chat
Worksite Setup

Testing Details

Sites that have search running in production include if your site is listed incorrectly, please let me know

Cambrdige, 19K documents, 6G index.
Cape Town, 14K documents, 1.3G index.

Future development

I am looking into providing RDF based search to provide discovery in a similar way to longwell2 (simile.mit.edu)

Another project is looking into extending Search to offer other text mining functionalities. These will include support for plagiarism detection, clustering and visualization.

Project: Search

Home