Index the sakai_event table

Description

The sakai_event table does not currently have any indexes on it apart from the unique index on the event_id.

Queries against this table therefore do a full table scan.

A new feature will be coming soon which uses this data to determine an engagement score for a student based on their activity in a site for their local time.

This data is not available through sitestats as sitestats aggregates the events per day/per site and does not include student info. Events are also not localised to a student's local timezone (for example if I had a server in Australia and a user in New York generated an event at 9pm, the event would be recorded for the next day).

If sitestats did collect this data then it would just be like a copy of the sakai_event table anyway and would likely result in an additional database write for every event that occurred.

This ticket adds a multi column index to the sakai_event table. This does not significantly impact the writes onto this table. In any case, the writes are batched up and run in a separate thread, as per the ClusterEventTracking service.

The sample PL/SQL app below generates thousands of inserts and batches them up for commit. See results below.

Sample app:

Results:
100 inserts
Without index, 0.020s
With index, 0.028s

500 inserts
Without index, 0.09s
With index, 0.10s

5000 inserts
Without index, 0.75s
With index, 0.76s

100,000 inserts
Without index, 14s
With index, 17s

Typically, the number of events would be on the lower end of these results as the check runs every 5 seconds to insert the current batch.

Query that uses the index:

Without the index and 1.6 million rows in the event table, the cost of this query is about 2000 times more than with it.

Activity

Show:

Paul Lukasewych March 23, 2016 at 8:26 AM

We use the aggregator job here to process the events, not the event listener, and we have a separate sitestats database, so it should have little to no impact on users. Currently we're processing 50,000 events at a time, and it takes about 2 minutes at most. I don't anticipate a huge increase in processing time at first to add writing to the new table, but as the table grows and updating the indices becomes more costly, it could be a problem. I'm just going to have to wait and see how things go with real data.

Steve Swinsburg March 22, 2016 at 9:12 PM

@Paul, that could be useful, interested to see how you get on. What's the approach for managing the increased db writes for the events you are collecting? Also, is it threaded?
@Earle, we've actually culled the list of events down considerably so I think we are just going to go with an event listener approach for now, and then can look at Paul's backend and see if we want to swap it out. One of the challenges with time based data is getting the data localised to the student, not the server, as the events currently are. Would be interested to have a chat about this though.

Paul Lukasewych March 22, 2016 at 9:22 AM
Edited

I'm working on local modifications right now to collect a subset of this data in the SiteStats database. Essentially, it will record the full event data for every event that SiteStats counts. The goal is to expose this extra detail in the SiteStats tool, so you can click on the event count and actually see the individual events with the proper timestamp and possibly retrieve additional info about the event via Sakai api calls (ie. find out the name of the assignment that was submitted to for an asn.submit event).

We keep data in the sakai_event table for 30 days. At our institution this results in about 35 million records in this table. Of these, maybe only half are of interest to SiteStats (are tied to a site and session). There is so much noise in this table I doubt even with indexing it would be worthwhile.

We'll be releasing the collection phase of the modifications in May, but the UI is going to take some more work because of performance concerns.

Earle Nietzel March 22, 2016 at 8:41 AM

So ElasticSearch is a full fledged search engine and everything you do is JSON based so you can query right from the front end.

I was thinking for the event data a time based approach to storing event data in ES that would be immediately available to query and is fast, see https://www.elastic.co/guide/en/elasticsearch/guide/current/time-based.html

If this sounds like something your interested in we should discuss it more?

Steve Swinsburg March 21, 2016 at 2:55 PM
Edited

That's what I considered also, however this data will be collected on a daily basis and then no longer needed. 100M rows is a bit of a stretch, see figures in KNL-575. But agree that it is a log. Have you guys got more info on the Tin Can/Elastic search approach you mentioned?

Maybe in the interim the data could be replicated to a separate reporting table every 6 hours or so and even denormalised (replace session_id with the user_id to avoid the additional join to sakai_session). Then the data could be truncated once it has been used.

Details

Priority

Components

Assignee

Reporter

Created March 20, 2016 at 4:50 AM
Updated April 25, 2018 at 3:34 PM