JSR-170

Overview

There is currently a beta quality JCR implementation of Content Hosting in trunk. This work was originaly done and documented in SAK-10366. This is some good historical information available too, but that is now deprecated. This page will deal specifically with JCR information related to the ContentHostingService and parts of the Resources tool. For general information about JSR-170 support in Sakai see here

The very first phase of JCR integration for Resources is an implementation of the existing ContentHostingService API using a JCR backend. With this initial support, the Resources Tool and Sakai DAV are meant to operate as they stand with no changes to their code.

Installing the JCR ContentHostingService

Sakai Trunk

Content Hosting service on jcr is available in trunk, and can be built as part of the framework profile. Deploy the famework with

mvn -Pframework clean install sakai:deploy -Dmaven.tomcat.home=/tomcathome; 
rm -rf /tomcathome/components/sakai-content-pack; 
cd content; 
mvn -Pframework-jcr clean install sakai:deploy  -Dmaven.tomcat.home=/tomcathome 

In addition, you can use the full-jcr maven profile to install everything with JCR including the content-providers, etc.

  1. If Sakai is already built and deployed, remove the following directory from the tomcat deploy
    tomcat/components/sakai-content-pack
  2. Then checkout and build the content branch
    svn co https://source.sakaiproject.org/svn/content/branches/SAK-12105
    mvn clean install sakai:deploy -f SAK-12105/pom.xml -PJCR
    
  3. At the moment there is an issue with the content provider component. After installing remove components/sakai-content-providers-pack.

Sakai 2.5.x

TODO This is probably the same as the Sakai Trunk instructions, it just hasn't been tried out yet.

Sakai 2.4.x

This needs to be built with Maven 2, and almost certainly requires pulling in the DB and Entity modules from trunk. It also requires pulling in util, but we did that above when installing the jackrabbit service. If you are using a different implementation such as Xythos, you will also need to install trunk util as described above.

  1. Checkout and install DB from trunk
    svn co https://source.sakaiproject.org/svn/db/trunk db
    mvn clean install sakai:deploy -f db/pom.xml
    
    • If you can, run maven und from 2.4.x db, otherwise removing the following from your tomcat:
      • tomcat/components/sakai-db-pack
      • tomcat/shared/lib/sakai-db-api
  2. Checkout and install entity from trunk
    svn co https://source.sakaiproject.org/svn/entity/trunk entity
    mvn clean install sakai:deploy -f entity/pom.xml
    
    • Again if you can't use maven und, manually remove the following:
      • tomcat/components/sakai-entity-pack
      • tomcat/shared/lib/sakai-entity-api
  3. At this point there might be two versions of hibernate in tomcat/shared/lib
    Remove hibernate 3.1.3 and leave 3.2.5ga
  4. Checkout and install the new content code
    svn co https://source.sakaiproject.org/svn/content/branches/SAK-12105
    mvn clean install sakai:deploy -f SAK-12105/pom.xml -PJCR
    
    • If you can, run maven und from 2.4.x content, otherwise removing the following from your tomcat:
      • tomcat/components/sakai-content-pack
  5. At the moment there is an issue with the content provider component. After installing remove components/sakai-content-providers-pack.

Enabling JCR Content

JCR is disabled by default. To enable it you can either switch over the components beans or use the JCR Inspector to switch over realtime.

Using JCR Inspector to switch over realtime

The JCR Inspector has controls to switch over to using the JCR Content Service (and switch back). They are located on the Import Legacy CHS Data view. Click on the buttons to switch.

Content Migration

Unlike some of the other data upgrades that have been performed on Content Hosting, the conversion to a JCR implementation requires copying all of the data over to another repository. Below is a description of the first algorithm are working on to perform the migration.

This first version:

  • Doesn't try to do anything fancy.
  • Is not parallelized in any way. It can only run on one node in the cluster at a time.

Depending on how testing goes and the needs of other universities, this may be spiffed up a bit. Otherwise, if it runs in a reasonable amount of time and tests out, it may stand as is.

Migration algorithm in a Nutshell

  1. Create a table to store path and migration status for each item. The migration will run through the table top to bottom, copying the items over and marking them finished when done.
  2. Copy all the entries in CONTENT_COLLECTION to the migration table. Then copy all the entries in CONTENT_RESOURCE to the migration table. This means that we will copy all the collections/folders over first, so they exist when we try to copy the files over.
  3. Start copying items over.
  4. Listen for Content Events that create, edit, or delete content items, and append these to the end of the migration table.

Detailed Description of Algorithm

Each row in the MIGRATE_CHS_CONTENT_TO_JCR table contains the CONTENT_ID, the Status, and the Event Type. It's important that the table be interpreted as existing linearly in time. Any row further down the table is expected to model a content event that occurred after the ones previously. Processing the table rows in order from top to bottom is required.

Status

Code

Not Started

0

Finished

1

Those are the only two. In the event that add the ability to run this in parallel on multiple cluster nodes, we will certainly have to add more status types.

Event Types

Description

ORIGINAL_MIGRATION

This means that is was part of the original table copy

content.add

This means that it was added to the migration table as the result of receiving a content.add event.

content.write

Same idea as above for writes

content.delete

Same idea as above for deletes

Below is a small example of what the table would look like during migration:

CONTENT_ID

STATUS

EVENT_TYPE

/myfolder/

1

ORIGINAL_MIGRATION

/myfolder/file1.txt

1

ORIGINAL_MIGRATION

/myfolder/file2.txt

1

ORIGINAL_MIGRATION

/myfolder/Music/

0

ORIGINAL_MIGRATION

/myfolder/Music/whittyBanter.mp3

0

ORIGINAL_MIGRATION

/myfolder/Music/moreBanter.mp3

0

content.add

/myfolder/file1.txt

0

content.write

In this scenerio, we've already copied myfolder and two files over that were already in Resources when we started migrating. We still need to copy over the Music folder and my whitty banter mp3. Also, since the migration has started I've added even more banter to my Music folder and changed some of the text in one of my files. These changes are added to the queue now and will be processed in good time.

  1. Starting Migration
    • Check to see if migration has ever started before. This is done by counting the rows in MIGRATE_CHS_CONTENT_TO_JCR. If there are any rows in the table it means the migration has been started previously.
    • If the migration is starting for the first time, the existing CHS data is added to the table.
      • The COLLECTION_ID's from CONTENT_COLLECTION are added to MIGRATE_CHS_CONTENT_TO_JCR with a status of 0 and event type of ORIGINAL_MIGRATION
      • The RESOURCE_ID's from CONTENT_RESOURCE are added to MIGRATE_CHS_CONTENT_TO_JCR with a status of 0 and event type of ORIGINAL_MIGRATION
  2. During Migration
    Each round of data migrating consists of starting a TimerTask, which fetches n unfinished items from the MIGRATE_CHS_CONTENT_TO_JCR table and copies them to the JCR Repository. The timer tasks all use one Timer and do not start until the previous finishes. There is an delay time t that can be configured, to specify the time to wait between each batch.
    • Fetch the next N unfinished items from the MIGRATE_CHS_CONTENT_TO_JCR table.
    • For each item:
      • If the item is a ContentCollection and the event type is ORIGINAL_MIGRATION, content.add, or content.write copy the ContentCollection to JCR. If the collection already exists in JCR, do not delete and re-add it, just overwrite the metadata properties, and remove any properties that are not in the source collection.
      • If the item is a ContentCollection and the event type is content.delete, remove the collection node from JCR. In the case that the collection was later readded in Resources, the content.add event for it will be further down the queue, so it will be recreated in that case.
      • If the item is a ContentResource and the event type is ORIGINAL_MIGRATION, content.add, or content.write, we will delete the file node in JCR and recreate it by copying the resource over from CHS. This is a bit different from the ContentCollection, where we did not actually remove the node before recreating it, since it was a folder and did not want to destroy the files/folders inside of it. In this particular situation, a resource file will never have children. ( Though in a pure JCR world, it is possible to do this, but the original ContentHosting has nothing modeled like this)
      • If the item is a ContentResource and the event type is content.delete, then we delete the file node from JCR completely.
      • After operating on the item, we update it's row in MIGRATE_CHS_CONTENT_TO_JCR and set the status to 1, finished.
    • After finishing all the content items in the batch, we reschedule this TimerTask setting the delay to the configurable batch delay t.

Edge cases

What if the server crashes?
The server crashes during a batch of copies. When it starts up, the copy that was in progress will still be marked 0 in the table. The copier always handles the case where the node already exists in JCR for some reason, and will just overwrite it and continue.

How do I switch when the conversion is done?
This is a tricky situation. The migration appears to be done when there are no entries in the table with status 0. But whenever someone causes a content event to happen a new migration entry appears. The very end of the migration may need to coincide with some sort of downtime to seal the deal and switch implementations.

Is it possible for the content events to be added out of order?
What would happen if someone added a folder and some files in the folder to Content Hosting, and for some reason the content.add for the child file was triggered before the content.add for the parent folder.

I think I'm going to need to add a timestamp column to the algorithm, and always sort on it before fetching the next batch of items to copy.

Running the migration

Currently for testing and development the only hooks for actually starting/stoping the migration are in the GUI JCRInspector. There will be some hooks added for starting it automatically on tomcat boot like other upgrade scripts do. The migration hooks are also exposed in an API so they could be triggered via quartz or a web service, etc.

TODO add API details

Testing Migration Integrity

In order to test the integrity of the migration, a random sampling of files/folders will be chosen for comparison. To compare these we will fetch the ContentCollection or ContentResource for both of them from both implementations, and then compare the properties, and some of the methods that determine properties such as conditional release. Occasionally the byte streams will be compared as well, but perhaps not as often depending on how long it takes for each one.

Testing Performance and Load

TODO Instructions for running the tests with the test runner. Also, Steve (me) needs to write the Maven 1 build for the GUI tool, since it is using a grotesque, yet delightfully reusable, module layout.