Database Refactoring

We plan to refactor the database schema for ContentHostingService in Sakai 2.4. Among the goals and objectives of that refactoring are:

  • Use arbitrary "long" integer 36-byte UUIDs as keys for relationships among tables used by CHS.
  • Totally eliminate XML.
  • Use SQL rather than Hibernate for all DB operations to support the CHS APIs (see To Hibernate Or Not To Hibernate).
  • Keep main entries for collections and resources in one table
  • Include all common properties/attributes in a main table (the CONTENT_ENTITY table) with a long integer key.
    • IN_COLLECTION (links to the containing collection)
    • ENTITY_PATH (the full path to the resource/collection within CHS)
    • ENTITY_ID (the SHA1 encoding of ENTITY_PATH)
    • ENTITY_UUID (generated by CHS; the primary key for this and other tables)
    • VERSION
    • DISPLAY_NAME (Use last "part" of ENTITY_PATH as display name instead of allowing users to partially "rename" a resource)
    • RESOURCE_TYPE (folder, fileUpload, textDocument, htmlDocument, etc)
    • CREATED_TIME
    • CREATED_BY
    • MODIFIED_TIME
    • MODIFIED_BY
    • ACCESS_MODE (grouped, site, inherited, public)
    • HIDDEN
    • RELEASE_DATE
    • RETRACT_DATE
    • HAS_PRIORITY_SORT
    • PRIORITY_SORT_ORDER
  • Include optional properties in a separate table as triples (the CONTENT_ENTITY_PROPERTIES table).
  • Include list of groups for GROUPED entities in separate table (the CONTENT_ENTITY_GROUPS table).
  • Include file-path or file-body for resources in separate table (the CONTENT_ENTITY_BODY_FILE and CONTENT_ENTITY_BODY_BINARY tables).
  • Include information about earlier versions in a separate table (the CONTENT_ENTITY_VERSIONS table). Will also need version tables for content body, probably)
  • Increase the size of ENTITY_ID ENTITY_PATH field from 255 to 1024 4096 (this is under discussion).
  • Primary key in all but CONTENT_ENTITY table is a BIGINT (long) 36-byte String, which is a UUID.
  • DB should assign those unique BIGINT keys in CONTENT_ENTITY table before we add rows to other tables.
  • Need to add indexes/constraints for common queries.
  • Promote Resource-type to field. Demote mime-type to property.

The first draft version of an autoddl file for HSQL is at "Draft of HSQL autoddl file". A revised version ("sakai_content_2_4_0.sql v2") is shown below, followed by some comments about what changed between these versions.

-----------------------------------------------------------------------------
-- CONTENT_ENTITY
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY 
(
	ENTITY_UUID VARCHAR (36) NOT NULL,
	VERSION SMALLINT,
	ENTITY_ID VARCHAR (255) NOT NULL,
	ENTITY_PATH VARCHAR (4096) NOT NULL,
	IN_COLLECTION VARCHAR (36),
	RESOURCE_TYPE VARCHAR (99),
	-- DISPLAY_NAME VARCHAR (1024),	
	CREATED_TIME TIMESTAMP,
	CREATED_BY VARCHAR (99),
	MODIFIED_TIME TIMESTAMP,
	MODIFIED_BY VARCHAR (99),
	ACCESS_MODE VARCHAR (16),
	HIDDEN BOOLEAN,
	RELEASE_DATE DATETIME (0),
	RETRACT_DATE DATETIME (0),
	HAS_PRIORITY_SORT BOOLEAN,
	PRIORITY_SORT_ORDER SMALLINT,
	CHANGES VARCHAR (1024),
	
	CONSTRAINT CONTENT_ENTITY_INDEX UNIQUE (ENTITY_ID)
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_VERSIONS
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_VERSIONS 
(
	ENTITY_UUID VARCHAR (36) NOT NULL,
	VERSION SMALLINT,
	IN_COLLECTION VARCHAR (36),
	RESOURCE_TYPE VARCHAR (99),
	-- DISPLAY_NAME VARCHAR (1024),	
	-- CREATED_TIME TIMESTAMP,
	-- CREATED_BY VARCHAR (99),
	MODIFIED_TIME TIMESTAMP,
	MODIFIED_BY VARCHAR (99),
	ACCESS_MODE VARCHAR (16),
	HIDDEN BOOLEAN,
	RELEASE_DATE DATETIME (0),
	RETRACT_DATE DATETIME (0),
	HAS_PRIORITY_SORT BOOLEAN,
	PRIORITY_SORT_ORDER SMALLINT,
	CHANGES VARCHAR (1024),
	
	CONSTRAINT CONTENT_ENTITY_INDEX UNIQUE (ENTITY_ID)
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_GROUPS
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_GROUPS 
(
	ENTITY_UUID VARCHAR (36),
	VERSION SMALLINT,
	GROUP_ID VARCHAR (99)
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_PROPERTIES
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_PROPERTIES 
(
	ENTITY_UUID VARCHAR (36),
	VERSION SMALLINT,
	ORDER_IN_LIST TINYINT,
	PROPERTY_ID VARCHAR (255),
	PROPERTY_VALUE LONGVARCHAR
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_BODY_FILE
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_BODY_FILE 
(
	ENTITY_UUID VARCHAR (36),
	VERSION SMALLINT,
	FILE_PATH VARCHAR (128)
);

-----------------------------------------------------------------------------
-- CONTENT_ENTITY_BODY_BINARY
-----------------------------------------------------------------------------

CREATE TABLE CONTENT_ENTITY_BODY_BINARY
(
	ENTITY_KEY BIGINT NOT NULL,
	VERSION SMALLINT,
	BODY BINARY
);

Changes:

In this version, the BIGINT (64-bit integer) keys have been dropped, and UUIDs are being used as primary keys in the various tables.

The full path is included as the ENTITY_PATH. When adding a new entity, we will use SHA1 encoding to get a 64-byte hash of the ENTITY_PATH and that will be the ENTITY_ID. There is an infinitesimal chance of a collision for two strings that are not the same, so maybe each time we attempt to create or access an entity based on a freshly computed SHA1 encoding of the full path, we also check to see whether the entity already exists and (if so) whether the stored path is the same as the path we started from. If it's the same, we're fine. Otherwise, we can use some additional method to calculate a hash.

There's a new table for "version control" information. This is a placeholder for the possibility that we implement some version-control features in Sakai 2.4. This is likely to change, depending on what version-control features we need to support in the Resources tool and in the CHS API.

The DISPLAY_NAME is gone. Instead, we will use the last "part" of the ENTITY_PATH as a display name. This will avoid the possibility of a user renaming a resource without changing its id, which later leads to confusion. This leaves the question of what "display name" is used for the top-level collections within a site, since those use the site's id (a UUID) as the ENTITY_PATH.

Notes:

To retrieve an entity (resource or collection) requires getting all rows in all tables that have the same ENTITY_KEY. Generally, we will be getting it from the main table based on a search for the ENTITY_ID.

We need to anticipate the most common queries and support them with indexes or constraints to make them very efficient. We also need to support sorting in the query rather than sorting in memory after retrieval (which is why DISPLAY_NAME is included in the main table, for example, rather than treated as a property).

I think this might be an improvement over the current schema, but I don't think it helps with two very common queries:

  1. Any ContentEntity has a place somewhere in a tree/hierarchy, where the parent node is a collection that "contains" the child and all its siblings and all its children. Because of groups, to fully resolve permissions on any entity in this hierarchy, we need information about the entity and every collection that contains it (i.e. every node above it in the tree/hierarchy on a direct path back to the root). It would be very cool if we had a single query that retrieved all the records for an entity AND the collections that contain it (all the way up the hierarchy). Is it possible that a BIGINT value (or a combination of BIGINT values) could express containment in a meaningful way without limiting breadth or depth of the hierarchy?
  2. We need an efficient way to retrieve a collection and everything it contains. Right now we do this programmatically. Such a query would have to be used very judiciously. And maybe it should be paged.

Comments, suggestions welcome.