Options for more efficient SQL storage

The Current State

Currently, use of the SQL storage backend for Citations and CitationLists is inefficient with both connections (which are cheap under MySQL but really expensive under Oracle if we're forcing new ones to be made) and with transactions.

If one edits a a citation, for example, what basically happens is:

  1. Data is pulled from the webpage on submit, including the ciatation and citationlist ids
  2. Citation list metadata is pulled from database (single db statement)
  3. Each individual citation is pulled from database (num citations statements)
  4. In-memory citation list object is modified to include the new/modified citation
  5. All citations in citation list are delete from datbaase (single db statement)
  6. For each individual citation
  7. ...for each individual field
  8. ......save the triple to the database. (num citations x num fields db statements)

A few notes

There are a bunch of places where we're reading data we already know, e.g., "select id, key, value where id=?" – we already know the id, so there's no sense in pulling it out of the DB again. That's a marginal speed issue, though, since the ids tend to be small and result in little network traffic.

Part A: Recommended solution: Edits/additions

The idealized solution — keeping track of which individual fields of which individual citations have been added modified and only deal with those — adds a ridiculous amount of logical overhead (esp. with regards to multi-valued fields).

A happier middle-ground would be to only delete/save citations that are "dirty" – either new or modified. We're still saving a lot of fields that we don't need to, but things should be a lot more efficient.

I would recommend the following:

  1. Add a boolean citation.isDirty = false to the citation object
  2. Modify BasicCitation.addPropteryValue, updateCitationProperty, and updateCustomUrl so they set isDirty = true
  3. Modify DbCitationService.retrieveCitation so that it sets isDirty == false before returning the fresh-from-the-storage ciatation (e.g., reset to false after all the calls to addPropertyValue set it to true, indicating that this is a pristine citation)
  4. Change DbCitationService.commitCitation to bail out unless the citation isDirty

Part B1: Recommended solution, retrieveCollection

If I'm reading it correctly, it looks like DbCiatationService.retrieveCollection is pulling each individual citation out of the database one at a time, resulting in n transcations.

We could build up a set of citationsids from the PROPERTY_HAS_CITATION triples into a comma-delimited list citationids, get all the ciatation data in one swoop with select CITATION_ID, PROPERTY_NAME, PROPERTY_VALUE from " + m_citationTableName + " where (CITATION_ID IN ?). It would look like this:

  1. Add a hasCitation(String citationID) method to CitationCollection to test if we're already storing a citation under that id (looks like there's already a contains method)
  2. Build list of ciataion ids and produce comma-delimited string
  3. Get all triples for all citations in the list
  4. Foreach triple
  5. ...unless citataionlist.hasCitation(citattionID), create a new citaiton and add it to the list
  6. ...add the property to the citation

What I don't understand about this is exactly how the ordering stuff is working; that may need all sorts of screwing around that we can't take into account if we don't do the citations one at a time.

If implemented, this should make pulling out and filling a citation list extrememly fast.

Part B2: An even better but slightly tricker solution for retrieveCollection: lazy loading of citations

Right now, the only interface to an individual citation is to load the entire citation list and then ask for the citation. Anything but listing the cites involves exactly one citation, so we're going through a lot of screwing around we don't have to.

We could implement another signature for DbCiatationService.retrieveCollection (I guess it would go in an interface somewhere??? I'm still getting a handle on Java) that takes two or three arguments:

CitationCollection retrieveCollection(String collectionID, List citationIDs)

Those citationIDs in the list (if any) will be loaded immediately; all others will be loaded on demand.

If we pass in a list of all the citation IDs as the second argument, we've essentially given ourselves a framework in which we can implement the previous recommendation (e.g., get the metadata for the CiatationCollection which includes the ability to get a list of included citationIDs, and then make a single call to this method to get those citations in one SQL statemetnt).

On the flip side, we very often don't need everything; we need exactly one citation. So, a call like:

collection = retrieveCollection(collID, [a list of a single citeid]

...would pull up the collection metadata and the data for just the single given citation, which could then be edited, viewed, whatever. Since it's the only thing available, it's the only one that will be potentially be "dirty" as well and saving will be fast.

Obviously, this will require an implementation of getCitation that will actually got out and fetch the citation from the database when necessary.