Issues

Select view

List view

Detail view

Select search mode

Basic

JQL

50 of 56

Mail digests can cause DB crash

Fixed

GENERAL

TESTING

GENERAL

TESTING

Description

CARET is seeing an issue related to the mail digests. Here is what we think we have figured out at this point:

PROBLEM:
org.sakaiproject.email.impl.BaseDigestService.java
This is loading the entire digest table every 1 second. We have 6 app servers so our DB is getting hammered pretty hard (it is our slowest query).

sendDigests has this (the key here is the empty catch block):
}
// if in use, missing, whatever, skip on
catch (Throwable any)
{
}
finally
{
if (edit != null)
{
cancel(edit);
edit = null;
}
}
This causes the digest process to attempt to send the digest over and over until the locks are cleared from the database.

org.sakaiproject.util.BaseDbSingleStorage.java
editResource in here seems to be the place where we attempt to update the status of the digest but cannot because of the locks. This is also where something else troublesome happens:
String sessionId = UsageSessionService.getSessionId();
if (sessionId == null)
{
sessionId = "";
}

This sessionId gets set to empty string for all the digests.

Down a little further:
org.sakaiproject.db.impl.BasicSqlService.java
dbWrite(String sql, Object[] fields, String lastField, Connection callerConnection, boolean failQuiet) calls on to prepareStatement(PreparedStatement pstmt, Object[] fields)
In this method we do the following:
if (fields[i] == null || (fields[i] instanceof String && ((String)
fields[i]).length() == 0))
{
// treat a Java null as an SQL null,
// and ALSO treat a zero-length Java string as an SQL null
// This makes sure that Oracle vs MySQL use the same value
// for null.
pstmt.setObject(pos, null);

Which converts the sessionId into a null. This would not be problem if the code does not fail while processing this digest. However, if it does fail, we are still supposed to be ok because we have code whch cleans up locks. Unfortunately, in org.sakaiproject.cluster.impl.SakaiClusterService.java we have code
like this in run():
// Delete any orphaned locks from closed or missing sessions.
statement = clusterServiceSql.getOrphanedLockSessionsSql();
List sessions = m_sqlService.dbRead(statement);
if (sessions.size() > 0) {

Here is the SQL that goes with that:
select distinct l.USAGE_SESSION_ID from SAKAI_LOCKS l left join SAKAI_SESSION s on l.USAGE_SESSION_ID = s.SESSION_ID where s.SESSION_ACTIVE is null

Unfortunately, this does not seem to be able to deal with null session_id and therefore the null session_id locks stay in the tables and cause infinite locks forever. This leads to huge load and growing logs which eventually cause the database to run out of disk space and die.

POSSIBLE SOLUTION (this is what we are doing at CARET but I would like to see this patched into trunk and 2.4.x if people think it is worthwhile):

1) Increase the timing on the digest check to once every 12 hours (I cannot see any reason to have it any shorter than this but I might be missing something, is there a reason?). Make this configurable in sakai.properties. This should be using a Timer instead od a thread.

2) Add in a check to the cluster service which purges all locks which are over X hours old. I am thinking something like 24 hours or maybe less for the default. This would be configurable in sakai.properties.

We are also seeing locks in the tables as old as 1.5 years (early 2006). This tells me that the current cleanup code is not completely trustworthy. I think this needs more investigation in the long run, but in order to keep our servers from dying this fix should be ok in the short run.

(from Stephen Marquard)
The cleanup code you're looking at in trunk (SAK-9857) is new for 2.5 - as you say, the 2-4-x and earlier cleanup code is indeed not trustworthy. We also have some of those (though none with null USAGE_SESSION_IDs).

It would probably be a good idea to add:
DELETE FROM SAKAI_LOCKS;

to the 2.4 > 2.5 conversion scripts.

Some final notes for everyone out there now that the issue is solved for us:
1) The dates in the XML blob that stores digests got our of sync for us and this was causing even worse problems. The code was checking and
seeing that the date range was not what it expected and therefore it would update the row in the DB for each one that did not match
(unfortunately it updated it to the same thing that was already there). This basically ensured it would loop forever and the updates
would crush our DB server with us seeing NumAppServers * NumCorruptDigestEntries updates per second.
With the fixes in, our load has dropped way down into easily acceptable levels.

If you see something like this happening (you are probably going to only notice it in the DB logs when you start to run out of space), the
solution is to either correct the dates in the XML or to fix the XML by removing the <message> tags (basically emptying out the everything
in the xml blob except a couple tags). This will be much harder or impossible if you are running trunk because of the new encoding so be
aware of that.

Attachments

Details

Priority

Critical

Affects versions

Fix versions

2.5.0

Components

Assignee

Aaron Zeckoski(Deactivated)

Reporter

Aaron Zeckoski(Deactivated)

Created October 8, 2007 at 6:46 AM

Updated October 23, 2008 at 6:29 AM

Resolved November 6, 2007 at 5:04 AM

Configure

Activity

Show:

Megan May October 12, 2007 at 10:48 AM

in .007 tag

Ian Boston October 10, 2007 at 2:58 AM

Patch has been applied to trunk. You should contact the branch managers to get it into the various branches.

Peter Knoop October 9, 2007 at 2:09 PM

Ian, can you review this patch for inclusion in 2.5? Thanks.

Aaron Zeckoski October 9, 2007 at 8:04 AM

Patch to address some issues is attached
1) This makes the digest timing configurable in sakai.properties
2) It also makes it configurable via Spring
3) It makes the ranges do a nearest match check in the case that thinks get out of sync
4) It switches from using a Thread to a Timer

Aaron Zeckoski October 9, 2007 at 12:37 AM

In our case it got out of sync because some of our 6 app servers started up with the wrong timezone and we did not catch it immediately. Part of the fix I am working on for this will correct the dates automatically when they seem to not match anymore.
In a nutshell, when the app server read the digest XML blob it processed it correctly but then it put a date/time string into the blob that would not work anymore on the servers with the right timestamp (it was off by one hour). Since the time match here is basically a string match part of the time it will never be able match it again because it assumes the hour has not changed. Since it cannot match it will loop through this digest forever.