Resources: Performance issues zipping large file sizes

Description

We experienced a Sakai 23 tomcat crash when a user attempted to zip a folder in Resources containing 1.8 Gb of content within a course site where the quota is set to 2 Gb. For this scenario in production with active load in production, led to jvm pausing for 50 seconds. This consequently triggered Apache Ignite to shutdown the node.

Our tomcat nodes have 20 Gb of RAM and a JVM footprint of: -Xms10g -Xmx10g -XX:NewSize=4g -XX:MaxNewSize=4g -Xss512k

When reproducing this scenario in a developer (non-production) environment, same memory footprint but minus the active load, I’ve observed that JVM pausing can be in excess of 20 seconds which could lead to a shutdown via Ignite. 

In the same developer environment, I’m finding that setting the Resources quota to Unlimited sometimes still does not yield a zip file for the 1.8 Gb test case. For instance, in one case we got a nested java.lang.OutOfMemoryError. 

I’ve attached a screencast (TestCaseWithoutMuchLoad.mp4) showing an attempt in the same developer environment, but the jvm pausing wasn’t as extreme, resulting in 1-2 seconds at most. (In the screencast I round incorrectly. That said I’ve seen over 20 seconds in prior attempts, not recorded.) For a system under a production load, I would expect the pausing to be longer per previous cited example of 50 seconds.

ADDENDUM: Considering the user’s events in SAKAI_EVENTS, I’ve recorded this newer screencast (MoreLikelyCaseExacerbatingJvmPause.mp4) which shows what likely happened during the node crash and how a jvm pause in excess of 30 seconds can result on a non-production system. Ignite didn’t take down the node in this case though.

While I’m not opposed to tweaking server specs and/or JVM specs as mitigation steps, I’m wondering if there might be some Sakai design or coding guardrails to develop and implement that would more effectively mitigate risks surfaced by this case.

This might be a problem in Sakai 25 too. I just haven’t tested that case yet.

Attachments

2
  • 27 Feb 2025, 12:57 AM
  • 27 Feb 2025, 12:11 AM

Activity

Show:

Sean Horner February 27, 2025 at 12:33 AM
Edited

Sent log to Earle via Slack. It’s big.

Earle Nietzel February 27, 2025 at 12:19 AM

can you provide a link to the log with the failure?

Details

Priority

Affects versions

Components

Assignee

Reporter

Created February 27, 2025 at 12:11 AM
Updated February 27, 2025 at 11:31 PM

Flag notifications