Performance issues in site visits and retrieval (doSite, getSites)
Description
Attachments
is related to
Activity
Neal Caidin April 28, 2014 at 1:04 PM
Can this one be closed or is there more to do, or QA on this ticket? Thanks.
Matthew Buckett March 4, 2014 at 4:44 AM
I noticed this against trunk and reported which Matt J linked to this ticket.
Curtis Van Osch September 6, 2013 at 4:00 PM
Has anyone else noticed if this patch breaks the paging in Worksite Setup?
Since I've integrated it, the first page seems to render fine (displaying 1 - 20 of 15017), but clicking through to the next page sometimes displays less than 20 sites (but the message still says displaying 21 - 40). The number of results decrease until the message No sites were found is displayed.
If I revert the patch, the problem disappears.
Curtis Van Osch September 5, 2013 at 2:13 PM
I just applied the v2 patch to our 2.9.1 code without issue, though I did have to specify our kernel version (1.3.1) in the pom.xml for entitybroker/core-providers, portal-impl and portal-service-impl. Otherwise maven seemed to be using sakai-kernel-api-1.3.0.jar from it's repo (even though the master/pom.xml specifies 1.3.1) and the build would fail (cannot find LazySite, among others).
If our tests pass, this will be going into production this evening to try and fix some performance issues we've been experiencing.
Lydia Li September 4, 2013 at 6:41 PM
Greg, were you able to resolve the compile error you saw ? We (at Stanford) are considering merging this patch to our 2.9.x. Is the v2 patch good to use?
There are various, rather serious performance issues with Sites at scale (size and usage). They are exposed on very common actions such as logging in (initial site visit, not actual login), all site visits, and tool requests for those that can merge content (announcements, schedule, etc.).
The most fundamental issue is that SiteService.getSites is called in very many places throughout the code base to retrieve the list of all sites for the current user, and this is never cached. On some basic requests like site visits, this is called upwards of three or four times by the portal (for finding the user's default site [after SAK-22386], calculating tabs, subsites, and so on).
At small scale (few concurrent users, few sites per user, short descriptions), the penalty is not obvious. However, it can be dramatic if, for example, users belong to hundreds of sites or site descriptions are long (as they can be easily by way of pasting from Word, for example). The long site descriptions are not the problem in and of themselves; they are just retrieved far too often and on operations that are too fundamental to any other activity.
The getSites operation should be optimized to provide a way to retrieve sites without costly data not relevant to how they will be used, as in the list of site titles used for rendering navigational tabs. There should also be some caching to account for other areas of the code base that may call getSites repeatedly.
Upon investigation, it was also found that the portal spends significant time (equivalent to all duplicative retrieval in load testing) escaping site information for HTML. This occurs regardless of any caching since it is at the time of viewing. This HTML-safe text should be precalculated and exposed as part of the site information.
The University of Michigan has invested significant energy in profiling and analyzing these issues after observing some performance degradation. In the original scenario, there was also some unknown degradation of connections within the database pool. Specifically, borrowed connections were transferring CLOB data (site descriptions, announcement bodies, others) extremely slowly, while new connections created on the affected servers performed at expected rates. It is not clear what the exact trigger was for this condition, but the pool (very outdated DBCP) or long-established database sessions are likely involved. This degradation not yet been reproduced outside of the original window, so energy has been directed on resolving the glaring efficiency issues in SiteService and Portal.
Specific notes:
SiteHandler goes through SiteNeighbourhoodServiceImpl.getAllSites, which retrieves all accessible sites regardless of the active context (sites, tabs, or other) because getSitesAtNode delegates unconditionally.
Solutions proposed:
Leave SiteService.getSites behavior unchanged for existing calls (except potentially faster because of caching)
Add a SiteService.getSites signature to retrieve records without processing descriptions
Implement a four-stage approach in SiteService.getSites: get IDs, check cache for each; query by ID for only those uncached, cache new
For sites loaded without descriptions, lazily load descriptions on loadAll and getSite (which retrieves all pages, etc. and caches a single site fully)
Add a new SiteService method, getUserSites that caches the complete list of sites for a user
Call getUserSites from SiteNeighbourhoodServiceImpl.getAllSites to take advantage of cache
Call getUserSites from MergedList in site/mergedlist-util (used for announcements, etc.)
Add new methods to Site, getHtmlShortDescription and getHtmlDescription to escape and retain them internally
Call the new HTML-safe description getters from PortalSiteHelperImpl instead of Web.escapeHtml on the plaintext descriptions