LDN - CodeStorm 2009

Poor Portal Performance

We are moving to Parallel deployment at Clayton State University. This came about as a response to the continuous performance and reliability problems we are experiencing with our Portal.

We have patched to the bleeding edge, at 4.2.1.0 at the time of this writing, and are still experiencing show stopping performance problems. I am curious to know what other institutions are averaging as far as performance goes, and with what settings, hardware specs, etc are being used.

Here is a little about us:

  • 6,500 students, 600 staff
  • Average load is 200-300 people logged in at the same time
  • Each user averages 10 tabs, with 13 channels present on the "default" tab
  • Of these tabs, 3 are TA channels, the rest are TCC channels, and 3 of those are loaded via AJAX from other servers
  • 211 Channels
  • Single Sign on icons average 11 per user

 

And the system crawls. We are seeing load times of upwards 60 seconds from submitting the login form to finishing rendering the tabs and channels. Our new resource tier in our parallel deployment setup has reduced this load time to about 15 seconds, but this still seems like a ridiculous load time for a website.

Here are the specs on our tiers: 

  • VMWare ESX Servers with resource tier, and portal tiers are VMs
  • RHEL 4
  • Quad-Core AMD Opteron Processor 8384 w/ 4 dedicated CPUs to resource tier
  • 6GB memory for resource tier
  • SAN storage device (connected via gigabit ethernet) to 15k SCSI HDD in RAID 5

 

With this hardware, we are not seeing a spike in any particular resource (CPU, memory usage, IO), but the login just hangs. This occurs with only one person logged in (after the cache hit occurs). We are testing with just the resource tier, outside of the load balancer before we incorporate this along with additional portal tiers.

We have read the L****** optimization guide, and the only relevant piece of information it contained seemed to be changing the the "tomcat-cp-conf" file to have the Xmx, and Xms settings use "246..." instead of "123..." bytes of memory.

I took a session capture with Firebug to show the size of data, and the number or requests after login and before rendering the Portal's main tab. I took a screenshot of the summary, and have uploaded it here. You can see that its reporting a 17 second login with only 100KB of data transferred.

Our portal log files just show the usual errors, and our OS log files report everything as normal.

Ideally, I would love to get as short a login delay as possible (less than one second). Has anyone achieved this? What settings are being used?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Login/initial load

Hi,

It's not the login that takes 17 seconds, but the initial load that takes 17 seconds. Your firebug screen shot is showing the home tab load time... the "render.user..." page.

By your description, it sounds like it's your content volume. Mainly the number of channels on the users tab, but also likely the total number of channels and the so-such.

From your: Each user averages 10 tabs, with 13 channels present on the "default" tab-

That means on the initial load you needs to render and cache 10 channels. Depending on the channel it may reach back to a database, abouther website, or IMAP (often the slowest of them all). So the user's singe GET means the portal is doing 10+ GETs and processing. Then if you are using targeted content channels with multiple sections you have even more GETs, and if you have channels in the master XSLT, then you have even more processing.

Then on initial login, the portal caches yours access to channels, credentials, personal information, and other such objects.

If these numbers are correct, I'd say the 17 seconds sounds about right.

Sungard should be able to tell you if that's the case.

Hope that helps sort things out.

 

Cheers

 

Dave Wolowicz

Amout of caching

Does the Portal cache permissions for ALL channels, sections, tabs, etc on login, or just for the default tab? I would gather it runs through all of them, because when you add new channels, the lists are pre-populated based on your access.

If this is the case, then it is doing 211 channels worth of permissions querying, plus any TCC section permissions. This could easily double - maybe even triple our number of permissions checks. It seems like a bad handicap to check for ALL of that on login.

How many channels do you have? How long is your load time?

Thanks for the help!

Have you tuned your Apache Tomcat setting as well?

We had Load Testing and failed but then with Tuning in place, we were more than fine. Make sure you check that I have a copy of the guide if you want to email me.

 

-Tom Galanis (@tgalanis)

Channel Caching

I beleive that it caches your permissions, but not the channels them selves. It will cache the channel content on the tabs as you view them I beleive, and that will depend on the channel type as some channels have minimal caching.

I know in 3.3 as you load a tab it will sometimes start rendering and will stop on a certain channel. That means that channel is taking some time. Other times it will wait and dump the entire render. I would trim down your home tab and see how your portal reacts.

 

Cheers

 

Dave

Thanks everyone

Our upgrade is right around the corner and we have resolved this nasty slowdown issue. Thanks to everyone who replied to help us out on this one.

We did this by matching the version of a new installation of our resource tier as closly as possible to the version of our source resource tier when we ran the replication process.

After doing a fresh import, we are looking at about one second to login despite having the same number of channels, tabs, and the same layout.

If I had to guess, there is something in version 4.2.1.0 knocks out some performance optimization when importing files from a replication.

We will look at fine-tuining this number even further in the future by consensing channels and tabs, and removing many that never get updated.

Also on the list of things to do is our Mobile view of the Portal.

Check your DB

See my post at http://www.lumdev.net/node/3070 under issues encountered.

The slowdowns we experienced were similar to what you describe.  We resolved it by indexing  the userId field in ta_x_user.  If you prune the table you will not see any slowdowns initially but you may see it slow down later on when more and more Personal TAs are posted to Luminis.