Help!! Luminis 4.1

1
Average: 1 (1 vote)

Before we "ditch Luminis for static html" we thought we'd try asking this community for help. August 22, 2008 we upgraded to Luminis 4.1.0.25 and have not had a stable system since.

Our current status is hopeless, but more technically, we are experiencing spikes in heap memory usage that ultimately lead to a crash. Sungard has said that this is a memory leak, but they have been unable to find it. We installed a recent patch (4.1.0.29) from them that was designed to reduce the number of items cached in the user session. Since installing it a week ago, we have had 4 crashes.

We run Lamda Probe daily and consistently see a spike in memory usage followed by failed garbage collections indicative of the system crashing.

Our concurrent user sessions are ~ 1500, we are NOT on parallel deployment, and run Solaris 9 on a v490 with 4 CPUs with 12gb of RAM. We have seen no correlations between number of concurrent users and crashes. Tuning has been applied, Sungard has worked with us for weeks now on everything they can think of.

We are beginning to wonder if it is possibly one user accessing a file, link, channel, etc., that is bringing the system down. Unfortunately, Sungard currently has no diagnostic tools to help us in this regard.

A few questions:

Has anyone experienced similar problems with Luminis 4.1.0?
Does anyone have any diagnostic tools to help locate user actions which may be causing problems? Or to monitor channel use, layout changes?
Suggestions, advice, good drink recipes?
We're open to anything that might help at this point.

Thanks,
Denise

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

verbose:gc

http://developers.sun.com/mobility/midp/articles/garbagecollection2/#9

We were able to pinpoint the leak for Sungard by looking at a verbose:gc log way back on Luminis 3. There are many tools to parse the verbose:gc log out there. At the time, I think we used HPjtune.

What we saw, was a slow steady decrease in the available heap after each garbage collection, and occasional crashes caused by out of control object creation. The 'one user clicks something and is froze' problem.

We showed the object in question to sungard and they were able to patch it. The slow leak leading to a crash is harder, as it might be multiple tiny leaks.

If you are crashing that often though, I'm guessing there are some out of control object creation events. So enable verbose:gc logging on your java params, and then take a look at the resulting gc.out log for big spikes in object creation.

Another even trickier problem we had with HPUX once (now we are on Solaris) was caused by the Sun compiled build of the webserver not being correct for HPUX memory quadrants. In the end, with HP's help, we figured out how to force a core dump of the web server memory at the time of the hang/crash, and sent each core to HP/Sun for analysis.

Do you have any support contracts for software with Sun? You might get them to help you output core dumps, and they could take a look at it and see what the heap is filled up with.

We had similar issue with LP 4.0.2.x

Hello Denise,
We had a similar issue with LP 4.0.1.x. The login was long, it took long time to load channels, it took long time for navigation between tabs - on a good day we could count 40 seconds.

We reported the problem to Sungard and they asked us to move to the higher version. We did that. We moved to Luminis 4.0.2 and we didn't have that issue. We are currently on LP 4.1.0.15 and we do not have that lag time.

I suggest you take off the LP 4.1.0.25 patch and install LP 4.1.0.15 and let me know how the performance is.

Update

Thank you for your comments. We already have verbose:gc log. We'd like to find some time to get HPJmeter (which now encompasses HPJtune).

As for backing off the patch, Sungard's recommendation at this point is that we keep the patch installed as it has helped us more clearly identify that we are having object related events (much like Jason described).

I will keep you up to date on the status of any fix we receive. In the meantime, we appreciate any and all comments as they give us ideas to look at and pursue.

Thanks again,
Denise

Denise Anderson
Portal Administrator
Wright State University

More info on patch

Denise,
We have had performance problems with Luminis 4 sincle load increased at the start of the Fall term.
However, we could not get even 300 users on before crashing. (Solaris 10, T2000, lots of memory).

RC identified database pooling issues through thread dumps that we submitted.
These problems were address in 4.1.0.29 that we installed.
This seems to have stabilized things for the time being. We can at least successfully host 400 users now.
(I thought the 4.1.0.29 was created just for us, but it would appear that some object creation fixes are in there as well.
The database pooling changes were to take some 'expensive' operations, such as reading and writing stack dumps and validation of connections, out of locked synchronized loops).

In terms of investigating what's going on, we are still doing that.
We have installed and are looking at the Lambda probe. Not quite sure what to get from that.
As well, the system monitor channel has provided some information. I removed the email channel from the home tab because it had the highest (mean_load_time * count) value.

We are still having problems rendering Targeted Content channels. I have re-create our most important Targeted Content channel in Luminis 4, and are now using this as opposed to the previous 'migrated from III' version. The premise being that a native 4 TC may perform better than a migrated III TC. There is no basis for this other than superstition :-)

As you are doing, I am continuing to look for new ways to observer Luminis, in an attempt to understand what our issues are, and how our specific setup is affecting performance. Can't offer more than that right now.
Bob

interesting post

Bob,

Though it's not good to hear that you've been in a similar situation, it is good to hear that things have stabilized somewhat. Like you, we've removed the email channel from the Home tab. On a side note, are you using integrated email? We are using Sun One Communications Express for non-integrated email and calendar. We upgraded to this non-integrated solution at the same time we installed Luminis IV in production -- just to further complicate things :).

Monitoring the Tenured Gen graph of the Lambda Probe has helped us know when the system is trashed beyond recovery and must be restarted. It has also allowed us to determine that, instead of the slow memory leak we suspected, we are seeing a sudden spike in garbage collection, followed by a lack of system recovery, leading us to believe we are likely dealing with the possibility that a single user is at the heart of the problem. Comparing logins in session.log, prior to each daily crash, has given us the name of one user who is common to each crash. We are trying to track him down. I'll let you know if that comes to anything.

Finally, we are looking at installing HPJTune, as Jason suggested, to see if we can get any further information. Unlike our pre-patch situation, we no longer see any correlation between concurrent user counts and crashes. We can crash with 400 users or 1500 users.

If you'd like to contact me off list with any questions, please email me at denise.anderson@wright.edu. I truly appreciated your post.

Thanks,
Denise

Denise Anderson
Portal Administrator
Wright State University