Luminis not responding after one hour, reboot required

This week our Luminis web server has started having issues. It is running on a Sun T5220 Solaris 10. Luminis version 4.2.2.34. After running for an hour to hour half it becomes unstable and stops responding. Banner channels are unable to be rendered. At the end of the slow response users trying to get logged in get the "too many simultaneous logins" message. By that point the system is gone. Just before it becomes unstable we notice ping packets are lost to the luminis server. No memory errors in log files. No error messages in system logs. Have to reboot the system before we see Luminis become stable again. Issue only occurs between 8 to 5. From 5pm to 8am the number of users is under 100 and no problems occur. Lambda probe shows a steady increase in total memory from start to end peaks around 550mb.

Has anyone encountered this situation and can provide insight?

Thanks!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

We are only on luminis3 so

We are only on luminis3 so this may be irrelevant...

Our system becomes unstable if one of the database servers that one of our custom channels connects to "goes away".
The system seems to get stuck waiting for the external db to respond.
The load on the server goes through the roof and then the platform becomes unstable - this is fixed by a reboot but will occur unless the offending channel is removed.

Banner Channels

Banner Channels are surely the culprit. Check your Banner DB for the time period when this was happening and check for concurrency for the BANPROXY user. We ran into this a few times and made a decision to ditch the Banner Channels completely. we eventually re-wrote quite a few using JQuery but will always stay away from the CAR file type Banner channels.
By the way, this problem is load related. We have about 170,000 users in our portal and had this Banner channel related problem at about 5000-6000 simultaneous sessions. Once we got rid of them, we have easily scaled up to more than 10,000 simultaneous sessions on our 13-webserver PD environment (ver 4.2.1.96)

Seconding Banner Channels

Since moving to Banner 8 we've been running into this issue. You can confirm that Banner Channels are the culprit by shutting down the Banproxy app on the application server - the portal should recover shortly thereafter. To work around this, try killing the active BANPROXY database sessions or restart the Banroxy app.

We've had to replace the channels with direct links into SSB. We have received no fix or explanation as to why this is happening from SGHE.

Could you share the jQuery channels you've created?

I wonder....

did you get this resolved?

I've be curious to see your gc log from /opt/luminis/products/tomcat/tomcat-cp/logs...

Resolution

Issue came down to the web server firewall was dropping the packets.

Using the Sun ipf firewall. Keeping state information on incoming connections the firewall was dropping connection state information when number of connections was high.

Found that out of the box ipf state table size is small:
# ipf -T list|grep fr_state
fr_statemax min 0x1 max 0x7fffffff current 4013
fr_statesize min 0x1 max 0x7fffffff current 5737

The command:
ipf -T list|grep state
will show the current ipf state infomation table size.

Increase ipf state size:
ipf -D -T fr_statemax=123113,fr_statesize=160031,fr_statemax,fr_statesize -E -T fr_statemax,fr_statesize

* IPSTATE_MAX (=fr_statemax) should be ~70% of IPSTATE_SIZE
* IPSTATE_SIZE (=fr_statesize) has to be a prime number

Sun has an ipf patch in description:
6900850 limit for number of states in the state table is too low by default

Summary:

1. Increase the ipf state table size. Put new settings in /usr/kernel/drv/ipf.conf.
2. Optimized the firewall rules in /etc/ipf/ipf.conf. Combined some rules. Removed old rules no longer needed.
3. Applied Sun ipf patch 141506-09 (required reboot).

Syndicate content