Clustering Resource Tier of Luminis

0
No votes yet

I decided that I really needed to cluster the resource tier of Luminis, especially if we were going to start looking at closing other avenues to resources off and forcing people to go through Luminis on our campus. So I found some test hardware and decided to have at it. This is installed with Luminis III.2 as a base platform to test failover, then I’ve also upgraded to III.3.1 patch 21 to see how patching in a clustered environment works. Its accomplished by using a couple of Sun Solaris Boxes (V240’s) and Solaris 9 with Sun Cluster 3.1. I won’t go into too much detail about Sun cluster, as I’ll assume if you are attempting this you are familiar with it, but as always I’m happy to answer any questions. :)

1. Install & patch Solaris on both nodes.
2. Install & patch Sun Cluster on both nodes (after this step you should have a two node cluster)
3. On our systems, luminis is installed in /opt/pipeline. For convenience, I’ve soft-linked /opt/pipeline to the global storage where luminis will ultimately reside (/global/luminis). This soft-link should exist on both nodes.
4. Create a resource group for luminis (i.e. luminis-rg). This can be accomplished with the command:
# scrgadm –a –g luminis-rg –y nodelist=node1,node2

5. Add the logical hostname to the resource and bring it online on node 1. This can be done with the commands:
# scrgadm –a –t SUNW.LogicalHost
# scrgadm –a –L –g luminis-rg –j logicalhostname
# scswitch –Z –g luminis-rg

6. Add the necessary users and groups for the luminis install (typical install stuff).
7. Install Luminis III.2 on Node 1 making sure your custom.conf uses the logical hostname and not the physical hostname and make sure you install to global storage that is accessible to both node 1 and node 2. (i.e. /global/luminis)
8. After the install is done, make sure you can bring it up on that node.
9. If everything looks good, go ahead and bring luminis down on node 1.
10. Fail the logical hostname over to Node 2 with the command:
# scswitch –z –g luminis-rg –h node2

11. Install Luminis III.2 on Node 2 making sure your custom.conf uses the logical hostname and not the physical hostname and make sure you install to global storage that is accessible to both node 1 and node 2. (i.e. /global/luminis) The global storage here can be the same as node 1. It will just overwrite the stuff that’s in global (this should still be a base install anyway, so no biggie there).
12. Again bring up luminis on node 2 and make sure everything is ok. If it looks good, bring it down again.
13. At this point you should have /global/luminis filled with the portal, but there are pieces that get installed outside of /opt/pipeline. From either node 1 or node 2 you should copy /var/opt/SUNWics5 to global storage somewhere. We used /global/luminis/var/opt/SUNWics5. We tried to keep the directory structure similar.
14. After you have the calendaring stuff copied to global storage, you need to create soft-links on both node 1 and node 2 from /var/opt/SUNWics5 to wherever you copied the directory to on global storage (i.e. /global/luminis/var/opt/SUNWics5)
15. At this point everything (except for message queue) is on global storage. I tried moving message queue to global storage, but I think its way to embedded in the OS to do so which is why we had to install luminis on both nodes.
16. Now you need to make calendar server use the logical name. For whatever reason all of luminis is happy using the logical hostname except for calendar server. It likes to use the physical hostname. So edit your ics.conf file (/global/luminis/products/SUNWics5/cal/bin/config/ics.conf) and change the following two lines:
local.servername = "node1" --> local.servername = “logicalhostname”
service.ens.host = "node1" --> service.ens.host = “logicalhostname”

Make sure that you keep the original lines in the file (just add a ! in front of them.) You’ll need to change these files back to perform any upgrades/patching.
17. You will also need to edit the ics.conf.bak that’s in the same directory and change the following entries (again keeping originals handy for upgrades):
local.hostname = "node1" --> local.hostname = “logicalhostname”
local.servername = "node1" --> local.servername = “logicalhostname”
service.ens.host = "node1" --> service.ens.host = “logicalhostname”
service.http.calendarhostname = "node1.myschool.edu" --> service.http.calendarhostname = “logicalhostname.myschool.edu”

18. In your web servers alias directory (/global/luminis/products/ws/alias) you have cert db files. You will need to make copies of all the db files that are in there so that there is a set for node 1 and a set for node 2. (i.e. You should have https-cp-node1-cert7.db and https-cp-node2-cert7.db for every db file in there, including the mb server.) A simple cp will work for this. When you renew certs for your portal, you will need to remember to copy these db files after you’ve installed the updated cert.
19. At this point, you can manually bring up the portal on either node manually. Add data to the portal while running on node 1 and then bring it down on node 1 and bring it up on node 2. The data you added should be there just as if you were still running on node 1.
20. We took this one step further and made the portal a resource within sun cluster. To do this, you can use the Generic Data Service cluster module for Sun Cluster. You give it a start script and a stop script (again these should be on global storage like /global/luminis/scripts/start.ksh or something similar.) Your portal HAS to be configured to use sudo for this to work. Sungard HE has a doc note on how to do this. You should remember to configure sudo on both nodes. Commands to do this are:
# scrgadm –a –t SUNW.gds
# scrgadm –a –j Portal –g luminis-rg –y Port_list=”80/tcp 636/tcp 443/tcp” –x Failover_enabled=true –x Start_command=”/global/luminis/scripts/cluster_agent_start.ksh” –x Stop_command=”/global/luminis/scripts/cluster_agent_stop.ksh”

The port_list is just what we ended up using. I think the GDS module just wants SOMETHING in there. It doesn’t matter much what we put in there because we turn off the pmf monitoring after the startup script is done anyway. There are many other options with the gds that you can tailor and customize to your environment including custom monitoring, etc.
21. With your startup script, I found I had to do a couple of things. I had to bring in all of cpadmins environment (.cprc and any customizations you’ve added). You also need to remember that cluster stuff runs as root. So in your startup script, all your normal lines (like startcp, runsesev, runsesau and runevents) need to be preceeded by su – cpadmin –c “” And finally with the startup script, because of the way luminis works, when the portal is done starting, the actually startup process pid goes away. Cluster doesn’t like this and thinks that the resource has failed. So you need to turn off pmf:
# pmfadm -s luminis-rg,Portal,0.svc HUP

luminis-rg is my resource group and Portal is the actual portal gds. Luminis-rg contains the logical hostname and the portal resource, and this is what actually fails over. You need to have the logical hostname on the same box that the portal is trying to run on, or it wont work, so it makes sense to group them together.
22. The stop script is much easier, you just put all your stop commands in there preceeded by the same su – cpadmin –c “
23. After all this is done, you finally have a portal that is resource clustered. If you do a scstat –g you should see something like this:
# scstat -g

-- Resource Groups and Resources --

Group Name Resources
---------- ---------
Resources: luminis-rg logicalhostname Portal

-- Resource Groups --

Group Name Node Name State
---------- --------- -----
Group: luminis-rg Node1 Offline
Group: luminis-rg Node2 Online

-- Resources --

Resource Name Node Name State Status Message
------------- --------- ----- --------------
Resource: logicalhostname node1 Offline Offline - LogicalHostname offline.
Resource: logicalhostname node2 Online Online - LogicalHostname online.

Resource: Portal node1 Offline Offline
Resource: Portal node2 Online Online

24. A couple notes on upgrading when you are in this configuration:
-- You can leave your resource online, but remember to unmonitor it anytime you are trying to patch (or even manually restart) the portal. That way cluster is not trying to restart the portal while you are trying to take it down.
--Remember to undo sudo before you attempt to patch/upgrade
-- Remember to change back the ics.conf and ics.conf.back files to use the physical hostname
-- Remember when you update your certs, to copy the db files in the alias directory.
-- I’ve upgraded from III.2 to III.3.1 patch level 234209342 and it all seems to work quite well as far as only having to patch it once on one node and taking it taking the patches with it when it fails to the other node. So it’s a little more setup in the beginning but not so much more maintenance in the long run.

There you have it! This will be put into production here at DU in the next coming months. New hardware is arriving and we’ll be migrating to this clustered solution. The nice thing is since you have two nodes, you can have your master node as one resource group and if you want to run parallel deployment as another group, then they can both fail over to each other. Makes hardware maintenance or entire compute room outages very easy to handle (we have a main computer room and a redundant computer room in another building on campus). Just fail it over y0! :)

I imagine something similar could be done for those of you running in the windows world. I believe windows has the notion of a “global” file system, although I don’t know about the whole resource group/resource stuff with windows.

Anyway, there it is in a nutshell (more like a coconut shell, but hey whatever works ) I’d be happy to answer any questions you guys might have.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Excellent Work! Here are my comments...

I came up with similar findings when testing how to do this on Veritas Cluster. A few comments:

Step 3: I don't think this will impact you, since your target is always going to be the same on both nodes, but use caution here. When I used a symbolic link to point /opt/pipeline somewhere else it caused the pkgmgr to hardcode the target rather than the link. It seemed to cache this information in some pkgmgr config files, which I then had to hunt down and modify. In my case, I was moving an install to a server which had a different target for the link, therefore it affected me.

Step 5: The logical hostname has always been a source of frustration to me. By default, Luminis looks for lots of things at startup running on localhost. The application itself seems to look for things on localhost as well, although I can't be certain of that. Long story short, out of the box Luminis binds to some ports on all interfaces, not just the logical interface. Therefore, it doesn't necessarily "play well" with other applications you might have running in your cluster. For example, if Luminis is bound to port 389 on all interfaces, you can not run any other application on that cluster node that wants to bind to port 389 on it's own logical interface. It is easy to re-configure the Luminis directory server to listen only on it's logical interface, but last time I tried this it seemed to break things in the application. In re-visiting the documentation recently, I noticed some installation parameters that may or may not have some impact on this:

config.server: Specifies the location of the LDAP server used by the system to store configuration properties. Default localhost.

ds.host: The fully qualified hostname of the server upon which the Directory Server software is installed, which is always the core Luminis server. If you supply the appropriate value for the host.name and host.domain properties, this value will be created and you should not need to set the ds.host property when installing on the Luminis server. You may have to specify the ds.host if you are installing software on another server. Default: ${host.fullname}

cs.host: The fully qualified hostname of the server upon which the Calendar Server software is installed. If you supply the appropriate value for the host.name and host.domain properties, this value will be created and you should not need to set the cs.host property. Default: ${host.fullname}

luminis.host: The fully qualified hostname of the machine upon which you are installing the primary components of the Luminis system (Directory Server, Web server, etc.). This value will be generated from the host.fullname property. If you’ve define the properties that make up host.fullname (host.domain and host.name) you do not need to set luminis.host. Default: ${host.fullname}

They say they default to ${host.fullname}, which is ${host.name}.${host.domain}. In theory, setting these values to the logical hostname should propagate through the system. Last time I tested it, not so. Therefore, if I were to test again I would look at the above properties at install. Also, my testing was all before their "Failover and Scalability" (FOS) parameters came about. Also, none of my comments on Step 5 matter unless you want to run multiple applications on your cluster nodes. If you just want to have a code failover node, no problem. Alternatively, you can also just keep a close eye on your ports and reconfigure other applications around the Luminis requirements.


In my opinion, if SGHE would just properly handle a logical hostname in the Sun JES configuration and not reference localhost in their startup scripts and Java code, then it would do as much or _more_ for enabling failover than their "FOS" or "PDS" or whatever the heck they are calling it these days.

Step 15: Ah, I see... I didn't understand why you were installing Luminis on both nodes. You mention here it was because of the message broker. You can try splitting the message broker, I have done this in my testing. The filesystem is in two places. If I recall correctly on Unix it's:

/etc/imq/license
/var/imq/instances/imqbroker

Interesting to note that, on Windows, the MQ software is all self-contained in c:\luminis\products\mq:

c:\luminis\products\mq\etc\lic
c:\luminis\products\mq\var\instances\imqbroker

Much nicer if your trying to contain Luminis to a single share storage volume! I can't believe I just said something nice about Windows ;-) Seriously, if I recall correctly I just sym-linked both Unix imq directories into /opt/pipeline and it worked fine. Same comments about pkgmgr apply here as in Step 3 above.

Step 16: I found the same thing when I tested this, it took some debugging though!

Step 18: I found the same thing when I tested this, what a pain! Think of an environment with 8 or more nodes... Ugh! I wonder if this requirement would go away if we could properly configure the web server instances (cp, mq, cpip, admin) to only bind to HTTP and HTTPS ports on the logical interface? I bet this whole step would go away if we could do that.........

Step 24: A related note about upgrading in HA environments. I had an install once that used iPlanet's directory replication. I left it active in a test environment during a III.2 -> III.3 upgrade, and it wrecked my replication agreements to the point where I couldn't start the primary Luminis directory without breaking them all. Your post just made me think of that....

Hope this helps.....

--
Best Regards,

Scott Spyrison

I tried to cheat and soft

I tried to cheat and soft link MQ to global storage when I was going that and it didnt work for some reason. Maybe if I find some spare time (yeah right) I'll see about doing it again.

Something I found humorous when I rebooted my entire test cluster is because I installed it on both nodes, it installs startup scripts on both nodes. Needless to say you should remove these and let the cluster bring up luminis with the resource, as opposed to the standalone startup scripts trying to start it on both nodes. LOL!

We'd like to leverage this too. Got guidance on Sun Cluster?

Having read your original post, (and treating it like a holy grail)we feel we would really benefit from duplicating this for our implementation.

The one thing you did not go into (and our Sys. Admin has no experience with) is the Sun Clustering set up.

Can you give us as good a primer on setting this up properly? possible Pitfalls and so on?

This would be most appreciated.

Experience

Hi Charles,

If you and your team are not familiar with Sun Cluster, it can be extremely tricky. In our environment, we have extensive knowledge/experience with sun cluster (and hiring a Sun guy away from Sun doesnt hurt either), and we were able to obtain checklists that I believe are not available to the public.

If you were looking at setting all this up properly with Sun Cluster and are not familiar with how it works, I would STRONGLY recommend having Sun come out and set your cluster up. Each hardware environment is different as is each version of Sun cluster. They dont have to do the application install, but they can make sure that everything is set up and functioning correctly, cluster wise. This may be pricey, but well worth it in my opinion.

If its not an option to have Sun take care of your cluster needs, I can see if I can dig up any Sun Cluster documentation that I can find for you. I'm not sure how much of the install guides are available to the public.

In any case, whether you install cluster yourself, or have sun do it, you should still have sun come give their blessing after its installed. We still do that step with all our production clusters.

Hope that helps!

--Mike

luminis IV cluster?

Hi

We have been running a similar setup for Luminis III.3.3 for a few years, and we are planning to migrate to Luminis IV in a clustered environment. Have you tried this yet? If so, have you come across any gotchas?

One thing I have noticed is that the jdk now gets installed in /usr/jdk rather than $CP_ROOT/products/jdk, there is a configuration parameter in the install script that is supposed to let you change the jdk install location but it just introduced an additional layer of sym links back to /usr/jdk when we tried it. I suspect this will make patching even more tricky, although we haven't tried this bit yet.

As one of the other posters mentioned, you can copy the message broker files manually rather than installing the whole thing on both nodes.

David

Not yet, but soon

Hey David,

I'm in the process of migrating to IV right now. I'm cleaning up a migration of a regular test instance to make sure I get everything I need to know, and then I'll work on "clusterizing" it. I hope to have my test cluster upgraded to IV sometime in April, so I might have more info then. We're supposed to do this in production in June, so hopefully there's not that many changes that need to happen!

Clustering Luminis

Which cluster agents are needed to run LuminisIV in a Sun Cluster?

Just GDS

For the luminis portal itself, it leverage it the way I've done it with III, it would just be the GDS. I'm not sure about IV yet...as...well...we're having some...dificulties...yeah dificulties....getting just a standalone instance of IV to behave the way we want it. I don't imagine its much different.