This week we finally got to the bottom of a problem that's been a bit of a head scratcher for the last 4-6 weeks.
Our enterprise portal (WebSphere Portal 6.0.1.3) had been standing up just fine with a user base of 30,000 users for over 12 months, but about 4-6 weeks ago it started exhibiting a behaviour I dubbed the "Portal Wibble Dance".
What is the Portal Wibble Dance?
The Portal Wibble Dance is where individual portal nodes started dropping in and out of service with alarming regularity and my inbox resembled a train wreck with thousands of the following alerts:
"node 1 is down"
"node 2 is down"
"node 1 is up"
"node 1 is down"
"node 3 is down"
"node 2 is up"
"node 3 is up"
Luckily we have a 4-node cluster and we weren't getting any service outages as at least one node remained up at any one time when the other's were "wibbling".
So Nodes were Crashing?
Didn't look like.
Notifications that nodes were back up were too quick to suggest that nodes were actually crashing - anyone who's worked with WebSphere Portal knows there's no way a node can restart in 30 seconds ;) - so it looked like the nodes were just slowing down enough for our Netscaler load balancers to mark them as down.
What the Hell is a Netscaler?
The fact that we use Citrix Netscalers as opposed to IBM Edge Servers to load balance has confused people who've worked on our infrastructure and are not used to heterogeneous environments. So an explanation.
Citrix Netscalers are hardware load balancers that balance the traffic to the 4 IBM HTTPD (IHS) Servers. It sends a GET request to /wps/portal and if it doesn't get a response within 10 seconds it flags a potential problem with that server. If it gets 3 failed attempts in a row for a server then it flags that server as down and stops sending traffic to it. The Netscaler then waits 30 seconds and tries the flagged server again. If it gets a response within 10 seconds then it flags the server as up and starts sending traffic to it again.
I get the above email alerts for each "down" and "up" flag and I was getting alerts too frequently for it to be portal restarts.
So Was There Anything in the Logs?
Boy, was there. When portal was "wibbling" it was ripping through 15MB of log files in about 45 seconds - guess that might have had something to do with the slowdowns ;)
There were thousands of the following errors repeating in the log files:
I managed to get a fix for IWKPC1007X from IBM Support. It was a known issue which was documented at http://www-01.ibm.com/support/docview.wss?rs=688&uid=swg1PK60885 , but the TechNote seems to have been withdrawn. I believe fix PK60885 is rolled in to 6.0.1.4 and above, but if you get the above error then request the fix direct from IBM Support.
So That Fixed Everything?
Wish it was that simple, but portal was still doing the "wibble dance", I just had one less error, but it did make it easier to spot:
So What Was the Real Fix?
Reducing the individual errors eventually led us to a repeating error with the IBM Web Clipping portlet, which we use to surface our password management application in portal using an IFRAME - yes I know IFRAMES are bad and I hang my head in shame.
Looks like there is a session management problem with IFRAMES and what I hadn't realised was that there had been a policy decision to push everyone to portal for password management, which increased the load on the Web Clipping portlet, increased the session management problem, and started causing enough errors to slow down portal enough for the Netscaler to keep dropping nodes out of service.
Our fix for the time being is to disable persistent sessions completely, and we're slowly going to be removing all Web Clipping/IFRAME portlets.
Parting Shot
Don't use IFRAMES kids, they're not big and they're not clever. They're a cheap integration point for a proof of concept, but they will bite you in the ass if you roll them into a production environment.
Our enterprise portal (WebSphere Portal 6.0.1.3) had been standing up just fine with a user base of 30,000 users for over 12 months, but about 4-6 weeks ago it started exhibiting a behaviour I dubbed the "Portal Wibble Dance".
What is the Portal Wibble Dance?
The Portal Wibble Dance is where individual portal nodes started dropping in and out of service with alarming regularity and my inbox resembled a train wreck with thousands of the following alerts:
"node 1 is down"
"node 2 is down"
"node 1 is up"
"node 1 is down"
"node 3 is down"
"node 2 is up"
"node 3 is up"
Luckily we have a 4-node cluster and we weren't getting any service outages as at least one node remained up at any one time when the other's were "wibbling".
So Nodes were Crashing?
Didn't look like.
Notifications that nodes were back up were too quick to suggest that nodes were actually crashing - anyone who's worked with WebSphere Portal knows there's no way a node can restart in 30 seconds ;) - so it looked like the nodes were just slowing down enough for our Netscaler load balancers to mark them as down.
What the Hell is a Netscaler?
The fact that we use Citrix Netscalers as opposed to IBM Edge Servers to load balance has confused people who've worked on our infrastructure and are not used to heterogeneous environments. So an explanation.
Citrix Netscalers are hardware load balancers that balance the traffic to the 4 IBM HTTPD (IHS) Servers. It sends a GET request to /wps/portal and if it doesn't get a response within 10 seconds it flags a potential problem with that server. If it gets 3 failed attempts in a row for a server then it flags that server as down and stops sending traffic to it. The Netscaler then waits 30 seconds and tries the flagged server again. If it gets a response within 10 seconds then it flags the server as up and starts sending traffic to it again.
I get the above email alerts for each "down" and "up" flag and I was getting alerts too frequently for it to be portal restarts.
So Was There Anything in the Logs?
Boy, was there. When portal was "wibbling" it was ripping through 15MB of log files in about 45 seconds - guess that might have had something to do with the slowdowns ;)
There were thousands of the following errors repeating in the log files:
"IWKPC1007X: Could not find an identity for name (resolved reference key): null" "IWKPY1015X: Unauthorised access by .... " "Exception is: com.ibm.db2.jcc.c.SqlException: Application must execute a rollback. The unit of work has already been rolled back in the database..."So What Was the Fix?
I managed to get a fix for IWKPC1007X from IBM Support. It was a known issue which was documented at http://www-01.ibm.com/support/docview.wss?rs=688&uid=swg1PK60885 , but the TechNote seems to have been withdrawn. I believe fix PK60885 is rolled in to 6.0.1.4 and above, but if you get the above error then request the fix direct from IBM Support.
So That Fixed Everything?
Wish it was that simple, but portal was still doing the "wibble dance", I just had one less error, but it did make it easier to spot:
SESN0016E: DatabaseSessionContext:performInvalidation detected an error. The database invalidation of timed out sessions has encountered an error..<br />pointing to a problem with the Session Database, but from the database side of things the DB looked fine, no errors, nothing.
So What Was the Real Fix?
Reducing the individual errors eventually led us to a repeating error with the IBM Web Clipping portlet, which we use to surface our password management application in portal using an IFRAME - yes I know IFRAMES are bad and I hang my head in shame.
Looks like there is a session management problem with IFRAMES and what I hadn't realised was that there had been a policy decision to push everyone to portal for password management, which increased the load on the Web Clipping portlet, increased the session management problem, and started causing enough errors to slow down portal enough for the Netscaler to keep dropping nodes out of service.
Our fix for the time being is to disable persistent sessions completely, and we're slowly going to be removing all Web Clipping/IFRAME portlets.
Parting Shot
Don't use IFRAMES kids, they're not big and they're not clever. They're a cheap integration point for a proof of concept, but they will bite you in the ass if you roll them into a production environment.