Sunday, 25 October 2009

Retiring the Portal Wibble Dance

This week we finally got to the bottom of a problem that's been a bit of a head scratcher for the last 4-6 weeks.

Our enterprise portal (WebSphere Portal had been standing up just fine with a user base of 30,000 users for over 12 months, but about 4-6 weeks ago it started exhibiting a behaviour I dubbed the "Portal Wibble Dance".

What is the Portal Wibble Dance?

The Portal Wibble Dance is where individual portal nodes started dropping in and out of service with alarming regularity and my inbox resembled a train wreck with thousands of the following alerts:

"node 1 is down"
"node 2 is down"
"node 1 is up"
"node 1 is down"
"node 3 is down"
"node 2 is up"
"node 3 is up"

Luckily we have a 4-node cluster and we weren't getting any service outages as at least one node remained up at any one time when the other's were "wibbling".

So Nodes were Crashing?

Didn't look like.

Notifications that nodes were back up were too quick to suggest that nodes were actually crashing - anyone who's worked with WebSphere Portal knows there's no way a node can restart in 30 seconds ;) - so it looked like the nodes were just slowing down enough for our Netscaler load balancers to mark them as down.

What the Hell is a Netscaler?

The fact that we use Citrix Netscalers as opposed to IBM Edge Servers to load balance has confused people who've worked on our infrastructure and are not used to heterogeneous environments.  So an explanation.

Citrix Netscalers are hardware load balancers that balance the traffic to the 4 IBM HTTPD (IHS) Servers.  It sends a GET request to /wps/portal and if it doesn't get a response within 10 seconds it flags a potential problem with that server.  If it gets 3 failed attempts in a row for a server then it flags that server as down and stops sending traffic to it.  The Netscaler then waits 30 seconds and tries the flagged server again.  If it gets a response within 10 seconds then it flags the server as up and starts sending traffic to it again.

I get the above email alerts for each "down" and "up" flag and I was getting alerts too frequently for it to be portal restarts.

So Was There Anything in the Logs?

Boy, was there.  When portal was "wibbling" it was ripping through 15MB of log files in about 45 seconds - guess that might have had something to do with the slowdowns ;)

There were thousands of the following errors repeating in the log files:
"IWKPC1007X: Could not find an identity for name (resolved reference key): null"

"IWKPY1015X: Unauthorised access by .... "

"Exception is: Application must execute a rollback. The unit of work has already been rolled back in the database..."
So What Was the Fix?

I managed to get a fix for IWKPC1007X from IBM Support.  It was a known issue which was documented at , but the TechNote seems to have been withdrawn.  I believe fix PK60885 is rolled in to and above, but if you get the above error then request the fix direct from IBM Support.

So That Fixed Everything?

Wish it was that simple, but portal was still doing the "wibble dance", I just had one less error, but it did make it easier to spot:
SESN0016E: DatabaseSessionContext:performInvalidation detected an error. The database invalidation of timed out sessions has encountered an error..<br />
pointing to a problem with the Session Database, but from the database side of things the DB looked fine, no errors, nothing.

So What Was the Real Fix?

Reducing the individual errors eventually led us to a repeating error with the IBM Web Clipping portlet, which we use to surface our password management application in portal using an IFRAME - yes I know IFRAMES are bad and I hang my head in shame.

Looks like there is a session management problem with IFRAMES and what I hadn't realised was that there had been a policy decision to push everyone to portal for password management, which increased the load on the Web Clipping portlet, increased the session management problem, and started causing enough errors to slow down portal enough for the Netscaler to keep dropping nodes out of service.

Our fix for the time being is to disable persistent sessions completely, and we're slowly going to be removing all Web Clipping/IFRAME portlets.

Parting Shot

Don't use IFRAMES kids, they're not big and they're not clever.  They're a cheap integration point for a proof of concept, but they will bite you in the ass if you roll them into a production environment.

Saturday, 10 October 2009

How To Achieve Real Transparency

I keep getting told in work that we have this thing called "transparency" and everyone is open and honest and there is clear communication from the top down.  Well, sorry to burst your bubble, but we don't.  There are silos, there are Ivory Towers and there are a confusing number of senior management groups with confusing acronyms - SME, ITPB, ITPO, MWE2 ITPB - and people don't really know these groups' remits or which group a particular issue should be raised in.

So how do we fix things?

SME (which stands for Senior Management Executive) have taken steps to address some of the issues in sending out regular bulletins of things that have been discussed and actioned at SME meetings.  This is a big step forward in addressing some of the confusion around SME, but that's just one of the groups, what about the others?

So how do we REALLY fix things?

This is where I see E2.0 / Social Business Networking / Call it What You Want making a real difference and adding real value.

We happen to have chosen Lotus Connections as our platform, but you can pretty much substitute any E2.0 platform you like - it's not about the technology, it's about the mindshift.

So in our organisation what if:
  • We had a Community visible to all of Information Services?
  • We had Wiki defining the remit of each group?
  • We recorded meeting minutes on the Wiki?
  • We had Blog post updates from all of the above mentioned groups?
  • We added project proposals to Files?
  • We had a Forum for seeding new ideas for projects?
  • We templated Activities for every task required to move a project through the system?
  • We used Activities as a light-weight management tool for managing a project that had been approved?
  • We give it a shot?
Use Case 1

Everyone in Information Services can see projects moving through the system and know at what stage each project is at.  It shows that the system works, even if people don't have projects in the system at the time.

Use Case 2

People understand the remit of each group, what is being discussed where and where they need to raise issues.

Use Case 3

Frank sees that Joe's project (which Frank didn't know about) has been rejected, but is important to Frank so Frank contacts Joe to help with getting the project re-submitted.

Use Case 4

Joe sees that Frank's project (which Joe didn't know about) has been accepted, but the project impacts on Joe so Joe contacts Frank to ensure project proposal is adjusted to reflect Joe's team effort.

Use Case 5

Andrew has had a lot of projects rejected, but sees that Mike has had a lot of projects approved.  Andrew contacts Mike for advice on how to submit projects.

Use Case 6

Mike has had a lot of projects accepted, but sees that Andrew has had a lot of projects rejected.  Some of the projects are of interest to Mike so Mike contacts Andrew to offer advice.

Use Case 7

Bob starts a thread on the Forum to ask if anyone is interested in helping him with a project proposal to reduce storage costs.  Andrew responds offering help and invites Paul and Rhys to contribute to the discussion.

Use Case 8

Rebecca is a new employee within Information Services and has never submitted a project proposal before.  Rebecca uses the Activity Template for project submissions and has a clear set of steps, tasks, milestones and documents to help her through the process.

Use Case 9

Andrew and Simon have their project approved and use Activities as a light-weight project management tool to share and complete tasks, track milestones and share relevant documents and emails.

The Reality

I know getting the mindshift to do this is not going to be easy, but it has to be worth a shot doesn't it?  We keep talking about business change, so let's change, let's do things differently.

The Expansion

This post is specific to Information Services, but I guarantee that the same structures and problems exist with every School or Directorate in Cardiff University - and I bet this solution fits for all of them.

The Conclusion

I honestly think this would work and if you want to see this happen then let me know.

I also accept that I am "politically naive" so if there are stumbling blocks to what I'm proposing then shout - but if you are a CU senior peep then if you shout against this you don't really believe in transparency ;)