Monday, 21 December 2009

Apache Shindig - Let's Get Social

As I've been harping on at IBM for them to support OpenSocial in products such as Lotus Connections I decided to have a play with Apache Shindig and the first steps are obviously to get it installed.

I'm using Ubunutu 9.10 and nabbed the following files:
Install Java

user@my-netbook: cp ~user/Downloads/jre-6u17-linux-i586.bin /usr/local/
user@my-netbook: cd /usr/local
user@my-netbook: ./jre-6u17-linux-i586.bin
user@my-netbook: ln -s jre1.6.0_17 java

I want the JAVA_HOME and JRE_HOME environment variables available to all users so I create the following file:

user@my-netbook: sudo vi /etc/profile.d/java.sh

and populate it with the following contents:

JAVA_HOME=/usr/local/java
export JAVA_HOME

JRE_HOME=/usr/local/java
export JRE_HOME

PATH=$PATH:$JAVA_HOME/bin
export PATH

Install Tomcat

user@my-netbook: cp ~user/Downloads/apache-tomcat-6.0.20.tar.gz /usr/local/
user@my-netbook: cd /usr/local
user@my-netbook: tar -xzvf apache-tomcat-6.0.20.tar.gz
user@my-netbook: ln -s apache-tomcat-6.0.20 tomcat

Then check that Tomcat starts ok:

user@my-netbook: cd /usr/local/tomcat/bin
user@my-netbook: ./startup.sh

and check you can access http://localhost:8080, where you should see:



Now shutdown Tomcat:

user@my-netbook: cd /usr/local/tomcat/bin
user@my-netbook: ./shutdown.sh

Install Shindig

user@my-netbook: cd /usr/local/tomcat/webapps
user@my-netbook: mv ROOT/ ROOT.BCK
user@my-netbook: cp ~user/Downloads/shindig-server-1.1-BETA5-incubating.war /usr/local/tomcat/webapps/ROOT.WAR
user@my-netbook: /usr/local/tomcat/bin/startup.sh

Check that Shindig is working by visiting the following URL in your browser:

http://localhost:8080/gadgets/ifr?url=http://www.labpixies.com/campaigns/todo/todo.xml

and you should see:







Hopefully that should be it.

Sunday, 25 October 2009

Retiring the Portal Wibble Dance

This week we finally got to the bottom of a problem that's been a bit of a head scratcher for the last 4-6 weeks.

Our enterprise portal (WebSphere Portal 6.0.1.3) had been standing up just fine with a user base of 30,000 users for over 12 months, but about 4-6 weeks ago it started exhibiting a behaviour I dubbed the "Portal Wibble Dance".

What is the Portal Wibble Dance?

The Portal Wibble Dance is where individual portal nodes started dropping in and out of service with alarming regularity and my inbox resembled a train wreck with thousands of the following alerts:

"node 1 is down"
"node 2 is down"
"node 1 is up"
"node 1 is down"
"node 3 is down"
"node 2 is up"
"node 3 is up"

Luckily we have a 4-node cluster and we weren't getting any service outages as at least one node remained up at any one time when the other's were "wibbling".

So Nodes were Crashing?

Didn't look like.

Notifications that nodes were back up were too quick to suggest that nodes were actually crashing - anyone who's worked with WebSphere Portal knows there's no way a node can restart in 30 seconds ;) - so it looked like the nodes were just slowing down enough for our Netscaler load balancers to mark them as down.

What the Hell is a Netscaler?

The fact that we use Citrix Netscalers as opposed to IBM Edge Servers to load balance has confused people who've worked on our infrastructure and are not used to heterogeneous environments.  So an explanation.

Citrix Netscalers are hardware load balancers that balance the traffic to the 4 IBM HTTPD (IHS) Servers.  It sends a GET request to /wps/portal and if it doesn't get a response within 10 seconds it flags a potential problem with that server.  If it gets 3 failed attempts in a row for a server then it flags that server as down and stops sending traffic to it.  The Netscaler then waits 30 seconds and tries the flagged server again.  If it gets a response within 10 seconds then it flags the server as up and starts sending traffic to it again.

I get the above email alerts for each "down" and "up" flag and I was getting alerts too frequently for it to be portal restarts.

So Was There Anything in the Logs?

Boy, was there.  When portal was "wibbling" it was ripping through 15MB of log files in about 45 seconds - guess that might have had something to do with the slowdowns ;)

There were thousands of the following errors repeating in the log files:
"IWKPC1007X: Could not find an identity for name (resolved reference key): null"

"IWKPY1015X: Unauthorised access by .... "

"Exception is: com.ibm.db2.jcc.c.SqlException: Application must execute a rollback. The unit of work has already been rolled back in the database..."
So What Was the Fix?

I managed to get a fix for IWKPC1007X from IBM Support.  It was a known issue which was documented at http://www-01.ibm.com/support/docview.wss?rs=688&uid=swg1PK60885 , but the TechNote seems to have been withdrawn.  I believe fix PK60885 is rolled in to 6.0.1.4 and above, but if you get the above error then request the fix direct from IBM Support.

So That Fixed Everything?

Wish it was that simple, but portal was still doing the "wibble dance", I just had one less error, but it did make it easier to spot:
SESN0016E: DatabaseSessionContext:performInvalidation detected an error. The database invalidation of timed out sessions has encountered an error..<br />
pointing to a problem with the Session Database, but from the database side of things the DB looked fine, no errors, nothing.

So What Was the Real Fix?

Reducing the individual errors eventually led us to a repeating error with the IBM Web Clipping portlet, which we use to surface our password management application in portal using an IFRAME - yes I know IFRAMES are bad and I hang my head in shame.

Looks like there is a session management problem with IFRAMES and what I hadn't realised was that there had been a policy decision to push everyone to portal for password management, which increased the load on the Web Clipping portlet, increased the session management problem, and started causing enough errors to slow down portal enough for the Netscaler to keep dropping nodes out of service.

Our fix for the time being is to disable persistent sessions completely, and we're slowly going to be removing all Web Clipping/IFRAME portlets.

Parting Shot

Don't use IFRAMES kids, they're not big and they're not clever.  They're a cheap integration point for a proof of concept, but they will bite you in the ass if you roll them into a production environment.


Saturday, 10 October 2009

How To Achieve Real Transparency

I keep getting told in work that we have this thing called "transparency" and everyone is open and honest and there is clear communication from the top down.  Well, sorry to burst your bubble, but we don't.  There are silos, there are Ivory Towers and there are a confusing number of senior management groups with confusing acronyms - SME, ITPB, ITPO, MWE2 ITPB - and people don't really know these groups' remits or which group a particular issue should be raised in.

So how do we fix things?

SME (which stands for Senior Management Executive) have taken steps to address some of the issues in sending out regular bulletins of things that have been discussed and actioned at SME meetings.  This is a big step forward in addressing some of the confusion around SME, but that's just one of the groups, what about the others?

So how do we REALLY fix things?

This is where I see E2.0 / Social Business Networking / Call it What You Want making a real difference and adding real value.

We happen to have chosen Lotus Connections as our platform, but you can pretty much substitute any E2.0 platform you like - it's not about the technology, it's about the mindshift.

So in our organisation what if:
  • We had a Community visible to all of Information Services?
  • We had Wiki defining the remit of each group?
  • We recorded meeting minutes on the Wiki?
  • We had Blog post updates from all of the above mentioned groups?
  • We added project proposals to Files?
  • We had a Forum for seeding new ideas for projects?
  • We templated Activities for every task required to move a project through the system?
  • We used Activities as a light-weight management tool for managing a project that had been approved?
  • We give it a shot?
Use Case 1

Everyone in Information Services can see projects moving through the system and know at what stage each project is at.  It shows that the system works, even if people don't have projects in the system at the time.

Use Case 2

People understand the remit of each group, what is being discussed where and where they need to raise issues.

Use Case 3

Frank sees that Joe's project (which Frank didn't know about) has been rejected, but is important to Frank so Frank contacts Joe to help with getting the project re-submitted.

Use Case 4

Joe sees that Frank's project (which Joe didn't know about) has been accepted, but the project impacts on Joe so Joe contacts Frank to ensure project proposal is adjusted to reflect Joe's team effort.

Use Case 5

Andrew has had a lot of projects rejected, but sees that Mike has had a lot of projects approved.  Andrew contacts Mike for advice on how to submit projects.

Use Case 6

Mike has had a lot of projects accepted, but sees that Andrew has had a lot of projects rejected.  Some of the projects are of interest to Mike so Mike contacts Andrew to offer advice.

Use Case 7

Bob starts a thread on the Forum to ask if anyone is interested in helping him with a project proposal to reduce storage costs.  Andrew responds offering help and invites Paul and Rhys to contribute to the discussion.

Use Case 8

Rebecca is a new employee within Information Services and has never submitted a project proposal before.  Rebecca uses the Activity Template for project submissions and has a clear set of steps, tasks, milestones and documents to help her through the process.

Use Case 9

Andrew and Simon have their project approved and use Activities as a light-weight project management tool to share and complete tasks, track milestones and share relevant documents and emails.

The Reality

I know getting the mindshift to do this is not going to be easy, but it has to be worth a shot doesn't it?  We keep talking about business change, so let's change, let's do things differently.

The Expansion

This post is specific to Information Services, but I guarantee that the same structures and problems exist with every School or Directorate in Cardiff University - and I bet this solution fits for all of them.

The Conclusion

I honestly think this would work and if you want to see this happen then let me know.

I also accept that I am "politically naive" so if there are stumbling blocks to what I'm proposing then shout - but if you are a CU senior peep then if you shout against this you don't really believe in transparency ;)





Friday, 14 August 2009

What Am I Not Getting?

First blog post in a very long time as I'm normally more of a blog reader than a blog writer, but something has been bugging me lately that I just don't get - and when I get bugged I normally turn to Twitter, but this needs more than 140 chars :-)

We're currently in the middle of a programme of work called the Modern Working Environment (MWE) at Cardiff University which is introducing a suite of social networking, collaboration, messaging, business process and integration tools to try and improve the way we work, make people more productive and improve communication and collaboration.

One of the problems we're facing is that we've been getting a lot of feedback recently that people are confused by the number of tools that we've released to date as part of MWE and people don't understand what tool to use for what task. The feedback has even gone as far as suggesting that we should be providing a prescribed matrix of tools versus tasks and that you should always use "tool X" for "task Y" - and this is the bit that's bugging me as the prescribed approach doesn't map to any other work or non-work task.

In my mind the chosen tool for any task is based (mainly) on:

* Appropriateness
* Context
* Availability
* Personal Preference
* Group Preference

e.g.

Task - "I need to ask a colleague a question"
Now for this you could visit them at their desk, phone them, email them, send them an instant message or even schedule a meeting, but people instinctively know which is the right tool to use and are easily able to realise when they need to switch tools if they made the wrong choice. E.g. starting what you thought would be a quick instant message exchange that results in one party sending "i'll phone you".

Task - "I need to travel somewhere"
Where are you traveling? You could walk, take the bus, take the train, drive, liftshare, cycle, run (if you're late for a meeting), fly, take a boat, or any combination of these to get to where you need to. Again people know which tool they want to use and will easily switch tools if appropriate, e.g I liftshare or drive to work in the main, but will switch to buses or trains depending on appropriateness and availability of these tools.

Task - "I need to track/report project progress"
What are you reporting? Who are you reporting to? Within INSRV we're currently tracking and reporting project progress using MS Project, MS Word, MS Powerpoint, MS Excel, Jira, Infra, Confluence and even JPG images. The point is that people know what they want to use, how they want to use it and, again, will switch tools when necessary, e.g. switching from an MS Excel tracked burndown chart to an MS Powerpoint when the information needs to be reported.

Task - "I need to do some DIY"
What are you doing? Knocking down a wall? Building a bookcase? Opening a tin of paint? If you're opening a tin of paint how many people would use a screwdriver? Is that the designed and advertised use of a screwdriver, no? But is it the most appropriate tool for the job, probably?

Task - "I need to make a reminder note"
What do you use here? Post-it note? Scribble in a notebook? Write on the back of your hand? Add to Notes To Do list? Create a Connections Activity? Record an audio note on your mobile? Add to your Remember The Milk account? People know when and what to use and don't use the same one tool everytime they carry out this task.

Actually I lied earlier and the real thing that is bugging me are why the above examples are so easy, but people are finding the suite of tools that we've released so hard in terms of what to use when?

What makes the above so easy? What makes these MWE tools so hard to understand? What are we doing wrong in not making things easy? What could we do better?

Is it knowledge of how to use the tools? Is it experience of using the tools? Is it that technology will always be horrible and scary to a lot of people?

If you've embarked on similar projects have you faced the same challenges? How did you overcome them? Did you overcome them or did your social networking projects die on the vine?

Thanks for reading - my blog posts may get more coherent with a bit more practice at writing :-)