January 10, 2014

Distance from Christmas

Back in January 2011, around this time of year, we had a pretty shitty set of weeks. Sometime around then, Tim told me, “you know this is normal; this time of year is considered the most depressing [for the western world] — something to do with no daylight and too many days away from Christmas”

The first few weeks of January 2012 sucked; 2013 was not much better. Sales slumping, employees not working out, system failure, and probably a whole slew of personal issues would always compound around then.

Well, I guess nature and science decided 2014 should be no different. This was the week we almost lost everything, again.

Monday – AACRAO server just dies. Fortunately we had backups off site and were able to recover; but the whole day gone.

Tuesday – AWMA launches. The usual post-launch QA ensues but we discover the iMIS portal is coded like a piece of shit. Tuesday and Wednesday gone.

This morning, at the gym with Zack, i was pretty upbeat. We had record traffic to our servers from the AAF competition, and they held strong on Weds Thurs and Friday. I told Zack in between sets, “you know running a business is like running servers in the cloud” — “you can scale up, be a contractor and make money with less redundancy” — or you can scale out “hire low cost workers who all do the same thing with a marginal profit.”

I didn’t realize 4 hours later all hell would break loose.

It started around noon; 185 people on the site, we would get reports of intermittent time-outs. My first thought was — shit — Rackspace is having a connectivity issue, this can take hours to fix. I immediately contacted Rackspace and learned “a node is failing; let’s take it out of rotation”

An hour goes back, traffic shifts between 150 and 190 users per minute; at the high side things are not doing too well; then at the lower side things are fine. I started to suspect that traffic was indeed the culprit.

An hour goes by and we have to interview Tommy. I’m slightly panicked, but composed for the interview. If this works out to be a bad hire it’s because we were not in our right mind during the interview; he was not brain dead so we made him an offer; probably not the best reason to hire someone.

During the interview, traffic surges to 220+ people — site pretty much dead.

Help desk complaints coming through non-stop. It’s like the healthcare website fiasco, except it is on our shoulders. Marc is out sick; I ask Tim/Zack for help to respond to people as I go into crisis mode.

We knew cloudsites would die at some point; but thought since it made it to 185 and held strong, we were in clear territory. Still, I had a backup server ready to go. 8 GB of Ram, 4 vCPUS, I could get it working within 15 minutes.

The thing about server load is that it will perform in constant time, until it is overloaded. Then it responds to no one. It isn’t a gradual slowdown, it is a shutdown — an all or nothing.

I “fail-over” to the backup server and within seconds that server is dead. From cloudsites handling some traffic to an immediate death. 220 people per minute now really pissed.

I fail back to cloudsites, and try again. This time I provision a massive 24 GB server with 12 vCPUs. I also am on the phone with Rackspace as this is going on. They are treating this as a “high traffic incident.”

As they work on their side, I enable the 24 GB server. It lasted about 16 seconds, 4x more than the less powerful server. Probably my lack of knowledge in the emergency, but for the second time, the site was totally down.

I fail back to cloudsites and just pray that traffic drops below the 180 mark. It doesn’t.

Fortunately, cloudsites finds a way to give me 2x the normal resources and the problem is solved.

I’m breathing again, but I just sat at my desk for 9 hours, eating just the fruit from our weekly fruitbox.

I called both Joanne and George, surprisingly both were confident that we were still the right team for them. Of course if this happens again, I don’t think they’ll be singing the same tune.

At 4:30 PM we had our monthly company strategy discussion; I was frazzled; but made it through. I skipped on the monthly happy hour to stay back and lick off my wounds. Zack stayed back with me and we helped tackle the regular hum-drum issues of the day.

At 7, I get a call from Joanne that “we are not going to tell Jim” (CEO of AAF). I was shocked, I told her that it was a 3 hour widespread outage with maybe 1000 people impacted, probably not the best idea to keep it under the cover. I guess we’ll see how that blows over.

Shitty day for sure. Credibility hurt pretty badly, but not irreparably.

What’s next?

Improve infrastructure; test the crap out of the improvements; pray this doesn’t happen again before we are ready.

Distance from Christmas

Distance from Christmas

Featured:

Elsewhere: