Archive for December, 2009

Efficent comunication protocols are critical

As we said before, one on the problems we had when migrating to AWS was that the backend system was putting a lot of stress on the server. After a week of benchmarking we realized that the protocol we where using to communicate between the different subsystems we have was the main responsible for the load increment.

From day one we didn’t want to use complex and cryptic protocols so we chose xmlrpc for our communication channel. It was easy to implement, had wide support in php and python and was very easy to debug. We knew that at some point we would need to switch to a more efficient protocol, but we didn’t know it was going to be so soon.

After doing some extensive benchmarking we realize that the through output of the protocol was very low, not only that, if too many xmlrpc connections were spawned it would eventually consume all resources of the process (file descriptors, sockets and memory). This was a painful lesson to learn, but we did. So we switched to the most efficient protocol we could found, that is binary. To be more precise we employ python’s cPickle binary protocol. Saying that the use of this is orders of magnitude more efficient is not even close :P

So after switching each subsystem to the new protocol we saw the load of the machines going down. As with all big changes in the backend of any system, it took a while to stabilize it. To avoid any havoc we actually put it into production subsystem by subsystem so during some time we had both protocols running at the same time.

And so, always remember that the choices you make will come to hunt you if not done correctly ;)

, , , , , ,

No Comments

Merry Christmas 2009!

Hi all,

We just wanted to wish you all a Merry Christmas and a happy New Year!! We hope that this incoming new year is even better than the past one and we wish you all the best of lucks!

Turkey cheers from the Inkzee Team

, ,

No Comments

Being a lean startup

After working as a freelancer on the biggest social network in Spain, I’ve come to value the principles of what Eric Ries calls, the Lean Startup. Among all the things he talks about, one of them stroke really powerful with us, and that is, the notion on Continuous Development. Having experienced the slowness of development in my previous company, I realized that what Eric proposes was key to the success of Inkzee.

We meditated the decision, and we got to the conclusion it was worth the effort and that we would, not only learn a lot during the process, but that we would get greater productivity on the long run. So off we went to implement the 5 steps Eric proposes:

  1. Continuous integration server: We created our own home brewed mini system that allows us to add new tests that get started every time we need them. For now it’s a rather rudimentary system but does the job handsomely.
  2. Source control commit check: We already had a source control system, so we just added all the commit checks. At first we thought this was a waste of time, but with some time we realized how many bugs we had/introduced without this. We added 3 very basic checks, a python syntax check (with pylint) , a very basic php syntax check and the automatic triggering of all the unit tests we have. These checks also enforced a common style rules for all the code under SVN, a somehow gruesome task at first (all the old code was triggering the syntax and style checks), but very rewarding in the end.
  3. Simple deployment script: We had a very small deployment script but we fine tuned it so that it work flawlessly with the new AWS infrastructure. As for now, we make all the deployments in the same fashion, something that reduces the number of flaws you can introduce in this step. The script also creates a backup copy of the previous running code, in case you need to revert to the old version because of some critical error. As to date we’ve never needed to revert anything ;)
  4. Real-time alerting: We introduced a lot of Munin plugins to monitor all type of parameters on our servers. We even developed some very simple Munin plugins for Tokyo Cabinet. We are still missing some alerting here. We have all the graphs but we still need to setup some alerting framework to detect weird situations.
  5. Root cause analysis (5 whys): This is probably the most critical part of the process. We’ve realized that this 5th step is what makes all the previous ones work. This process is an iterative one, you start with a small thing but with time, it will grow into an amazing process that’s is able to detect the slightness problem way before it makes it to the live servers. We’ve become used to always ask the 5 whys and it’s helping in improving the quality of the software we code.

So all in all, it’s an incredible experience we are still figuring out but that to date, has been impressive. We’ve managed to do 14 code uploads in a single day with no bugs whatsoever, plus stopped introducing small potentially critical bugs in the code base and all the way into production. If you want to give it a try, please do read the original posts by Eric Ries, you wont regret it.

, , , ,

4 Comments

Christmas updates

Hi all,

Sorry we’ve been so silent lately. It’s been a mad fall. Last post we wrote about a migration to Amazon Web Services (AWS). To be fair, the migration was a little painful. We had to set new machines with new software versions that made the system a little shaky for a while. We also run into some really nasty performance issues with the new AWS instances.

First of all, if you have a Debian image as we had, be sure to use the correct fine tuned C libraries for Xen Virtualization (that’s what AWS runs on). We started having a huge load on the machines and after investigating we discovered we were using the wrong C library and it was really killing the server.

So after changing the libraries and setting up images that would autoconfigure themselves with the latest config files and code from the latest release, we setup the monitoring software. That’s when we discovered the magic of the AWS instances. When you’re the only one running stuff in your datacenter, everything runs smoothly, but when the US East Coast woke up, the performance of the machine would just drop alarmingly. The reason is that, when no one else is using the resources of the machine your VM is running, you’ll get extra resources, but when all the resources are being used, you’ll get capped to what is configured for that AWS instance. In our case, that was killing us.

After figuring out this, we started benchmarking the backend and looking for ways to decrease the stress. When we finally did, everything went back to normal. We had to fine tune some parameters but the worse part was done. That was our great AWS adventure that didn’t end up there. In the next posts we’ll write a little bit about different issues we had and things we’ve been doing lately.

, , ,

No Comments