The importance of resting

Happy New Year all!

During this Christmas, we gave ourselves a little break and I have to say, it felt really good! We entrepreneurs love to do things, we are generally over excited by everything and it’s rare if we aren’t handling several projects at the same time. The problem is that we don’t know when to take a break. Not resting, not knowing when to stop and watch movies with popcorn affects the performance of whatever we do.

Exactly that was what was happening me. My performance in the last weeks of December was terrible and I couldn’t concentrate in my coding. So, having been here before I immediately recognized it as what it was, lack of resting. So, I took 10 days of Christmas holidays.

Now I’m back, with a lot of energy for the new year a full of surprises for the next months! So, stay tuned for some news and remember, take a break from time to time, it’s good for your health ;)

, , ,

No Comments

Efficent comunication protocols are critical

As we said before, one on the problems we had when migrating to AWS was that the backend system was putting a lot of stress on the server. After a week of benchmarking we realized that the protocol we where using to communicate between the different subsystems we have was the main responsible for the load increment.

From day one we didn’t want to use complex and cryptic protocols so we chose xmlrpc for our communication channel. It was easy to implement, had wide support in php and python and was very easy to debug. We knew that at some point we would need to switch to a more efficient protocol, but we didn’t know it was going to be so soon.

After doing some extensive benchmarking we realize that the through output of the protocol was very low, not only that, if too many xmlrpc connections were spawned it would eventually consume all resources of the process (file descriptors, sockets and memory). This was a painful lesson to learn, but we did. So we switched to the most efficient protocol we could found, that is binary. To be more precise we employ python’s cPickle binary protocol. Saying that the use of this is orders of magnitude more efficient is not even close :P

So after switching each subsystem to the new protocol we saw the load of the machines going down. As with all big changes in the backend of any system, it took a while to stabilize it. To avoid any havoc we actually put it into production subsystem by subsystem so during some time we had both protocols running at the same time.

And so, always remember that the choices you make will come to hunt you if not done correctly ;)

, , , , , ,

No Comments

Merry Christmas 2009!

Hi all,

We just wanted to wish you all a Merry Christmas and a happy New Year!! We hope that this incoming new year is even better than the past one and we wish you all the best of lucks!

Turkey cheers from the Inkzee Team

, ,

No Comments

Being a lean startup

After working as a freelancer on the biggest social network in Spain, I’ve come to value the principles of what Eric Ries calls, the Lean Startup. Among all the things he talks about, one of them stroke really powerful with us, and that is, the notion on Continuous Development. Having experienced the slowness of development in my previous company, I realized that what Eric proposes was key to the success of Inkzee.

We meditated the decision, and we got to the conclusion it was worth the effort and that we would, not only learn a lot during the process, but that we would get greater productivity on the long run. So off we went to implement the 5 steps Eric proposes:

  1. Continuous integration server: We created our own home brewed mini system that allows us to add new tests that get started every time we need them. For now it’s a rather rudimentary system but does the job handsomely.
  2. Source control commit check: We already had a source control system, so we just added all the commit checks. At first we thought this was a waste of time, but with some time we realized how many bugs we had/introduced without this. We added 3 very basic checks, a python syntax check (with pylint) , a very basic php syntax check and the automatic triggering of all the unit tests we have. These checks also enforced a common style rules for all the code under SVN, a somehow gruesome task at first (all the old code was triggering the syntax and style checks), but very rewarding in the end.
  3. Simple deployment script: We had a very small deployment script but we fine tuned it so that it work flawlessly with the new AWS infrastructure. As for now, we make all the deployments in the same fashion, something that reduces the number of flaws you can introduce in this step. The script also creates a backup copy of the previous running code, in case you need to revert to the old version because of some critical error. As to date we’ve never needed to revert anything ;)
  4. Real-time alerting: We introduced a lot of Munin plugins to monitor all type of parameters on our servers. We even developed some very simple Munin plugins for Tokyo Cabinet. We are still missing some alerting here. We have all the graphs but we still need to setup some alerting framework to detect weird situations.
  5. Root cause analysis (5 whys): This is probably the most critical part of the process. We’ve realized that this 5th step is what makes all the previous ones work. This process is an iterative one, you start with a small thing but with time, it will grow into an amazing process that’s is able to detect the slightness problem way before it makes it to the live servers. We’ve become used to always ask the 5 whys and it’s helping in improving the quality of the software we code.

So all in all, it’s an incredible experience we are still figuring out but that to date, has been impressive. We’ve managed to do 14 code uploads in a single day with no bugs whatsoever, plus stopped introducing small potentially critical bugs in the code base and all the way into production. If you want to give it a try, please do read the original posts by Eric Ries, you wont regret it.

, , , ,

4 Comments

Christmas updates

Hi all,

Sorry we’ve been so silent lately. It’s been a mad fall. Last post we wrote about a migration to Amazon Web Services (AWS). To be fair, the migration was a little painful. We had to set new machines with new software versions that made the system a little shaky for a while. We also run into some really nasty performance issues with the new AWS instances.

First of all, if you have a Debian image as we had, be sure to use the correct fine tuned C libraries for Xen Virtualization (that’s what AWS runs on). We started having a huge load on the machines and after investigating we discovered we were using the wrong C library and it was really killing the server.

So after changing the libraries and setting up images that would autoconfigure themselves with the latest config files and code from the latest release, we setup the monitoring software. That’s when we discovered the magic of the AWS instances. When you’re the only one running stuff in your datacenter, everything runs smoothly, but when the US East Coast woke up, the performance of the machine would just drop alarmingly. The reason is that, when no one else is using the resources of the machine your VM is running, you’ll get extra resources, but when all the resources are being used, you’ll get capped to what is configured for that AWS instance. In our case, that was killing us.

After figuring out this, we started benchmarking the backend and looking for ways to decrease the stress. When we finally did, everything went back to normal. We had to fine tune some parameters but the worse part was done. That was our great AWS adventure that didn’t end up there. In the next posts we’ll write a little bit about different issues we had and things we’ve been doing lately.

, , ,

No Comments

Step 3: AWS migration

Hello all!

We are finally there! We’ve devoted this past month to migrate all of the Inkzee system to the new database we are deploying, Tokyo Cabinet. It’s taken us forever to finish this migration but it’s here to stay. We are still missing some key things, but the bulk of it is already there so we’re starting to begin the migration to Amazon Web Services (AWS).

Right now we’ve managed to build a custom image for our servers with all the tools we need to run the Inkzee backend. Tomorrow and next week we’ll start moving data from the old server to the one in Amazon. This wont be painless and we know it. The architecture is still not 100% stable so we expect some minor glitches but nothing too serious.

So, here we go for step 3!!

Happy summer to all!

The Inkzee Team

, , ,

No Comments

Tokyo Tyrant and some numbers

Tokyo Tyrant is the database server that uses Tokyo Cabinet as backend. It allows you to access the database remotely. It supports 3 protocols, binary, memcache and http. This is great if you have already existing infrastructure.

We needed a php class that implemented the protocol so we took a look at two of them, Net_TokyoTyrant with Pete Warden’s patch and Tyrant by Bertrand Mansion. The first one supports http and binary protocols, while Tyrant only supports the raw binary protocol.

During the first tests, Net_TokyoTyrant went crazy when inserting over 28000 records over http, so I guess there’s something wrong with that. When we switched to the binary protocol it worked as expected.

Here are some quick numbers:

Net_TokyoTyrant (100000 keys)

Time inserted: 50.3662779331 secs
Time retrieved: 57.7555668354 secs
Time deleted: 34.1996 secs

Tyrant (100000 keys)

Time inserted: 39.330272913 secs
Time retrieved: 44.3433589935 secs
Time deleted: 26.9360201359 secs

The former is slightly faster so I guess we’ll go for it. Specially important is that the author keeps it up to date, which is also a plus!

The Inkzee Team

, , , , , ,

No Comments

First stats with Tokyo Cabinet

Today we started testing Tokyo Cabinet as our DBM for the new design. We had some very good references about it, so we thought we should give it a try.

After setting up Tokyo Cabinet, it’s python binding and Tokyo Tyrant (db server) with it’s python bindings too we did some fast tests. We drafted a new schema-less design for the new database and dumped part of some old data to Tokyo Cabinet.

For those not familiar with the term schema-less, it’s basically a database that has no table structure, that is, everything is stored as a tuple of (key, value). On one side, a key-value database is much faster to read/write but it’s much harder to maintain and keep in sync.

So, we did some queries (read only operations) in both databases an this is what we saw:

Test 1:

  • All data from a feed (MySQL):  0.01699 s
  • Partial data from a feed (TC): 0.00174 s

This first test wasn’t really fair, as MySQL had to retrieve all fields per record, while TC just had to access a bunch of buckets with fewer fields. We did this first test as it’s going to be the real scenario, currently we retrieve many more fields from a Feed than we should and so, the new query under TC is, not only faster because of the database, but because it’s much more lightweighted.

Anyway, we modified the test so that both queries retrieved both fields per row:

Test 2:

  • Partial data from a feed (MySQL): 0.00346 s
  • Partial data from a feed (TC): 0.00151 s

Here we can see that both are slightly similar. Again, this isn’t really fair, as MySQL is executing just one query against several that we do with TC. So, we changed the TC query into a multiget request (request several keys at the same time):

Test 3:

  • Partial data from a feed (MySQL): 0.003533 s
  • Partial data from a feed (TC with Multiget): 0.000845 s

Under exact circunstances it’s clear which one is faster. So, I think we’ll continue experimenting with Tokyo Cabinet and some more real data and see how it performs.

, , , , , ,

7 Comments

When partitioning isn’t enough

These past weeks we’ve been partitioning our database design. The goal was to achieve better scalability. Because Inkzee grows with the number of feeds it holds, not the users, we needed to partitioned the data tables so that we could process feed posts faster.

After altering a lot of our current code so that it worked with the new database design we’ve been experiencing problems with MySQL. It seems that, even though the solution makes the overall system much faster (like 3 to 4 times faster), some operations don’t play too well with MySQL and add an unaccepted latency to the system.

We’ve been resisting the urge to migrate to a schema-less database but it seems we have no other option but to transition to it. So, even though we thought we could have the new design working by the end of the week, we are afraid we’ll have to postpone it until further notice. We’ll keep you guys updated though!

The Inkzee Team

, , , , ,

No Comments

Step 2: Database redesign

As part of our milestones towards opening up Inkzee we have the database redesign. We currently manage more than 2 millions posts and over 4000 blogs. And although it might not seem as a lot, our database is starting to complain. A lot of the queries we do against it are getting really sluggish.

That means that if we ought to open up Inkzee we need to redesign the database so it can sustain a higher load of blogs and posts. We are currently working on it and we’ve done great advances. We have a prototype working with the new design but there are still some bugs and problems to resolve.

We hope the new design is finished sometime during this week. We’ll then fire up our test cases and check nothing is broken and once we’re sure the new design is as flawless as we can get it, we’ll release it to you guys! Hopefully you’ll experience a much faster site, not only on a subscription by subscription basis but specially when you request all posts from all blogs.

We’ll keep you posted!

The Inkzee Team

, , ,

No Comments