First stats with Tokyo Cabinet


Today we started testing Tokyo Cabinet as our DBM for the new design. We had some very good references about it, so we thought we should give it a try.

After setting up Tokyo Cabinet, it’s python binding and Tokyo Tyrant (db server) with it’s python bindings too we did some fast tests. We drafted a new schema-less design for the new database and dumped part of some old data to Tokyo Cabinet.

For those not familiar with the term schema-less, it’s basically a database that has no table structure, that is, everything is stored as a tuple of (key, value). On one side, a key-value database is much faster to read/write but it’s much harder to maintain and keep in sync.

So, we did some queries (read only operations) in both databases an this is what we saw:

Test 1:

  • All data from a feed (MySQL):  0.01699 s
  • Partial data from a feed (TC): 0.00174 s

This first test wasn’t really fair, as MySQL had to retrieve all fields per record, while TC just had to access a bunch of buckets with fewer fields. We did this first test as it’s going to be the real scenario, currently we retrieve many more fields from a Feed than we should and so, the new query under TC is, not only faster because of the database, but because it’s much more lightweighted.

Anyway, we modified the test so that both queries retrieved both fields per row:

Test 2:

  • Partial data from a feed (MySQL): 0.00346 s
  • Partial data from a feed (TC): 0.00151 s

Here we can see that both are slightly similar. Again, this isn’t really fair, as MySQL is executing just one query against several that we do with TC. So, we changed the TC query into a multiget request (request several keys at the same time):

Test 3:

  • Partial data from a feed (MySQL): 0.003533 s
  • Partial data from a feed (TC with Multiget): 0.000845 s

Under exact circunstances it’s clear which one is faster. So, I think we’ll continue experimenting with Tokyo Cabinet and some more real data and see how it performs.

, , , , , ,

  1. #1 by Gabriel - June 25th, 2009 at 11:38

    uhmmm interesting

  2. #2 by abarrera - June 25th, 2009 at 11:47

    Thinking about using it at work? :P

  3. #3 by Ceporrock - June 25th, 2009 at 15:30

    Ok, I’m not really used to this advanced technology, I’m just the coffee-carrier-man at my office, but, can you explain it a little bit more? I mean, what is a tuple for, and how are relationships made in that schema-less designs :o Examples, please!

  4. #4 by abarrera - June 25th, 2009 at 15:52

    Basically, what you work with is a hash table, so you store pairs ok key -> value, for example, idLog -> processPID. They key here is to duplicate content. In a relational DB you normally try to minimize the duplicated content, here it’s the opposite. You trade read/write access speeds for cpu and space.

    So for example, instead of joining 2 tables, you duplicate (a process called denormalization) the data under both keys:

    idLog:bla -> [field1, field2]
    idLog:bla2 -> [field1, field3, field4]

    You then do all sortings, filtering, etc on the business logic side of the app, instead of the database.

    Hope it helps a little ;)

  5. #5 by Jorge - June 26th, 2009 at 13:42

    Alex,

    Glad that you integrated a container database system. Makes lots of sense for anything that is not database relational. Stats numbers are ok, I would recommend you BerkeleyDB better becaouse is more than a dbm giving you extra functionality, more power, etc…

    I developed a server implementation, an open source project:
    http://code.google.com/p/dbmd/

    to make Bekeley distributed and not having to locally open database files and work on remote servers with bulk methods, etc…

    With BerkeleyDB you can have environments, transactions if you wise, replication, duplicate keys, BTREE, HASH, memory cache, etc…

    For example, you could have most popular feeds in an environment with some memory assigned (like 128MB), having a factor of 10 related to disk access, and the rest of feeds in a low memory environment. This would give you memory access for most popular feeds, disk speed access for less popular.

  6. #6 by abarrera - June 28th, 2009 at 17:52

    @Jorge, problem with BDB is that’s it’s slow compared to Tokyo Cabinet. I really don’t need special extra functionality. I do few weird operations from a database point of view. Although TC also supports more abstract operations, alas BDB :)

    Check out the numbers: http://tokyocabinet.sourceforge.net/benchmark.pdf

    TC also has transactions, replication, logging, btree, hash, in-memory storage ;)

    I already use in-memory storage with memcache, which btw, is also supported by Tokyo Tyrant (access to the data through the memcache protocol).

    Thanks for the tips though!

  7. #7 by Pavel Guzhikov - September 29th, 2009 at 14:57

    Hi there.

    I’m doing a little research for my big project and looking for some good key-value db with success stories.

    Yeap, I know TC is really good. But what python bindings are you talking about? It’s really important for me, so I’m here and waiting for your answer :]

    Thanks.

  8. #8 by Bruce - May 19th, 2010 at 17:32

    uhmmm interesting

(will not be published)
  1. No trackbacks yet.