• A touch from the distributed NoSql DB world

    Published by on January 28th, 2013 5:39 pm under Algorithms, Data structures, DB, Distributed databases, Distributed systems, MapReduce, NoSQL

    1 Comment

    Lately I’ve reading the ‘Seven Databases in Seven Weeks’ book, and besides recommending it if you like databases in general, there are a lot of interesting concepts and ideas to learn from. In this post I won’t be talking about the details or good uses of a particular DB or which one to choose to resolve a particular data problem you have, but from something that I enjoyed significantly more while reading: the supporting algorithms, data structures and architectures applied to resolve specific problems.

    I enjoy watching how complex things can be resolved with just a simple idea, or a couple of them, and this is the case in the DB world where the integration of all the things is more than the sum of individual parts. So, let’s start with the things that gladly caught my attention:

    One of the things I liked most and never gone into detail before, more than just heard its name and vaguely known what it could be used for, are the Bloom filters. I gladly understood how some of the databases analyzed during the book take advantage of them in order to perform fast lookups when need to get a specific value. It is a good structure to manage a not accurate index using significantly less space than an accurate index would use and, with a certain probability and error rate, can tell you if the thing you are looking for is maybe there, or answer that is not there (being 100% sure – i.e. it guarantees no false negatives). One interesting thing about its theory is that you can calculate these probabilities according to how many bits you are using to track things, how many hash functions are used to evaluate each piece of data and how big the set you are indexing is getting (or will be). Using this and having an approximate idea of how your data will grow you can know the number of bits you need to maintain a filter with an error rate below a specific bound.

    A good practical example is how this is used within HBase. As this database stores columns in chunks that are split among several servers according to some random sharding schema (decided by the servers), you can know what is the chunk you need to go looking for something but you don’t know it that something could be there or not (here a Bloom filter can avoid you to search at all if it answers that what you are looking for is not there). You can find an interactive example and related stuff here.

    Another interesting concept used in many NoSql databases to resolve query and aggregation demands, among others, is the usage of the MapReduce algorithm. Since the idea is very simple to understand, it is a powerful concept and unleashes you to perform whatever logic you want to execute over your distributed and probably huge sets of data in several ways. Its foundations is the divide and conquer principle, where a task (or big problem) is split and treated as subtask (or small problems) returning partial results, and at the end a final step is performed to merge them all (this approach sometimes helps to simplify computational complexity and demanded resources, where both can be analyzed theoretically). Besides, being a very old concept in computer science, the MapReduce new boom is related to how Google (who patented it) uses it to resolve its huge query demands. In this case this is just a particular scenario that happens on nodes within a cluster so it introduces parallelism as well. Each of the participants containing part of the data you need to process, and waiting for the function (or logic) each one must execute (over its owned set) in order to return its results to the caller. An overall view consists of: a first step (map) where you possibly evaluate and convert set of elements to another domain, and a merge step (reduce) that operates with the list of results from different nodes to generate the final result.

    A practical example of this can be found on Riak DB where map/reduce function is sent to the cluster and distributed over the participant nodes in the “Riak Ring”. The map function is ran in each of the nodes in the cluster closer to where the data is (without the need to transfer ‘raw’ data through the network). So, besides avoiding network latency issues (by moving big sets of data), you don’t also need high centralized processing power as this architecture evaluates (in parallel) each subset in each individual node. Finally the reduce function is executed after all the partial results from nodes has been collected. You can find more details about how this works here.

    A (very) interesting theorem, from theoretical computer science and related to distributed computing, is the CAP theorem. This basically asserts that if you have a distributed system, you can have simultaneously two of the following properties: Consistency, Availability or Partition Tolerance, but not the three of them. Taking into consideration the environment and the many factors that act against a distributed system is not possible to fulfill these three features without falling into the fallacies of distributed computing.

    Beyond what the theorem says, it’s more interesting to see how some NoSql databases, that are also distributed, deal with it and allow ‘playing’ with this assertion. If we get Riak into scene again, we will find that this DB has an interesting way of parameterize its inner workings while reading/writing data within the cluster. Riak’s ring architecture distributes, and replicates, data within the nodes in its cluster. This clearly sets the DB as Partition Tolerance but other features are still available for you to play with: Consistency & Availability. Nevertheless, in the case of Riak you will never achieve full consistency if you use it in a real distributed environment (Riak has something that it calls ‘Eventual Consistency’).

    This database allows you to freely select the values for: the total of nodes (N) that will participate in replicating data, how many nodes should be written (W) before a successful write is returned, and how many nodes should be read (R) to get the latest value for a key. With these parameters you can for example achieve high availability by setting a high value for N, nevertheless, if what you want to achieve is consistency you can try setting W=N or R=N (leaving the other value R or W to 1), where the first set of settings is called consistency by writes and the last one consistency by reads. You can think that having a high N and one of two choices you can achieve both consistency and availability, but this is not the case as having a high number of replica nodes leads to the following scenario: if you choose to configure consistency by writes, the number of endpoints where an update can hit is bigger leading to inconsistent versions of data for the same key (that the user must resolve); in addition to degrading the performance when writing to multiple nodes. If you try to set consistency by reads you will finally find a similar scenario, showing that you can’t achieve both. I encourage you to read more of how Riak allows you to play with it and its internals.

    This is how Riak stands in front of the CAP theorem, and it’s nice to see how tweaking some parameters and playing a little with the cluster configuration, you can set Riak as {c}AP having sometimes (or eventually) consistency. Other databases achieve consistency or availability exclusively, but not both.

    There are also another interesting structures used in the web distributed DB world that you can find, and if you like algorithms, data structures and architectures like me, I recommend you to dive into them: inverted indexes, vector clocks, write-ahead logging, distributed file systems, DB sharding, and many more.

    Thanks for reading, happy diving!! :)

    Tags: , , , ,