Last week’s Virtual JUG session was about distributed databases in general and Apache Cassandra in particular. The session looked at how distributing your database might help your application to run smoother, when your business has a need to go distributed and what you should know before you take it there.
Virtual JUG speaker, Christopher Batey a technical evangelist and a software engineer at DataStax, presented a great session, not only talking about the Cassandra technology and design choices, but also covering general details and theory of distributed databases. The IRC chat was buzzing and if you haven’t checked it yet during any Virtual JUG sessions, we are always happy to see you at #VirtualJUG on Freenode. Then again, if you’re more curious about Cassandra, you can just find Christopher on Twitter: @chbatey and he’ll be happy to share his thoughts with you.
Anyway, let’s dig into what we have learnt from this session. If you want to watch it for yourself first, here it is embedded for your convenience.
First of all, to set up the atmosphere for this session about distributed databases, Christopher made us think of the last outage we have had or inability to scale the service with the customer growth. Indeed, database issues are a common reason for service outages, and we’ve experienced a fair share of them ourselves. So learning how distributed databases can help with some of the problems would be lovely.
However, don’t think too much about those gloomy days, the science has shown that a stressed or embarrassed mind barely learns anything, so take a deep breath and get ready to learn the answers to the following questions:
- Does NoSQL mean a database?
- Why would you use any distributed database?
- How does one run a database on multiple servers?
- How does Apache Cassandra work?
So, what does NoSQL mean?
There are several ways to understand the acronym NoSQL, but in general it is used to refer to all databases and data stores that do not use the relational data model. Consequently, they do not use SQL the language and offer some other means to query the data. However, this space is quite large and there are multiple very different technologies available:
- Caches: like Redis and Hazelcast
- Search tools: Lucene, Apache Solr, ElasticSearch
- Databases: Cassandra, MongoDB, etc.
Christopher explained this characterization as follows: one of the main questions that a database cares about is data durability, a database should be the source of truth in your application. This means that the database comes with some sort of technology to enable that, via commit logs or similar.
This is not all, in fact to pick the right tool for the job of being a database in your system you need to understand if the data size will grow much larger than the memory capacity of the cluster you’ll be running; will you tolerate the downtime and what availability you’re aiming for, etc.
Next, when you picked a technology to use, you need to figure out if you need to distribute your data store. You don’t necessary need to do that, but at some point of growth you probably won’t want to throw money at the problem and scale your database server and the network storage behind it.
A distributed system makes the overall architecture of the project more complex, but it makes the fine-tuning the system for throughput easier. With a distributed system you can tweak the throughput of read queries or write queries in isolation. On top of that a distributed system makes it possible to handle hardware failures. Disasters happen (just watch the Jurassic Park movies), and so losing a datacenter to an outage is normal these days. If your system can tolerate that, you can go with a simple option, but for a globally used web application that has to be online 24/7, going distributed is almost a must.
Ok, hopefully you’re convinced that sometimes we need to run a database on multiple servers. The other way to scale is by doing so vertically by purchasing more and more powerful hardware, which can be an effective plan for a short time of growth.
How to run a database on multiple servers?
Horizontal scaling is the ability to run a piece of software on many small machines rather than a huge, powerful machine. This is harder on the application level, since your data is split across multiple locations and your code has to be more or less stateless to make use of the load balancing and be served by any of those small machines.
Existing databases technologies typically run in a master/slave setup. The master is the main database and it gets asynchronously duplicated to the other servers. When the master is down a slave instance is temporarily working to provide the data. This is great for scaling the read operations, any slave server can answer the queries instead of the only main instance answering everything.
As soon as we enter the master/slave territory, we get the limited consistency world. We might hit the moment when the latest writes are not yet propagated fully and the query won’t return the latest written values. At the same time we haven’t scaled write operations yet, since they go through the master only and the data size is duplicated, it’s not not scaled. This might be okay, but let’s see how we can really scale writes.
Sharding is the common way to scale writes. You run a several copies of this master/slave architecture and the data is distributed among them. The software decides into which shard it has to write and the throughput is limited for every shard.
This however limits the ability to run ad-hoc queries. Since no shard holds your entire data, you cannot just fetch the data from several of these easily. Think of how you organize unique primary keys, which are auto-incremented. You have to ensure that the key is globally unique, so you have to go and check every shard every time you create one. So you stick with a local invariant inside one shard and change the data model to avoid performing join queries across several shards.
Now since we’re in the distributed system universe, Christopher described the CAP theorem. In short CAP theorem says that a distributed system can have two of the three basic properties:
- Partition tolerance
Consistency here means linearizable consistency such that all events are ordered. For every two events you can say which happened before the other. Availability means that all of the servers that are currently running will give you an answer to a query. Partition tolerance means that the system will work when servers cannot communicate with others.
The typical master/slave system we mentioned earlier satisfies consistency, since all the writes go through the master and you can usually order them. However, when the master is down, slaves won’t answer the queries for some time, since they are picking out the new master, so the system is not available.
Now with all that in mind, let’s talk about a different way of organizing the database. Welcome, Apache Cassandra.
Apache Cassandra is an open source project, so if you’re using it or planning to start doing so, do yourself a favor and join the community as well. You don’t have to become a hardcore Cassandra contributor to make things better, just join the IRC chats and mail lists, figure out where their task tracking system is, vote on the features you want to see and submit a bug or two. Even better, not only can you contribute to the backend server technology, but you can also improve the driver that you use from your programming language of choice. Plenty of opportunities to contribute to the ecosystem and make the world a better, more convenient place.
The main difference in the Cassandra architecture is that any server in the cluster can handle reads and writes for any value. To avoid the chaos you have to design the system very well. We need a conflict resolution strategy to avoid losing write operations when two servers try to update the same value at the same time.
A good starting point to better understand the complexities and tradeoffs of such systems is the Dynamo paper by Amazon. It talks about how to design a database with the main idea that an odd mistake is better than an unavailable system. Cassandra is heavily influenced by Dynamo, but it’s not an identical implementation.
Full scans are bad on a single server, it would be a disaster on a cluster of databases. Consistent hashing is a way to determine where the data is and where from should the system query it. The Cassandra data model reminds a map of maps. The first key in this map is hashed and determines to which node the data will go.
You can figure out which node has a value for a particular key. Let’s take 42 for example, that would be in the range 0-249 so we’d look on node A. Since you can consistently hash the data into integers every server knows where to look for data. To avoid losing data, it is replicated across several nodes. You don’t put your backup on the same machine. In the same fashion Cassandra is aware of the network topology and it replicates data intelligently to the nodes that are not in the same rack or region on AWS.
The details of the replication configuration are all tunable and there are plenty of details to make it perfect. In the session Christopher went further to explain how Cassandra write and read operations work.
Additionally, he talked about data modelling and how you want to organize your data to get the best out of your Cassandra deployment, and we also looked at CQL, the Cassandra Query Language, how to use Cassandra types like lists and maps, the details of the Cassandra drivers, fault tolerance etc.
However, for the sake of brevity, here we won’t replicate the session in full and we will encourage you to go and watch it first hand!
There’s a number of online resources and books Christopher recommended for enthusiasts that want to learn about Cassandra:
- Christopher’s blog, where he writes about distributed systems, fault tolerance, network issues and everything he finds interesting.
- Planet Cassandra has many resources, success stories, descriptions of the right use cases for Cassandra and much more.
- DataStax Academy offers online courses where you can learn about distributed databases, and specifically Cassandra.
- Cassandra High Availability is one of the good books on using Cassandra to achieve greater throughput and high availability.
Additionally, if you want to learn more about Cassandra or just listen to Christopher talk more, you’re in luck! He was recently speaking about Cassandra at GeekOut. You can watch the recording of his session and all other GeekOut videos right here: http://2015.geekout.ee/videos/. It was a very inspiring conference, give it a shot.
Interview with Christopher Batey and final words
After the session, we interviewed Christopher for some juicy answers on Java, microservices (how could we resist?), databases and the best way to organize database access from code. It turns out Christopher knows quite a bit about networks, latencies, testing large distributed systems. Hopefully from our interview you will get some pointers to where to look next if these topics interest you. Also, if you’re curious about the removal of Unsafe in Java 9, which is a hot topic right now, and how this will affect Cassandra and what Java engineers should do to handle the removal of Unsafe in their projects, Christopher has an advice for all of us.
If you leave your email in the form below, I’d be happy to send you an occasional email about what’s happening with RebelLabs, the best pieces of content we have and what’s happening in the world of Java in general.