Hans' Blog

Understanding Cassandra

April 25, 2019

Cassandra is an awesome database technology, but I had a difficulty understanding it. That is, until I went through Datastax course material and a very important line:

In using Cassandra, you design your database based on your query.

I am no database guy. I only know certain SQL basic command that I use in learning tutorials, but never learn anything formally. I do know that in using SQL database, you need to get a good data model that is normalized. So that sentence sounds out of place for a database, but at the same time it makes sense!

If you have taken Computer Science courses on data structures, at some point in time, you hear the suggestion that to improve read performance, you sort the data structure upon insertion. That way any lookup will be fast. That's the premise of Cassandra, in a distributed manner. That's why you need good partitioning key and clustering columns, because they determine the sorting order for that quick lookup! Of course, Cassandra being distributed means that the quick lookup is needed across nodes in different network zone, hence the need for tokens that are sorted and which range can be sharded.

The other thing in Cassandra that goes against conventional database wisdom (in SQL world) is denormalization. In Cassandra, as you design by query, there will be cases whereby data is duplicated across tables (i.e. not normalized). This, again, comes back to the fact that Cassandra would like to focus on that quick, scalable operation on insert and read based on the sortable tokens. Rather than joining data on queries (which is hard when your tokens are sharded and would take time to operate), just duplicate the data! Fact of life is that storage is cheap, so that data duplication should have costed much less than slow queries due to join. Nonetheless, this means that the application developers need to think of how to ensure their duplicated data are consistent.