tech oriented notes to self and lessons learned
Practical NoSQL experiences with Apache Cassandra
2013-07-13Posted by on
Most of the backend systems I’ve worked with over the years have employed relational database storage in some role. Despite many application developers complaining about RDBMS performance, I’ve found that with good design and implementation a relational database can actually scale a lot further than developers think. Often software developers who don’t really understand relational databases tend to blame the database for being a performance bottleneck, even if the root cause could actually be traced to bad design and implementation.
That said, there are limits to RDBMS scalability and it can become a serious issue with massive transaction and data volumes. A common workaround is to partitioning application data based on a selected criteria (functional area and/or selected property of entities within functional area) and then distributing data across database server nodes. Such partitioning must usually be done at the expense of relaxing consistency. There are also plenty of other use cases for which relational databases in general, or the ones that are available to you, aren’t without problems.
Load-balancing and failover are sometimes difficult to achieve even on a smaller scale with relational databases, especially if you don’t have the option to license a commercial database clustering option. And even if you can, there are limits to scalability. People tend to workaround these problems with master-slave database configurations, but they can be difficult to set up and manage. This sort of configuration will also impact data consistency if master-slave replication is not synchronous, as is often the case.
When an application also requires a dynamic or open-ended data model, people usually start looking into NoSQL storage solutions.
This was the path of design reasoning for a project I’m currently working on. I’ve been using Apache Cassandra (v1.2) in a development project for a few months now. NoSQL databases come in very different forms and Cassandra is often characterized as a “column-oriented” or “wide-row” database. With the introduction of the Cassandra Query Language (CQL) Cassandra now supports declaring schema and typing for your data model. For the application developer, this feature brings the Cassandra data model somewhat closer to the relational (relations and tuples) model.
NoSQL design goals and the CAP theorem
NoSQL and relational databases have very different design goals. It’s important for application developers to understand these goals because in practice they guide and dictate the set of feasible product features.
ACID transaction guarantees provide a strong consistency model around which web applications have traditionally been designed. When building Internet-scale systems developers came to realize that strong consistency guarantees come at a cost. This was formulated in Brewer’s CAP theorem, which in its original form stated that a distributed system can only achieve two of the following properties:
- consistency (C) equivalent to having a single up-to-date copy of the data;
- high availability (A) of that data (for updates); and
- tolerance to network partitions (P).
The “2 of 3” formulation was later revised somewhat by Brewer, but this realization led developers to consider using alternative consistency models, such as “Basically Available, Soft state, Eventual consistency” or BASE, in order to trade off strong consistency guarantees for availability and partition tolerance, but also scalability. Promoting availability over consistency became a key design tenet for many NoSQL databases. Other common design goals for NoSQL databases include high performance, horizontal scalability, simplicity and schema flexibility. These design goals were also shared by Cassandra founders, but it was also designed to be CAP-aware, meaning the developer is allowed to tune the tradeoff between consistency and latency.
BASE is a consistency model for distributed systems that does not require a NoSQL database. NoSQL databases that promote the BASE model also encourage applications to be designed around BASE. Designing a system that uses BASE consistency model can be challenging from technical perspective, but also because relaxing consistency guarantees will be visible to the users and requires a new way of thinking from the product owner, who traditionally are accustomed to thinking in terms of a strong consistency model.
Data access APIs and client libraries
One of the first things needed when starting to develop a Java database application is a database client library. With most RDBMS products this is straightforward: JDBC is the defacto low-level database access API, so you just download a JDBC driver for that particular database and configure your higher level data access API (e.g. JPA) to use that driver. You get to choose which higher level API to use, but there’s usually only a single JDBC driver vendor for a particular database product. Cassandra on the other hand currently has 9 different clients for Java developers. These clients provide different ways of managing data: some offer an object-relational -mapping API, some support CQL and others provide a lower level (e.g. Thrift based) APIs.
Data in Cassandra can be accessed and managed using an RPC-style (Thrift based) API, but Cassandra also has a very basic query language called CQL that resembles SQL syntactically to some extent, but in many cases the developer is required to have a much deeper knowledge of how the storage engine works below the hood than with relational databases. The Cassandra community recommended API to use for new projects using Cassandra 1.2 is CQL 3.
Since Cassandra is being actively developed, it’s important to pick a client whose development pace matches that of the server. Otherwise you won’t be able to leverage all the new server features in your application. Because Cassandra user community is still growing, it’s good to choose a client with an active user community and existing documentation. Astyanax client, developed by Netflix, currently seems to be the most widely used, production-ready and feature complete Java client for Cassandra. This client supports both Thrift and CQL 3 based APIs for accessing the data. DataStax, a company that provides commercial Cassandra offering and support, is also developing their own CQL 3 based Java driver, which recently came out of beta phase.
Differences from relational databases
Cassandra storage engine design goals are radically different from those of relational databases. These goals are inevitably reflected in the product and APIs, and IMO neither can nor should be hidden from the application developer. The CQL query language sets expectations for many developers and may make them assume they’re working with a relational database. Some important differences to take note of that may feel surprising from an RDBMS background include:
- joins and subqueries are not supported. The database is optimized for key-oriented access and data access paths need to be designed in advance by denormalization. Apache Solr, Hive, Pig and similar solutions can be used to provide more advanced query and join functionality
- no integrity constraints. Referential and other types of integrity constraint enforcement must be built into the application
- no ACID transactions. Updates within a single row are atomic and isolated, but not across rows or across entities. Logical units of work may need be split and ordered differently than when using RDBMS. Applications should be designed should to be designed for eventual consistency
- only indexed predicates can be used in query statements. An index is automatically created for row keys, indexes in other column values can be created as needed (except currently for collection typed columns). Composite indexes are not supported. Solr, Hive etc. can be used to address these limitations.
- sort criteria needs to be designed ahead. Sort order selection is very limited. Rows are sorted by row key and columns by column name. These can’t be changed later. Solr, Hive etc. can be used to address these limitations.
Data model design is a process where developers will encounter other dissimilarities compared to the RDBMS world. For Cassandra, the recommended data modeling approach is the opposite of RDBMS: identify data access patterns, then the model data to support those access patterns. Data independence is not a primary goal and developers are expected to understand how the CQL data model maps to storage engine’s implementation data structures in order to make optimal use of Cassandra. (In practice, full data independence can be impossible to achieve with high data volume RDBMS applications as well). The database is optimized for key-oriented data access and data model must be denormalized. Some aspects of the application that can be easily modified or configured at runtime in relational databases are design time decisions with Cassandra, e.g. sorting.
A relational application data model typically stores entities of a single type per relation. The Cassandra storage engine does not require that rows in a column family contain the same set of columns. You can store data about entirely unrelated entities in a single column family (wide rows).
Row key, partition key and clustering column are data modeling concepts that are important to understand for the Cassandra application developer. The Cassandra storage engine uniquely identifies rows by row key and keys provide the primary row access path. A CQL single column primary key maps directly to a row key in the storage engine. In case of a composite primary key, the first part of the primary key is used as the row key and partition key. The remaining parts of a composite primary key are used as clustering columns.
Row key and column name, along with partitioner (and comparator) algorithm selection have important consequences for data partitioning, clustering, performance and consistency. Row key and partitioner control how data is distributed among nodes in the database cluster and ordered within a node. These parameters also determine whether range scanning and sorting is possible in the storage engine. Logical rows with the same partition key get stored as a single, physical wide row, on the same database node. Updates within a single storage engine row are atomic and isolated, but not across rows. This means that your data model design determines which updates can be performed atomically. Columns within a row are clustered and ordered by the clustering columns, which is particularly important when the data model includes wide rows (e.g. time-series data).
When troubleshooting issues with an application, it’s often very important to be able to study the data in the storage engine using ad-hoc queries. Though Cassandra does support ad-hoc CQL queries, the supported query types are more limited. Also, the database schema changes, data migration and data import typically require custom software development. On the other hand, schema evolution has traditionally been very challenging with RDBMS when large data volumes have been involved.
Cassandra supports secondary indexes, but applications are often designed to maintain separate column families that support looking up data based on a single or multiple secondary access criteria.
One of the interesting things I noticed about Cassandra was that it has really nice load-balance and failover clustering support that’s quite easy to setup. Failover works seamlessly and fast. Cassandra is also quite lightweight and effortless to set up. Data access and manipulation operation performance is extremely fast in Cassandra. The data model is schema-flexible and supports use cases for which RDMBS usually aren’t up to the task e.g. storing large amounts of time-series data with very high performance.
Cassandra is a highly available, Internet-scale NoSQL database with design goals that are very different from those of traditional relational databases. The differences between Cassandra and relational databases identified in this article should each be regarded as having pros and cons and be evaluated in the context of the your problem domain. Also, using NoSQL does not exclude the use of RDBMS – it’s quite common to have a hybrid architecture where each database type is used in different use cases according the their strengths.
When starting their first NoSQL project, developers are likely to enter new territory and have their first encounters with related concepts such as big data and eventual consistency. Relational databases are often associated with strong consistency, whereas NoSQL systems are associated with eventual consistency (even though the use of a certain type of database doesn’t imply a particular consistency model). When moving from the relational world and strong consistency to the NoSQL world the biggest mind shift may be in understanding and architecting an application for eventual consistency. Data modeling is another area where a new way of design thinking needs to be adopted.
Cassandra is a very interesting product with a wide range of use cases. I think it’s particularly well suited database option for the following use cases:
- very large data volumes
- very large user transaction volumes
- high reliability requirements for data storage
- dynamic data model. Data model may be semi structured and expected see significant changes over time
- cross datacenter distribution
It is, however, very different from relational databases. In order to be able to make an informed design decision on whether to use Cassandra or not, a good way to learn more is to study the documentation carefully. Cassandra development is very fast paced, so many of the documents you may find could be outdated. There’s no substitute for hands-on experience, though, so you should do some prototyping and benchmarking as well.