practicing techie

tech oriented notes to self and lessons learned

Tag Archives: rdbms

Practical NoSQL experiences with Apache Cassandra

Most of the backend systems I’ve worked with over the years have employed relational database storage in some role. Despite many application developers complaining about RDBMS performance, I’ve found that with good design and implementation a relational database can actually scale a lot further than developers think. Often software developers who don’t really understand relational databases tend to blame the database for being a performance bottleneck, even if the root cause could actually be traced to bad design and implementation.

That said, there are limits to RDBMS scalability and it can become a serious issue with massive transaction and data volumes. A common workaround is to partitioning application data based on a selected criteria (functional area and/or selected property of entities within functional area) and then distributing data across database server nodes. Such partitioning must usually be done at the expense of relaxing consistency. There are also plenty of other use cases for which relational databases in general, or the ones that are available to you, aren’t without problems.

Load-balancing and failover are sometimes difficult to achieve even on a smaller scale with relational databases, especially if you don’t have the option to license a commercial database clustering option. And even if you can, there are limits to scalability. People tend to workaround these problems with master-slave database configurations, but they can be difficult to set up and manage. This sort of configuration will also impact data consistency if master-slave replication is not synchronous, as is often the case.

When an application also requires a dynamic or open-ended data model, people usually start looking into NoSQL storage solutions.

This was the path of design reasoning for a project I’m currently working on. I’ve been using Apache Cassandra (v1.2) in a development project for a few months now. NoSQL databases come in very different forms and Cassandra is often characterized as a “column-oriented” or “wide-row” database. With the introduction of the Cassandra Query Language (CQL) Cassandra now supports declaring schema and typing for your data model. For the application developer, this feature brings the Cassandra data model somewhat closer to the relational (relations and tuples) model.

NoSQL design goals and the CAP theorem

NoSQL and relational databases have very different design goals. It’s important for application developers to understand these goals because in practice they guide and dictate the set of feasible product features.

ACID transaction guarantees provide a strong consistency model around which web applications have traditionally been designed. When building Internet-scale systems developers came to realize that strong consistency guarantees come at a cost. This was formulated in Brewer’s CAP theorem, which in its original form stated that a distributed system can only achieve two of the following properties:

  • consistency (C) equivalent to having a single up-to-date copy of the data;
  • high availability (A) of that data (for updates); and
  • tolerance to network partitions (P).

The “2 of 3” formulation was later revised somewhat by Brewer, but this realization led developers to consider using alternative consistency models, such as “Basically Available, Soft state, Eventual consistency” or BASE, in order to trade off strong consistency guarantees for availability and partition tolerance, but also scalability. Promoting availability over consistency became a key design tenet for many NoSQL databases. Other common design goals for NoSQL databases include high performance, horizontal scalability, simplicity and schema flexibility. These design goals were also shared by Cassandra founders, but it was also designed to be CAP-aware, meaning the developer is allowed to tune the tradeoff between consistency and latency.

BASE is a consistency model for distributed systems that does not require a NoSQL database. NoSQL databases that promote the BASE model also encourage applications to be designed around BASE. Designing a system that uses BASE consistency model can be challenging from technical perspective, but also because relaxing consistency guarantees will be visible to the users and requires a new way of thinking from the product owner, who traditionally are accustomed to thinking in terms of a strong consistency model.

Data access APIs and client libraries

One of the first things needed when starting to develop a Java database application is a database client library. With most RDBMS products this is straightforward: JDBC is the defacto low-level database access API, so you just download a JDBC driver for that particular database and configure your higher level data access API (e.g. JPA) to use that driver. You get to choose which higher level API to use, but there’s usually only a single JDBC driver vendor for a particular database product. Cassandra on the other hand currently has 9 different clients for Java developers. These clients provide different ways of managing data: some offer an object-relational -mapping API, some support CQL and others provide a lower level (e.g. Thrift based) APIs.

Data in Cassandra can be accessed and managed using an RPC-style (Thrift based) API, but Cassandra also has a very basic query language called CQL that resembles SQL syntactically to some extent, but in many cases the developer is required to have a much deeper knowledge of how the storage engine works below the hood than with relational databases. The Cassandra community recommended API to use for new projects using Cassandra 1.2 is CQL 3.

Since Cassandra is being actively developed, it’s important to pick a client whose development pace matches that of the server. Otherwise you won’t be able to leverage all the new server features in your application. Because Cassandra user community is still growing, it’s good to choose a client with an active user community and existing documentation. Astyanax client, developed by Netflix, currently seems to be the most widely used, production-ready and feature complete Java client for Cassandra. This client supports both Thrift and CQL 3 based APIs for accessing the data. DataStax, a company that provides commercial Cassandra offering and support, is also developing their own CQL 3 based Java driver, which recently came out of beta phase.

Differences from relational databases

Cassandra storage engine design goals are radically different from those of relational databases. These goals are inevitably reflected in the product and APIs, and IMO neither can nor should be hidden from the application developer. The CQL query language sets expectations for many developers and may make them assume they’re working with a relational database. Some important differences to take note of that may feel surprising from an RDBMS background include:

  • joins and subqueries are not supported. The database is optimized for key-oriented access and data access paths need to be designed in advance by denormalization. Apache Solr, Hive, Pig and similar solutions can be used to provide more advanced query and join functionality
  • no integrity constraints. Referential and other types of integrity constraint enforcement must be built into the application
  • no ACID transactions. Updates within a single row are atomic and isolated, but not across rows or across entities. Logical units of work may need be split and ordered differently than when using RDBMS. Applications should be designed should to be designed for eventual consistency
  • only indexed predicates can be used in query statements. An index is automatically created for row keys, indexes in other column values can be created as needed (except currently for collection typed columns). Composite indexes are not supported. Solr, Hive etc. can be used to address these limitations.
  • sort criteria needs to be designed ahead. Sort order selection is very limited. Rows are sorted by row key and columns by column name. These can’t be changed later. Solr, Hive etc. can be used to address these limitations.

Data model design is a process where developers will encounter other dissimilarities compared to the RDBMS world. For Cassandra, the recommended data modeling approach is the opposite of RDBMS: identify data access patterns, then the model data to support those access patterns. Data independence is not a primary goal and developers are expected to understand how the CQL data model maps to storage engine’s implementation data structures in order to make optimal use of Cassandra. (In practice, full data independence can be impossible to achieve with high data volume RDBMS applications as well). The database is optimized for key-oriented data access and data model must be denormalized. Some aspects of the application that can be easily modified or configured at runtime in relational databases are design time decisions with Cassandra, e.g. sorting.

A relational application data model typically stores entities of a single type per relation. The Cassandra storage engine does not require that rows in a column family contain the same set of columns. You can store data about entirely unrelated entities in a single column family (wide rows).

Row key, partition key and clustering column are data modeling concepts that are important to understand for the Cassandra application developer. The Cassandra storage engine uniquely identifies rows by row key and keys provide the primary row access path. A CQL single column primary key maps directly to a row key in the storage engine. In case of a composite primary key, the first part of the primary key is used as the row key and partition key. The remaining parts of a composite primary key are used as clustering columns.

Row key and column name, along with partitioner (and comparator) algorithm selection have important consequences for data partitioning, clustering, performance and consistency. Row key and partitioner control how data is distributed among nodes in the database cluster and ordered within a node. These parameters also determine whether range scanning and sorting is possible in the storage engine. Logical rows with the same partition key get stored as a single, physical wide row, on the same database node. Updates within a single storage engine row are atomic and isolated, but not across rows. This means that your data model design determines which updates can be performed atomically. Columns within a row are clustered and ordered by the clustering columns, which is particularly important when the data model includes wide rows (e.g. time-series data).

When troubleshooting issues with an application, it’s often very important to be able to study the data in the storage engine using ad-hoc queries. Though Cassandra does support ad-hoc CQL queries, the supported query types are more limited. Also, the database schema changes, data migration and data import typically require custom software development. On the other hand, schema evolution has traditionally been very challenging with RDBMS when large data volumes have been involved.

Cassandra supports secondary indexes, but applications are often designed to maintain separate column families that support looking up data based on a single or multiple secondary access criteria.

One of the interesting things I noticed about Cassandra was that it has really nice load-balance and failover clustering support that’s quite easy to setup. Failover works seamlessly and fast. Cassandra is also quite lightweight and effortless to set up. Data access and manipulation operation performance is extremely fast in Cassandra. The data model is schema-flexible and supports use cases for which RDMBS usually aren’t up to the task e.g. storing large amounts of time-series data with very high performance.

Conclusions

Cassandra is a highly available, Internet-scale NoSQL database with design goals that are very different from those of traditional relational databases. The differences between Cassandra and relational databases identified in this article should each be regarded as having pros and cons and be evaluated in the context of the your problem domain. Also, using NoSQL does not exclude the use of RDBMS – it’s quite common to have a hybrid architecture where each database type is used in different use cases according the their strengths.

When starting their first NoSQL project, developers are likely to enter new territory and have their first encounters with related concepts such as big data and eventual consistency. Relational databases are often associated with strong consistency, whereas NoSQL systems are associated with eventual consistency (even though the use of a certain type of database doesn’t imply a particular consistency model). When moving from the relational world and strong consistency to the NoSQL world the biggest mind shift may be in understanding and architecting an application for eventual consistency. Data modeling is another area where a new way of design thinking needs to be adopted.

Cassandra is a very interesting product with a wide range of use cases. I think it’s particularly well suited database option for the following use cases:

  • very large data volumes
  • very large user transaction volumes
  • high reliability requirements for data storage
  • dynamic data model. Data model may be semi structured and expected see significant changes over time
  • cross datacenter distribution

It is, however, very different from relational databases. In order to be able to make an informed design decision on whether to use Cassandra or not, a good way to learn more is to study the documentation carefully. Cassandra development is very fast paced, so many of the documents you may find could be outdated. There’s no substitute for hands-on experience, though, so you should do some prototyping and benchmarking as well.

Advanced PostgreSQL features

Nearly all the server-side development projects I’ve worked in over the years have stored at least part of their data in a relational database system. Even the systems using NoSQL storage have included a RDBMS in some form, whether local to a single subsystem or in a large role. In some cases the RDBMS systems have been proprietary, but increasingly they’ve been open source projects.

I’ve been using PostgreSQL in projects with RDBMS storage requirements on and off over the years. It has often impressed me with its depth of support for the SQL standard as well as wide range of non-standard extension features. With other widely used OSS RDBMS engines, I’ve often noticed that while the database claims to support feature X on paper, it only supports a subset. And then there’s a list of limitations you need to account for. Typically those limitations are something you wouldn’t expect, so they come as a surprise. Subqueries and joins are one such feature.

With PostgreSQL, I haven’t yet found a case where the database wouldn’t be able to handle a SQL standard subquery or join statement. Having an item such as comprehensive support for SQL subqueries and joins in a database product feature list may not look particularly appealing, it may even raise some suspicions. But from a developer point of view I find this “feature” a very important one, perhaps even one of PostgreSQL’s biggest selling points compared to some of it’s OSS competitors. PostgreSQL has many other advanced features that are interesting for application developers.

Common table expressions and hierarchic queries

Common table expressions or CTE is a handy standard SQL feature that allows you to split a query statement into distinct parts where results of each part will appear as a table, so you can reference the intermediate results in other parts of the statement, possibly several times. This can help make query statements more maintainable, but it also helps optimize queries in some cases, since CTE expressions are performed only once per statement execution.

In addition to allowing subquery factoring, CTE can process data as hierarchical. With small amounts of data and relatively shallow hierarchies you can implement hierarchical queries simply using joins, but this workaround may not be acceptable in all cases. A CTE hierarchical query makes it possible to process arbitrarily deep hierarchies with just one query.

Here’s an example how you can get a list of subtasks in an arbitrarily deep task tree along with path information for each task. Including path information will make it easier to build a graph representation on the receiving end.

mydb=> WITH RECURSIVE task_tree (id, name, parent_id, depth, path) AS (
mydb(>   SELECT id, name, parent_id, 1, ARRAY[t.id]
mydb(>     FROM task t WHERE t.id = 1
mydb(>   UNION ALL
mydb(>   SELECT s.id, s.name, s.parent_id, tt.depth + 1, path || s.id
mydb(>     FROM task s, task_tree tt WHERE s.parent_id = tt.id
mydb(> )
mydb-> SELECT * FROM task_tree
mydb-> ORDER BY depth ASC;

 id |    name     | parent_id | depth |    path
----+-------------+-----------+-------+-------------
  1 | task1       |           |     1 | {1}
  4 | task1-1     |         1 |     2 | {1,4}
  5 | task1-2     |         1 |     2 | {1,5}
  6 | task1-3     |         1 |     2 | {1,6}
 13 | task1-3-1   |         6 |     3 | {1,6,13}
 14 | task1-3-1-1 |        13 |     4 | {1,6,13,14}
(6 rows)

PostgreSQL also has an extension module that allows handling hierarchic data in a less verbose, but non-standard manner:

mydb=> SELECT * FROM connectby('task', 'id', 'parent_id', '1', 0, '/')
mydb->  AS t(id BIGINT, parent_id BIGINT, level INT, branch TEXT);

 id | parent_id | level |  branch
----+-----------+-------+-----------
  1 |           |     0 | 1
  4 |         1 |     1 | 1/4
  5 |         1 |     1 | 1/5
  6 |         1 |     1 | 1/6
 13 |         6 |     2 | 1/6/13
 14 |        13 |     3 | 1/6/13/14
(6 rows)

Aggregates and window functions

SQL GROUP BY lets you calculate aggregates of data over a single or multiple columns in a result set. The clause, however, can only aggregate over a single grouping, so it wouldn’t be possible e.g. to get calculate average salaries over departments and locations in a single query. Another limitation is that only aggregated data is returned and detail data is not preserved, so you can’t get both the detail records and aggregates using a single query.

Window functions make it possible to get both. Here’s how to calculate employee salary aggregates over several different groupings while preserving the detail records:

mydb=> SELECT depname, location, empno, salary,
mydb-> AVG(salary) OVER (PARTITION BY depname) avgdept,
mydb-> SUM(salary) OVER (PARTITION BY depname) sumdept,
mydb-> AVG(salary) OVER (PARTITION BY location) avgloc,
mydb-> RANK() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos
mydb-> FROM empsalary;

  depname  | location | empno | salary |        avgdept        | sumdept |        avgloc         | pos
-----------+----------+-------+--------+-----------------------+---------+-----------------------+-----
 develop   | fi       |     8 |   6000 | 5020.0000000000000000 |   25100 | 4550.0000000000000000 |   1
 develop   | se       |    10 |   5200 | 5020.0000000000000000 |   25100 | 4950.0000000000000000 |   2
 develop   | fi       |    11 |   5200 | 5020.0000000000000000 |   25100 | 4550.0000000000000000 |   3
 develop   | fi       |     9 |   4500 | 5020.0000000000000000 |   25100 | 4550.0000000000000000 |   4
 develop   | fi       |     7 |   4200 | 5020.0000000000000000 |   25100 | 4550.0000000000000000 |   5
 personnel | fi       |     2 |   3900 | 3700.0000000000000000 |    7400 | 4550.0000000000000000 |   1
 personnel | fi       |     5 |   3500 | 3700.0000000000000000 |    7400 | 4550.0000000000000000 |   2
 sales     | se       |     1 |   5000 | 4866.6666666666666667 |   14600 | 4950.0000000000000000 |   1
 sales     | se       |     3 |   4800 | 4866.6666666666666667 |   14600 | 4950.0000000000000000 |   2
 sales     | se       |     4 |   4800 | 4866.6666666666666667 |   14600 | 4950.0000000000000000 |   3
(10 rows)

Like CTE, window functions is a feature specified in the SQL standard, but it’s not supported by all OSS or proprietary RDBMS systems.

Pivoting data

Sometimes it’s nice to be able to pivot data in a properly normalized data model, so that repeating groups of related entities are folded into parent entity as columns. This can be useful e.g. for reporting purposes and ad-hoc queries. PostgreSQL can handle pivoting data using subqueries and arrays like this:

mydb=> SELECT e.*,
mydb-> (SELECT ARRAY_TO_STRING(ARRAY(SELECT emp_phone_num FROM emp_phone p WHERE e.employee_id = p.emp_id), ',')) AS phones
mydb->  FROM employees AS e;

 employee_id | last_name | manager_id |         phones
-------------+-----------+------------+-------------------------
         100 | King      |            |
         101 | Kochhar   |        100 | 555-123,555-234,555-345
         108 | Greenberg |        101 | 555-111
         205 | Higgins   |        101 | 555-914,555-222
         206 | Gietz     |        205 |
...
(13 rows)

Another way is to use the tablefunc extension module again:

mydb=> SELECT *
mydb-> FROM crosstab(
mydb(>   'SELECT emp_id, contact_type, contact FROM emp_sm_contact ORDER BY 1',
mydb(>   'SELECT DISTINCT contact_type FROM emp_sm_contact ORDER BY 1'
mydb(> )
mydb-> AS emp_sm_contact(emp_id BIGINT, "g+" TEXT, "linkedIn" TEXT, twitter TEXT);

 emp_id |    g+     | linkedIn  | twitter
--------+-----------+-----------+---------
    100 | bking     | b.king    | beking
    101 | kochhar.2 | kochhar.1 | kochhar
    200 |           |           | whalen
(3 rows)

Other advanced features

Other advanced PostgreSQL features that I find of interest to application developers include:

  • pattern matching. Regular expression matching is supported
  • geolocation queries. PostGIS extension adds comprehensive support for managing and querying geospatial data
  • partitioning
  • replication

Final thoughts

Problems faced in the transition phase of the software development process, when the software has been handed over from development to operations team, have prompted the need for closer collaboration between the teams in the form of devops culture. Similarly, application developers can’t remain ignorant of database design, implementation and optimization issues, and expect the DBAs to magically fix any data tier related design issues after the system has been implemented. Application developers need to learn how to leverage database systems effectively and take responsibility of database tier design to make transitions more seamless and production deployments succeed.

While standards based object relational mapping (ORM) technologies, such as Java persistence API, can be a great help to application developers, developers should be aware of what kind of queries the ORM implementation is generating, and in particular watch out for N+1 queries issues. With higher data, transaction volumes or data access patterns advanced database features will be a significant help in optimising the application.

To quote Oracle guru Tom Kyte: it’s a database, not a data dump. PostgreSQL is an advanced relational database engine and it has a lot of features that can help application developers implement new features faster and in a more efficient and scalable manner. As with all the tools: you should learn how to use it to get the most out of it.

More info