tech oriented notes to self and lessons learned
Selecting a platform for your next application development project can be a complex and burdensome undertaking. It can also be very intriguing and a lot of fun. There’s a wide range of different approaches to take: at one end The Architect will attend conferences, purchase and study analyst reports from established technology research companies such as Gartner, and base his evaluation on analyst views. Another approach is to set up a cross-disciplinary evaluation committee that will collect a wishlist of platform requirements from around the organization and make its decision based on a consensus vote. The first approach is very autocratic, while the second can sometimes lead to lack of focus. A clear, coherent vision of requirements and prioritization is essential for the success of the evaluation. Due to these problems, a middle road and a more pragmatic approach is becoming increasingly popular: a tightly-knit group of senior propellerheads use a more empiric method of analysing requirements, study and experiment with potential solution stack elements, brainstorm to produce a short list of candidates to be validated using a hands-on architecture exercises and smell-tests. Though hands-on experimentation can lead to better results, the cost of this method can be prohibitive, so often only a handful of solutions that pass the first phase screening can be evaluated this way.
Platform evaluation criteria depend on the project requirements and may include:
Performance and scalability are often high priority concerns. They are also among those platform properties that can be formulated into quantifiable criteria, though the key challenge here is how to model the user and implement performance tests that accurately model your expected workloads. Benchmarking several different platforms can only add to the cost of benchmarking.
A company called TechEmpower has started a project called TechEmpower Framework Benchmarks, or TFB for short, that aims to compare the performance of different web frameworks. The project publishes benchmark results that application developers can use to make more informed decisions when selecting frameworks. What’s particularly interesting about FrameworkBenchmarks, is that it’s a collaborative effort conducted in an open manner. Development related discussions take place in an online forum and the source code repository is publicly available on GitHub. Doing test implementation development in the open is important for enabling peer review and it allows implementations to evolve and improve over time. The project implements performance tests for a wide variety of frameworks, and chances are that the ones that you’re planning to use are included. If not, you can create your own tests and submit them to be included in the project code base. You can also take the tests and run the benchmarks on your own hardware.
Openly published test implementations are not only useful for producing benchmark data, but can also be used by framework developers to communicate framework performance related best practices to application developers. They also allow framework developers to receive reproducible performance benchmarking feedback and data for optimization purposes.
It’s interesting to note that the test implementations have been designed and built by different groups and individuals, and some may have been more rigorously optimized than others. The benchmarks measure the performance of the framework as much as they measure the test implementation, and in some cases suboptimal test implementation will result in poor overall performance. Framework torchbearers are expected to take their best shot in optimizing the test implementation, so the implementations should eventually converge to optimal solutions given enough active framework pundits.
In the project’s parlance, the combination of programming language, framework and database used is termed “framework permutation” or just permutation, and some test types have been implemented in 100+ different permutations. The different test types include:
Currently, the latest benchmark is Round 9 and the result data is published on the project web page. The data is not available in machine-readable form and it can’t be sorted by column for analysing patterns. It can, however, be imported into a spreadsheet program fairly easily, so I took the data and analyzed it a bit. Some interesting observations could be made just by looking at the raw data. In addition to comparing throughput, it’s also interesting to compare how well frameworks scale. One way of quantifying scalability is to take test implementation throughput figures for the lowest and highest concurrency level (for test types 1, 2, 4 and 6) per framework and plot them on a 2-D plane. A line can then be drawn between these two points with the slope characterizing scalability. Well-scaling test implementations would be expected to have a positive, steep slope for test types 1, 2, 4 and 6 whereas for test types 3 and 5 the slope is expected to be negative.
This model is not entirely without problems since the scalability rating is not relative to the throughput, so e.g. a poorly performing framework can end up having a great scalability rating. As a result, you’d have to look at these figures together.
To better visualize throughput against concurrency level (“Peak Hosting” environment data), I created a small web app that’s available at http://tfb-kippo.rhcloud.com/ (the app is subject to removal without notice).
The JSON serialization test aims to measure framework overhead. One could argue that it’s a bit of a micro benchmark, but it should demonstrate how well the framework does with basic tasks like request routing, JSON serialization and response generation.
The top 10 frameworks were based on the following programming languages: C++, Java, Lua, Ur and Go. C++ based CPPSP was the clear winner while the next 6 contestants were Java -based. No database is used in this test type.
The top 7 frameworks with highest throughput also have the highest scalability rating. After that, both these figures start declining fairly rapidly. This is a very simple test and it’s a bit of a surprise to see such large variation in results. In their commentary TechEmpower attributes some of the differences to how well frameworks work on a NUMA-based system architecture.
Quite many frameworks are Java or JVM based and rather large variations exist even within this group, so clearly neither the language nor the JVM is an impeding factor in this group.
I was surprised about Node.js and HHVM rankings. Unfortunately, the Scala-based Spray test implementation, as well as the JVM-based polyglot framework Vert.x implementation, were removed due to being outdated. Hope to see these included in a future benchmark round.
This test type measures database access throughput and parallelizability. Again, surprisingly large spread in performance can be observed for a fairly trivial test case. This would seem to suggest that framework or database access method overhead contributes significantly to the results. Is the database access technology (DB driver or ORM) a bottleneck? Or is the backend system one? It would be interesting to look at the system activity reports from test runs to analyze potential bottlenecks in more detail.
Before seeing the results I would’ve expected the DB backend to be the bottleneck, but this doesn’t appear to be clear-cut based on the fact that the top, as well as many of the bottom performing test implementations, are using the same DB. It was interesting to note that the top six test implementations use a relational database with the first NoSQL based implementation taking 7th place. This test runs DB read statements by ID, which NoSQL databases should be very good at.
Top performing 10 frameworks were based on Java, C++, Lua and PHP languages and are using MySQL, PostgreSQL and MongoDB databases. Java based Gemini leads with CPPSP being second. Both use MySQL DB. Spring based test implementation performance was a bit of a disappointment.
Where the previous test exercised a single database query per request this test does a variable number of database queries per request. Again, I would’ve assumed this test would measure the backend database performance more than the framework performance, but it seems that framework and database access method overhead can also contribute significantly.
The top two performers in this test are Dart based implementations that use MongoDB.
Top 10 frameworks in this test are based on Dart, Java, Clojure, PHP and C# languages and they use MongoDB and MySQL databases.
This is the most complex test that aims to exercise the full framework stack from request routing through business logic execution, database access, templating and response generation.
Top 10 frameworks are based on C++, Java, Ur, Scala, PHP languages and with the full spectrum of databases being used (MySQL, PostgreSQL and MongoDB).
In addition to reads this test exercises database updates as well.
HHVM wins this test with 3 Node.js based frameworks coming next. Similar to the Single database query test the top 13 implementations work with relational MySQL DB, before NoSQL implementations. This test exercises simple read and write data access by ID which, again, should be one of NoSQL database strong points.
The aim of this test is to measure how well the framework performs under extreme load conditions and massive client parallelism. Since there’s no backend system dependencies involved, this test measures platform and framework concurrency limits. Throughput plateaus or starts degrading with top-performing frameworks in this test before client concurrency level reaches the maximum value, which seems to suggest that a bottleneck is being hit somewhere in the test setup, presumably hardware, OS and/or framework concurrency.
Many frameworks are at their best with concurrency level of 256, except CPPSP which peaks at 1024. CPPSP is the only one of the top-performing implementations that is able to significantly improve its performance as the concurrency level increases from 256, but even with CPPSP throughput actually starts dropping after concurrency level hits the 4,096 mark. Only 12 test implementations are able to exceed 1 M requests per second. Some well-known platforms e.g. Spring did surprisingly poorly.
There seems to be something seriously wrong with HHVM test run as it generates only tens of responses per second with concurrency levels 256 and 1024.
Top 10 frameworks are based on C++, Java, Scala and Lua languages. No database is used in this test.
In the scientific world research must be repeatable, in order to be credible. Similarly, the benchmark test methodology and relevant circumstances should be documented to make the results repeatable and credible. There’re a few details that could be documented to improve repeatability.
The benchmarking project source code doesn’t seem to be tagged. Tagging would be essential for making benchmarks repeatable.
A short description of the hardware and some other test environment parameters is available on the benchmark project web site. However, the environment setup (hardware + software) is expected to change over time, so this information should be documented per round. Also, Linux distribution minor release or the exact Linux kernel version don’t appear to be identified.
Detailed data about what goes on inside the servers could be published, so that externals could analyze benchmark results in a more meaningful way. System activity reports e.g. system resource usage (CPU, memory, IO) can provide valuable clues to possible scalability issues. Also, application, framework, database and other logs can be useful to test implementers.
Resin was chosen as the Java application server over Apache Tomcat and other servlet containers due to performance reasons. While I’m not contesting this statement, but there wasn’t any mention about software versions, and since performance attributes tend to change over time between releases, this premise is not repeatable.
Neither the exact JVM version nor the JVM arguments are documented for JVM based test implementation execution. Default JVM arguments are used if test implementations don’t override the settings. Since the test implementations have very similar execution profiles by definition, it could be beneficial to explicitly configure and share some JVM flags that are commonly used with server-side applications. Also, due to JVM ergonomics different GC parameters can be automatically selected based on underlying server capacity and JVM version. Documenting these parameters per benchmark round would help with repeatability. Perhaps all the middleware software versions could be logged during test execution and the full test run logs could be made available.
Since I’ve worked recently on implementing RESTful services based on JAX-RS 2 API with asynchronous processing (based on Jersey 2 implementation) and Apache Cassandra NoSQL database, I got curious about how this combination would perform against the competition so, I started coding my own test implementation. I decided to drop JAX-RS in this case, however, to eliminate any non-essential abstraction layers that might have a negative impact on performance.
One of the biggest hurdles in getting started with test development was that, at the time I started my project there wasn’t a way to test run platform installation scripts in smaller pieces, and you had to run the full installation, which took a very long time. Fortunately, since then framework installation procedure has been compartmentalized, so it’s possible to install just the framework that you’re developing tests for. Also, recently the project has added support for fully automated development environment setup with Vagrant, which is a great help. Another excellent addition is Travis CI integration that allows test implementation developers to gain additional assurance that their code is working as expected also outside their sandbox. Unfortunately, Travis builds can take a very long time, so you might need to disable some of the tests that you’re not actively working on. The Travis CI environment is also a bit different from the developer and the actual benchmarking environments, so you could bump into issues with Travis builds that don’t occur in the development environment, and vice versa. Travis build failures can sometimes be very obscure and tricky to troubleshoot.
The actual test implementation code is easy enough to develop and test in isolation, outside of the real benchmark environment, but if you’re adding support for new platform components such as databases or testing platform installation scripts, it’s easiest if you have an environment that’s a close replica of the actual benchmarking environment. In this case adding support for a new database involved creating a new DB schema, test data generation and automating database installation and configuration.
Implementing the actual test permutation turned out to be interesting, but surprisingly laborious, as well. I started seeing strange error responses occasionally when benchmarking my test implementation with ab and wrk, especially with higher loads. TFB executes Java based performance implementations in the Resin web container, and after a while of puzzlement about the errors, I decided to test the code in other web containers, namely Tomcat and Jetty. It turned out that I had bumped into 1 Resin bug (5776) and 2 Tomcat bugs (56736, 56739) related to servlet asynchronous processing support.
Architecturally, Test types 1 and 6 have been implemented using traditional synchronous Servlet API, while the rest of the test implementations leverage non-blocking request handling through Servlet 3 asynchronous processing support. The test implementations store their data in the Apache Cassandra 2 NoSQL database, which is accessed using the DataStax Java Driver. Asynchronous processing is also used in the data access tier in order to minimize resource consumption. JSON data is processed with the Jackson JSON library. In Java versions predating version 8, asynchronous processing requires passing around callbacks in the form of anonymous classes, which can at times be a bit high-ceremony syntactically. Java 8 Lambda expressions does away with some of the ceremonial overhead, but unfortunately TFB doesn’t yet fully support the latest Java version. I’ve previously used the JAX-RS 2 asynchronous processing API, but not the Servlet 3 async API. One thing I noticed during the test implementation was that the mechanism provided by Servlet 3 async API for generating error response to the client is much lower level, less intuitive and more cumbersome than its JAX-RS async counterpart.
The test implementation code was merged in the FrameworkBenchmarks code base, so it should be benchmarked on the next round. The code can be found here:
TechEmpower’s Framework Benchmarks is a really valuable contribution to the web framework developer and user community. It holds great potential for enabling friendly competition between framework developers, as well as, framework users, and thus driving up performance of popular frameworks and adoption of framework performance best practices. As always, there’s room for improvement. Some areas from a framework user and test implementer point of view include: make the benchmark tests and results more repeatable, publish raw benchmark data for analysis purposes and work on making test development and adding new framework components even easier.
Good job TFB team + contributors – can’t wait to see Round 10 benchmark data!
Most of the backend systems I’ve worked with over the years have employed relational database storage in some role. Despite many application developers complaining about RDBMS performance, I’ve found that with good design and implementation a relational database can actually scale a lot further than developers think. Often software developers who don’t really understand relational databases tend to blame the database for being a performance bottleneck, even if the root cause could actually be traced to bad design and implementation.
That said, there are limits to RDBMS scalability and it can become a serious issue with massive transaction and data volumes. A common workaround is to partitioning application data based on a selected criteria (functional area and/or selected property of entities within functional area) and then distributing data across database server nodes. Such partitioning must usually be done at the expense of relaxing consistency. There are also plenty of other use cases for which relational databases in general, or the ones that are available to you, aren’t without problems.
Load-balancing and failover are sometimes difficult to achieve even on a smaller scale with relational databases, especially if you don’t have the option to license a commercial database clustering option. And even if you can, there are limits to scalability. People tend to workaround these problems with master-slave database configurations, but they can be difficult to set up and manage. This sort of configuration will also impact data consistency if master-slave replication is not synchronous, as is often the case.
When an application also requires a dynamic or open-ended data model, people usually start looking into NoSQL storage solutions.
This was the path of design reasoning for a project I’m currently working on. I’ve been using Apache Cassandra (v1.2) in a development project for a few months now. NoSQL databases come in very different forms and Cassandra is often characterized as a “column-oriented” or “wide-row” database. With the introduction of the Cassandra Query Language (CQL) Cassandra now supports declaring schema and typing for your data model. For the application developer, this feature brings the Cassandra data model somewhat closer to the relational (relations and tuples) model.
NoSQL and relational databases have very different design goals. It’s important for application developers to understand these goals because in practice they guide and dictate the set of feasible product features.
ACID transaction guarantees provide a strong consistency model around which web applications have traditionally been designed. When building Internet-scale systems developers came to realize that strong consistency guarantees come at a cost. This was formulated in Brewer’s CAP theorem, which in its original form stated that a distributed system can only achieve two of the following properties:
The “2 of 3” formulation was later revised somewhat by Brewer, but this realization led developers to consider using alternative consistency models, such as “Basically Available, Soft state, Eventual consistency” or BASE, in order to trade off strong consistency guarantees for availability and partition tolerance, but also scalability. Promoting availability over consistency became a key design tenet for many NoSQL databases. Other common design goals for NoSQL databases include high performance, horizontal scalability, simplicity and schema flexibility. These design goals were also shared by Cassandra founders, but it was also designed to be CAP-aware, meaning the developer is allowed to tune the tradeoff between consistency and latency.
BASE is a consistency model for distributed systems that does not require a NoSQL database. NoSQL databases that promote the BASE model also encourage applications to be designed around BASE. Designing a system that uses BASE consistency model can be challenging from technical perspective, but also because relaxing consistency guarantees will be visible to the users and requires a new way of thinking from the product owner, who traditionally are accustomed to thinking in terms of a strong consistency model.
One of the first things needed when starting to develop a Java database application is a database client library. With most RDBMS products this is straightforward: JDBC is the defacto low-level database access API, so you just download a JDBC driver for that particular database and configure your higher level data access API (e.g. JPA) to use that driver. You get to choose which higher level API to use, but there’s usually only a single JDBC driver vendor for a particular database product. Cassandra on the other hand currently has 9 different clients for Java developers. These clients provide different ways of managing data: some offer an object-relational -mapping API, some support CQL and others provide a lower level (e.g. Thrift based) APIs.
Data in Cassandra can be accessed and managed using an RPC-style (Thrift based) API, but Cassandra also has a very basic query language called CQL that resembles SQL syntactically to some extent, but in many cases the developer is required to have a much deeper knowledge of how the storage engine works below the hood than with relational databases. The Cassandra community recommended API to use for new projects using Cassandra 1.2 is CQL 3.
Since Cassandra is being actively developed, it’s important to pick a client whose development pace matches that of the server. Otherwise you won’t be able to leverage all the new server features in your application. Because Cassandra user community is still growing, it’s good to choose a client with an active user community and existing documentation. Astyanax client, developed by Netflix, currently seems to be the most widely used, production-ready and feature complete Java client for Cassandra. This client supports both Thrift and CQL 3 based APIs for accessing the data. DataStax, a company that provides commercial Cassandra offering and support, is also developing their own CQL 3 based Java driver, which recently came out of beta phase.
Cassandra storage engine design goals are radically different from those of relational databases. These goals are inevitably reflected in the product and APIs, and IMO neither can nor should be hidden from the application developer. The CQL query language sets expectations for many developers and may make them assume they’re working with a relational database. Some important differences to take note of that may feel surprising from an RDBMS background include:
Data model design is a process where developers will encounter other dissimilarities compared to the RDBMS world. For Cassandra, the recommended data modeling approach is the opposite of RDBMS: identify data access patterns, then the model data to support those access patterns. Data independence is not a primary goal and developers are expected to understand how the CQL data model maps to storage engine’s implementation data structures in order to make optimal use of Cassandra. (In practice, full data independence can be impossible to achieve with high data volume RDBMS applications as well). The database is optimized for key-oriented data access and data model must be denormalized. Some aspects of the application that can be easily modified or configured at runtime in relational databases are design time decisions with Cassandra, e.g. sorting.
A relational application data model typically stores entities of a single type per relation. The Cassandra storage engine does not require that rows in a column family contain the same set of columns. You can store data about entirely unrelated entities in a single column family (wide rows).
Row key, partition key and clustering column are data modeling concepts that are important to understand for the Cassandra application developer. The Cassandra storage engine uniquely identifies rows by row key and keys provide the primary row access path. A CQL single column primary key maps directly to a row key in the storage engine. In case of a composite primary key, the first part of the primary key is used as the row key and partition key. The remaining parts of a composite primary key are used as clustering columns.
Row key and column name, along with partitioner (and comparator) algorithm selection have important consequences for data partitioning, clustering, performance and consistency. Row key and partitioner control how data is distributed among nodes in the database cluster and ordered within a node. These parameters also determine whether range scanning and sorting is possible in the storage engine. Logical rows with the same partition key get stored as a single, physical wide row, on the same database node. Updates within a single storage engine row are atomic and isolated, but not across rows. This means that your data model design determines which updates can be performed atomically. Columns within a row are clustered and ordered by the clustering columns, which is particularly important when the data model includes wide rows (e.g. time-series data).
When troubleshooting issues with an application, it’s often very important to be able to study the data in the storage engine using ad-hoc queries. Though Cassandra does support ad-hoc CQL queries, the supported query types are more limited. Also, the database schema changes, data migration and data import typically require custom software development. On the other hand, schema evolution has traditionally been very challenging with RDBMS when large data volumes have been involved.
Cassandra supports secondary indexes, but applications are often designed to maintain separate column families that support looking up data based on a single or multiple secondary access criteria.
One of the interesting things I noticed about Cassandra was that it has really nice load-balance and failover clustering support that’s quite easy to setup. Failover works seamlessly and fast. Cassandra is also quite lightweight and effortless to set up. Data access and manipulation operation performance is extremely fast in Cassandra. The data model is schema-flexible and supports use cases for which RDMBS usually aren’t up to the task e.g. storing large amounts of time-series data with very high performance.
Cassandra is a highly available, Internet-scale NoSQL database with design goals that are very different from those of traditional relational databases. The differences between Cassandra and relational databases identified in this article should each be regarded as having pros and cons and be evaluated in the context of the your problem domain. Also, using NoSQL does not exclude the use of RDBMS – it’s quite common to have a hybrid architecture where each database type is used in different use cases according the their strengths.
When starting their first NoSQL project, developers are likely to enter new territory and have their first encounters with related concepts such as big data and eventual consistency. Relational databases are often associated with strong consistency, whereas NoSQL systems are associated with eventual consistency (even though the use of a certain type of database doesn’t imply a particular consistency model). When moving from the relational world and strong consistency to the NoSQL world the biggest mind shift may be in understanding and architecting an application for eventual consistency. Data modeling is another area where a new way of design thinking needs to be adopted.
Cassandra is a very interesting product with a wide range of use cases. I think it’s particularly well suited database option for the following use cases:
It is, however, very different from relational databases. In order to be able to make an informed design decision on whether to use Cassandra or not, a good way to learn more is to study the documentation carefully. Cassandra development is very fast paced, so many of the documents you may find could be outdated. There’s no substitute for hands-on experience, though, so you should do some prototyping and benchmarking as well.