practicing techie

tech oriented notes to self and lessons learned

Monthly Archives: May 2013

Advanced PostgreSQL features

Nearly all the server-side development projects I’ve worked in over the years have stored at least part of their data in a relational database system. Even the systems using NoSQL storage have included a RDBMS in some form, whether local to a single subsystem or in a large role. In some cases the RDBMS systems have been proprietary, but increasingly they’ve been open source projects.

I’ve been using PostgreSQL in projects with RDBMS storage requirements on and off over the years. It has often impressed me with its depth of support for the SQL standard as well as wide range of non-standard extension features. With other widely used OSS RDBMS engines, I’ve often noticed that while the database claims to support feature X on paper, it only supports a subset. And then there’s a list of limitations you need to account for. Typically those limitations are something you wouldn’t expect, so they come as a surprise. Subqueries and joins are one such feature.

With PostgreSQL, I haven’t yet found a case where the database wouldn’t be able to handle a SQL standard subquery or join statement. Having an item such as comprehensive support for SQL subqueries and joins in a database product feature list may not look particularly appealing, it may even raise some suspicions. But from a developer point of view I find this “feature” a very important one, perhaps even one of PostgreSQL’s biggest selling points compared to some of it’s OSS competitors. PostgreSQL has many other advanced features that are interesting for application developers.

Common table expressions and hierarchic queries

Common table expressions or CTE is a handy standard SQL feature that allows you to split a query statement into distinct parts where results of each part will appear as a table, so you can reference the intermediate results in other parts of the statement, possibly several times. This can help make query statements more maintainable, but it also helps optimize queries in some cases, since CTE expressions are performed only once per statement execution.

In addition to allowing subquery factoring, CTE can process data as hierarchical. With small amounts of data and relatively shallow hierarchies you can implement hierarchical queries simply using joins, but this workaround may not be acceptable in all cases. A CTE hierarchical query makes it possible to process arbitrarily deep hierarchies with just one query.

Here’s an example how you can get a list of subtasks in an arbitrarily deep task tree along with path information for each task. Including path information will make it easier to build a graph representation on the receiving end.

mydb=> WITH RECURSIVE task_tree (id, name, parent_id, depth, path) AS (
mydb(>   SELECT id, name, parent_id, 1, ARRAY[t.id]
mydb(>     FROM task t WHERE t.id = 1
mydb(>   UNION ALL
mydb(>   SELECT s.id, s.name, s.parent_id, tt.depth + 1, path || s.id
mydb(>     FROM task s, task_tree tt WHERE s.parent_id = tt.id
mydb(> )
mydb-> SELECT * FROM task_tree
mydb-> ORDER BY depth ASC;

 id |    name     | parent_id | depth |    path
----+-------------+-----------+-------+-------------
  1 | task1       |           |     1 | {1}
  4 | task1-1     |         1 |     2 | {1,4}
  5 | task1-2     |         1 |     2 | {1,5}
  6 | task1-3     |         1 |     2 | {1,6}
 13 | task1-3-1   |         6 |     3 | {1,6,13}
 14 | task1-3-1-1 |        13 |     4 | {1,6,13,14}
(6 rows)

PostgreSQL also has an extension module that allows handling hierarchic data in a less verbose, but non-standard manner:

mydb=> SELECT * FROM connectby('task', 'id', 'parent_id', '1', 0, '/')
mydb->  AS t(id BIGINT, parent_id BIGINT, level INT, branch TEXT);

 id | parent_id | level |  branch
----+-----------+-------+-----------
  1 |           |     0 | 1
  4 |         1 |     1 | 1/4
  5 |         1 |     1 | 1/5
  6 |         1 |     1 | 1/6
 13 |         6 |     2 | 1/6/13
 14 |        13 |     3 | 1/6/13/14
(6 rows)

Aggregates and window functions

SQL GROUP BY lets you calculate aggregates of data over a single or multiple columns in a result set. The clause, however, can only aggregate over a single grouping, so it wouldn’t be possible e.g. to get calculate average salaries over departments and locations in a single query. Another limitation is that only aggregated data is returned and detail data is not preserved, so you can’t get both the detail records and aggregates using a single query.

Window functions make it possible to get both. Here’s how to calculate employee salary aggregates over several different groupings while preserving the detail records:

mydb=> SELECT depname, location, empno, salary,
mydb-> AVG(salary) OVER (PARTITION BY depname) avgdept,
mydb-> SUM(salary) OVER (PARTITION BY depname) sumdept,
mydb-> AVG(salary) OVER (PARTITION BY location) avgloc,
mydb-> RANK() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos
mydb-> FROM empsalary;

  depname  | location | empno | salary |        avgdept        | sumdept |        avgloc         | pos
-----------+----------+-------+--------+-----------------------+---------+-----------------------+-----
 develop   | fi       |     8 |   6000 | 5020.0000000000000000 |   25100 | 4550.0000000000000000 |   1
 develop   | se       |    10 |   5200 | 5020.0000000000000000 |   25100 | 4950.0000000000000000 |   2
 develop   | fi       |    11 |   5200 | 5020.0000000000000000 |   25100 | 4550.0000000000000000 |   3
 develop   | fi       |     9 |   4500 | 5020.0000000000000000 |   25100 | 4550.0000000000000000 |   4
 develop   | fi       |     7 |   4200 | 5020.0000000000000000 |   25100 | 4550.0000000000000000 |   5
 personnel | fi       |     2 |   3900 | 3700.0000000000000000 |    7400 | 4550.0000000000000000 |   1
 personnel | fi       |     5 |   3500 | 3700.0000000000000000 |    7400 | 4550.0000000000000000 |   2
 sales     | se       |     1 |   5000 | 4866.6666666666666667 |   14600 | 4950.0000000000000000 |   1
 sales     | se       |     3 |   4800 | 4866.6666666666666667 |   14600 | 4950.0000000000000000 |   2
 sales     | se       |     4 |   4800 | 4866.6666666666666667 |   14600 | 4950.0000000000000000 |   3
(10 rows)

Like CTE, window functions is a feature specified in the SQL standard, but it’s not supported by all OSS or proprietary RDBMS systems.

Pivoting data

Sometimes it’s nice to be able to pivot data in a properly normalized data model, so that repeating groups of related entities are folded into parent entity as columns. This can be useful e.g. for reporting purposes and ad-hoc queries. PostgreSQL can handle pivoting data using subqueries and arrays like this:

mydb=> SELECT e.*,
mydb-> (SELECT ARRAY_TO_STRING(ARRAY(SELECT emp_phone_num FROM emp_phone p WHERE e.employee_id = p.emp_id), ',')) AS phones
mydb->  FROM employees AS e;

 employee_id | last_name | manager_id |         phones
-------------+-----------+------------+-------------------------
         100 | King      |            |
         101 | Kochhar   |        100 | 555-123,555-234,555-345
         108 | Greenberg |        101 | 555-111
         205 | Higgins   |        101 | 555-914,555-222
         206 | Gietz     |        205 |
...
(13 rows)

Another way is to use the tablefunc extension module again:

mydb=> SELECT *
mydb-> FROM crosstab(
mydb(>   'SELECT emp_id, contact_type, contact FROM emp_sm_contact ORDER BY 1',
mydb(>   'SELECT DISTINCT contact_type FROM emp_sm_contact ORDER BY 1'
mydb(> )
mydb-> AS emp_sm_contact(emp_id BIGINT, "g+" TEXT, "linkedIn" TEXT, twitter TEXT);

 emp_id |    g+     | linkedIn  | twitter
--------+-----------+-----------+---------
    100 | bking     | b.king    | beking
    101 | kochhar.2 | kochhar.1 | kochhar
    200 |           |           | whalen
(3 rows)

Other advanced features

Other advanced PostgreSQL features that I find of interest to application developers include:

  • pattern matching. Regular expression matching is supported
  • geolocation queries. PostGIS extension adds comprehensive support for managing and querying geospatial data
  • partitioning
  • replication

Final thoughts

Problems faced in the transition phase of the software development process, when the software has been handed over from development to operations team, have prompted the need for closer collaboration between the teams in the form of devops culture. Similarly, application developers can’t remain ignorant of database design, implementation and optimization issues, and expect the DBAs to magically fix any data tier related design issues after the system has been implemented. Application developers need to learn how to leverage database systems effectively and take responsibility of database tier design to make transitions more seamless and production deployments succeed.

While standards based object relational mapping (ORM) technologies, such as Java persistence API, can be a great help to application developers, developers should be aware of what kind of queries the ORM implementation is generating, and in particular watch out for N+1 queries issues. With higher data, transaction volumes or data access patterns advanced database features will be a significant help in optimising the application.

To quote Oracle guru Tom Kyte: it’s a database, not a data dump. PostgreSQL is an advanced relational database engine and it has a lot of features that can help application developers implement new features faster and in a more efficient and scalable manner. As with all the tools: you should learn how to use it to get the most out of it.

More info