Database consulting and application development...

...because your data is your business

April 28, 2017, 9 a.m.

A PostgreSQL Response to Uber

The slides from my talk at Percona Live 2017, A PostgreSQL Response to Uber, which I also gave as “Why Uber Was (Mostly) Wrong” at PGDay Nordic 2017, are now available.

Posted by Christophe Pettus
March 31, 2017, 9:34 a.m.

“Corruption War Stories” from PGConf US 2017

The slides from my presentation, Corruption War Stories, are now available.

Posted by Christophe Pettus
March 31, 2017, 9:33 a.m.

“Django and PostgreSQL” from PGConf US 2017

The slides from my presentation, Django and PostgreSQL, are now available.

Posted by Christophe Pettus
March 29, 2017, 7:37 p.m.

Change autovacuum_freeze_max_age without a restart (sort of…)

This blog post is kind of involved, so I'm giving a short version at the top, with some background for beginners at the bottom. The middle section explains the motivation for using this hack in the first place.

Short version

I came up with a useful and/or terrible hack the other day: setting autovacuum_freeze_max_age as a storage parameter. I definitely don't recommend doing this routinely, but it unblocked us during a critical maintenance window.

    ALTER TABLE my_table SET (autovacuum_freeze_max_age = 300000000);

Don't forget to set it back when you're done! Otherwise you will incur an even longer autovacuum freeze, probably when you least expect it.

Medium-length version

My colleague Kacey Holston was in the midst of upgrading a client from PostgreSQL 9.4 to 9.6, using Slony for minimal downtime. As planned, the client took a few minutes of downtime so Kacey could do. She was ready to reverse the direction of replication (so the 9.6 server was replicating to the 9.4 server, in case our client to fall back to it). But there was an autovacuum freeze (a.k.a. "autovacuum (to prevent wraparound)" that was keeping Slony from getting the brief ExclusiveLock it needed.

She knew from experience that this table takes three hours to freeze. But the client had only minutes of downtime scheduled – that was the whole point of using Slony!

If only it were possible to change autovacuum_freeze_max_age on the fly; then we could bump it up to stop that autovacuum. Unfortunately, you have to restart the database in order to change it. Except…

You can set it on a per-table basis, as follows. This took effect immediately:

    ALTER TABLE my_table SET (autovacuum_freeze_max_age = 300000000);

If you do this, don't forget to set it back to the normal value (by default, 200000000) once you're done! Otherwise autovacuum freezes on this table will come around less often and take even longer.

Background for beginners:

When the oldest transaction ID on any row in a table is more than autovacuum_freeze_max_age old (200 million transaction old, by default), then an "autovacuum (to prevent wraparound)" process runs on the table to reclaim old transaction IDs. For large tables, this can be a problem, because it can generate a lot of CPU and I/O activity during busy hours. Also, as we saw here, it locks the table (in a SHARE UPDATE EXCLUSIVE mode); this blocks DDL changes (a.k.a. migrations).

For more-technical background, see the official PostgreSQL docs on how transaction IDs work, and for a friendlier intro, see this series of blog posts.

Posted by Quinn Weaver
March 29, 2017, 7:08 a.m.

PostgreSQL When It’s Not Your Job

My slides from my PGConf US 2017 tutorial, PostgreSQL When It’s Not Your Job, are available now.

Posted by Christophe Pettus
March 3, 2017, 5:20 p.m.

Slides from my talk at SCALE15x

The slides from my SCALE 15x talk: Minimizing PostgreSQL Major Version Upgrade Downtime Using Slony are available here:

You'll want to download the powerpoint version as it includes some embedded screencasts that don't play on slideshare.

Posted by Jeff Frost
Feb. 7, 2017, 2:14 a.m.

Corruption War Stories at PGDay FOSDEM 2017

The slides for my talk Corruption War Stories are now available.

Posted by Christophe Pettus
Jan. 11, 2017, 4:26 p.m.

Retiring from the Core Team

Those of you in the PostgreSQL community will have noticed that I haven't been very active for the past year.  My new work on Linux containers and Kubernetes has been even more absorbing than I anticipated, and I just haven't had a lot of time for PostgreSQL work.

For that reason, as of today, I am stepping down from the PostgreSQL Core Team.

I joined the PostgreSQL Core Team in 2003.  I decided to take on project advocacy, with the goal of making PostgreSQL one of the top three databases in the world.  Thanks to the many contributions by both advocacy volunteers and developers -- as well as the efforts by companies like EnterpriseDB and Heroku -- we've achieved that goal.  Along the way, we proved that community ownership of an OSS project can compete with, and ultimately outlast, venture-funded startups.

Now we need new leadership who can take PostgreSQL to the next phase of world domination.  So I am joining Vadim, Jan, Thomas, and Marc in clearing the way for others.

I'll still be around and still contributing to PostgreSQL in various ways, mostly around running the database in container clouds.  It'll take a while for me to hand off all of my PR responsibilities for the project (assuming that I ever hand all of them off).

It's been a long, fun ride, and I'm proud of the PostgreSQL we have today: both the database, and the community.  Thank you for sharing it with me.

Posted by Josh Berkus
Jan. 4, 2017, 11 a.m.

Django: dictionary update sequence element #0 has length 1; 2 is required

dictionary update sequence element #0 has length 1; 2 is required usually means a missing name='' in a urlconf.

Posted by Christophe Pettus
Jan. 4, 2017, 9 a.m.

Estimates “stuck” at 200 rows?

If you are seeing the planner estimating 200 rows exactly from an aggregate operation, that might mean that a table is missing planner statistics.

Posted by Christophe Pettus
Jan. 1, 2017, 3:53 p.m.

Django: Site matching query does not exist

Site matching query does not exist means you forgot the record in djangosite, or your forgot SITEID.

Posted by Christophe Pettus
Dec. 30, 2016, 12:50 p.m.

The Multi-Column Index of the Mysteries

Multi-column indexes are more subtle than you might think.

Posted by Christophe Pettus
Dec. 28, 2016, 7:24 p.m.

A Cheap and Cheerful Replication Check

Don't calculate log positions unless you have to.

Posted by Christophe Pettus
Nov. 3, 2016, 6:51 a.m.

Securing PostgreSQL at PGConf EU

The slides for my talk, Securing PostgreSQL at PGConf EU 2016 are now available.

Posted by Christophe Pettus
Nov. 2, 2016, 3:03 a.m.

Unclogging the VACUUM at PGConf EU

The slides for my presentation Unclogging the VACUUM at PGConf EU in Tallinn, Estonia are now available.

Posted by Christophe Pettus
Aug. 8, 2016, 6:55 a.m.

Using EXPLAIN json format output in plpgsql

It's possible to use EXPLAIN output in plpgsql. In the default text format, the result comes back as a set of text values, so you can process them in a loop like this:

   exp text;
   for exp in explain myquery
      raise notice '%', exp;
   end loop;

If you use json format output, however, the result comes back as a single json document rather than a set of lines. You still might need a loop - I haven't found another way yet of getting the output from EXPLAIN into a variable - but the loop will only have one iteration. Here is an example taken from a function I wrote the other day that lets you get the estimated number of rows from the plan:

   exp json;
   for exp in explain (format json) myquery
      raise notice 'rows: %', exp#>>'{0,Plan,Plan Rows}';
   end loop;

Posted by Andrew Dunstan
June 9, 2016, 6:03 a.m.

Removing a Buildfarm client/server perl dependency

With one very small exception, there is nothing in the buildfarm server that requires the client to be running perl, even though both the server and the standard client are written in perl.

Andreas Scherbaum has written a new specialized client in python, and he asked me about removing that dependency. Essentially what this requires is that an alternative mechanism be allowed for transmitting the serialized configuration in the client's web transaction. Currently the client uses perl's Storable module to serialize the data, and the server uses the same module to de-serialize it. The obvious alternative candidate is to serialize it as JSON.

The first problem was to determine if we can sanely distinguish between data serialized by the two mechanisms. And we can. JSON is not allowed to contain any control characters, and a structure serialized using Storable's nfreeze() method is pretty much guaranteed to contain such characters. So I've added a test to the receiving program that looks for such characters and if it doesn't find them assumes that the data is JSON and decodes it accordingly. This has been tested using the client nightjar.

There are a few wrinkles, though. The main one is that we can't assume that every client has a module available to encode the data as JSON. In fact, nightjar didn't until I installed one. So in order to build this into the client I'll need to find a way to determine at runtime if it's available and fall back to using Storable otherwise.

We should also look at actually storing the data as JSON rather than as a serialized blob. When the buildfarm was created we didn't have any database support for JSON, but now the sensible thing to do would be to store the data as jsonb, and make it searchable.

But that's a project for another day.

Posted by Andrew Dunstan
June 4, 2016, 9:34 p.m.

Recent Slides

Slides from my recent presentations at PGCon and PyCon.

Posted by Christophe Pettus
June 4, 2016, 8:20 a.m.

Using wal-e with Postgres on a non-standard port

I just discovered after a great deal of puzzlement and far too much time that if you run multiple instances of Postgres on a machine, and you're using wal-e for continuous backup, you'll need an entry for PGPORT or PGSERVICE in the environment directory for any instance that's not running on port 5432. Otherwise, wal-e will happily start and stop a backup on your port 5432 server and your non-5432 backups will be completely useless since they won't have a backup label.

Yet another example of the fact that any untested backup is not worth anything. If I hadn't tested this and found the problem we could have been left without a backup when we needed it.

Posted by Andrew Dunstan
May 31, 2016, 5:47 a.m.

Indiscriminate use of CTEs considered harmful

Common Table Expressions are a wonderful thing. Not only are they indespensible for creating recursive queries, but they can be a powerful tool in creating complex queries that are comprehensible. It's very easy to get lost in a fog of sub-sub-sub-queries, so using CTEs as a building block can make things a lot nicer.

However, there is one aspect of the current implementation of CTEs that should make you pause. Currently CTEs are in effect materialized before they can be used. That is, Postgres runs the query and stashes the data in a temporary store before it can be used in the larger query. There are a number of consequences of this.

First, this can be a good thing. I have on a number of occasions used this fact to good effect to get around problems with poorly performing query plans. It's more or less the same effect as putting "offset 0" on a subquery.

However, it can also result in some very inefficient plans. In particular, if CTEs return a lot of rows they can result in some spectacularly poorly performing plans at the point where you come to use them. Note also that any indexes on the underlying tables will be of no help to you at all at this point, since you are no longer querying against those tables but against the intermediate result mentioned above, which has no indexes at all.

This was brought home to me forcefully on Friday and Saturday when I was looking at a very poorly performing query. After some analysis and testing, the simple act of inlining two CTEs in the query in question resulted in the query running in 4% of the time it had previously taken. Indiscriminate use of CTEs had made the performance of this query 25 times worse.

So the moral is: be careful in using CTEs. They are not just a convenient tool for abstracting away subqueries.

There has been some discussion about removing this aspect of the implementation of CTEs. It's not something that is inherent in CTEs, it's simply a part of the way we have implemented them in PostgreSQL. However, for now, you need to be familiar with the optimization effects when using them, or you might face the same problem I was dealing with above.

Posted by Andrew Dunstan
May 18, 2016, 9:02 a.m.

Changing PostgreSQL Version Numbering

Per yesterday's developer meeting, the PostgreSQL Project is contemplating a change to how we do version numbers.  First, let me explain how we do version numbers now.  Our current version number composition is:

9 . 5 . 3 
Major1 . Major2 . Minor

That is, the second number is the "major version" number, reflecting our annual release.  The third number is the update release number, reflecting cumulative patch releases.  Therefore "9.5.3" is the third update to to version 9.5.

The problem is the first number, in that we have no clear criteria when to advance it.  Historically, we've advanced it because of major milestones in feature development: crash-proofing for 7.0, Windows port for 8.0, and in-core replication for 9.0.  However, as PostgreSQL's feature set matures, it has become less and less clear on what milestones would be considered "first digit" releases.  The result is arguments about version numbering on the mailing lists every year which waste time and irritate developers.

As a result, the PostgreSQL Project is proposing a version numbering change, to the following:

10 . 2
Major . Minor

Thus "10.2" would be the second update release for major version 10.   The version we release in 2017 would be "10" (instead of 10.0), and the version we release in 2018 will be "11".

The "sortable" version number available from the server, libpq, and elsewhere would remain the same six digits, zero-filled in the middle.  So 10.2 would be 100002.

The idea is that this will both put an end to the annual arguments, as well as ending the need to explain to users that 9.5 to 9.6 is really a major version upgrade requiring downtime.

Obviously, there is potential for breakage of a lot of tools, scripts, automation, packaging and more in this.  That's one reason we're discussing this now, almost a year before 10 beta is due to come out.

The reason for this blog post is that I'm looking for feedback on what this version number change will break for you.  Particularly, I want to hear from driver authors, automation engineers, cloud owners, application stack owners, and other folks who are "downstream" of PostgreSQL.  Please let us know what technical problems this will cause for you, and how difficult it will be to resolve them in the next nine months.

We are not, at this point, interested in comments on how you feel about the version change or alternate version naming schemes.  That discussion has already happened, at length.  You can read it here, here, and here, as well as at the developer meeting.

Places to provide feedback:

Thanks for any feedback you can provide.

Note that the next release of PostgreSQL, due later this year, will be "9.6" regardless.  We're deciding what we do after that.

Posted by Josh Berkus
April 29, 2016, 5:34 a.m.

Windows XP - PostgreSQL's long goodbye.

I just committed a patch to allow building with Visual Studio 2015. Due to a change in the way we need to detect locales with this compiler, it will not be possible to use binaries built with it on Windows XP. Despite the fact that Microsoft declared Windows XP to be completely out of support well over a year ago, it still lives on in a huge number of installations. My own instance still gets occasional updates from Microsoft. And you can still build and run the very latest PostgreSQL on Windows XP. But you can't use it on Windows XP if it's built with Visual Studio 2015 or later.

I will keep my instance (whose only job is to run several buildfarm members) running as long as it doesn't require any significant tweaks. But a day will eventually come when it does require such tweaks, or just won't build and run successfully any more,  and at that stage I'll shut it down.

Posted by Andrew Dunstan
April 28, 2016, 11:17 a.m.

Don't delete pg_xlog

This StackOverflow question reminded me of this old blog post, which is still relevant today:

pg_log, pg_xlog and pg_clog

There are three directories in a default $PGDATA directory when you create it which are named "pg_*log".


$PGDATA/pg_log is the default location for the database activity logs, which include error messages, query logging, and startup/shutdown messages.  This is where you should first look for information when PostgreSQL won't start.  Many Linux distributions and other packaging systems relocate this log directory to somewhere like /var/log/postgresql.

You can freely delete, rename, compress, and move files in pg_log without penalty, as long as the postgres user still has rights to write to the directory. If pg_log becomes bloated with many large files, you probably need to decrease the number of things you're logging by changing the settings in postgresql.conf.

Do note that if you "delete" the current log file on a Linux or Unix system, it may remain open but not accessible, just sending any successive log messages to /dev/null until the file rotates.


$PGDATA/pg_xlog is the PostgreSQL transaction log.  This set of binary log files, with names like '00000001000000000000008E', contain images of the data from recent transactions.  These logs are also used for binary replication.

If replication, archiving, or PITR is failing, this directory can become bloated with gigabytes of logs the database server is saving for when archiving resumes. This can cause you to run out of disk space
Unlike pg_log, you may not freely delete, move, or compress files in this directory.  You may not even move the directory without symlinking it back to its original location.  Deleting pg_xlog files may result in unrecoverable database corruption.

If you find yourself in a situation where you've got 100GB of files in pg_xlog and the database won't start, and you've already disabled archiving/replication and tried clearing disk space every other way, then please take two steps:
  1. Move files from pg_xlog to a backup disk or shared network drive, don't delete them, and
  2. Move only a few of the oldest files, enough to allow PostgreSQL to start again.


$PGDATA/pg_clog contains a log of transaction metadata.   This log tells PostgreSQL which transactions completed and which did not.  The clog is small and never has any reason to become bloated, so you should never have any reason to touch it.

Should you ever delete files from pg_clog, you might as well delete the entire database directory. There is no recovery from a missing clog.

Note that this means, if you back up the files in a $PGDATA directory, you should make sure to include the pg_clog and pg_xlog as well, or you may find that your backup is not usable.

Posted by Josh Berkus
April 26, 2016, 4:44 p.m.

Join us for the 3rd pgCon User Unconference

This year, we're continuing to experiment with new formats for the pgCon unconference.  In 2013 and 2014 we had an Unconference on the Saturday of pgCon.  In 2015 we had a limited Developer Unconference on Wednesday.

This year, we will have a Developer Unconference on Wednesday, and a User Unconference on Saturday.  We're doing this because people were disappointed that we didn't do the User Unconference last year, and asked us to bring it back.  So, hopefully you planned to stay over Saturday!

The User Unconference has several purposes:

  • to give various teams and special interest groups an opportunity to schedule something
  • to let folks whose technology was released too late for the CfP another chance to present something
  • to continue discussions started around talks in the main program
So, please join us!  And if you have ideas for User Unconference sessions which you want to make sure get on the program, please list them on the wiki page using the template provided.  Note that final sessions will be chosen at 10am Saturday morning, though.

Posted by Josh Berkus
April 17, 2016, 8:39 a.m.

Building an extension with VS2015

I needed to get a DLL prepared of my tiny closed format ranges extension, and so I tried following the excellent instructions Craig Ringer gave a couple of years ago. I was using Visual Studio 2015, on a Windows 10 box, targeting PostgreSQL 9.4, 64bit.

But I ran into some problems. I could compile the thing quite cleanly, but Postgres couldn't find the functions when I tried to create the extension. I had a short and very helpful email exchange with Craig, and came up with this solution: the exported functions needed to be declared with a prototype before the call to PG_FUNCTION_INFO_V1, and both the prototype and the function definition needed to be decorated with PGDLLEXPORT. Once I had done that, the DLL loaded when I ran "create extension", and the types worked as expected.

I don't know if this is required with any earlier versions of Visual Studio, or even of it works with them.  When I get a chance I will see if I can find out.

Based on this, I think it's probably worth trying at some stage to put together a script to build the project and solution files for extensions. I don't have time to do that right now, however.

Posted by Andrew Dunstan
March 31, 2016, 8:09 a.m.

9.5.2 update release and corrupt indexes

We've released an off-schedule update release today, because of a bug in one of 9.5's features which has forced us to partially disable the feature.  This is, obviously, not the sort of thing we do lightly.

One of the performance features in 9.5 was an optimization which speeded up sorts across the board for text and numeric values, contributed by Peter Geoghegan.  This was an awesome feature which speeded up sorts across the board by 50% to 2000%, and since databases do a lot of sorting, was an overall speed increase for PostgreSQL.  It was especially effective in speeding up index builds.

That feature depends on a built-in function in glibc, strxfrm(), which could be used to create a sortable hash of strings.  Now, the POSIX standard says that strxfrm() + strcmp() should produce sorting results identical to the strcoll() function.  And in our general tests, it did.

However, there are dozens of versions of glibc in the field, and hundreds of collations, and it's computationally unreasonable to test all combinations.  Which is how we missed the problem until a user reported it.  It turns out that for certain releases of glibc (particularly anything before 2.22 on Linux or BSD), with certain collations, strxfrm() and strcoll() return different results due to bugs.  Which can result in an index lookup failing to find rows which are actually there.  In the bug report, for example, an index on a German text column on RedHat Enterprise Linux 6.5 would fail to find many rows in a "between" search.

As a result, we've disabled the feature in 9.5.2 for all indexes which are on collations other than the simplified "C" collation.  This sucks.

Also, if you're on 9.5.0 or 9.5.1 and you have indexes on columns with real collations (i.e. not "C" collation), then you should REINDEX (or CREATE CONCURRENTLY + DROP CONCURRENTLY) each of those indexes.  Which really sucks.

Of course we're discussing ways to bring back the feature, but nobody has a solution yet. In the meantime, you can read more about the problem on the wiki page.

Posted by Josh Berkus
March 24, 2016, 12:08 p.m.

Weird stuff happens

Five days ago, my buildfarm animal jacana suddenly started getting an error while trying to extract typedefs. It had been happily doing this for ages, and suddenly the quite reproducible error started. For now I have disabled its typedef analysis, but I will need to get to the bottom of it. It's bad enough to crash the buildfarm client leaving the build directory dirty and not able to process further builds until I clean it up. I'm assuming it's probably something that changed in the source code, as nothing else has changed at all. These are the commits that took place between a good run and the first appearance of the error.

  • 9a83564 Allow SSL server key file to have group read access if owned by root
  • 6eb2be1 Fix stupid omission in c4901a1e.
  • 07aed46 Fix missed update in _readForeignScan().
  • ff0a7e6 Use yylex_init not yylex_init_extra().
  • a3e39f8 Suppress FLEX_NO_BACKUP check for psqlscanslash.l.
  • 0ea9efb Split psql's lexer into two separate .l files for SQL and backslash cases.
 I don't know for dead certain that any of these has caused an issue, but finding out what the problem is is just one more way for me to spend my copious free time.

Posted by Andrew Dunstan
March 19, 2016, 2:54 p.m.

GIN indexing an array of enums

GIN indexes of arrays can be pretty useful. And arrays of enums can also be pretty useful. Unfortunately, there is no built in GIN indexing of arrays of enums, and in fact there is no pseudo-type for arrays of enums. So, while we can declare an operator class for supporting a type of "anyenum", which I did in my recent work on btree_gin and btree_gist support for enums, we can't declare one for support of "anyenum[]".

However, we can declare one for support of a concrete enum type's array type. And all the pieces are already there.

After a lot of reading and experimentation, Here's what I found that seems to work just like the builtin GIN array operator classes.

Given an enum type of animal, you can declare an operator class for its array type like this (note how almost completely generic this is):

create operator class _animal_ops 
default for type public.animal[]
using gin
family array_ops as
function 1 enum_cmp(anyenum,anyenum),
function 2 pg_catalog.ginarrayextract(anyarray, internal),
function 3 ginqueryarrayextract(anyarray, internal, smallint, internal,
internal, internal, internal),
function 4 ginarrayconsistent(internal, smallint, anyarray, integer,
internal, internal, internal, internal),
function 6 ginarraytriconsistent(internal, smallint, anyarray, integer,
internal, internal, internal),
storage oid ;

and here it is working:
andrew=# \d animal_classes
Table "public.animal_classes"
Column | Type | Modifiers
class | integer |
animals | animal[] |

andrew=# select * from animal_classes where '{lion}' <@ (animals);
class | animals
1 | {lion,zebra}

andrew=# set enable_seqscan = off;
andrew=# create index animal_classes_animals_idx on animal_classes using gin(animals);
andrew=# explain (costs off) select * from animal_classes where '{lion}' <@ (animals);
Bitmap Heap Scan on animal_classes
Recheck Cond: ('{lion}'::animal[] <@ animals)
-> Bitmap Index Scan on animal_classes_animals_idx
Index Cond: ('{lion}'::animal[] <@ animals)
(4 rows)

andrew=# select * from animal_classes where '{lion}' <@ (animals);
class | animals
1 | {lion,zebra}
(1 row)


Posted by Andrew Dunstan
March 19, 2016, 7:15 a.m.

Gist and Gin support for enums

Recently I submitted a patch (too late for release 9.6, unfortunately) to add support for enum data types to the btree_gist and btree_gin additional modules. This came out of a desire to be able to use an enum field in an exclusion constraint. That really only requires GiST support, but since I was adding enum support to btree_gist it seemed a good idea to add it to btree_gin as well, so that, for example, enum fields can be used in multi-column GIN indexes.

Here's an exclusion constraint example using enums, adapted from the Postgres docs.

andrew=# create extension btree_gist;
andrew=# \dx
List of installed extensions
Name | Version | Schema | Description
btree_gist | 1.2 | public | support for indexing common datatypes in GiST
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)

andrew=# create type animal as enum ('zebra','lion');
andrew=# create table zoo (cage int, exhibit animal, exclude using gist(cage with =, exhibit with
andrew=# INSERT INTO zoo VALUES(123, 'zebra');
andrew=# INSERT INTO zoo VALUES(123, 'zebra');
andrew=# INSERT INTO zoo VALUES(123, 'lion');
ERROR: conflicting key value violates exclusion constraint "zoo_cage_exhibit_excl"
DETAIL: Key (cage, exhibit)=(123, lion) conflicts with existing key (cage, exhibit)=(123, zebra).

Posted by Andrew Dunstan
March 8, 2016, 9:29 a.m.

Json dates, times, and binary data

If I had to pick one thing I would change about the JSON spec it's this: they should allow timestamps and dates as primitive values. It's a very common complaint. Another thing they should have provided for is binary data. This gets a bit annoying if you convert a record containing a bytea to json and hand it to the client. They see a string like '\x71951ce4c84c0b4de861377f27a4938a'. If you can, it's probably nicer to encode your bytea before converting it to json, using encode(mybytea,'base64') or encode(mybytea,'hex'). Then your client might see a string for which they have a standard routine to convert to binary, rather than having to strip off the the initial '\x' before converting it.

Posted by Andrew Dunstan
Feb. 24, 2016, 6:06 p.m.

Stupid Docker Tricks: running Firefox in a container

Once you start messing around with desktop containers, you end up doing a lot of things "because you can" and not because they're a particularly good idea.  For example, I need to run multiple Firefox instances under different profiles in order to maintain various social media accounts, such as the @projectatomic twitter feed.  Now, I could do that by launching Firefox with various -P flags, but that would be no fun at all.  Instead, I'm going to launch Firefox in a container.

Mind you, if you're a webdev or a desktop hacker, this would be a good way to launch various hacked versions of Firefox without messing with the browser you need every day.  There's four basic steps here:

  1. build a Firefox image
  2. authorize X11 connections from containers
  3. enable Firefox connections in SELinux
  4. run the container
First, build the Firefox image.  This is fairly standard except that you need to tailor it to the UID and GID of your desktop user so that you don't have to jump through a lot of hoops to get SELinux to authorize connecting from the container to your desktop X server.  I use a Dockerfile like this one:

    FROM fedora
    # install firefox
    RUN dnf install -y firefox
    # install dependancies
    RUN dnf install -y libcanberra-gtk3 PackageKit-gtk3-module \
        dbus dbus-devel dbus-x11
    RUN dbus-uuidgen --ensure

    # make uid and gid match inside and outside the container
    # replace 1000 with your gid/uid, find them by running
    # the id command
    RUN export uid=1000 gid=1000 && \
        mkdir -p /home/firefox && \
        echo "firefox:x:${uid}:${gid}:Developer,,,:/home/firefox:/bin/bash" >> /etc/passwd  && \
        echo "firefox:x:${uid}:" >> /etc/group  && \
        echo "firefox ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers  && \
        chmod 0440 /etc/sudoers  && \
        chown ${uid}:${gid} -R /home/firefox

    #remove cache from the image to shrink it a bit
    RUN dnf clean all

    # set up and run firefox
    USER firefox
    ENV HOME /home/firefox
    CMD /usr/bin/firefox -no-remote

Then you can build your image by running:

    docker build -t username/firefox .

Next we need to make sure that the docker container is allowed to run X11 apps on your desktop machine, so that Firefox can run inside the container but be displayed on your desktop.  This is a simple command, allowing anyone on localhost to run X apps:

    xhost +

Thirdly we'll need to also make that work with SELinux.  The simplest way to do this is to try it, have SELinux block, and then enable it.  So try launching the Firefox container with the command in the step below.  It should fail with some kind of "could not connect to display" error.  Then run these commands, as root:

    grep firefox /var/log/audit/audit.log | audit2allow -M mypol
    semodule -i mypol.pp    

Finally, your Firefox container should be ready to go.  Except you need to add some flags, due to the need to share the X11 socket between the container and the desktop.  Here's what I use:

    docker run -it -e DISPLAY --net=host jberkus/firefox

This should bring up Firefox in a window on your desktop, under a profile and cache which exists only in the container.   If you want to always dispose of this Firefox without saving anything, add an --rm flag to the above.

If you don't want to paste all of the above from a blog (and really, who does?) I've put up some scripts on Github.

Now, if only I could figure out why fonts aren't rendering correctly in Firefox run this way.  Ideas?

Posted by Josh Berkus
Feb. 20, 2016, 4:17 p.m.

Release 4.17 of the PostgreSQL Buildfarm client

I have just cut release 4.17 of the PostgreSQL Buildfarm client. It is available at

Changes of note:

  • use PGCTLTIMEOUT instead of hardcoded timeout settings
  • shipped config file is renamed to build-farm.conf.sample to avoid
  • overwriting a default config file
  • use configured make command in make_bin_installcheck
  • restrict restart sleeps to Windows platforms
  • fix trigger_exclude pattern in build-farm.conf
  • protect git checkout with a lock if using git_use_workdirs

For the most part, the thing of significance it the PGCTLTIMEOUT change, which follows up on some changes in core code. The new code defaults this setting to 120, but it can be set in the config file. If you don't need any of the other changes (most people probably won't) then just adding something like


to the build_env stanza of your config file should do the trick, and you can happily wait for the next release.

Posted by Andrew Dunstan
Feb. 19, 2016, 2:38 p.m.

JSONB, PostgreSQL and Go

Just ran across this excellent pair of blog posts on using JSONB with Go application design to simplify your data model.  Since that blog is not syndicated on Planet Postgres, reblogging it so that more people see it.

If you find other great PostgreSQL content on the web, make sure folks know about it.

Posted by Josh Berkus
Feb. 16, 2016, 10:51 a.m.

Always Do This #5: The Elements of postgresql.conf Style

I won't say that I know the one true way of editing postgresql.conf, but I do.

Posted by Christophe Pettus
Feb. 11, 2016, 7:53 p.m.

Indexes Are Not Cheap

Don't go crazy with indexes.

Posted by Christophe Pettus
Feb. 9, 2016, 10:46 a.m.

Django 1.8/1.9 and PostgreSQL: An Ever-Closer Union

I’ll be speaking about Django and PostgreSQL at PyCon US 2016.

Posted by Christophe Pettus
Feb. 8, 2016, 7:46 p.m.

PostgreSQL High Availability, 2016 Edition

The slides from my talk at PG Day at FOSDEM 2016 are now available.

Posted by Christophe Pettus
Feb. 2, 2016, 6:59 p.m.

Always Do This #4: Put stats_temp_directory on a memory file system

Always set statstempdirectory to point to a RAM disk.

Posted by Christophe Pettus
Jan. 29, 2016, 2:46 p.m.

Language thoughts

All the languages I've used heavily have one primitive data structure you fall into using as a golden hammer:

  • C: arrays
  • Lisp: lists
  • Perl: hashes
  • JS: "objects"
Et cetera. Python is borderline; it's tempting to abuse dicts, but there's a bit less friction to using proper objects than with Perl.

Posted by Quinn Weaver
Jan. 25, 2016, 9:07 p.m.

PostgreSQL High Availability, 2016 Edition

I’ll be speaking about PostgreSQL High Availability at PGDay 2016 at FOSDEM.

Posted by Christophe Pettus
Jan. 25, 2016, 8:33 a.m.

Fixed length record files

I've had a few requests recently for support for the Fixed Length File FDW that I created as an experiment about 5 years ago. Apparently people are still getting data from COBOL programs or some such source. The code has bitrotted some in the meanwhile, so I'll try to grab a little time to update it, add some test cases and docco, etc.

Posted by Andrew Dunstan
Jan. 21, 2016, 5:40 p.m.

JSON Home Improvement at SCALE 14x

The slides from my presentation, JSON Home Improvement at SCALE 14x, are now available.

Posted by Christophe Pettus
Jan. 21, 2016, 12:53 p.m.

A simple JSON difference function

Building a JSONB difference function with things you already have around the house.

Posted by Christophe Pettus
Jan. 19, 2016, 6:02 a.m.

Buildfarm server moving

Today we're moving the buildfarm server to a new machine and more modern Postgres. Here is yesterday's (corrrected) announcement:

Apologies for the late notice.

Tomorrow, January 19th, at 4.00 pm US East Coast time (UT - 5.0) we will be moving the buildfarm server from its current home at CommandPrompt, where we have been ever since we started, to a machine that is part of the standard core infrastructure. In doing so we will be moving to a) a more modern and supported PostgreSQL version, and b) a machine with more disk space so that our current severe pace shortage will be alleviated. In addition, the community would be much better placed to maintain the buildfarm if either JD or I were to fall under a bus.

The outage is expected to last about 4 hours or less, and we will sent out notifications when this is complete.

Buildfarm owners who want to avoid getting reporting failures should disable their animals during that time. We don't have an avalanche of commits right now either, but it might also be nice if committers were to refrain from adding changes in the hours leading up to this and until we announce that we're back online, for the benefit of those owners who don't see this message in time.

Thanks in advance for your help and understanding.

And many thanks to CommandPrompt for their constant support over the many years we've been in operation.

In a few hours I will start disabling my 12 buildfarm members.

Posted by Andrew Dunstan
Jan. 16, 2016, 11:06 p.m.

Blocked by rdsadmin

One of our clients on RDS had a VACUUM FREEZE that was hanging for a long time. "I bet it's waiting on a lock," I thought. Yup, but the curious part is what it was waiting on: a mysterious process run by the rdsadmin role (see RECORD 2, below):

SELECT pg_stat_activity, pg_locks.mode
FROM pg_stat_activity
JOIN pg_locks USING (pid)
JOIN pg_class ON pg_locks.relation = pg_class.oid
WHERE pg_class.relname = 'users'
AND pg_locks.mode IN ('ShareUpdateExclusiveLock', 'ShareLock', 'ShareRowExclusiveLock', 'ExclusiveLock', 'AccessExclusiveLock');

-[ RECORD 1 ]----+------------------------------
datid | 1234
datname | my_database
pid | 14641
usesysid | 16396
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | 5430
backend_start | 2016-01-15 22:05:06.161987+00
xact_start | 2016-01-15 22:14:39.301425+00
query_start | 2016-01-15 22:14:39.301425+00
state_change | 2016-01-15 22:14:39.301429+00
waiting | t
state | active
query | VACUUM FREEZE verbose users;
mode | ShareUpdateExclusiveLock
-[ RECORD 2 ]----+------------------------------
datid | 1234
datname | my_database
pid | 22328
usesysid | 10
usename | rdsadmin
application_name |
client_addr |
client_hostname |
client_port |
backend_start |
xact_start |
query_start |
state_change |
waiting |
state |
query |
mode | ShareUpdateExclusiveLock
Further examination showed that this process was locking only the users table (and its indexes):

SELECT locktype, relation::regclass AS tablename
FROM pg_locks
JOIN pg_stat_activity USING (pid)
WHERE pid = 22328;

locktype | tablename
relation | user_index_a
relation | user_index_b
relation | user_index_c
relation | users_pkey
virtualxid |
relation | users
(13 rows)

Has anyone else seen such a process? I'm curious as to what it is. My current best guess is a vacuum We opened a ticket with Amazon to ask them, so I'll update this post when Amazon replies.

EDIT 2016-01-17: the above is a braino for "an autovacuum." An Amazon rep wrote back the next day to say that it was, indeed, an autovacuum process, and included some vacuum-tuning tips. So good on them, although it's unfortunate that RDS's privilege system doesn't allow you to see the pg_stat_activity.query field when the rdsadmin role is working on a table you own.

Posted by Quinn Weaver
Jan. 15, 2016, 5:52 a.m.


This is a pretty nice feature in 9.5. The other day I needed to look at some data from a WordPress site that's stored in MySQL. I had a dump of the data, but I'm not terribly proficient at using MySQL tools or their SQL dialect, so I thought I would try out just importing the schema to PostgreSQL using the new IMPORT FOREIGN SCHEMA command. It worked pretty well. First I installed the Postgres rpms from the community repository. Unfortunately, the repository doesn't have the latest mysql foreign data wrapper for 9.5, so I cloned it from github and built and installed it without difficulties. Then I restored the backup into the test database on local mysql instance. Then there was a small glitch. When I tried to import the schema it complained that it didn't know about the type "tinytext". So I created a domain for this that just mapped it to text. and then re-ran the import, and it worked just fine. Here is the whole import session:

[andrew@dino mysql_fdw]$ psql wp
psql (9.5.0)
Type "help" for help.

wp=# create extension mysql_fdw;
wp=# CREATE SERVER mysql_server
OPTIONS (host '', port '3306');
wp=# create user mapping for andrew server mysql_server;
wp=# import FOREIGN SCHEMA test from server mysql_server into public;
ERROR: type "tinytext" does not exist
LINE 4: comment_author tinytext NOT NULL,
"comment_ID" bigint NOT NULL,
"comment_post_ID" bigint NOT NULL,
comment_author tinytext NOT NULL,
comment_author_email varchar NOT NULL,
comment_author_url varchar NOT NULL,
"comment_author_IP" varchar NOT NULL,
comment_date timestamp NOT NULL,
comment_date_gmt timestamp NOT NULL,
comment_content text NOT NULL,
comment_karma int NOT NULL,
comment_approved varchar NOT NULL,
comment_agent varchar NOT NULL,
comment_type varchar NOT NULL,
comment_parent bigint NOT NULL,
user_id bigint NOT NULL
) SERVER mysql_server OPTIONS (dbname 'test', table_name 'wp_comments');

CONTEXT: importing foreign table "wp_comments"
wp=# create domain tinytext as text;
wp=# import FOREIGN SCHEMA test from server mysql_server into public;

Posted by Andrew Dunstan
Jan. 14, 2016, 2:19 p.m.

Doing a bulk merge in PostgreSQL 9.5

You can do a merge in PostgreSQL 9.5 using things you already have around the house.

Posted by Christophe Pettus
Jan. 8, 2016, 10:24 a.m.

Configuration changes in 9.5: transaction log size

If you downloaded 9.5 after yesterday's release, you might have noticed some changes to postgresql.conf, especially if you copied over you favorite 9.4 config and it refused to start up.  Particularly, the parameter checkpoint_segments is gone, and has been replaced by min_wal_size and max_wal_size.  What does this mean?

Well, checkpoint_segments was the old way we had for users to determine how "big" the transaction log should be.  Users with a large server or a lot of transactions per second would set it high, and users with small VMs and a slow database could set it low.  This had some problems, though:

  1. The default setting worked for pretty much nobody, so you always had to adjust it.
  2. The WAL*  always used the maximum space available, even if it wasn't needed.
  3. Figuring out the largest size your WAL could be required some math and a knowledge of version-specific PostgreSQL internals.
The last was the most confusing part for users; the calculation for maximum WAL size was:

   ( ( checkpoint_segments * 2 ) + 1 )  * 16MB ) +  ( wal_keep_segments * 16MB )

... which meant that people generally sized it by picking an arbitrary number and then adjusting up or down based on feedback.

The new parameters are way simpler:
  • min_wal_size: the minimum size the transaction log will be;
  • max_wal_size: the maximum size the transaction log will be (but see below)
This means that your transaction log on disk shouldn't ever be larger than ( max_wal_size + wal_keep_segments ).  It is a "soft" limit though; if PostgreSQL gets really behind, or if archiving is failing, it will get higher than max_wal_size.

However, that isn't the really cool part.  Heikki did an amazing thing, in that the WAL is sized dynamically based on how much was used during the previous cycles.  So you can set max_wal_size to some high value (default is 1GB),  and not worry about PostgreSQL using a bunch of extra disk space if it's not needed.  This means that we can set a default which will be "good enough" for 80% of our users, and we have.  This makes me do a little happy dance.

The other 20% may want to tune, still though, so here's some guidelines:
  • if you know your database write traffic is "bursty", with long periods of inactivity followed by furious writes, increase min_wal_size;
  • if you do bulk loads larger than 1GB, increase max_wal_size to the size of a bulk load;
  • if you write more than 1GB of data every 10 minutes, increase max_wal_size;
  • if you get "checkpoints occurring to frequently" error messages, try increasing both parameters incrementally.
Most users won't need to touch those parameters at all, though.  Which is as it should be.

* WAL == Write Ahead Log == Transaction Log == XLog

Posted by Josh Berkus
Jan. 7, 2016, 9 a.m.

PostgreSQL 9.5 fills in lots of gaps

The two big features from my point of view in the long-awaited PostgreSQL 9.5, officially released today, are the INSERT ... ON CONFLICT DO NOTHING / UPDATE feature, and addition of CUBE, ROLLUP and more generalized GROUPING SETS. These are two very important developments that go a long way to filling in the gaps in our feature set.

My contribution has been a bit more more modest than previously for this release. Here are the things I was involved with, along with others:

  • Add jsonb functions jsonb_set() and jsonb_pretty()
  • Allow text, text array, and integer values to be subtracted from jsonb documents
  • Add jsonb || operator
  • Add json_strip_nulls() and jsonb_strip_nulls() functions to remove JSON null values from documents
  • Add jsonb generator functions to_jsonb(), jsonb_object(), jsonb_build_object(), jsonb_build_array(), jsonb_agg(), and jsonb_object_agg()
  • Add \pset option pager_min_lines to control pager invocation in psql
  • Improve psql's line counting used when deciding to invoke the pager
  • Make psql's \sf and \ef commands honor ECHO_HIDDEN
  • Add statistics for minimum, maximum, mean, and standard deviation times to pg_stat_statements
Original work for the first three of these was done by Dmitry Dolgov, and for the last by Mitsumasa Kondo.

This is the fourth straight release where I've done a substantial amount of work on JSON features, and it will be the last for a while at least.  We've come a long way since Robert Haas and I snuck some JSON stuff into release 9.2 at the very last minute. Despite a somewhat rocky road, I venture to say that if we hadn't done that we would not have made as much progress as we have since then.

Posted by Andrew Dunstan
Jan. 5, 2016, 8:55 a.m.

New year, new job

If you've been following my writing online for the last year, you'll know that I've been pretty excited about Docker and other emerging Linux container technologies.  That's why I'm excited to announce my new position as Community Lead for Project Atomic at Red Hat. I'm really enthused about this, and I plan to make you as excited about Project Atomic and container tech as I am.

Does this mean I'm not working on PostgreSQL anymore?  Of course not. In fact, I'm already working on a PostgreSQL HA solution which relies on Atomic; you'll find out more about it at SCALE14 if you attend. 

It does mean I've stepped down as CEO of PostgreSQL Experts, which position is now occupied by my extremely capable former coworker and Django contributor, Christophe Pettus.  Since Christophe and Stacey Haysler have been effectively (and profitably!) running PGX for the last year, most of our customers won't even notice the change.

It also means that there will be lots more stuff about Linux containers on this blog, most of which will not be tagged with the PostgreSQL tag and thus not appear on Planet Postgres.  So if you're interesed in that, you'll want to follow my blog directly.

I'm also in the middle of moving to Portland, Oregon, so expect me to be rather unresponsive for the new two weeks.  After that ... on to new adventures!

Posted by Josh Berkus
Jan. 4, 2016, 8:53 p.m.

Always Do This #3: Log Lock Waits and Temp Files

loglockwaits = on and logtempfiles = 0

Posted by Christophe Pettus
Jan. 4, 2016, 6:39 p.m.

January SFPUG: What’s New in PostgreSQL 9.5?

I’ll be presenting at the January San Francisco PostgreSQL Users’ Group meeting on what’s new in PostgreSQL 9.5. I hope you can join us!

Posted by Christophe Pettus
Dec. 31, 2015, 8:54 a.m.

Speaker Practice Survey

VM Brasseur and I are doing a speaker training tutorial at SCALE14, entitled "10 Step Program for Great Tech Talks." As experienced speakers who go to a lot of conferences, we have way too much material for a three hour tutorial, and need to know what to cut.  So she had the idea of doing a survey of conference attendees, which we conducted over a couple weeks.  The survey was meant to find out what frequent conference attendees liked about good talks, and didn't like about not-so-good talks.

You can check out the original survey here, although we're no longer paying attention to responses.


First let's find out who our surveyees are. 108 people responded. Given that VM and I broadcasted the survey over Twitter, as did Josh Simmons and Gareth Greenaway, I expect that our sample will be fairly skewed; let's see.  First we asked people whether they just attend talks, or whether they both give and attend them.

As you can see, our sample skews heavily towards presenters, with a minority of pure audience members.  A few wiseacres said they only give talks, and don't attend them, which wasn't supported by their other answers. We also asked how many conferences and talks folks went to in the last year:

So, most of our respondees are frequent conference and/or talk attendees.  This colors the rest of the survey; what we're looking at is what a group of experienced people who go to a lot of talks, and present more than a few, think of other speakers' performance.  I suspect that if we did a survey of folks at their very first tech conference, we would see somewhat different data.

The Good

We asked "Thinking of the presentations you attended with topics and/or speakers you've most enjoyed, what speaker behaviors below do you feel substantially added to the experience?".  The idea of this question was to find out what skills or actions by the speaker were the most important for making a talk valuable.  Each item was rated on a 1-5 scale.

The above graph requires some explanation.  The length of the bars is the average score.  The bars are ranked from the highest to lowest rated practice.  The color of the bars indicates the median score, for cases where the average is deceptively skewed: Red 5, Blue 4, Green 3, and Black 2.

We can see a few things from this.  First, no practice was considered completely unimportant; nothing averaged below 2.1.  Good narrative structure was clearly the best liked practice, with deep subject knowledge and being an energetic speaker as our second and third ranked items.

If you look at the overall rankings, you can see that "content" items are the best loved, whereas talk "extras" and technical improvements are the least important.  Slide visual design, which many speakers spend a lot of time sweating over, just averages 3.0, and the three items involving technical interaction with the audience (demos, participation, and exercises), come in last.  I do wonder whether those three are last because they are so seldom done well, even by experienced speakers, or because they really are unimportant.  Let me know what you think in the comments.

The Bad

To balance this out, we asked surveyees, "Thinking of the presentations you attended with topics and/or speakers you liked, what speaker behaviors below do you feel substantially detracted from, or even ruined, an otherwise good presentation?" The idea was not to find out why sucky presentations sucked, but what things prevented an acceptable talk from being good.  Here's the ratings, which are graphed the same way as the "good" ratings:

The top two speaking problems were not being able to hear or understand the speaker, and the presenter being bored and/or distracted.  The first shows how important diction, projection, and AV systems are to having a good talk -- especially if you're speaking in a language which isn't your primary one.  The second is consistent with the good speaker ratings: people love energetic speakers, and if speakers are listless, so is the audience.  So make sure to shoot that triple espresso before you go on stage.

There were a few surprises in here for me.  Unreadable code on slides, one of my presonal pet peeves, was a middle-ranker.  Running over or under time, which I expected to be a major talk-ruiner, was not considered to be.  Offensive content, on the other hand, is a much more widespread problem than I would have assumed.

And "ums" and body language problems, however, ranked at the bottom.  From my own observation, I know that these are common issues, but apparently they don't bother people that much.  Or do they?

The Ugly

Finally, we asked, "If you could magically eliminate just one bad speaker behavior from all presenters everywhere, which one would it be?"  The idea was to find out what one thing was really driving audiences bananas, it was so bad.  Here's the responses:

So, first, there are a few things which are inconsistent with the ratings.  Unpracticed presentations were the #1 issue, and some other items which looked unimportant by rating, like saying "um" a lot, show up as a top pet peeve.  So apparently people hate "um" and interruptions a lot, but they don't see them ruining that many otherwise good talks.  The other top issues match the ratings we already saw.

About a seventh of the responses were ones which received only one or two votes, including an assortment of write-in responses.  Here's a few of the more interesting ones:
  • "Reading slide content to audience" (two votes, including a page-long tirade)
  • "Giving the talk using slides that are really optimized to be downloaded by someone who isn't able to attend the talk.  Presentation slides and downloadable slides are different beasts."
  • "Long personal company or employment history introductions"

On to the Tutorial

So this survey will give us enough information to know what things we can cut back on in order to make our timeslot ... and even a few things to spend more time on!  If you want to learn more about how to be a great speaker, join us at SCALE14 on the morning of Thursday, January 21.

If you just like the graphs, both the graphs and the IPython notebook used to produce them are on our Github repo.

Posted by Josh Berkus
Dec. 23, 2015, 4:20 p.m.

A Christmas present: PostgreSQL 9.5 RC1

It's been much-delayed, but PostgreSQL 9.5 RC1 is finally available.  This release contains not only fixes for all known issues in 9.5 features, it also contains an "ultimate solution" to multixact truncation handling.  Multixacts were the cause of multiple bugfix releases in versions 9.3 and 9.4, and several hackers have taken the opportunity in 9.5 to implement a new design for this.  This has been a large part of the delay in releasing 9.5.

So: if you plan to upgrade to 9.5, download and test now.  Final will be released in early January, barring discovery of new ghastly bugs in the RC.

To make it easier for you to test, I've updated the postgres95-test Docker image, ready for you to download and try out whatever you want.

9.5 brings us a lot of cool features which are likely to justify upgrading for you:

  • Row Level Security
  • Grouping Sets/CUBE/ROLLUP
  • Faster sorts for text
  • FDW push-down and partitioning support
  • BRIN indexes
If you want to read up on the features so you know what to test:

Posted by Josh Berkus
Dec. 23, 2015, 9:16 a.m.

json functions for dynamic field modification in plpgsql, including in triggers.

One of the occasionally annoying things about Plpgsql is that it's very static, and any sort of dynamic access to records, say by using a trigger argument, is difficult. I just conducted a little experiment that seems to make life a bit easier. The json_populate_record() function returns a record of the same type as its first argument. Mostly we just use NULL::record_type here, but it turns out that in a trigger function it works perfectly happily with NEW and OLD, and they can be modified using any sort of dynamic values you like, including values from TG_ARGV

Here's a simple example:

mydb=# create type rainbow as enum('r','o','y','g','b','i','v');
mydb=# create table foo (a text, b int, c boolean, d rainbow);
mydb=# create or replace function foo_tr()
returns trigger
language plpgsql as $$
if tg_op = 'INSERT' or tg_op = 'UPDATE'
return json_populate_record(NEW,json_object(TG_ARGV));
return old;
end if;
mydb=# create trigger foo_tri before insert on foo for each row execute procedure foo_tr(d,i);
mydb=# create trigger foo_tru before update on foo for each row execute procedure foo_tr(c,true);
mydb=# insert into foo values ('x',1,null,'r') returning *;
a | b | c | d
x | 1 | | i
(1 row)

mydb=# update foo set c = null, d = 'o' returning *;
a | b | c | d
x | 1 | t | o
(1 row)


Posted by Andrew Dunstan
Dec. 22, 2015, 5:19 a.m.

Numeric scale and precision

From the quick tip file:

I have lost track of the number of times I have seen people get this wrong. The precision of numeric doesn't specify the number of digits allowed before the decimal point, it specifies the total number of digits before and after the decimal point. So the number allowed before the decimal point is the precision minus the scale. And that means that if you want to increase the scale for finer grained numbers without reducing the range of numbers allowed you need to increase both the precision and the scale.

Let's say you want to have money amounts of US dollars with no fractional cents and up to single trillions of dollars. You would use numeric(15,2). That allows thirteen digits before the decimal and two after. Now if you want to change that to allow fractional cents in hundredths, you will need to use numeric(17,4), which still allows thirteen digits before the decimal but now allows four after.

Posted by Andrew Dunstan
Dec. 21, 2015, 6:45 p.m.

Always Do This #2: ssl_renegotiation_limit = 0

Set sslrenegotiationlimit = 0.

Posted by Christophe Pettus
Dec. 12, 2015, 12:59 p.m.

psql -c will no longer imply --no-psqlrc

Recently some very good changes were made that allow you to have multiple -c and -f arguments to psql. This is going to be quite useful. One of the things this changed as a part of this is that -c no longer implies --no-psqlrc. I found this out the hard way when it proved to be that cause of unpleasant failures by one of my buildfarm animals. This animal runs on my normal Unix workstation, where I have a .psqlrc doing various things, and it also runs a non-standard module that calls "psql -c". Until now there was no issue, because -c meant that my .psqlrc was not processed. The solution was to as "--no-psqlrc" to those psql calls. But I suspect this might catch quite a few people once 9.6 is released.

Posted by Andrew Dunstan
Dec. 8, 2015, 2:20 p.m.

Meet the newest member of the PostgreSQL community: MongoDB, Inc.

Well, this story is better told by the guy who broke it, John De Goes.

The TL;DR is that MongoDB will be shipping PostgreSQL as its "legacy BI connector" in version 3.2, using PostgreSQL+Multicorn FDW as a wrapper to connect SQL-based BI systems to Mongo.

I have two thoughts about this:

  1. Welcome to the PostgreSQL community, MongoDB staff!
  2. Multicorn?  I mean, not that Multicorn isn't great, but don't you think you should be using a C-based FDW driver for performance?
Anyway, hopefully the folks at MongoDB will put in a talk at pgNYC about this.  We're always keen to hear about creative uses of FDWs, and if this isn't one, I don't know what is.

PostgreSQL, The Center Of Your Dataverse™

(even if you're using Mongo)

Posted by Josh Berkus
Nov. 21, 2015, 9:53 a.m.

Announcing Release 4.16 of the PostgreSQL Buildfarm client

I have just released version 4.16 of the PostgreSQL Buildfarm client

It can be downloaded from

Several bugs have been fixed, and there are these non-trivial changes:

  • capturing of TAP test logs
  • bin check enabled on Windows
  • rm_worktrees feature
  • an experimental module to run Sepgsql tests
  • try database shutdown following a start failure
  • report all PGBuild module versions to server

rm_worktrees is enabled by putting this in your config file:

rm_worktrees => 1,

The effect is that at the end of a run the checked out work tree is removed, leaving just the git repository, which in turn might be mostly linked to your HEAD branch if you use the git_use_workdirs option. Before a run, the work tree will be checked out again. The net effect is a significant saving in space used. with these two options, the additional space for each branch except when it's actually building is reduced to less than 30Mb. On crake, the git mirror, branch trees and cache now come in at about 1.5Gb. That's a lot less than it used to be. The additional cost of checking out the worktree each time is very modest.

Shutdown after a start failure tries to remedy a situation where we try to start the server, and don't detect that it has started, but it has in fact started. So now if we get a failure we try to shut down any server that might have started, just in case. This change is possibly redundant given the recent change where postgres detects that its data directory has disappeared and shuts down when it has, but it's probably worth having anyway.

Posted by Andrew Dunstan
Nov. 14, 2015, 7:20 a.m.

A record with all defaults

Fresh from yesterday's help file:

Say you want to insert a record into your table with all the default values set. Later on you'll update some of the values. You might want to do this so you can get the row's ID field. So you try this:

insert into mytable values () returning my_id;
but you find it doesn't work. VALUES isn't allowed to have an empty list. Here's what you do:
insert into mytable values (default) returning my_id;
This will insert the default value explicitly into first column in the table, and implicitly into all the other columns. It doesn't matter if any of the columns doesn't have an explicit default value set - in that case the default is NULL. Of course, that means this won't work if you have NOT NULL columns without a default set.

Posted by Andrew Dunstan
Nov. 12, 2015, 8:45 p.m.

PostgreSQL link round-up

First, in case you somehow missed it, PostgreSQL 9.5 Beta 2 is now out.  Time for another round of testing!  There's fewer and fewer bugs found, so we're not far from a final release.  I don't know about anyone else, but we're going into production on Beta 2 in a couple places.  Can't wait any longer for Upsert and RLS.

Second, pgConfSV has announced its keynotes, from CitusData, Pivotal, and -- as a bit of a surprise -- from streaming data thing Kafka (before you ask, I didn't pick the keynotes). I believe registration is still open, so you can still go to the conference next week if you act quickly.

Thirdly, I have a roundup of some cool blogs covering PostgreSQL which aren't on Planet and you may have missed: did a terrific two-part article on why PostgreSQL is the best open source database.  Read it here: Part I  Part II

They also covered using JSON in Postgres with Python.

In stranger news, there's an amazingly strange story about Soylent, PostgreSQL, and auto-responding to poisoning accusations.  Yes, Soylent Inc. uses PostgreSQL, why not?  Read the whole weird tale here on Zapier's blog.

That's it for my round up ... if you have some good links, post them in the comments.

Posted by Josh Berkus
Nov. 10, 2015, 10:31 p.m.

wal-e - avoiding the cron emails

If you're like me, you use wal-e for PITR based backups on many of the PostgreSQL servers you manage.  While setting it up, you likely discovered the --terse option greatly reduces the log output.  However, if you run the backup-push job in cron and have email aliases set up properly, you probably receive emails like this every time the backup-push runs:

NOTICE:  pg_stop_backup complete, all required WAL segments have been archived

Why is that?  Well, the default value for client_min_messages in PostgreSQL is NOTICE, which means the client (wal-e in this case) will be sent messsages of NOTICE and higher priority.

So, how to change that?  You may have gotten this far and realized there isn't a handy command line switch that wal-e exposes.  That's ok though, wal-e uses libpq, so we just need to use the PGOPTIONS environment variable.

If you're using wal-e with envdir like in the documentation examples, then you can just plunk down a PGOPTIONS file in /etc/wal-e.d/env that looks like this:

-c client_min_messages=WARNING

and that's it. If you're using a wrapper script for wal-e instead of envdir, you just need to export that environment variable before calling wal-e like this:

export PGOPTIONS="-c client_min_messages=WARNING"

Now sit back and enjoy a few less cron emails in the morning.

EDIT: The envdir method only works if you have bash as your cron shell SHELL=/bin/bash in the crontab or you call the command like this:

bash -c "envdir /etc/wal-e.d/env /usr/local/bin/wal-e --terse backup-push /var/lib/postgresql/9.4/main"

Posted by Jeff Frost
Nov. 10, 2015, 4:01 a.m.

Ready to be an Accidental DBA at pgConfSV?

My AccidentalDBA tutorial materials have been updated for pgConfSV next week.  If you are planning to take this tutorial, please read and follow the setup instructions before you get to the conference.

I created this tutorial in 2009 because of the number of people these days who find themselves in charge of a PostgreSQL server ... or many servers ... without any DBA background, training, or desire to be ops.  The tutorial is intended to teach you the minimum you need to keep your PostgreSQL server from falling over so you can get back to whatever your main job is.

You can take it self-paced if you can't make it to pgConfSV.  The tutorial includes all materials and notes.  In this version, I've added the option of a Docker container (preferred), pg_stat_statements, and a restore-to-point-in-time exercise.

If you are going to attend this tutorial, note that it requires advance setup which will be harder to perform once you get to your hotel, and near-impossible if you wait until 10 minutes before it starts.  So read up and install now.

And ... I believe there's still a few registrations left for the tutorial day, so consider signing up.  You have nothing to lose but your fear of database admin!

Posted by Josh Berkus
Nov. 8, 2015, 2:25 p.m.

Latest adventures with a buildfarm animal on Raspberry Pi 2

My buildfarm member axolotl, that runs (or ran) on a Raspberry Pi 2 running a Fedora 21 remix, died the other day, apparently with a corrupted file system.

While I'm trying to revive it, I thought I would try running the stock Fedora spin built for ARM, instead of a remix. This is pretty easy to do using the script found here. It looks like this is a bad idea, though. It's running so slowly just doing a "dnf update" that I wouldn't think of running a buildfarm member on it.

So, if I can't resuscitate the file system on the other SD card, I'll try a respin with a custom kernel again. There is someone maintaining a set here, which you can read about on the Raspberry Pi forums.


fsck on the root partition managed to resuscitate it without apparent problems, So in the end I haven't had to jump thought more hoops.

Posted by Andrew Dunstan
Nov. 3, 2015, 8:39 a.m.

Keeping config files under version control - reprise

I recently blogged about the virtue of putting PostgreSQL config files under version control. My attention was drawn to etckeeper, and a day or so ago I started playing with it, and it seems pretty nice. If you keep your configs in /etc like Debian/Ubuntu, then just deploying etckeeper is probably all you need to do. But if you are using Fedora, Redhat, Centos, or anything where the config files aren't in /etc, it's a little bit harder. But not too much. I found a nice little script at serverfault that you can put in etckeeper's commit.d directory that seems to do the trick and happily pulls files outside /etc into the repository. This seems like quite a good solution.

Posted by Andrew Dunstan
Oct. 23, 2015, 11:10 a.m.

A little permissions problem

So you're on Fedora 22 and you want to run the latest stable PostgreSQL release. You install the PGDG repository package and then install postgresql94-server. You decide to test it out first, so as your normal login user you run initdb and then try to start the server using pg_ctl, something you've done a hundred times before. It should "just work," right? Not this time. You examine the log file and you see it complaining like this:

could not create lock file "/var/run/postgresql/.s.PGSQL.5432.lock": Permission denied
Ouch! I've never needed that before. These files have always gone in /tmp. OK, so why is it different now we are putting them elsewhere? It turns out that the directory has been created by the RPM with permissions different from /tmp, and only the postgres user can write there.

My solution (which I hope our RPM maintainers adopt) was to change the permissions on the directory like this, making them the same as those of /tmp:
sudo chmod 1777 /var/run/postgresql
After that when I try to start the database everything is as expected.

Posted by Andrew Dunstan
Oct. 16, 2015, 8:14 a.m.

How to find what's causing that error

Tom Lane just posted a terrific piece of advice about how to find what's causing an error you don't know the source of, which is worth quoting:

What I would recommend is that you get the data onto a non-production
machine where you can play around a bit more. One thing you could do
then is run a build with debug symbols, attach to the backend process
with gdb, and set a breakpoint at "errfinish". Then provoke the error,
and capture a backtrace from the call to errfinish. That would greatly
narrow things down, though it might not be enough to isolate the bug
I normally like to have debuginfo packages installed even on production machines so I can do this sort of thing quickly.

Another thing you can try in a non-production environment is to run your test case against a build with assertions enabled and see if it trips an assertion failure. That's often a quick way of finding where problems are occurring.

Posted by Andrew Dunstan
Oct. 8, 2015, 10:27 a.m.

Testing on Fedora

One of the things that is nice about Fedora is how easy it is to test something very quickly. They provide a considerable number of cloud images suitable for spinning up instances fast. In my case I have a machine with Vagrant and VirtualBox installed, and when we got a complaint recently about PostgreSQL misbehaving with some glibc locales in fedora 23, I wanted to test it. After downloading the image I wanted I was able to get running with this:

vagrant box add --name=f23
mkdir f23
cd f23
vagrant init f23
vagrant ssh

It took me about 5 minutes from the time I discovered the image I wanted to test to starting on setting up to build PostgreSQL on it, and most of that was download time.

Posted by Andrew Dunstan
Oct. 6, 2015, 8:19 p.m.

Reflections on (Humans) Trusting (Humans') Trust

One thing I've learned from consulting is that you should trust a competent person's feels about a technology, even if they can't immediately come up with arguments to support them. Seemingly vague statements like "Oh, I hate the Foo webserver, it's so flaky" or "Bar is a really annoying language to work in" or "the Baz API just doesn't feel very user-friendly" are actually the distilled product of many hard-won experiences that the speaker may not be able to call to mind individually. I observed this last night at Larry Wall's Perl 6 talk, specifically the Q&A: most of the time he came up with great examples, but sometimes he just had to talk about his own language in broad, esthetic terms. And I'm no Larry Wall, but today I found myself in a similar position when advising a client about which of two virtualization platforms to use. Of course, this applies to people speaking from experience and not from prejudice; you have to know how good someone's judgment is in order to transitively-trust them to have good opinions. That's one soft skill that gives you a major edge in technology. But, contrary to stereotype, most technical people are good at making these kinds of meta-judgments.

Posted by Quinn Weaver
Aug. 31, 2015, 10:07 p.m.

Lock-polling script for ALTER TABLE

One of PostgreSQL's advantages over many other SQL databases is letting users modify database objects with a minimum of locking and fuss, and do it in a transaction so it can be rolled back.  However, a minimum of locking isn't no locking, so one still has to plan for getting a lock to push through the database migration.

For example, say I wanted to add two new columns to a table.  I already know that by making the column nullable, and not setting a default, I can add them with a metadata-only change which requires only an update to the system catalogs.  However, this table change does require an ACCESS EXCLUSIVE lock for a few milleseconds to go through, which can be tricky.  See, an access exclusive lock blocks everything, including reads and autovacuum, which can be a bit messy if your timing is bad.  You can end up waiting for that lock behind a long-running query, and then all the other database traffic can pile up behind you.

Now, if we had ALTER TABLE CONCURRENTLY it would take care of this for you.  But we don't (yet), so you need to fake it with a DO script.  Here's an example DO script which polls for a lock every 2 seconds until it can get one, then pushes through the table change:

    DO $f$
    DECLARE ntries INT := 10;
        sleepytime INT := 2;

    FOR get_lock IN 1 .. ntries LOOP
                    LOCK TABLE mytable
                    ALTER TABLE mytable ADD COLUMN new_col1 INT,
                            ADD COLUMN new_col2 VARCHAR;
                    RAISE INFO 'table updated';
                    WHEN lock_not_available THEN
                            PERFORM pg_sleep(sleepytime);

    RAISE INFO 'unable to obtain lock after % tries', ntries;


The idea here is that you keep trying to LOCK ... NOWAIT, which will throw and error if it can't get the lock immediately.  Then it sleeps and tries again 2 seconds later.  If you're using 9.3 or later, you can take an even better approach, using lock_timeout:

    DO $f$
    DECLARE ntries INT := 10;
        sleepytime INT := 2;
    SET lock_timeout = '100ms';
    FOR get_lock IN 1 .. ntries LOOP
                    LOCK TABLE mytable IN ACCESS EXCLUSIVE MODE;
                    ALTER TABLE mytable ADD COLUMN a int,
                            ADD COLUMN b INT,
                            ADD COLUMN c INT;
                    RAISE INFO 'table updated';
                    WHEN lock_not_available THEN
                            PERFORM pg_sleep(sleepytime);

    RAISE INFO 'unable to obtain lock after % tries', ntries;


The advantage of this is that it doesn't have to get the lock immediately; it waits 100ms, then gives up on getting the lock.  That helps in situations where the table is never completely free of read queries, without risking a serious pile-up of requests.

You should be able to adapt this approach. to the migrations you actually need to push through.

Posted by Josh Berkus
Aug. 28, 2015, 3:16 p.m.

Stupid Hacks: Dictionary replace function

I write a lot more PL/pgSQL than I'd like to.  Not that I don't like SQL, but as a language PL/pgSQL really shows its thrown-together origin; it's like 80's primitive.  One thing that PL/pgSQL lacks is good string manipulation tools, which is particularly tragic given that the #1 thing to do in PL/pgSQL is to generate queries from parameters and run EXECUTE.

Postgres has two built-in ways to do string substitution: concatenation and format().  Both have drawbacks.  Let me give you an example:

EXECUTE 'SELECT ' || col1 || ', ' || col2 || ' FROM ' || userschema ||
  '.accounts WHERE ' || filterclause || ' ORDER BY ' || col1 || ',' || col2;

EXECUTE format('SELECT %s, %s FROM %s.accounts WHERE %s ORDER BY %s, %s', col1, col2, userschema, filterclause, col1, col2);

You can see the problem here.  Both formats are hard to read and hard to maintain.  Python and Perl have a good fix for this: dictionary/hash-based string substitution, where you can swap in the dictionary keys for the values.  So I wrote up a quick hack to do this in PL/pgSQL.

Here's the dict-replace function.

Using it, you'd replace the above with:

EXECUTE replace_vars('SELECT ${col1}, ${col2} FROM ${userschema}.accounts
  WHERE ${filterclause} ORDER BY ${col1}, ${col2}', vardict);

Of course, you need to first set up the vardict as a JSON value, in the form:

vardict := '{ "col1" : "username", 'col2' : "branchname", ...}'

Still, much more readable, eh?  No refcounting, no repeating variables, no string breaks.  I used Bash's variable substitution syntax of ${var} because it seemed like the thing least likely to conflict with user text, unlike anything involving %.

There's some caveats, though: it's not performant, and you could probably find strings which will break it, particularly if you're swapping in JSON values.  It's not SQL-injection safe, so don't use it for stuff handling user input.  You still have to do your own quoting of strings.  And if you have access to PL/Perl or PL/Python you don't need this nonsense.

But for the minimal case, it should help.

Posted by Josh Berkus
Aug. 14, 2015, 4:14 p.m.

Most of the talks chosen for pgConfSV

Since we're still working on the schedule, I wanted to give folks a preview of the talks we've picked for pgConfSV.  We still have a few more talks to pick and the tutorials to finalize.  Regardless, it's looking like a really exciting lineup! 
We'll have folks from Heap, Wanelo, TreasureData, Rackspace,, EMC, Joyent, Square, and more.   The sad part will be missing three talks in each timeslot.

Note that this set of talks is not final; we're still swapping a couple of things, and some speakers have not confirmed.

pgConfSV is November 17th and 18th in South San Francisco.  If you're going to AnsibleCon, that's the day after, so you can do both!

Posted by Josh Berkus
Aug. 12, 2015, 10:03 a.m.

Two Great Tastes That Taste Great Together: cstore + Pipeline

cstore_fdw, the column-store extension for PostgreSQL by CitusData, is a really good way to add compressed storage for archival data, and analytic data intended to be aggregated, to your application.  Because it's a column store, though, cstore wants new data added in batches, the bigger the better.  This means that you need to find some way to batch-up incoming data, preferably one able to accept a lot of new data rapidly. 

This is where PipelineDB comes in.  Since PipelineDB is open source now, and based on 9.4.4, I can add extensions to it, including cstore_fdw.  I've done so with the PipelineDB 0.7.7a Docker container, so if you use Docker it's simply available.

As a demonstration of this, I'll set up some fake clickstream data, archived to a cstore table hourly.  First, I wrote a quick python script to generate it continuously and push it to a Pipeline stream.

Then I created the stream and continuous view in PipelineDB:

    CREATE STREAM seenstream ( user_id int, page_id int, ts timestamptz );

    CREATE CONTINUOUS VIEW seenfeed as select user_id, page_id, ts
    FROM seenstream
    WHERE arrival_timestamp > ( clock_timestamp() - interval '90 minutes' );

Next, I created the cstore table:

    CREATE EXTENSION cstore_fdw;

    CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

        user_id int,
        page_id int,
        ts timestamp tz
    SERVER cstore_server
    OPTIONS (compression 'pglz');

Finally, I added a simple script which ran the following query once per hour:

    INSERT INTO seen
    SELECT user_id, page_id, ts
    FROM seenfeed
    WHERE ts >= ( now() - interval '1 hour' )
    ORDER BY user_id, page_id, ts;

... and then I started everything in motion.

Now, for cstore the ORDER BY is vitally important; it determines how the blocks you create for the column store are organized.  In this particular case, I knew that I would be doing more analysis by user.  But most users would do ORDER BY ts instead.

After a few hours, I checked back, and now I can run some analytical queries on the cstore table.  For example, user activity:

    select user_id, count(distinct page_id), max(ts) 

    from seen group by user_id;

    user_id | count |              max             
          1 |    92 | 2015-08-11 22:59:51.504777+00
          2 |    86 | 2015-08-11 22:54:09.77318+00
          3 |    89 | 2015-08-11 22:59:14.574697+00

page activity:

    select page_id, count(distinct user_id) as duv, max(ts) as last
    from seen group by page_id order by duv desc;

    page_id | duv |             last             
         71 |  96 | 2015-08-11 22:59:38.690743+00
         99 |  96 | 2015-08-11 22:58:43.004618+00
          4 |  96 | 2015-08-11 22:57:45.95007+00

... and more. 

Now, in a production environment, you'll want to do more than that.  Depending on traffic, you might batch inserts daily instead of hourly.  You might want to use several continuous views to store different summaries of the data instead of raw data.  But the above should be enough to show you how well these two tools go together.

Posted by Josh Berkus
Aug. 11, 2015, 4:09 p.m.

Please, security test our code!

Since Oracle was so nice as to remind everyone what software security is really like with closed-source software, I wanted to remind people how finding and reporting security issues works in PostgreSQL:

  1. Feel free to "reverse engineer" the code.  In fact, here it is on github if you want to scrutinize it.
  2. We generally credit security researchers who find real security holes (with limitations for duplicates, etc.).  Over the last few years, some of the most critical bugs in PostgreSQL were found by professional security researchers doing things like fuzz testing.
  3. If you think you've found a security issue, please report it to If it turns out to be a non-issue, we'll tell you, and you can report it as a regular bug.
  4. Be prepared to answer questions about your report.  We showed you our code, you can show us yours.
Our open approach to security is the reason why PostgreSQL was rated by the Database Hacker's Handbook as "the most secure by default": more secure than Oracle.  And why for five Defcons in a row, security hackers have been unable to crack PostgreSQL in the annual Schemaverse competition.

And please ... update your servers! We send out those incremental updates for a reason, and often as not, there are security patches.   More information on our Security Page.

Note: the above is my personal opinion and is not the opinion of The PostgreSQL Project or any other organization.

Posted by Josh Berkus
Aug. 7, 2015, 5:54 p.m.

Understanding Unintuitive TABLESAMPLE Results

In 9.5 Alpha 2:

create table mil ( id int, val text );
insert into mil select i, i::text || '-val'  from
generate_series(1,1000000) as gs(i);

postgres=# select * from mil tablesample system ( 0.04 );
 id | val
(0 rows)

Huh, what?

So, what I didn't understand here is the way rows are selected for TABLESAMPLE SYSTEM.  Since SYSTEM is page-based, I thought that we selected the requested % of pages, and then pick that many pages at random.  Since this table had exactly 185 rows per page, it should return 370 rows every time (2 pages).  But that's not what happened. In fact, running the following query I got a variety of counts:

SELECT count(*) FROM (select * from mil tablesample system ( 0.04 ) ) as a;

925?  0?  What the hey?

What's really happening is that pages for SYSTEM are selected a different way.  Each page is checked against the probability once.  This means that, while on average you'll get the number of pages you're expecting, the numbers will vary from request to request quite a bit.

This also means that SYSTEM is a bad choice for really small sample sizes, like 0.01%.  BERNOULLI is better, because it'll be checking by row, and therefore the size of the return sample will be much more predictable.  It will still have a bit of variation, though; in my testing, +/- 10% on a few hundred rows.

Gulcin Yildirim has a great explanation of this on the 2nd Quadrant blog.

So, what if you need TABLESAMPLE to get a very specific number of rows for you?  Well, that's why Petr Jelinek wrote the optional (loadable) SYSTEM_ROWS sampling method.  This can be loaded as the tsm_system_rows extension in 9.5.

Hopefully that helps folks be less confused than me about how TABLESAMPLE works.

Posted by Josh Berkus
Aug. 6, 2015, 10:31 a.m.

Time to test 9.5 Alpha 2

PostgreSQL 9.5 Alpha 2 is out.  We're still calling this one an Alpha instead of a Beta, because major replumbing is still underway.  However, most of the issues reported with Alpha 1 have been fixed.  A bunch of really useful bug reports came from regular users who tested out the alpha with their own workloads or tests, so please keep up the good work!

And, if you didn't test Alpha 1, well, we still need your help testing Alpha 2.  Here's some ideas for what to test.

Again, to make testing easier for folks, I've created a docker image of Alpha 2.  If you already run docker, simply run jberkus/postgres95-test and you can test our features without needing to worry about how to install Postgres.  Specific notes on this test image are on the wiki.

As part of updating to alpha2, I managed to shrink the image by about 200MB by getting rid of a lot of build code.  However, I did not remove 150MB of build and compile tools because I'm expecting testers to want to install additional extensions or tools in the image.

Posted by Josh Berkus
July 30, 2015, 11:36 a.m.

Texas Trip: DjangoCon and Postgres Open

This September, I will be having a great Texas adventure, and you're invited along. 

September 8-10, Austin: DjangoCon.US.   I will be presenting "PostgreSQL Performance in 15 Minutes" on the 9th. 

September 9th or 10th, Austin: I will speak at AustinPUG.  Date, location and exact topic still TBA.  Follow the AustinPUG Meetup, or check back here in a week for an update.

September 16-18, Dallas: Postgres Open:  I will be doing my Replication 101 tutorial, and then Explaining Explain.

September is a great time to be in Texas: temperatures are reasonable, Texans are friendly, and there's lots of great Mexican food and barbecue.  So, register for DjangoCon and/or Postgres Open today and join me!

Oh, and the photo? That's one of the hand-thrown DjangoCon pint cups I'm making as speaker gifts for DjangoCon.  Don't you wish you'd submitted a talk, now?

Posted by Josh Berkus
July 28, 2015, 9:52 a.m.

PipelineDB: streaming Postgres

If you've been following the tech news, you might have noticed that we have a new open source PostgreSQL fork called "PipelineDB".  Since I've joined the advisory board of PipelineDB, I thought I'd go over what it is, what it does, and what you'd use it for.  If you're not interested in Pipeline, or Postgres forks, you can stop reading now.

PipelineDB is a streaming database version of PostgreSQL.  The idea of a streaming database, first introduced in the PostgreSQL fork TelegraphCQ back in 2003, is that queries are run against incoming data before it is stored, as a kind of stream processing with full query support.  If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data".  This was previously implemented in StreamBase, StreamSQL, and the PostgreSQL fork Truviso. In the Hadoop world, the concept is implemented in Apache SparkSQL.

On a practical level, what streaming queries do is allow you to eliminate a lot of ETL and redundant or temporary storage.

PipelineDB 0.7.7 is 100% of PostgreSQL 9.4, plus the ability to create Continuous Views, which are actually standing queries which produce different data each time you query them depending on the incoming stream.  The idea is that you create the queries which filter and/or summarize the data you're looking for in the stream, and store only the data you want to keep, which can go in regular PostgreSQL tables.

As an example of this, we're going to use PipelineDB to do tag popularity counts on Twitter.  Twitter has a nice streaming API, which gives us some interesting stream data to work with.  First I spun up a PipelineDB Docker container.  Connecting to it, I created the "twitter" database and a static stream called "tweets":

Creating a static stream isn't, strictly speaking, necessary; you can create a Continuous View without one.  As a career DBA, though, implied object names give me the heebie-jeebies.  Also, in some future release of PipelineDB, static streams will have performance optimizations, so it's a good idea to get used to creating them now.

    docker run pipelinedb/pipelinedb
    josh@Radegast:~$ psql -h -p 6543 -U pipeline
    Password for user pipeline:
    psql (9.4.1, server 9.4.4)
    Type "help" for help.
    pipeline=# create database twitter;
    pipeline=# \c twitter
    twitter=# create stream tweets ( content json );

Then I created a Continous View which pulls out all identified hashtags from each tweet.  To do this, I have to reach deep inside the JSON of the tweet structure and use json_array_elements to expand that into a column.  Continuous Views also automatically add a timestamp column called "arrival_timestamp" which is the server timestamp when that particular streaming row showed up.  We can use this to create a 1-hour sliding window over the stream, by comparing it to clock_timestamp().  Unlike regular views, volatile expressions are allowed in Continuous Views.

    SELECT json_array_elements(content #>

      ARRAY['entities','hashtags']) ->> 'text' AS tag
    FROM tweets
    WHERE arrival_timestamp >
          ( clock_timestamp() - interval '1 hour' );

This pulls a continous column of tags which appear in the San Francisco stream.

Then I created a linked Docker container with all of the Python tools I need to use TwitterAPI, and then wrote this little script based on a TwitterAPI example.  This pulls a stream of tweets with geo turned on and identified as being in the area of San Francisco.  Yes, INSERTing into the stream is all that's required for a client to deliver stream data.  If you have a high volume of data, it's better to use the COPY interface if your language supports it.

Then I started it up, and it started pushing tweets into my stream in PipelineDB.  After that, I waited an hour for the stream to populate with an hour's worth of data.

Now, let's do some querying:

    twitter=# select * from tagstream limit 5;

How about the 10 most popular tags in the last hour?

    twitter=# select tag, count(*) as tag_count from tagstream group
              by tag order by tag_count desc limit 10;
         tag       | tag_count
    Hiring         |       211
    Job            |       165
    CareerArc      |       163
    Jobs           |       154
    job            |       138
    SanFrancisco   |        69
    hiring         |        60
    FaceTimeMeNash |        42
    CiscoRocks     |        35
    IT             |        35

I detect a theme here.  Namely, one of sponsored tweets by recruiters.

Now, obviously you could do this by storing all of the tweets in an unlogged table, then summarizing them into another table, etc.  However, using continuous views avoids a bunch of disk-and-time-wasting store, transform, store again if you're uninterested in the bulk of the data (and tweet metadata is bulky indeed). Further, I can create more continuous views based on the same stream, and pull different summary information on it in parallel. 

So, there you have it: PipelineDB, the latest addition to the PostgreSQL family.  Download it or install it using the Docker container I built for it.

Posted by Josh Berkus
July 24, 2015, 12:15 p.m.

Using Docker to isolate VPN clients

As a consultant, I need to use VPNs to log into many of our client environments.  As one of our data security policies, our staff are not allowed to connect to multiple customer VPNs from the same machine at the same time; plus many VPNs do not permit this.  However, this causes some work efficiency issues whenever we need to run a job for a customer which involves a lot of wait time, such as building an index.

I've used VirtualBox VMs as isolation environments, but VBox consumes a lot of memory and system resources.  So I started thinking: could I use Docker containers to supply some network/file isolation while connected to multiple VPNs?  The answer is yes.

Let's take the difficult case first: a Cisco IPSec VPN.  First, I exported the VPN configuration from Gnome's network manager, which creates a PCF file.  Next, I start Docker container based on the official Ubuntu Trusty images, giving it the net_admin Linux cap to allow the VPN to work.  I also need to mount the new PCF file into a directory in the container so I can access it.

     docker run -it --cap-add net_admin \
        --volume /home/josh/vpns:/mnt/vpn ubuntu:trusty

After this I need to install the connection software:

     apt-get update
     apt-get install vpnc ssh

Once vpnc is installed, I need to convert that PCF file to a vpnc file and install it:

     cd /mnt/vpn
     pcf2vpnc vpn.pcf /etc/vpnc/vpnc.conf
     chmod 600 /etc/vpnc/vpnc.conf

Once that's all set up, I can start the VPN:

     root@c5b7802ca120:/# vpnc vpnc
     Enter password for jberkus@
     VPNC started in background (pid: 78)...

Success!  Sometimes, though, you'll see this error message:

         Cannot open "/proc/sys/net/ipv4/route/flush"

I haven't pinned down what causes that yet, but it seems to have no effect on actual ability to use the vpnc VPN.  Probably I need one more --cap-add, but I don't yet know what it is.

Anyway, once you have that running, docker commit the image and you have a saved VPN image for that particular VPN (just remember not to upload it to Docker Hub).  Then you can run the image whenever you need that VPN.

OpenVPN is actually more complicated, because device access as well as net access is required.  On the up side, you only need the config file to actually connect.  See this project on how to set up your container, and then run:

       openvpn --config /mnt/vpn/config.ovpn

Now, the caveat:  Docker containers do not supply nearly the levels of isolation which a full VM would, especially since we have to grant the "net_admin" cap to get the VPN to work.  While containers prevent accidental network cross-connection, they would not block targeted malware which attacks the network stack. In our case, that means I can use this technique for our regular clients, but not for PCI-compliant customers or other clients with strong regulated data isolation requirements.

But for the others, I can now easily run 2 VPNs at a time, while protecting them from cross-connection, and me from accidentally targeting the wrong customer's server.

Thanks to Rikaard Hosein, programmerq, and other helpful folks on IRC channel #docker for advice on this.

Posted by Josh Berkus
July 23, 2015, 1:46 p.m.

unsafe UPSERT using WITH clauses

By now you've read about PostgreSQL 9.5 and our shiny new UPSERT feature.  And you're thinking, "hey, 9.5 is still in alpha, how do I get me some UPSERT action now?"  While there are some suggestions for workarounds on the PostgreSQL wiki, I'm going to show you one more method for approximating UPSERT while you help us test 9.5 so that it will come out sooner.

The general suggested way for handling UPSERT situations is one of the following:

  1. Try to do an INSERT.  On error, have the application retry with an UPDATE.
  2. Write a PL/pgSQL procedure which does insert-then-update or update-then-insert.
Both of these approaches have the drawback of being very high overhead: the first involves multiple round-trips to the database and lots of errors in the log, and the second involves major subtransaction overhead.  Neither is concurrency-safe, but then the method I'm about to show you isn't either.  At least this method avoids a lot of the overhead, though.

What's the method?  Using writeable WITH clauses.  This feature, introduced in 9.1, allows you to do a multi-step write transaction as a single query.  For an example, let's construct a dummy table with a unique key on ID and a value column, then populate it:

     create table test_upsert ( id int not null primary key, 
        val text );
     insert into test_upsert select i, 'aaa'
       from generate_series (1, 100) as gs(i);

Now, let's say we wanted to update ID 50, or insert it if it doesn't exist.  We can do that like so:

    newrow ( id, val ) as (
        VALUES ( 50::INT, 'bbb'::TEXT ) ),
    tryupdate as (
        UPDATE test_upsert SET val = newrow.val
        FROM newrow
        WHERE =
    INSERT INTO test_upsert
    SELECT id, val
        FROM newrow
    WHERE id NOT IN ( SELECT id FROM tryupdate );

The above tries to update  ID=50.  If no rows are updated, it inserts them.  This also works for multiple rows:

    newrow ( id, val ) as (
        VALUES ( 75::INT, 'ccc'::TEXT ),

                      ( 103::INT, 'ccc'::TEXT )
    tryupdate as (
        UPDATE test_upsert SET val = newrow.val
        FROM newrow
        WHERE =
    INSERT INTO test_upsert
    SELECT id, val
        FROM newrow
    WHERE id NOT IN ( SELECT id FROM tryupdate );

... and will update or insert each row as called for.

Given that we can do the above, why do we need real UPSERT?  Well, there's some problems with this approximate method:
  • It's not concurrency-safe, and can produce unexpected results given really bad timing of multiple connections wanting to update the same rows.
  • It will still produce key violation errors given bad concurrency timing, just fewer of them than method 1 above.
  • It's still higher overhead than 9.5's UPSERT feature, which is optimized.
  • It will return INSERT 0 0  from calls that do only updates, possibly making the app think the upsert failed.
  • It's not safe to use with INSERT/UPDATE triggers on the same table.
But ... if you can't wait for 9.5, then at least it's a temporary workaround for you.  In the meantime, download 9.5 alpha and get testing the new UPSERT.

Posted by Josh Berkus
June 9, 2015, 1:08 p.m.

Baidu Bot Blues

A client complained of mysterious connection spikes. "We think we're getting hit by bots, but we're not sure," they said; "Can you look into it?"

So I edited postgresql.conf, turned on log_connections and log_disconnections, did a 'pg_ctl -D MY_DATA_DIR reload', and waited (in my case I also set log_min_duration_statment = 0, but I don't recommend that unless you can watch to make sure the logs aren't filling the log partition — and make sure you're not going to hammer a slow log device).

A while later, I did

ls -t data/pg_log/*.csv | head -5 | xargs grep -c connection

(That says "List log files in order from most recent to least recent, take the five most recent, and count the lines in each file containing 'connection'.")

I got lucky. A spike was obviously in progress; the current log file showed a count two orders of magnitude higher than the other log files'. Since the problem was in progress, no post mortem was required; I just mailed the client and asked "Do your web logs show bot traffic right now?"

The client mailed right back saying, "Yes, we see a bunch of connections from Baidu's web crawler; it's going nuts hitting our site."

Baidu is a Chinese search engine. It turns out its crawling bot is well-known for being aggressive. As that link explains, the bot doesn't respect Crawl-delay in robots.txt (that's annoying but unsurprising, since Crawl-delay is a non-standard extension).

So if this happens to you, you have these options:

  1. Register at Baidu's site (Chinese-only) to request less-aggressive crawling.
  2. Try to tune PostgreSQL connection pooling to handle the onslaught.
  3. Block Baidu entirely. This is an extreme option — not advised if you do business in China, or if you have Chinese-speaking customers elsewhere who are reaching you through Baidu. The latter requires analytics to figure out.

Posted by Quinn Weaver
May 8, 2015, 2:37 p.m.

syslog Considered Dangerous

Edit 2015-10-01: there's a similar problem on RDS, where logs are written to the same partition as data, which can cause problems if you don't have enough provisioned IOPS (or disk space).

A client of ours was suffering mysterious, intermittent lock pileups on a large table. To diagnose the problem, I ran our lock-logging scripts.

Querying the resulting log_transaction_locks table, I saw lots of extend locks and RowExclusiveLocks piling up behind normal UPDATEs (and INSERTs) — there were even extend locks waiting on other extend locks. This is pretty unusual; you'd expect that when one UPDATE completed, the extend lock would succeed, PostgreSQL would quickly extend the relation (in this case, the big table's TOAST relation), and queries would keep running smoothly. These were all single-row writes; they should have taken that long. However, a complicating factor was that a couple of columns in that big table were giant TEXT columns storing JSON strings. Was that the issue? The problem was unclear, and we needed more info.

At this point my colleague Christophe Pettus made the excellent suggestion to do 'strace -p' on one of the locked processes. The dump revealed lots of sendto() calls with giant strings, which he immediately recognized as syslog messages based on their formatting.

Now, this was a problem. syslog writes to /var, and /var was hanging off the root partition, which was an EBS volume (without provisioned IOPS). So each basically the entire SQL of each query was being written across the network, including those giant JSON strings, and the query itself wouldn't terminate till the write was fsync()'ed. In the meantime, extend locks and other UPDATEs piled up behind it. The PostgreSQL data directory was on a big, fast software RAID across SSD instance storage, but performance was being bottlenecked by that single EBS volume.

The client elected to turn off log_lock_waits, and the problem went away. In addition, Christophe suggested prepending a minus sign to the filename in /etc/rsyslog.d/postgres.conf, like so, which tells syslog not to flush between each log statement:

local0.* -/var/log/postgres.log

I thought that was neat: a potential fix to a huge performance problem, in one character(!)

Lessons learned:

1. If your log_destination contains syslog, be careful about filesystem layout. (If you stick to csvlog, this normally isn't a problem, because csvlogs, by default, go in pg_log/ under the PostgreSQL data dir.)

2. Logging info about a performance problem can sometimes make it worse. In this case, we decided to diagnose the pre-existing lock pileups by setting log_lock_waits = on and deadlock_timeout = 200. But, because of the syslog issue, those very settings exacerbated the lock problems. This was necessary to catch the problem, but it's something to keep in mind if you decide to use our lock-logging scripts.

Posted by Quinn Weaver
Sept. 18, 2014, 12:19 p.m.

WAL-E with Rackspace CloudFiles over Servicenet

Found a great walkthrough on setting up WAL-E to use python-swiftclient for storage in Rackspace Cloud Files:

Unfortunately by default, your backups use the public URL for Cloud Files and eat into metered public bandwidth.

The way to work around this is to set the endpoint_type to internalURL instead of the default publicURL.

You do that by setting the following environment variable:


That allows WAL-E to use Servicenet for base backups and WAL archiving which will be much faster and not eat into your metered public bandwidth.

Posted by Jeff Frost
Sept. 17, 2014, 1:18 p.m.

RDS for Postgres: List of Supported Extensions

Today I learned that Amazon doesn't keep any list of extensions supported in PostgreSQL. Instead, their documentation tells you to start a psql session and run 'SHOW rds.extensions'. But that creates a chicken-and-egg situation if you have an app that needs extensions, and you're trying to decide whether to migrate.

So here's a list of extensions supported as of today, 2014-09-17 (RDS PostgreSQL 9.3.3). I'll try to keep this current.



Posted by Quinn Weaver
April 29, 2014, 6:04 p.m.

How to start PostgreSQL the Debian way with pg_ctl

Sometimes I find myself wanting to start a PostgreSQL cluster with pg_ctl on a Debian or Ubuntu system.  If you've ever tried this, you know that something like this doesn't work:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main/ start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ postgres cannot access the server configuration file "/var/lib/postgresql/9.3/main/postgresql.conf": No such file or directory

because the Debian packages place the config files in /etc/postgresql/<version>/<cluster name>

So, I used to think that this was the way to start it:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /etc/postgresql/9.3/main/ start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ 2014-04-30 00:59:32 UTC LOG:  database system was shut down at 2014-04-30 00:57:49 UTC
2014-04-30 00:59:32 UTC LOG:  database system is ready to accept connections
2014-04-30 00:59:32 UTC LOG:  autovacuum launcher started

And that does generally work, but the proper way to do it is this:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main/ -o "-c config_file=/etc/postgresql/9.3/main/postgresql.conf" start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ 2014-04-30 01:00:22 UTC LOG:  database system was shut down at 2014-04-30 01:00:16 UTC
2014-04-30 01:00:22 UTC LOG:  database system is ready to accept connections
2014-04-30 01:00:22 UTC LOG:  autovacuum launcher started

The real Debian way is to use pg_ctlcluster like so:

vagrant@vagrant-ubuntu-trusty-64:~$ sudo -u postgres pg_ctlcluster 9.3 main start

Posted by Jeff Frost
April 18, 2014, 3:54 p.m.

Grabbing the statement durations out of a log file using gnu grep

Sometimes I find myself wanting to grep all the query durations out of a file for sorting.  I recently discovered that I can use gnu grep's lookbehind syntax like so:

grep -oP '(?<=duration: )[0-9\.]+' postgresql-Mon.csv

and then sort them numerically like so:

grep -oP '(?<=duration: )[0-9\.]+' postgresql-Mon.csv | sort -n | tail

Posted by Jeff Frost
April 2, 2014, 4:02 p.m.

Putting stats_temp_directory on a ramdisk

This hack is an old chestnut among PostgreSQL performance tuners, but it doesn't seem to be widely known elsewhere. That's a shame, because it's pure win, and it's ridiculously easy to set up. You don't even need to restart PostgreSQL.

Here's the situation: PostgreSQL writes certain temporary statistics. These go in the dir given by the stats_temp_directory setting. By default, that's pg_stat_tmp in the data dir. Temp files get written a lot, but there's no need for them to persist.

That makes them perfect candidates for a ramdisk (a.k.a. RAM drive). A ramdisk is a chunk of memory treated as a block device by the OS. Because it's RAM, it's super-fast. As far as the app is concerned, the ramdisk just holds a filesystem that it can read and write like any other. Moreover, PostgreSQL generally only needs a few hundred kilobytes for stats_temp_directory; any modern server can fit that in RAM.

In Linux, you set up a ramdisk like this:

As root:

'mkdir /var/lib/pgsql_stats_tmp' [1]

'chmod 777 /var/lib/pgsql_stats_tmp'

'chmod +t /var/lib/pgsql_stats_tmp'

Add this line to /etc/fstab. That 2G is an upper limit; the system will use only as much as it needs.

tmpfs /var/lib/pgsql_stats_tmp tmpfs size=2G,uid=postgres,gid=postgres 0 0

'mount /var/lib/pgsql_stats_tmp'

Then, as postgres:

Change the stats_temp_directory setting in postgresql.conf:

stats_temp_directory = '/var/lib/pgsql_stats_tmp'

Tell PostgreSQL to re-read its configuration:

'pg_ctl -D YOUR_DATA_DIR reload'

And that's it!

Other operating systems have different ways to set up ramdisks. Perhaps I'll cover them in a later post.

[1] The directory /var/lib/pgsql_stats_tmp is an arbitrary choice, but it works well for Debian's filesystem layout.

Posted by Quinn Weaver
Nov. 21, 2013, 7:42 p.m.

Ubuntu 3.11.0 kernel not as stable as hoped

As I mentioned in my previous post, Ubuntu recently released a 3.11.0 kernel for Precise.  Unfortunately, while it appeared stable during a long burn-in process with various pgbench workloads and scale sizes, it crashed shortly after putting the machine into service, so make sure you test thoroughly on your hardware. 

This particular hardware was the Dell Poweredge 720 that rackspace uses.

Guess we're going back to custom compiling kernels for a little while longer.

Anyone else have experiences good or bad with it?

I believe this is the same kernel that's shipping with 13.10 Saucy Salamander.

Posted by Jeff Frost
Nov. 15, 2013, 1:34 p.m.

Ubuntu has released a 3.11.0 kernel for 12.04 LTS!

This one is for those of you compiling a custom kernel or staying on 10.04 LTS because of this issue on database servers in the 3.x kernels:

Ubuntu has released linux-image-3.11.0-13-generic for 12.04 in which that fix is included.  It's actually been included since 3.9rc2, but Ubuntu skipped straight to 3.11.

And there was much rejoicing in the PostgreSQL community.  At least from those of us who manage 12.04 LTS PostgreSQL servers!

For those of you who don't recall, here's a link to the big discussion thread:

One item of note on 3.11 is that the kernel.sched_migration_cost setting is now called kernel.sched_migration_cost_ns, so if you are setting that via sysctl, you will need to update your sysctl.conf or sysctl.d file after upgrade.

Now go upgrade your test servers and let us know how it performs!  I've been burning in a new server for about 24 hours with pgbench and the 3.11.0-13-generic kernel and so far it's been stable and a good performer.

Posted by Jeff Frost
Aug. 22, 2013, 3:57 p.m.


I prefer PostgreSQL, which is the core of our business at pgExperts. But I still have some legacy clients on MySQL. For one of them, I'm writing new features in Perl/MySQL. The mysql_auto_reconnect option just saved my day. I'd been getting errors like "DBD::mysql::st execute failed: MySQL server has gone away." To use it in DBI, do

my $dbh = DBI->connect(
{ mysql_auto_reconnect => 1 },
To use it in Rose::DB::Object, do

File MyApp::DB;
our @ISA = qw( Rose::DB );

mysql_auto_reconnect => 1,

Posted by Quinn Weaver
Aug. 10, 2013, 6:41 p.m.

Get your PgBouncer pools stats into Ganglia

Just whipped up a quick ruby script to put PgBouncer "show pools" data into ganglia using gmetric.  Based it heavily off the postgres_gmetric script.

You can find the pgbouncer gmetric script here:

Posted by Jeff Frost
Feb. 27, 2013, 5:38 p.m.

Script to add the respository on Debian/Ubuntu

You've probably heard the announcement: Martin Pitt's PPA is being deprecated.  This is in an effort to push folks towards the new official

See the announcement here:

Also, see the deprecation notice on Martin's PPA:

Since we have quite a few customers using Martin's PPA for PostgreSQL packages under Ubuntu, I wrote a little shell script to switch repositories if you're using Martin's PPA and just install the repository if you are not.

You can find it here:

Posted by Jeff Frost
Feb. 18, 2013, 4:30 p.m.

Python Type Gripe

I've been digging into Python lately. On Friday, I ran into an unintuitive behavior where two sets containing identical objects (of a custom class) were comparing unequal. I bet you can guess the reason: I hadn't defined __hash__ and __eq__ for my class. Take a look at this documentation for __repr__:

For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object. A class can control what this function returns for its instances by defining a __repr__() method.
"For many?" "makes an attempt?!" "often including?!!" To my way of thinking, this is a little crazy. There should be a single root class in the inheritance hierarchy, and that root should define __repr__, __hash__, and __eq__ methods that operate on/check for equality of each attribute in the class (by value, not by memory address!) Then they would behave consistently for all classes. Then two sets, each containing three objects of the same class, each containing the same attributes with the same values, would compare equal when you used == on them, following the principle of least surprise.

Of course, I can't rewrite Python to make this happen. I'm tempted just to make my own root class that implements these behaviors via metaprogramming, and make every class I ever define a subclass of it. Is there a better way?

PS: A colleague pointed out that this behavior reflects a deliberate decision in Python design philosophy: "Python […] favors informal protocols over inheritance trees […] You can think of __eq__ and __hash__ as kind of being parts of an informal 'collectable' protocol." That sort of makes sense. Of course, the thing about informal protocols is that there's nothing to enforce consistency.

Posted by Quinn Weaver
Feb. 14, 2013, 4:24 p.m.

Finding unreplicated tables and sequences in slony with check_postgres and nagios

Since slony doesn't automatically add tables to replication for you, sometimes it's handy to double check that everything is being replicated.  If your slony cluster is named "sample" then you would have a slony schema named "_sample". As such, you can query against the sl_table and sl_sequence tables to find unreplicated tables like so:

SELECT pg_class.OID::regclass as table_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'r' AND pg_class.OID NOT IN (SELECT tab_reloid FROM _sample.sl_table) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample')

SELECT pg_class.OID::regclass as sequence_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'S' AND pg_class.OID NOT IN (SELECT seq_reloid FROM _sample.sl_sequence) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample');

 Those queries are a bit unpleasant to remember, so I like to make them into views like so:

CREATE OR REPLACE VIEW public.unreplicated_tables AS SELECT pg_class.OID::regclass as table_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'r' AND pg_class.OID NOT IN (SELECT tab_reloid FROM _sample.sl_table) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample') AND pg_class.relname <> 'fwd_epc_map';

CREATE OR REPLACE VIEW public.unreplicated_sequences AS SELECT pg_class.OID::regclass as sequence_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'S' AND pg_class.OID NOT IN (SELECT seq_reloid FROM _sample.sl_sequence) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample');

Now, that's a little handier because we can just select straight out of the view.  Now that we have these easy to use views, we can use's custom_query check to setup a nagios alert whenever there are unreplicated tables or sequences.

First we need to create the check commands like so:

define command{
        command_name    check_pgsql_slony_unreplicated_tables
        command_line    /usr/lib64/nagios/plugins/ --host $HOSTADDRESS$ --port=5432 --dbuser nagios --dbname mydb --action=custom_query --query="SELECT count(table_name) AS result FROM public.unreplicated_tables" --valtype=integer --warn=1 --critical=5

define command{
        command_name    check_pgsql_slony_unreplicated_sequences
        command_line    /usr/lib64/nagios/plugins/ --host $HOSTADDRESS$ --port=5432 --dbuser nagios --dbname mydb --action=custom_query --query="SELECT count(sequence_name) AS result FROM public.unreplicated_sequences" --valtype=integer --warn=1 --critical=5
I apologize for all the wrapping there.

Now that we have the commands defined, we use them in a service definition:

define service {
    use                          postgres-service
    notifications_enabled        1
    hostgroup_name               slony_primary
    service_description          Slony Unreplicated Tables
    check_command                check_pgsql_slony_unreplicated_tables
    contact_groups               pgexperts

define service {
    use                          postgres-service
    notifications_enabled        1
    hostgroup_name               slony_primary
    service_description          Slony Unreplicated Sequences
    check_command                check_pgsql_slony_unreplicated_sequences
    contact_groups               pgexperts
And that's it! Now we'll receive an alert whenever an unreplicated table or sequence exists on the servers in hostgroup slony_primary.

Posted by Jeff Frost
Feb. 12, 2013, 2:48 p.m.

Russian Doll Syndrome / Please Monitor Your RAID Volumes

As part of setting up nagios monitoring for customers, I always make sure that the RAID volumes are being monitored on the DB servers.  More often than not, I have to tell the customer there are bad disks in one or more of the RAID volumes.  Don't be one of those customers, make sure you monitor the RAID volumes on your PostgreSQL (and other) servers!

So, today I was setting up some of this same monitoring for LSI RAID controllers in a Dell server using the check_sasraid_megaraid nagios plugin.   This plugin requires the LSI SNMP installer as well as net-snmp and a few other miscellaneous packages.

This brings us to the Russian Doll Syndrome.  I download the zip file from the LSI support site and unzip it to find a zip full of zips:

$ unzip

Ok, fine, they have a zip for each architecture, so let's unzip the Linux_x64 one and get on with this:

$ unzip
  inflating: MD5CheckSum.txt        

Oh, sweet, another zip file!

$ unzip
  inflating: readme.txt             
 extracting: sas_snmp_64bit.tar.gz  

And inside that, we find a tarball:

$ tar xvfz sas_snmp_64bit.tar.gz

And finally an RPM.  Yeesh LSI, why couldn't we just have the RPM right inside the file or even just in the top level zip file?

So, all kidding aside, please make sure you're monitoring the RAID volumes on your PostgreSQL servers!

Posted by Jeff Frost
Feb. 1, 2013, 3:12 p.m.

Why PostgreSQL Doesn't Allow Unsafe Implicit Type Coercions

I got bitten by a MySQL bug today, and it reminded me why we don't allow¹ unsafe type coercions in PostgreSQL. It's easy to talk about them in the abstract, but harder to explain why they're a problem in practice. Here's a really clear example.

I do most of my consulting through pgExperts these days, but I still have a legacy client on MySQL. Today I was writing a new feature for them. Using test-driven development, I was debugging a particular invocation of a five-way join. I found it was returning many rows from a certain right-hand table, when it should have been returning none.

I took a closer look at my SQL and found this code²:

WHERE order.status_id NOT IN ('completed', 'cancelled')

But that's not right! status_id isn't a varchar. It's an int FK to a lookup table:

CREATE TABLE statuses (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
name varchar(16) NOT NULL,
UNIQUE KEY name (name)

MySQL was silently coercing my integer order.status_id to a varchar, comparing it to 'completed' and 'cancelled' (which of course it could never match), and returning a bunch of right-hand table rows as a result. The filter was broken.

PostgreSQL would never haved allowed this. It would have complained about the type mismatch between status_is and 'completed' or 'cancelled'. Rather than a confusing resultset, I'd get a clear error message at SQL compile time:

ERROR: invalid input syntax for integer: "cancelled"
LINE 8: select * from frobs where order.status_id NOT IN ('cancelled', 'c...

Fortunately, I caught and fixed this quickly (hurrah for test-driven development!) The fix was simple:

JOIN statuses stat on = order.status_id
WHERE NOT IN ('completed', 'cancelled')

But under different circumstances this could have seriously burned me.

And that's why we disallow certain implicit type coercions. This post is not intended as a rant against MySQL. Rather, it's an illustration of how a seemingly abstract issue can seriously hose your code. Caveat hacker!

1. Since PostgreSQL 8.3 (2008), that is. Before that, we did allow unsafe coercions.

2. I've changed the code and identifiers to protect client confidentiality.

Posted by Quinn Weaver
Aug. 23, 2012, 5:57 p.m.

PostgreSQL, NUMA and zone reclaim mode on linux

Lately we've been seeing issues with zone reclaim mode on large memory multi processor NUMA linux systems.

What's NUMA?  It's just an acronym for Non-Uniform Memory Access.  This means that some memory in your system is more expensive for a particular CPU to access than its "local" memory.

You can see how much more distant the kernel considers the different zones by using the numactl command like so:

numactl --hardware
If you've got a modern multiprocessor system, you'll probably see something like this:

available: 2 nodes (0-1)
node 0 size: 48417 MB
node 0 free: 219 MB
node 1 size: 48480 MB
node 1 free: 135 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10 

Here we see a distance of 10 for node 0 to access node 0 memory and 21 for node 0 to access node 1 memory.   What does distance really mean?  It's a cost parameter based on number of "hops" or buses that separate the node from the distant memory.

Now, what's zone reclaim mode?

From the linux kernel docs:


Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.

It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset

I highlighted the text above because PostgreSQL depends heavily on the filesystem cache and disabling zone reclaim mode is desirable in this situation.

There's been a bit of discussion about this on the pgsql-performance mailing list here:

If you've got a modern multi-socket system, odds are good that zone reclaim mode is enabled automatically on boot.  You can check this by looking at /proc/sys/vm/zone_reclaim_mode.

The biggest issue we've seen with zone reclaim mode enabled on customer multi-socket systems is the filesystem cache never filling up even when the database is much larger than RAM.  That's because the system is trying to keep some "local" memory available. After disabling zone_reclaim_mode, the filesystem cache fills up and performance improves.

So, how to disable zone_reclaim_mode?  The best way to do this is via sysctl.  Just add:

vm.zone_reclaim_mode = 0 

to /etc/sysctl.conf, save it and execute sysctl -p to load the new settings into the kernel.

Other interesting non PostgreSQL pages on NUMA/zone_reclaim_mode:

Posted by Jeff Frost
Aug. 2, 2012, 10:16 a.m.

Finding non-UTF8 data in a dump

Recently I was tasked with identifying all rows in a PostgreSQL database that contained non UTF8 data for migrating from SQL_ASCII encoding to UTF8.  Because of the database's size, just using iconv and diff on a dump file wouldn't work.

I started with this perl script from Jon Jensen:

It got me a little way down the road, but what I really wanted was to know which table had the bad row and better yet, be able to easily import the bad rows with a COPY command for cleanup and eventual use as temporary tables that we can be used with UPDATE statements to fix the production DB.

So, I turned to Python.  I found a isUTF8(text) definition here:

I took that and wrote which outputs COPY commands with only the non-UTF8 rows to STDOUT and also outputs a schema-only dump command to STDERR which only dumps the tables with bad data for ease of creating a temporary DB.

You can find the script on github here:

What other methods do you folks use?

Posted by Jeff Frost
May 23, 2012, 7:05 p.m. script to dump all PostgreSQL DBs

For a long time, I've written all my automation scripts in bourne shell and bash with occasional forays into perl.  I really like bash because you're always guaranteed to have bash installed on a modern linux system and so don't need to install perl and 15 modules to get the job done.

That said, I've recently begun using python for more automation.  As an exercise, I've rewritten my script for python 2.7 trying to use only the standard library.

The script pg_dumps all the databases in a PostgreSQL cluster in the custom format allowing you to specify the compression level.  (See this blog post regarding pg_dump compression level performance.)  It also dumps the globals to a bzip2'd SQL file and then removes files in the dump destination directory that are older than the --keep value (default 7 days).  Finally, if it had any errors during the run, it sends the error log file to the specified email address.  It also picks up libpq environment variables, but will override those with anything specified on the command line.

You can find it here:

Posted by Jeff Frost
May 15, 2012, 2:47 p.m.

Slony Upstart Script for Ubuntu 10.04 LTS

 Since Ubuntu is moving towards using upstart instead of SysV init scripts, I thought I'd write a quick upstart script for starting slony compiled and installed in /usr/local/slony.  Upstart 1.4+ supports setuid, but the version of upstart included with Ubuntu 10.04LTS is only 0.6.5 and even on Ubuntu 11.10, it's only 1.3, so instead I use start-stop-daemon's --chuid switch.  The script should be installed as /etc/init/slony.conf. Here's the script:

---cut here---
# slony - slon daemon
# slon is the slony replication daemon

description    "slony slon daemon"

start on net-device-up
stop on runlevel [06]

# can't use setuid until upstart 1.4
#setuid slony
# which is why we use start-stop-daemon with the --chuid switch


    exec start-stop-daemon --start --chuid slony --exec /usr/local/slony/bin/slon -- -f /etc/slon.conf >> /var/log/slony/slony.log 2>&1
end script
---cut here---

This will have the slon daemon run as the slony user and log to /var/log/slony/slony.log.  You'll also need a valid slony config file in /etc/slon.conf.

You can also find the script in the pgexperts upstart-scripts git repo:

Posted by Jeff Frost
April 2, 2012, 5:58 p.m.

Using the row_number() window function to reorder images

One of my customers uses the rails acts_as_list module for maintaining order of images on forms.

Through some application issues they ended up with images that had identical order for the same provider and image_category. So, it ended up looking like this:

select provider_id, image_category_id, position, id from images where image_category_id=1234 order by position;


provider_id | image_category_id | position | id

       9999 |              1234 |        3 | 484900
       9999 |              1234 |        4 | 484899
       9999 |              1234 |        4 | 484897
       9999 |              1234 |        4 | 484896
       9999 |              1234 |        4 | 484894
       9999 |              1234 |        4 | 484893
       9999 |              1234 |        4 | 484892
       9999 |              1234 |        4 | 484890
       9999 |              1234 |        5 | 484889
       9999 |              1234 |        6 | 484887
       9999 |              1234 |        7 | 484886
       9999 |              1234 |        8 | 479818
       9999 |              1234 |        9 | 479817
       9999 |              1234 |       10 | 479815

Note that there are several rows with position=4 above. This was a prevalent issue throughout the images table and not just with this example provider.

So, what they wanted was to reorder the images such that the equal positions were kept together and the overall order stayed the same.  So, for example, it should look like this (sorry for the small font):

 provider_id |image_category_id | position |   id  

        9999 |             1234 |        3 | 484900  ==> Position 1
        9999 |             1234 |        4 | 484899  ==> Position 2
        9999 |             1234 |        4 | 484897  ==> Position 3
        9999 |             1234 |        4 | 484896  ==> Position 4
        9999 |             1234 |        4 | 484894  ==> Position 5
        9999 |             1234 |        4 | 484893  ==> Position 6
        9999 |             1234 |        4 | 484892  ==> Position 7
        9999 |             1234 |        4 | 484890  ==> Position 8
        9999 |             1234 |        5 | 484889  ==> Position 9
        9999 |             1234 |        6 | 484887  ==> Position 10
        9999 |             1234 |        7 | 484886  ==> Position 11
        9999 |             1234 |        8 | 479818  ==> Position 12
        9999 |             1234 |        9 | 479817  ==> Position 13
        9999 |             1234 |       10 | 479815  ==> Position 14

So, I thought about it for a bit, then it occurred to me: why not use the row_number() window function.  So I wrote the following migration SQL script:

CREATE temp TABLE new_image_positions 
, row_number AS position
SELECT provider_type
, provider_id
, image_category_id
, position
, id
, row_number()
partition BY provider_id
, image_category_id
ORDER BY position
FROM images
WHERE provider_type = 'Project'
AND position IS NOT NULL
) x ;

UPDATE images
SET position = n.position
FROM new_image_positions n


And, voila, all the images are now properly ordered:

 provider_id | image_category_id | position |   id   
        9999 |              1234 |        1 | 484900
        9999 |              1234 |        2 | 484899
        9999 |              1234 |        3 | 484897
        9999 |              1234 |        4 | 484896
        9999 |              1234 |        5 | 484894
        9999 |              1234 |        6 | 484893
        9999 |              1234 |        7 | 484892
        9999 |              1234 |        8 | 484890
        9999 |              1234 |        9 | 484889
        9999 |              1234 |       10 | 484887
        9999 |              1234 |       11 | 484886
        9999 |              1234 |       12 | 479818
        9999 |              1234 |       13 | 479817
        9999 |              1234 |       14 | 479815
        9999 |              1234 |       46 | 403052
        9999 |              1234 |       47 | 403051
        9999 |              1234 |       48 | 403050
        9999 |              1234 |       49 | 403049
        9999 |              1235 |        1 | 484916
        9999 |              1235 |        2 | 484915
        9999 |              1235 |        3 | 484912
        9999 |              1235 |        4 | 484910
        9999 |              1236 |        1 | 493000
        9999 |              1236 |        2 | 492999

Posted by Jeff Frost
Dec. 23, 2011, 10:22 a.m.

pg_dump compression settings

After doing the base backup benchmarks, I thought it would be interesting to benchmark pg_dump run locally and remotely using all the different compression settings.  This will allow me to compare the dump times and the space savings as well as see how much dumping remotely slows the process.  Nobody is storing their dumps on the same server, right?

For the record, all these dumps used pg_dump's custom dump format (-Fc), the PGDATA directory is 93GB on disk and resides on a 4 disk RAID10 of 7200 RPM SATA drives.

CPU on the test server: Intel(R) Xeon(R) CPU X3330  @ 2.66GHz.
CPU on the test remote client: Intel(R) Core(TM) i7 CPU 960  @ 3.20GHz.
Network is gigabit.
The total size reported is from du -sh output.
Both client and test server are running postgresql-9.1.2.

For comparison, our best base backup time from the base backup blog post was: 15m52.221s

The results:

Local pg_dump:
compression: 0
time: 19m7.455s
total size: 77G
compression: 1
time: 21m53.128s
total size: 11G
compression: 2
time: 22m27.507s
total size: 11G
compression: 3
time: 24m18.966s
total size: 9.8G
compression: 4
time: 30m10.815s
total size: 9.2G
compression: 5
time: 34m26.119s
total size: 8.3G
compression: 6
time: 41m35.340s
total size: 8.0G
compression: 7
time: 49m4.484s
total size: 7.9G
compression: 8
time: 91m28.689s
total size: 7.8G
compression: 9
time: 103m24.883s
total size: 7.8G

Remote pg_dump:
compression: 0
time: 20m1.363s
total size: 77G
compression: 1
time: 22m9.205s
total size: 11G
compression: 2
time: 22m19.158s
total size: 11G
compression: 3
time: 23m7.426s
total size: 9.8G
compression: 4
time: 26m10.383s
total size: 9.2G
compression: 5
time: 28m57.431s
total size: 8.3G
compression: 6
time: 33m23.939s
total size: 8.0G
compression: 7
time: 38m11.321s
total size: 7.9G
compression: 8
time: 62m27.385s
total size: 7.8G
compression: 9
time: 72m5.123s
total size: 7.8G

So, a few interesting observations:

  • Base backups are indeed faster, but that's no surprise since they have less overhead. 
  • Taking the backup locally is only slightly faster than remotely for the smaller compression levels
  • At compression level 3 and above the faster CPU on the test client becomes a huge benefit
  • Looks like the default compression is level 6. This might not be what you want if you're concerned about minimizing the backup time and have the disk space to spare.

Posted by Jeff Frost
Dec. 6, 2011, 3:45 p.m.

PostgreSQL Base Backup Benchmark Results

For a while now we've sort of anecdotally thought that compression should help with base backups and also that the rsync server is faster than rsync over ssh, so while nothing was happening on our test server, I thought I'd benchmark a few different methods.

Looks like rsync using an rsync server with no compression was fastest, but not by much. Pg_basebackup is right there with rsync using any of the three RC4 ciphers all within a few seconds of one another. Looks like blowfish-cbc comes in next followed by aes128-cbc.

Most interesting is that adding compression quadruples the sync time! That's probably not the case over a slow link, but over gigabit it's definitely the case.

Tests were done with a 93GB PGDATA, gigabit network, linux software 4 disk RAID10 data dir on the first test server copied to linux software 2 disk RAID0 on the second server.

Following are the results:

Without compression:

time rsync test0::data /data/tmp/test-data/ -avP
real 15m52.221s
user 3m10.257s
sys 4m11.460s

time /usr/pgsql-9.1/bin/pg_basebackup -D /data/tmp/test-data -c fast -h test0 -U postgres
real 16m13.440s
user 0m17.314s
sys 2m18.217s

time rsync -e 'ssh -c arcfour' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m13.242s
user 8m32.674s
sys 5m28.312s

time rsync -e 'ssh -c arcfour256' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m13.656s
user 8m48.356s
sys 6m7.343s

time rsync -e 'ssh -c arcfour128' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m3.947s
user 8m44.627s
sys 6m7.499s

time rsync -e 'ssh -c blowfish-cbc' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 26m22.516s
user 19m16.038s
sys 8m45.046s

time rsync -e 'ssh -c aes128-cbc' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 29m30.785s
user 27m31.919s
sys 6m5.154s

time rsync -e 'ssh' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 30m50.301s
user 24m10.951s
sys 9m13.090s

time rsync -e 'ssh -c aes256-ctr' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 38m23.789s
user 31m36.011s
sys 8m39.210s

Our Best Performers With Compression:

time rsync -e 'ssh -c arcfour' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avPz
real 62m31.933s
user 14m48.752s
sys 2m42.412s

time rsync test0::data /data/tmp/test-data/ -avPz
real 62m53.699s
user 6m42.846s
sys 1m44.357s

Since Rod, Joshua and Rodrigo's comments suggested pigz and netcat below, I thought I'd give them both a quick try to see if they produced a winner.  Conveniently, gnu-tar has a --use-compress-program option, so you can easily use pigz in this manner.

So, tar over ssh with the arcfour256 cipher using pigz for compression:

time ssh -c arcfour256 test0 "cd /data/9.1/data ; tar --use-compress-program=pigz -cf - ." | (cd /data/tmp/test-data ; tar --use-compress-program=pigz -xvf -)

 real    18m11.901s
user    7m3.587s
sys     3m53.487s

On test0: tar --use-compress-program=pigz -cf - . | nc -l 7000
On test1: time nc test0 7000 | tar --use-compress-program=pigz -xvf -

real    17m46.805s
user    6m17.562s
sys     4m4.563s

So, using tar with pigz over ssh and over netcat are much faster compression methods, they still are a little slower than no compression.  This got me wondering if perhaps the 93GB of PGDATA doesn't compress all that well:

14G     /data/tmp/data-test.tgz

So, that's 6:1 compression. Not bad!  Interestingly, it took gnu-tar + pigz about 17m to create the test tgz file. That's about identical to the time it took to ship it over the network and uncompress/untar it on the other end.

That made me try tar with no compression piped over netcat:

On test0: tar -cf - . | nc -l 7000

On test1: time nc test0 7000 | tar -xvf -

real    15m52.313s
user    0m10.033s
sys     3m17.630s

So, tar with no compression over netcat clocked in almost identical to using rsync's native server.

Anyone have any other tools for me to test?

Posted by Jeff Frost