Database consulting and application development...

...because your data is your business

Jan. 2, 2018, 10 p.m.

A Replication Cheat-Sheet

So many types of replication, so little time! What kind of PostgreSQL replication should you use?

Posted by Christophe Pettus
Oct. 30, 2017, 2 p.m.

Commitment Issues

One of the essentials of any database system is that a transaction is either in progress, committed, or rolled back. But consider what happens if… BEGIN; UPDATE table SET money=money+100000; COMMIT; -- And you get an error that the server has disconnected Did that transaction commit, or didn’t it? You don’t really know, because you […]

Posted by Christophe Pettus
Oct. 27, 2017, 1:29 a.m.

Streaming replication stopped? One more thing to check.

We recently were asked by a client to look at a strange replication situation. Replication had simply stopped from a primary to a streaming replication secondary. Everything looked fine: The secondary was connected to the primary, as we could see in pg_stat_replication. The secondary was not receiving any queries, so delays in applying incoming page […]

Posted by Christophe Pettus
Oct. 27, 2017, 1:06 a.m.

“Humans do not have a primary key” at PGConf.EU

The slides for “Humans do not have a primary key” as presented at PGConf EU are available here. Three links from the presentation are: Personal names around the world Kudzu and the California Marriage Amendment General Data Protection Regulations in the EU

Posted by Christophe Pettus
Sept. 7, 2017, 10:19 a.m.

“Humans do not have a primary key” at PGOpen SV

I’ll be presenting “Humans do not have a primary key” at PGOpen SV today. Two links from the presentation are: Personal names around the world Kudzu and the California Marriage Amendment

Posted by Christophe Pettus
Sept. 5, 2017, 8:51 a.m.

A brief reminder about casting TIMESTAMPTZ

I spent a few minutes worrying I had lost my mind: Results coming back to my application were different from those obtained by running psql directly on the database server, for identical queries. As is often the case, the problem was time zones. In particular, when casting a TIMESTAMPTZ to a DATE, the current client […]

Posted by Christophe Pettus
Sept. 4, 2017, 1:18 a.m.

Blog Moved.

This is probably no big deal to anyone, but the blog has moved to a new server. Whee!

Posted by Christophe Pettus
July 9, 2017, 5:19 p.m.

A word of praise for TextBelt

Quite often I need to kick off a long-running process for a client, then resume work immediately once it's done. pg_restore, pg_basebackup, pg_upgrade, and vacuumdb --analyze-only all come to mind as examples. The ideal thing is to get a text message upon completion. When I'm working on my own boxes, I can just mail(1) my mobile carrier's text gateway, but I can't count on clients' servers having sendmail or the like set up (they usually don't).

Enter TextBelt. This service is a dead-simple HTTP-to-SMS gateway. Adapted from their docs:

curl -X POST \
--data-urlencode phone="$MY_MOBILE_NUMBER" \
--data-urlencode message='The process completed.' \

Cleartext HTTP is also supported, in case the client box has broken SSL libraries. My texts are always nondescript, so I don't mind sending them in the clear.

The whole thing is open-source, so you can set up your own TextBelt server. Or you can be lazy and throw a few dollars their way for a certain number of texts. I rather like this business model, actually, as a way to support open-source work.

Posted by Quinn Weaver
July 7, 2017, 3:47 p.m.

The PATH of cron

Short version

Postgres Plus keeps psql out /usr/bin, so you need to set PATH in your cron jobs (including for WAL-E).

Longer version

Like all good people, I set up a cron jobs to run nightly WAL-E base backups. With one client, this failed the first time:

wal_e.main ERROR MSG: could not run one or more external programs WAL-E depends upon
DETAIL: Could not run the following programs, are they installed? psql
STRUCTURED: time=2017-07-07T03:00:01.957961-00 pid=25833
Turns out they were using Postgres Plus, and it puts psql in /opt/PostgresPlus/9.3AS/bin. That directory is in the enterprisedb user's PATH, of course, but not in the minimal PATH that cron jobs get by default. So I had to log in as enterprisedb, echo $PATH, and then paste PATH = [what echo said] at the top of my cron job. Moral of the story: take care when setting up cron jobs for PostgreSQL forks, or for non-standard community PostgreSQL installations.

Posted by Quinn Weaver
June 9, 2017, 11:27 a.m.

shared_buffers is not a sensitive setting

There’s a lot of mystery-cult information floating around out there about what to set shared_buffers to in PostgreSQL. Most of it is, at best, overthinking a fairly simple setting. You can get 90% of the way there, on 95%+ of PostgreSQL systems, just by setting it to 25% of total system memory, to a maximum […]

Posted by Christophe Pettus
June 7, 2017, 3:03 p.m.

Resetting the postgres user’s password

For some reason, this is something that even experienced PostgreSQL people don’t know about, which is: What do you do if you’ve forgotten the postgres user’s password, and you have no other superuser available? The answer is: Bring up PostgreSQL in single-user mode, reset the password, bring it back up in standard mode. Of course, […]

Posted by Christophe Pettus
April 28, 2017, 9 a.m.

A PostgreSQL Response to Uber

The slides from my talk at Percona Live 2017, A PostgreSQL Response to Uber, which I also gave as “Why Uber Was (Mostly) Wrong” at PGDay Nordic 2017, are now available.

Posted by Christophe Pettus
March 31, 2017, 9:34 a.m.

“Corruption War Stories” from PGConf US 2017

The slides from my presentation, Corruption War Stories, are now available.

Posted by Christophe Pettus
March 31, 2017, 9:33 a.m.

“Django and PostgreSQL” from PGConf US 2017

The slides from my presentation, Django and PostgreSQL, are now available.

Posted by Christophe Pettus
March 29, 2017, 7:37 p.m.

Change autovacuum_freeze_max_age without a restart (sort of…)

This blog post is kind of involved, so I'm giving a short version at the top, with some background for beginners at the bottom. The middle section explains the motivation for using this hack in the first place.

Short version

I came up with a useful and/or terrible hack the other day: setting autovacuum_freeze_max_age as a storage parameter. I definitely don't recommend doing this routinely, but it unblocked us during a critical maintenance window.

    ALTER TABLE my_table SET (autovacuum_freeze_max_age = 300000000);

Don't forget to set it back when you're done! Otherwise you will incur an even longer autovacuum freeze, probably when you least expect it.

Medium-length version

My colleague Kacey Holston was in the midst of upgrading a client from PostgreSQL 9.4 to 9.6, using Slony for minimal downtime. As planned, the client took a few minutes of downtime so Kacey could do. She was ready to reverse the direction of replication (so the 9.6 server was replicating to the 9.4 server, in case our client to fall back to it). But there was an autovacuum freeze (a.k.a. "autovacuum (to prevent wraparound)" that was keeping Slony from getting the brief ExclusiveLock it needed.

She knew from experience that this table takes three hours to freeze. But the client had only minutes of downtime scheduled – that was the whole point of using Slony!

If only it were possible to change autovacuum_freeze_max_age on the fly; then we could bump it up to stop that autovacuum. Unfortunately, you have to restart the database in order to change it. Except…

You can set it on a per-table basis, as follows. This took effect immediately:

    ALTER TABLE my_table SET (autovacuum_freeze_max_age = 300000000);

If you do this, don't forget to set it back to the normal value (by default, 200000000) once you're done! Otherwise autovacuum freezes on this table will come around less often and take even longer.

Background for beginners:

When the oldest transaction ID on any row in a table is more than autovacuum_freeze_max_age old (200 million transaction old, by default), then an "autovacuum (to prevent wraparound)" process runs on the table to reclaim old transaction IDs. For large tables, this can be a problem, because it can generate a lot of CPU and I/O activity during busy hours. Also, as we saw here, it locks the table (in a SHARE UPDATE EXCLUSIVE mode); this blocks DDL changes (a.k.a. migrations).

For more-technical background, see the official PostgreSQL docs on how transaction IDs work, and for a friendlier intro, see this series of blog posts.

Posted by Quinn Weaver
March 29, 2017, 7:08 a.m.

PostgreSQL When It’s Not Your Job

My slides from my PGConf US 2017 tutorial, PostgreSQL When It’s Not Your Job, are available now.

Posted by Christophe Pettus
March 3, 2017, 5:20 p.m.

Slides from my talk at SCALE15x

The slides from my SCALE 15x talk: Minimizing PostgreSQL Major Version Upgrade Downtime Using Slony are available here:

You'll want to download the powerpoint version as it includes some embedded screencasts that don't play on slideshare.

Posted by Jeff Frost
Feb. 7, 2017, 2:14 a.m.

Corruption War Stories at PGDay FOSDEM 2017

The slides for my talk Corruption War Stories are now available.

Posted by Christophe Pettus
Jan. 4, 2017, 11 a.m.

Django: dictionary update sequence element #0 has length 1; 2 is required

dictionary update sequence element #0 has length 1; 2 is required usually means a missing name='' in a urlconf.

Posted by Christophe Pettus
Jan. 4, 2017, 9 a.m.

Estimates “stuck” at 200 rows?

If you are seeing the planner estimating 200 rows exactly from an aggregate operation, that might mean that a table is missing planner statistics.

Posted by Christophe Pettus
Jan. 1, 2017, 3:53 p.m.

Django: Site matching query does not exist

Site matching query does not exist means you forgot the record in djangosite, or your forgot SITEID.

Posted by Christophe Pettus
Dec. 30, 2016, 12:50 p.m.

The Multi-Column Index of the Mysteries

Multi-column indexes are more subtle than you might think.

Posted by Christophe Pettus
Dec. 28, 2016, 7:24 p.m.

A Cheap and Cheerful Replication Check

Don't calculate log positions unless you have to.

Posted by Christophe Pettus
Nov. 3, 2016, 6:51 a.m.

Securing PostgreSQL at PGConf EU

The slides for my talk, Securing PostgreSQL at PGConf EU 2016 are now available.

Posted by Christophe Pettus
Nov. 2, 2016, 3:03 a.m.

Unclogging the VACUUM at PGConf EU

The slides for my presentation Unclogging the VACUUM at PGConf EU in Tallinn, Estonia are now available.

Posted by Christophe Pettus
June 4, 2016, 9:34 p.m.

Recent Slides

Slides from my recent presentations at PGCon and PyCon.

Posted by Christophe Pettus
Feb. 16, 2016, 10:51 a.m.

Always Do This #5: The Elements of postgresql.conf Style

I won't say that I know the one true way of editing postgresql.conf, but I do.

Posted by Christophe Pettus
Feb. 11, 2016, 7:53 p.m.

Indexes Are Not Cheap

Don't go crazy with indexes.

Posted by Christophe Pettus
Feb. 9, 2016, 10:46 a.m.

Django 1.8/1.9 and PostgreSQL: An Ever-Closer Union

I’ll be speaking about Django and PostgreSQL at PyCon US 2016.

Posted by Christophe Pettus
Jan. 29, 2016, 2:46 p.m.

Language thoughts

All the languages I've used heavily have one primitive data structure you fall into using as a golden hammer:

  • C: arrays
  • Lisp: lists
  • Perl: hashes
  • JS: "objects"
Et cetera. Python is borderline; it's tempting to abuse dicts, but there's a bit less friction to using proper objects than with Perl.

Posted by Quinn Weaver
Jan. 16, 2016, 11:06 p.m.

Blocked by rdsadmin

One of our clients on RDS had a VACUUM FREEZE that was hanging for a long time. "I bet it's waiting on a lock," I thought. Yup, but the curious part is what it was waiting on: a mysterious process run by the rdsadmin role (see RECORD 2, below):

SELECT pg_stat_activity, pg_locks.mode
FROM pg_stat_activity
JOIN pg_locks USING (pid)
JOIN pg_class ON pg_locks.relation = pg_class.oid
WHERE pg_class.relname = 'users'
AND pg_locks.mode IN ('ShareUpdateExclusiveLock', 'ShareLock', 'ShareRowExclusiveLock', 'ExclusiveLock', 'AccessExclusiveLock');

-[ RECORD 1 ]----+------------------------------
datid | 1234
datname | my_database
pid | 14641
usesysid | 16396
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | 5430
backend_start | 2016-01-15 22:05:06.161987+00
xact_start | 2016-01-15 22:14:39.301425+00
query_start | 2016-01-15 22:14:39.301425+00
state_change | 2016-01-15 22:14:39.301429+00
waiting | t
state | active
query | VACUUM FREEZE verbose users;
mode | ShareUpdateExclusiveLock
-[ RECORD 2 ]----+------------------------------
datid | 1234
datname | my_database
pid | 22328
usesysid | 10
usename | rdsadmin
application_name |
client_addr |
client_hostname |
client_port |
backend_start |
xact_start |
query_start |
state_change |
waiting |
state |
query |
mode | ShareUpdateExclusiveLock
Further examination showed that this process was locking only the users table (and its indexes):

SELECT locktype, relation::regclass AS tablename
FROM pg_locks
JOIN pg_stat_activity USING (pid)
WHERE pid = 22328;

locktype | tablename
relation | user_index_a
relation | user_index_b
relation | user_index_c
relation | users_pkey
virtualxid |
relation | users
(13 rows)

Has anyone else seen such a process? I'm curious as to what it is. My current best guess is a vacuum We opened a ticket with Amazon to ask them, so I'll update this post when Amazon replies.

EDIT 2016-01-17: the above is a braino for "an autovacuum." An Amazon rep wrote back the next day to say that it was, indeed, an autovacuum process, and included some vacuum-tuning tips. So good on them, although it's unfortunate that RDS's privilege system doesn't allow you to see the pg_stat_activity.query field when the rdsadmin role is working on a table you own.

Posted by Quinn Weaver
Nov. 10, 2015, 10:31 p.m.

wal-e - avoiding the cron emails

If you're like me, you use wal-e for PITR based backups on many of the PostgreSQL servers you manage.  While setting it up, you likely discovered the --terse option greatly reduces the log output.  However, if you run the backup-push job in cron and have email aliases set up properly, you probably receive emails like this every time the backup-push runs:

NOTICE:  pg_stop_backup complete, all required WAL segments have been archived

Why is that?  Well, the default value for client_min_messages in PostgreSQL is NOTICE, which means the client (wal-e in this case) will be sent messsages of NOTICE and higher priority.

So, how to change that?  You may have gotten this far and realized there isn't a handy command line switch that wal-e exposes.  That's ok though, wal-e uses libpq, so we just need to use the PGOPTIONS environment variable.

If you're using wal-e with envdir like in the documentation examples, then you can just plunk down a PGOPTIONS file in /etc/wal-e.d/env that looks like this:

-c client_min_messages=WARNING

and that's it. If you're using a wrapper script for wal-e instead of envdir, you just need to export that environment variable before calling wal-e like this:

export PGOPTIONS="-c client_min_messages=WARNING"

Now sit back and enjoy a few less cron emails in the morning.

EDIT: The envdir method only works if you have bash as your cron shell SHELL=/bin/bash in the crontab or you call the command like this:

bash -c "envdir /etc/wal-e.d/env /usr/local/bin/wal-e --terse backup-push /var/lib/postgresql/9.4/main"

Posted by Jeff Frost
Oct. 6, 2015, 8:19 p.m.

Reflections on (Humans) Trusting (Humans') Trust

One thing I've learned from consulting is that you should trust a competent person's feels about a technology, even if they can't immediately come up with arguments to support them. Seemingly vague statements like "Oh, I hate the Foo webserver, it's so flaky" or "Bar is a really annoying language to work in" or "the Baz API just doesn't feel very user-friendly" are actually the distilled product of many hard-won experiences that the speaker may not be able to call to mind individually. I observed this last night at Larry Wall's Perl 6 talk, specifically the Q&A: most of the time he came up with great examples, but sometimes he just had to talk about his own language in broad, esthetic terms. And I'm no Larry Wall, but today I found myself in a similar position when advising a client about which of two virtualization platforms to use. Of course, this applies to people speaking from experience and not from prejudice; you have to know how good someone's judgment is in order to transitively-trust them to have good opinions. That's one soft skill that gives you a major edge in technology. But, contrary to stereotype, most technical people are good at making these kinds of meta-judgments.

Posted by Quinn Weaver
June 9, 2015, 1:08 p.m.

Baidu Bot Blues

A client complained of mysterious connection spikes. "We think we're getting hit by bots, but we're not sure," they said; "Can you look into it?"

So I edited postgresql.conf, turned on log_connections and log_disconnections, did a 'pg_ctl -D MY_DATA_DIR reload', and waited (in my case I also set log_min_duration_statment = 0, but I don't recommend that unless you can watch to make sure the logs aren't filling the log partition — and make sure you're not going to hammer a slow log device).

A while later, I did

ls -t data/pg_log/*.csv | head -5 | xargs grep -c connection

(That says "List log files in order from most recent to least recent, take the five most recent, and count the lines in each file containing 'connection'.")

I got lucky. A spike was obviously in progress; the current log file showed a count two orders of magnitude higher than the other log files'. Since the problem was in progress, no post mortem was required; I just mailed the client and asked "Do your web logs show bot traffic right now?"

The client mailed right back saying, "Yes, we see a bunch of connections from Baidu's web crawler; it's going nuts hitting our site."

Baidu is a Chinese search engine. It turns out its crawling bot is well-known for being aggressive. As that link explains, the bot doesn't respect Crawl-delay in robots.txt (that's annoying but unsurprising, since Crawl-delay is a non-standard extension).

So if this happens to you, you have these options:

  1. Register at Baidu's site (Chinese-only) to request less-aggressive crawling.
  2. Try to tune PostgreSQL connection pooling to handle the onslaught.
  3. Block Baidu entirely. This is an extreme option — not advised if you do business in China, or if you have Chinese-speaking customers elsewhere who are reaching you through Baidu. The latter requires analytics to figure out.

Posted by Quinn Weaver
May 8, 2015, 2:37 p.m.

syslog Considered Dangerous

Edit 2015-10-01: there's a similar problem on RDS, where logs are written to the same partition as data, which can cause problems if you don't have enough provisioned IOPS (or disk space).

A client of ours was suffering mysterious, intermittent lock pileups on a large table. To diagnose the problem, I ran our lock-logging scripts.

Querying the resulting log_transaction_locks table, I saw lots of extend locks and RowExclusiveLocks piling up behind normal UPDATEs (and INSERTs) — there were even extend locks waiting on other extend locks. This is pretty unusual; you'd expect that when one UPDATE completed, the extend lock would succeed, PostgreSQL would quickly extend the relation (in this case, the big table's TOAST relation), and queries would keep running smoothly. These were all single-row writes; they should have taken that long. However, a complicating factor was that a couple of columns in that big table were giant TEXT columns storing JSON strings. Was that the issue? The problem was unclear, and we needed more info.

At this point my colleague Christophe Pettus made the excellent suggestion to do 'strace -p' on one of the locked processes. The dump revealed lots of sendto() calls with giant strings, which he immediately recognized as syslog messages based on their formatting.

Now, this was a problem. syslog writes to /var, and /var was hanging off the root partition, which was an EBS volume (without provisioned IOPS). So each basically the entire SQL of each query was being written across the network, including those giant JSON strings, and the query itself wouldn't terminate till the write was fsync()'ed. In the meantime, extend locks and other UPDATEs piled up behind it. The PostgreSQL data directory was on a big, fast software RAID across SSD instance storage, but performance was being bottlenecked by that single EBS volume.

The client elected to turn off log_lock_waits, and the problem went away. In addition, Christophe suggested prepending a minus sign to the filename in /etc/rsyslog.d/postgres.conf, like so, which tells syslog not to flush between each log statement:

local0.* -/var/log/postgres.log

I thought that was neat: a potential fix to a huge performance problem, in one character(!)

Lessons learned:

1. If your log_destination contains syslog, be careful about filesystem layout. (If you stick to csvlog, this normally isn't a problem, because csvlogs, by default, go in pg_log/ under the PostgreSQL data dir.)

2. Logging info about a performance problem can sometimes make it worse. In this case, we decided to diagnose the pre-existing lock pileups by setting log_lock_waits = on and deadlock_timeout = 200. But, because of the syslog issue, those very settings exacerbated the lock problems. This was necessary to catch the problem, but it's something to keep in mind if you decide to use our lock-logging scripts.

Posted by Quinn Weaver
Sept. 18, 2014, 12:19 p.m.

WAL-E with Rackspace CloudFiles over Servicenet

Found a great walkthrough on setting up WAL-E to use python-swiftclient for storage in Rackspace Cloud Files:

Unfortunately by default, your backups use the public URL for Cloud Files and eat into metered public bandwidth.

The way to work around this is to set the endpoint_type to internalURL instead of the default publicURL.

You do that by setting the following environment variable:


That allows WAL-E to use Servicenet for base backups and WAL archiving which will be much faster and not eat into your metered public bandwidth.

Posted by Jeff Frost
Sept. 17, 2014, 1:18 p.m.

RDS for Postgres: List of Supported Extensions

Today I learned that Amazon doesn't keep any list of extensions supported in PostgreSQL. Instead, their documentation tells you to start a psql session and run 'SHOW rds.extensions'. But that creates a chicken-and-egg situation if you have an app that needs extensions, and you're trying to decide whether to migrate.

So here's a list of extensions supported as of today, 2014-09-17 (RDS PostgreSQL 9.3.3). I'll try to keep this current.



Posted by Quinn Weaver
April 29, 2014, 6:04 p.m.

How to start PostgreSQL the Debian way with pg_ctl

Sometimes I find myself wanting to start a PostgreSQL cluster with pg_ctl on a Debian or Ubuntu system.  If you've ever tried this, you know that something like this doesn't work:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main/ start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ postgres cannot access the server configuration file "/var/lib/postgresql/9.3/main/postgresql.conf": No such file or directory

because the Debian packages place the config files in /etc/postgresql/<version>/<cluster name>

So, I used to think that this was the way to start it:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /etc/postgresql/9.3/main/ start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ 2014-04-30 00:59:32 UTC LOG:  database system was shut down at 2014-04-30 00:57:49 UTC
2014-04-30 00:59:32 UTC LOG:  database system is ready to accept connections
2014-04-30 00:59:32 UTC LOG:  autovacuum launcher started

And that does generally work, but the proper way to do it is this:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main/ -o "-c config_file=/etc/postgresql/9.3/main/postgresql.conf" start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ 2014-04-30 01:00:22 UTC LOG:  database system was shut down at 2014-04-30 01:00:16 UTC
2014-04-30 01:00:22 UTC LOG:  database system is ready to accept connections
2014-04-30 01:00:22 UTC LOG:  autovacuum launcher started

The real Debian way is to use pg_ctlcluster like so:

vagrant@vagrant-ubuntu-trusty-64:~$ sudo -u postgres pg_ctlcluster 9.3 main start

Posted by Jeff Frost
April 18, 2014, 3:54 p.m.

Grabbing the statement durations out of a log file using gnu grep

Sometimes I find myself wanting to grep all the query durations out of a file for sorting.  I recently discovered that I can use gnu grep's lookbehind syntax like so:

grep -oP '(?<=duration: )[0-9\.]+' postgresql-Mon.csv

and then sort them numerically like so:

grep -oP '(?<=duration: )[0-9\.]+' postgresql-Mon.csv | sort -n | tail

Posted by Jeff Frost
April 2, 2014, 4:02 p.m.

Putting stats_temp_directory on a ramdisk

This hack is an old chestnut among PostgreSQL performance tuners, but it doesn't seem to be widely known elsewhere. That's a shame, because it's pure win, and it's ridiculously easy to set up. You don't even need to restart PostgreSQL.

Here's the situation: PostgreSQL writes certain temporary statistics. These go in the dir given by the stats_temp_directory setting. By default, that's pg_stat_tmp in the data dir. Temp files get written a lot, but there's no need for them to persist.

That makes them perfect candidates for a ramdisk (a.k.a. RAM drive). A ramdisk is a chunk of memory treated as a block device by the OS. Because it's RAM, it's super-fast. As far as the app is concerned, the ramdisk just holds a filesystem that it can read and write like any other. Moreover, PostgreSQL generally only needs a few hundred kilobytes for stats_temp_directory; any modern server can fit that in RAM.

In Linux, you set up a ramdisk like this:

As root:

'mkdir /var/lib/pgsql_stats_tmp' [1]

'chmod 777 /var/lib/pgsql_stats_tmp'

'chmod +t /var/lib/pgsql_stats_tmp'

Add this line to /etc/fstab. That 2G is an upper limit; the system will use only as much as it needs.

tmpfs /var/lib/pgsql_stats_tmp tmpfs size=2G,uid=postgres,gid=postgres 0 0

'mount /var/lib/pgsql_stats_tmp'

Then, as postgres:

Change the stats_temp_directory setting in postgresql.conf:

stats_temp_directory = '/var/lib/pgsql_stats_tmp'

Tell PostgreSQL to re-read its configuration:

'pg_ctl -D YOUR_DATA_DIR reload'

And that's it!

Other operating systems have different ways to set up ramdisks. Perhaps I'll cover them in a later post.

[1] The directory /var/lib/pgsql_stats_tmp is an arbitrary choice, but it works well for Debian's filesystem layout.

Posted by Quinn Weaver
Nov. 21, 2013, 7:42 p.m.

Ubuntu 3.11.0 kernel not as stable as hoped

As I mentioned in my previous post, Ubuntu recently released a 3.11.0 kernel for Precise.  Unfortunately, while it appeared stable during a long burn-in process with various pgbench workloads and scale sizes, it crashed shortly after putting the machine into service, so make sure you test thoroughly on your hardware. 

This particular hardware was the Dell Poweredge 720 that rackspace uses.

Guess we're going back to custom compiling kernels for a little while longer.

Anyone else have experiences good or bad with it?

I believe this is the same kernel that's shipping with 13.10 Saucy Salamander.

Posted by Jeff Frost
Nov. 15, 2013, 1:34 p.m.

Ubuntu has released a 3.11.0 kernel for 12.04 LTS!

This one is for those of you compiling a custom kernel or staying on 10.04 LTS because of this issue on database servers in the 3.x kernels:

Ubuntu has released linux-image-3.11.0-13-generic for 12.04 in which that fix is included.  It's actually been included since 3.9rc2, but Ubuntu skipped straight to 3.11.

And there was much rejoicing in the PostgreSQL community.  At least from those of us who manage 12.04 LTS PostgreSQL servers!

For those of you who don't recall, here's a link to the big discussion thread:

One item of note on 3.11 is that the kernel.sched_migration_cost setting is now called kernel.sched_migration_cost_ns, so if you are setting that via sysctl, you will need to update your sysctl.conf or sysctl.d file after upgrade.

Now go upgrade your test servers and let us know how it performs!  I've been burning in a new server for about 24 hours with pgbench and the 3.11.0-13-generic kernel and so far it's been stable and a good performer.

Posted by Jeff Frost
Aug. 22, 2013, 3:57 p.m.


I prefer PostgreSQL, which is the core of our business at pgExperts. But I still have some legacy clients on MySQL. For one of them, I'm writing new features in Perl/MySQL. The mysql_auto_reconnect option just saved my day. I'd been getting errors like "DBD::mysql::st execute failed: MySQL server has gone away." To use it in DBI, do

my $dbh = DBI->connect(
{ mysql_auto_reconnect => 1 },
To use it in Rose::DB::Object, do

File MyApp::DB;
our @ISA = qw( Rose::DB );

mysql_auto_reconnect => 1,

Posted by Quinn Weaver
Aug. 10, 2013, 6:41 p.m.

Get your PgBouncer pools stats into Ganglia

Just whipped up a quick ruby script to put PgBouncer "show pools" data into ganglia using gmetric.  Based it heavily off the postgres_gmetric script.

You can find the pgbouncer gmetric script here:

Posted by Jeff Frost
Feb. 27, 2013, 5:38 p.m.

Script to add the respository on Debian/Ubuntu

You've probably heard the announcement: Martin Pitt's PPA is being deprecated.  This is in an effort to push folks towards the new official

See the announcement here:

Also, see the deprecation notice on Martin's PPA:

Since we have quite a few customers using Martin's PPA for PostgreSQL packages under Ubuntu, I wrote a little shell script to switch repositories if you're using Martin's PPA and just install the repository if you are not.

You can find it here:

Posted by Jeff Frost
Feb. 18, 2013, 4:30 p.m.

Python Type Gripe

I've been digging into Python lately. On Friday, I ran into an unintuitive behavior where two sets containing identical objects (of a custom class) were comparing unequal. I bet you can guess the reason: I hadn't defined __hash__ and __eq__ for my class. Take a look at this documentation for __repr__:

For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object. A class can control what this function returns for its instances by defining a __repr__() method.
"For many?" "makes an attempt?!" "often including?!!" To my way of thinking, this is a little crazy. There should be a single root class in the inheritance hierarchy, and that root should define __repr__, __hash__, and __eq__ methods that operate on/check for equality of each attribute in the class (by value, not by memory address!) Then they would behave consistently for all classes. Then two sets, each containing three objects of the same class, each containing the same attributes with the same values, would compare equal when you used == on them, following the principle of least surprise.

Of course, I can't rewrite Python to make this happen. I'm tempted just to make my own root class that implements these behaviors via metaprogramming, and make every class I ever define a subclass of it. Is there a better way?

PS: A colleague pointed out that this behavior reflects a deliberate decision in Python design philosophy: "Python […] favors informal protocols over inheritance trees […] You can think of __eq__ and __hash__ as kind of being parts of an informal 'collectable' protocol." That sort of makes sense. Of course, the thing about informal protocols is that there's nothing to enforce consistency.

Posted by Quinn Weaver
Feb. 14, 2013, 4:24 p.m.

Finding unreplicated tables and sequences in slony with check_postgres and nagios

Since slony doesn't automatically add tables to replication for you, sometimes it's handy to double check that everything is being replicated.  If your slony cluster is named "sample" then you would have a slony schema named "_sample". As such, you can query against the sl_table and sl_sequence tables to find unreplicated tables like so:

SELECT pg_class.OID::regclass as table_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'r' AND pg_class.OID NOT IN (SELECT tab_reloid FROM _sample.sl_table) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample')

SELECT pg_class.OID::regclass as sequence_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'S' AND pg_class.OID NOT IN (SELECT seq_reloid FROM _sample.sl_sequence) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample');

 Those queries are a bit unpleasant to remember, so I like to make them into views like so:

CREATE OR REPLACE VIEW public.unreplicated_tables AS SELECT pg_class.OID::regclass as table_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'r' AND pg_class.OID NOT IN (SELECT tab_reloid FROM _sample.sl_table) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample') AND pg_class.relname <> 'fwd_epc_map';

CREATE OR REPLACE VIEW public.unreplicated_sequences AS SELECT pg_class.OID::regclass as sequence_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'S' AND pg_class.OID NOT IN (SELECT seq_reloid FROM _sample.sl_sequence) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample');

Now, that's a little handier because we can just select straight out of the view.  Now that we have these easy to use views, we can use's custom_query check to setup a nagios alert whenever there are unreplicated tables or sequences.

First we need to create the check commands like so:

define command{
        command_name    check_pgsql_slony_unreplicated_tables
        command_line    /usr/lib64/nagios/plugins/ --host $HOSTADDRESS$ --port=5432 --dbuser nagios --dbname mydb --action=custom_query --query="SELECT count(table_name) AS result FROM public.unreplicated_tables" --valtype=integer --warn=1 --critical=5

define command{
        command_name    check_pgsql_slony_unreplicated_sequences
        command_line    /usr/lib64/nagios/plugins/ --host $HOSTADDRESS$ --port=5432 --dbuser nagios --dbname mydb --action=custom_query --query="SELECT count(sequence_name) AS result FROM public.unreplicated_sequences" --valtype=integer --warn=1 --critical=5
I apologize for all the wrapping there.

Now that we have the commands defined, we use them in a service definition:

define service {
    use                          postgres-service
    notifications_enabled        1
    hostgroup_name               slony_primary
    service_description          Slony Unreplicated Tables
    check_command                check_pgsql_slony_unreplicated_tables
    contact_groups               pgexperts

define service {
    use                          postgres-service
    notifications_enabled        1
    hostgroup_name               slony_primary
    service_description          Slony Unreplicated Sequences
    check_command                check_pgsql_slony_unreplicated_sequences
    contact_groups               pgexperts
And that's it! Now we'll receive an alert whenever an unreplicated table or sequence exists on the servers in hostgroup slony_primary.

Posted by Jeff Frost
Feb. 12, 2013, 2:48 p.m.

Russian Doll Syndrome / Please Monitor Your RAID Volumes

As part of setting up nagios monitoring for customers, I always make sure that the RAID volumes are being monitored on the DB servers.  More often than not, I have to tell the customer there are bad disks in one or more of the RAID volumes.  Don't be one of those customers, make sure you monitor the RAID volumes on your PostgreSQL (and other) servers!

So, today I was setting up some of this same monitoring for LSI RAID controllers in a Dell server using the check_sasraid_megaraid nagios plugin.   This plugin requires the LSI SNMP installer as well as net-snmp and a few other miscellaneous packages.

This brings us to the Russian Doll Syndrome.  I download the zip file from the LSI support site and unzip it to find a zip full of zips:

$ unzip

Ok, fine, they have a zip for each architecture, so let's unzip the Linux_x64 one and get on with this:

$ unzip
  inflating: MD5CheckSum.txt        

Oh, sweet, another zip file!

$ unzip
  inflating: readme.txt             
 extracting: sas_snmp_64bit.tar.gz  

And inside that, we find a tarball:

$ tar xvfz sas_snmp_64bit.tar.gz

And finally an RPM.  Yeesh LSI, why couldn't we just have the RPM right inside the file or even just in the top level zip file?

So, all kidding aside, please make sure you're monitoring the RAID volumes on your PostgreSQL servers!

Posted by Jeff Frost
Feb. 1, 2013, 3:12 p.m.

Why PostgreSQL Doesn't Allow Unsafe Implicit Type Coercions

I got bitten by a MySQL bug today, and it reminded me why we don't allow¹ unsafe type coercions in PostgreSQL. It's easy to talk about them in the abstract, but harder to explain why they're a problem in practice. Here's a really clear example.

I do most of my consulting through pgExperts these days, but I still have a legacy client on MySQL. Today I was writing a new feature for them. Using test-driven development, I was debugging a particular invocation of a five-way join. I found it was returning many rows from a certain right-hand table, when it should have been returning none.

I took a closer look at my SQL and found this code²:

WHERE order.status_id NOT IN ('completed', 'cancelled')

But that's not right! status_id isn't a varchar. It's an int FK to a lookup table:

CREATE TABLE statuses (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
name varchar(16) NOT NULL,
UNIQUE KEY name (name)

MySQL was silently coercing my integer order.status_id to a varchar, comparing it to 'completed' and 'cancelled' (which of course it could never match), and returning a bunch of right-hand table rows as a result. The filter was broken.

PostgreSQL would never haved allowed this. It would have complained about the type mismatch between status_is and 'completed' or 'cancelled'. Rather than a confusing resultset, I'd get a clear error message at SQL compile time:

ERROR: invalid input syntax for integer: "cancelled"
LINE 8: select * from frobs where order.status_id NOT IN ('cancelled', 'c...

Fortunately, I caught and fixed this quickly (hurrah for test-driven development!) The fix was simple:

JOIN statuses stat on = order.status_id
WHERE NOT IN ('completed', 'cancelled')

But under different circumstances this could have seriously burned me.

And that's why we disallow certain implicit type coercions. This post is not intended as a rant against MySQL. Rather, it's an illustration of how a seemingly abstract issue can seriously hose your code. Caveat hacker!

1. Since PostgreSQL 8.3 (2008), that is. Before that, we did allow unsafe coercions.

2. I've changed the code and identifiers to protect client confidentiality.

Posted by Quinn Weaver
Aug. 23, 2012, 5:57 p.m.

PostgreSQL, NUMA and zone reclaim mode on linux

Lately we've been seeing issues with zone reclaim mode on large memory multi processor NUMA linux systems.

What's NUMA?  It's just an acronym for Non-Uniform Memory Access.  This means that some memory in your system is more expensive for a particular CPU to access than its "local" memory.

You can see how much more distant the kernel considers the different zones by using the numactl command like so:

numactl --hardware
If you've got a modern multiprocessor system, you'll probably see something like this:

available: 2 nodes (0-1)
node 0 size: 48417 MB
node 0 free: 219 MB
node 1 size: 48480 MB
node 1 free: 135 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10 

Here we see a distance of 10 for node 0 to access node 0 memory and 21 for node 0 to access node 1 memory.   What does distance really mean?  It's a cost parameter based on number of "hops" or buses that separate the node from the distant memory.

Now, what's zone reclaim mode?

From the linux kernel docs:


Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.

It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset

I highlighted the text above because PostgreSQL depends heavily on the filesystem cache and disabling zone reclaim mode is desirable in this situation.

There's been a bit of discussion about this on the pgsql-performance mailing list here:

If you've got a modern multi-socket system, odds are good that zone reclaim mode is enabled automatically on boot.  You can check this by looking at /proc/sys/vm/zone_reclaim_mode.

The biggest issue we've seen with zone reclaim mode enabled on customer multi-socket systems is the filesystem cache never filling up even when the database is much larger than RAM.  That's because the system is trying to keep some "local" memory available. After disabling zone_reclaim_mode, the filesystem cache fills up and performance improves.

So, how to disable zone_reclaim_mode?  The best way to do this is via sysctl.  Just add:

vm.zone_reclaim_mode = 0 

to /etc/sysctl.conf, save it and execute sysctl -p to load the new settings into the kernel.

Other interesting non PostgreSQL pages on NUMA/zone_reclaim_mode:

Posted by Jeff Frost
Aug. 2, 2012, 10:16 a.m.

Finding non-UTF8 data in a dump

Recently I was tasked with identifying all rows in a PostgreSQL database that contained non UTF8 data for migrating from SQL_ASCII encoding to UTF8.  Because of the database's size, just using iconv and diff on a dump file wouldn't work.

I started with this perl script from Jon Jensen:

It got me a little way down the road, but what I really wanted was to know which table had the bad row and better yet, be able to easily import the bad rows with a COPY command for cleanup and eventual use as temporary tables that we can be used with UPDATE statements to fix the production DB.

So, I turned to Python.  I found a isUTF8(text) definition here:

I took that and wrote which outputs COPY commands with only the non-UTF8 rows to STDOUT and also outputs a schema-only dump command to STDERR which only dumps the tables with bad data for ease of creating a temporary DB.

You can find the script on github here:

What other methods do you folks use?

Posted by Jeff Frost
May 23, 2012, 7:05 p.m. script to dump all PostgreSQL DBs

For a long time, I've written all my automation scripts in bourne shell and bash with occasional forays into perl.  I really like bash because you're always guaranteed to have bash installed on a modern linux system and so don't need to install perl and 15 modules to get the job done.

That said, I've recently begun using python for more automation.  As an exercise, I've rewritten my script for python 2.7 trying to use only the standard library.

The script pg_dumps all the databases in a PostgreSQL cluster in the custom format allowing you to specify the compression level.  (See this blog post regarding pg_dump compression level performance.)  It also dumps the globals to a bzip2'd SQL file and then removes files in the dump destination directory that are older than the --keep value (default 7 days).  Finally, if it had any errors during the run, it sends the error log file to the specified email address.  It also picks up libpq environment variables, but will override those with anything specified on the command line.

You can find it here:

Posted by Jeff Frost
May 15, 2012, 2:47 p.m.

Slony Upstart Script for Ubuntu 10.04 LTS

 Since Ubuntu is moving towards using upstart instead of SysV init scripts, I thought I'd write a quick upstart script for starting slony compiled and installed in /usr/local/slony.  Upstart 1.4+ supports setuid, but the version of upstart included with Ubuntu 10.04LTS is only 0.6.5 and even on Ubuntu 11.10, it's only 1.3, so instead I use start-stop-daemon's --chuid switch.  The script should be installed as /etc/init/slony.conf. Here's the script:

---cut here---
# slony - slon daemon
# slon is the slony replication daemon

description    "slony slon daemon"

start on net-device-up
stop on runlevel [06]

# can't use setuid until upstart 1.4
#setuid slony
# which is why we use start-stop-daemon with the --chuid switch


    exec start-stop-daemon --start --chuid slony --exec /usr/local/slony/bin/slon -- -f /etc/slon.conf >> /var/log/slony/slony.log 2>&1
end script
---cut here---

This will have the slon daemon run as the slony user and log to /var/log/slony/slony.log.  You'll also need a valid slony config file in /etc/slon.conf.

You can also find the script in the pgexperts upstart-scripts git repo:

Posted by Jeff Frost
April 2, 2012, 5:58 p.m.

Using the row_number() window function to reorder images

One of my customers uses the rails acts_as_list module for maintaining order of images on forms.

Through some application issues they ended up with images that had identical order for the same provider and image_category. So, it ended up looking like this:

select provider_id, image_category_id, position, id from images where image_category_id=1234 order by position;


provider_id | image_category_id | position | id

       9999 |              1234 |        3 | 484900
       9999 |              1234 |        4 | 484899
       9999 |              1234 |        4 | 484897
       9999 |              1234 |        4 | 484896
       9999 |              1234 |        4 | 484894
       9999 |              1234 |        4 | 484893
       9999 |              1234 |        4 | 484892
       9999 |              1234 |        4 | 484890
       9999 |              1234 |        5 | 484889
       9999 |              1234 |        6 | 484887
       9999 |              1234 |        7 | 484886
       9999 |              1234 |        8 | 479818
       9999 |              1234 |        9 | 479817
       9999 |              1234 |       10 | 479815

Note that there are several rows with position=4 above. This was a prevalent issue throughout the images table and not just with this example provider.

So, what they wanted was to reorder the images such that the equal positions were kept together and the overall order stayed the same.  So, for example, it should look like this (sorry for the small font):

 provider_id |image_category_id | position |   id  

        9999 |             1234 |        3 | 484900  ==> Position 1
        9999 |             1234 |        4 | 484899  ==> Position 2
        9999 |             1234 |        4 | 484897  ==> Position 3
        9999 |             1234 |        4 | 484896  ==> Position 4
        9999 |             1234 |        4 | 484894  ==> Position 5
        9999 |             1234 |        4 | 484893  ==> Position 6
        9999 |             1234 |        4 | 484892  ==> Position 7
        9999 |             1234 |        4 | 484890  ==> Position 8
        9999 |             1234 |        5 | 484889  ==> Position 9
        9999 |             1234 |        6 | 484887  ==> Position 10
        9999 |             1234 |        7 | 484886  ==> Position 11
        9999 |             1234 |        8 | 479818  ==> Position 12
        9999 |             1234 |        9 | 479817  ==> Position 13
        9999 |             1234 |       10 | 479815  ==> Position 14

So, I thought about it for a bit, then it occurred to me: why not use the row_number() window function.  So I wrote the following migration SQL script:

CREATE temp TABLE new_image_positions 
, row_number AS position
SELECT provider_type
, provider_id
, image_category_id
, position
, id
, row_number()
partition BY provider_id
, image_category_id
ORDER BY position
FROM images
WHERE provider_type = 'Project'
AND position IS NOT NULL
) x ;

UPDATE images
SET position = n.position
FROM new_image_positions n


And, voila, all the images are now properly ordered:

 provider_id | image_category_id | position |   id   
        9999 |              1234 |        1 | 484900
        9999 |              1234 |        2 | 484899
        9999 |              1234 |        3 | 484897
        9999 |              1234 |        4 | 484896
        9999 |              1234 |        5 | 484894
        9999 |              1234 |        6 | 484893
        9999 |              1234 |        7 | 484892
        9999 |              1234 |        8 | 484890
        9999 |              1234 |        9 | 484889
        9999 |              1234 |       10 | 484887
        9999 |              1234 |       11 | 484886
        9999 |              1234 |       12 | 479818
        9999 |              1234 |       13 | 479817
        9999 |              1234 |       14 | 479815
        9999 |              1234 |       46 | 403052
        9999 |              1234 |       47 | 403051
        9999 |              1234 |       48 | 403050
        9999 |              1234 |       49 | 403049
        9999 |              1235 |        1 | 484916
        9999 |              1235 |        2 | 484915
        9999 |              1235 |        3 | 484912
        9999 |              1235 |        4 | 484910
        9999 |              1236 |        1 | 493000
        9999 |              1236 |        2 | 492999

Posted by Jeff Frost
Dec. 23, 2011, 10:22 a.m.

pg_dump compression settings

After doing the base backup benchmarks, I thought it would be interesting to benchmark pg_dump run locally and remotely using all the different compression settings.  This will allow me to compare the dump times and the space savings as well as see how much dumping remotely slows the process.  Nobody is storing their dumps on the same server, right?

For the record, all these dumps used pg_dump's custom dump format (-Fc), the PGDATA directory is 93GB on disk and resides on a 4 disk RAID10 of 7200 RPM SATA drives.

CPU on the test server: Intel(R) Xeon(R) CPU X3330  @ 2.66GHz.
CPU on the test remote client: Intel(R) Core(TM) i7 CPU 960  @ 3.20GHz.
Network is gigabit.
The total size reported is from du -sh output.
Both client and test server are running postgresql-9.1.2.

For comparison, our best base backup time from the base backup blog post was: 15m52.221s

The results:

Local pg_dump:
compression: 0
time: 19m7.455s
total size: 77G
compression: 1
time: 21m53.128s
total size: 11G
compression: 2
time: 22m27.507s
total size: 11G
compression: 3
time: 24m18.966s
total size: 9.8G
compression: 4
time: 30m10.815s
total size: 9.2G
compression: 5
time: 34m26.119s
total size: 8.3G
compression: 6
time: 41m35.340s
total size: 8.0G
compression: 7
time: 49m4.484s
total size: 7.9G
compression: 8
time: 91m28.689s
total size: 7.8G
compression: 9
time: 103m24.883s
total size: 7.8G

Remote pg_dump:
compression: 0
time: 20m1.363s
total size: 77G
compression: 1
time: 22m9.205s
total size: 11G
compression: 2
time: 22m19.158s
total size: 11G
compression: 3
time: 23m7.426s
total size: 9.8G
compression: 4
time: 26m10.383s
total size: 9.2G
compression: 5
time: 28m57.431s
total size: 8.3G
compression: 6
time: 33m23.939s
total size: 8.0G
compression: 7
time: 38m11.321s
total size: 7.9G
compression: 8
time: 62m27.385s
total size: 7.8G
compression: 9
time: 72m5.123s
total size: 7.8G

So, a few interesting observations:

  • Base backups are indeed faster, but that's no surprise since they have less overhead. 
  • Taking the backup locally is only slightly faster than remotely for the smaller compression levels
  • At compression level 3 and above the faster CPU on the test client becomes a huge benefit
  • Looks like the default compression is level 6. This might not be what you want if you're concerned about minimizing the backup time and have the disk space to spare.

Posted by Jeff Frost
Dec. 6, 2011, 3:45 p.m.

PostgreSQL Base Backup Benchmark Results

For a while now we've sort of anecdotally thought that compression should help with base backups and also that the rsync server is faster than rsync over ssh, so while nothing was happening on our test server, I thought I'd benchmark a few different methods.

Looks like rsync using an rsync server with no compression was fastest, but not by much. Pg_basebackup is right there with rsync using any of the three RC4 ciphers all within a few seconds of one another. Looks like blowfish-cbc comes in next followed by aes128-cbc.

Most interesting is that adding compression quadruples the sync time! That's probably not the case over a slow link, but over gigabit it's definitely the case.

Tests were done with a 93GB PGDATA, gigabit network, linux software 4 disk RAID10 data dir on the first test server copied to linux software 2 disk RAID0 on the second server.

Following are the results:

Without compression:

time rsync test0::data /data/tmp/test-data/ -avP
real 15m52.221s
user 3m10.257s
sys 4m11.460s

time /usr/pgsql-9.1/bin/pg_basebackup -D /data/tmp/test-data -c fast -h test0 -U postgres
real 16m13.440s
user 0m17.314s
sys 2m18.217s

time rsync -e 'ssh -c arcfour' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m13.242s
user 8m32.674s
sys 5m28.312s

time rsync -e 'ssh -c arcfour256' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m13.656s
user 8m48.356s
sys 6m7.343s

time rsync -e 'ssh -c arcfour128' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m3.947s
user 8m44.627s
sys 6m7.499s

time rsync -e 'ssh -c blowfish-cbc' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 26m22.516s
user 19m16.038s
sys 8m45.046s

time rsync -e 'ssh -c aes128-cbc' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 29m30.785s
user 27m31.919s
sys 6m5.154s

time rsync -e 'ssh' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 30m50.301s
user 24m10.951s
sys 9m13.090s

time rsync -e 'ssh -c aes256-ctr' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 38m23.789s
user 31m36.011s
sys 8m39.210s

Our Best Performers With Compression:

time rsync -e 'ssh -c arcfour' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avPz
real 62m31.933s
user 14m48.752s
sys 2m42.412s

time rsync test0::data /data/tmp/test-data/ -avPz
real 62m53.699s
user 6m42.846s
sys 1m44.357s

Since Rod, Joshua and Rodrigo's comments suggested pigz and netcat below, I thought I'd give them both a quick try to see if they produced a winner.  Conveniently, gnu-tar has a --use-compress-program option, so you can easily use pigz in this manner.

So, tar over ssh with the arcfour256 cipher using pigz for compression:

time ssh -c arcfour256 test0 "cd /data/9.1/data ; tar --use-compress-program=pigz -cf - ." | (cd /data/tmp/test-data ; tar --use-compress-program=pigz -xvf -)

 real    18m11.901s
user    7m3.587s
sys     3m53.487s

On test0: tar --use-compress-program=pigz -cf - . | nc -l 7000
On test1: time nc test0 7000 | tar --use-compress-program=pigz -xvf -

real    17m46.805s
user    6m17.562s
sys     4m4.563s

So, using tar with pigz over ssh and over netcat are much faster compression methods, they still are a little slower than no compression.  This got me wondering if perhaps the 93GB of PGDATA doesn't compress all that well:

14G     /data/tmp/data-test.tgz

So, that's 6:1 compression. Not bad!  Interestingly, it took gnu-tar + pigz about 17m to create the test tgz file. That's about identical to the time it took to ship it over the network and uncompress/untar it on the other end.

That made me try tar with no compression piped over netcat:

On test0: tar -cf - . | nc -l 7000

On test1: time nc test0 7000 | tar -xvf -

real    15m52.313s
user    0m10.033s
sys     3m17.630s

So, tar with no compression over netcat clocked in almost identical to using rsync's native server.

Anyone have any other tools for me to test?

Posted by Jeff Frost