Database consulting and application development...

...because your data is your business

March 20, 2019, 6:34 a.m.

“Look It Up: Practical PostgreSQL Indexing” at Nordic PGDay 2019

The slides from my presentation at PGDay Nordic 2019 are now available.

Posted by Christophe Pettus
Feb. 11, 2019, 2:44 p.m.


A feature of PostgreSQL that most people don’t even know exists is the ability to export and import transaction snapshots. The documentation is accurate, but it doesn’t really describe why one might want to do such a thing. First, what is a “snapshot”? You can think of a snapshot as the current set of committed […]

Posted by Christophe Pettus
Feb. 8, 2019, 5:52 p.m.

Do not change autovacuum age settings

PostgreSQL has two autovacuum-age related settings, autovacuum_freeze_max_age, and vacuum_freeze_table_age. Both of them are in terms of the transaction “age” of a table: That is, how long it has been since the table has been scanned completely for “old” tuples that can be marked as “frozen” (a “frozen” tuple is one that no open transaction can […]

Posted by Christophe Pettus
Feb. 4, 2019, 1 p.m.

“Breaking PostgreSQL at Scale” at FOSDEM 2019

"Breaking PostgreSQL at Scale" at FOSDEM 2019.

Posted by Christophe Pettus
Jan. 11, 2019, 2:28 p.m.

In the Bay Area? This Tuesday, 2019-01-15, SFPUG features David Fetter's talk on ASSERTIONs. RSVPs close Monday at noon, so don't hesitate! Thanks to Heap for hosting at their FiDi office.

Posted by Quinn Weaver
Dec. 6, 2018, 12:37 p.m.

BDR talk by Mark Wong of 2nd Quadrant

In the Bay Area? This Wednesday, 2018-12-12, Mark Wong from 2nd Quadrant talking about BDR (Bi-Directional Replication, a form of multi-master) for PostgreSQL. This is a great chance to get inside, real-time info on BDR.

Multi-master is one of those features where when you need it, you really, really need it, and if you're in that category, this talk is for you. It's also of interest to anyone trying to figure out the best solution for scaling and redundancy beyond one machine and one data center.

To attend, you must RSVP at Meetup with your full name (for building security's guest list).

Posted by Quinn Weaver
Oct. 30, 2018, 10:56 p.m.

Remember your history

PostgreSQL keeps track of which WAL files go with which timelines in small history files. Each time you make a base backup, a history file is born. The file is written once and never updated. It's a simple system, and it works well and silently.

In fact, sometimes it works a little too silently.

At PostgreSQL Experts we've run into the problem where a client's history files disappear because they are stored in S3, and there's a lifecycle configuration in place that says to move everything over a certain age to Glacier. That's a good policy for WAL files!

Unfortunately, it's not a good policy for history files: without the latest history file you can't restore the latest backup, and without past history files, you are unable to do PITR to certain points in time.

The solution we used was to move the whole WAL archive to S3 Standard-Infrequent Access storage, dissolving the problem with lifecycle configurations while controlling costs. But you could also fix this by editing the lifecycle configuration.

The important thing is this: hold on to all history files. They're tiny text files, and when you need them, you really need them. This is also a good reason to test restores, not just of the latest backup, but of database states at arbitrary points in time.

*    *    *

Addendum: another very common problem we see is WAL archives that become corrupted because a client accidentally pointed a two primaries at the same WAL archive (for instance, they might have copied a postgresql.conf file by hand, or via a DevOps tool like Puppet). In this case, the whole archive is corrupted, and you're best off starting with a fresh S3 bucket or an empty directory and doing a new base backup immediately.

One of the many nice features of pgBackRest is that it will notice this and prevent you from doing it. Fewer footguns → better backups.

Posted by Quinn Weaver
Oct. 29, 2018, 11:23 p.m.

It's just an expression

PostgreSQL has lots of great features for text search.

In particular, there are a bunch of ways to do case-insensitive searches for text:

  • There's standard SQL ILIKE… but than can be expensive — especially if you put %'s at both start and end — and it's overkill if you want an exact string match.
  • The same goes for case-insensitive regexp matching: overkill for simple case-insensitive matches. It does work with indexes, though!
  • Then there's the citext extension, which is pretty much the perfect answer. It lets you use indexes and still get case-insensitive matching, which is really cool. It Just Works.
OK, but what if you didn't have the foresight to use citext? And what if you don't want to go through the pain of changing the data type of that column? In a case like this, an expression index can be really handy.

Without an index, a case-insensitive match like this…

sandbox# select addy from email_addresses where lower(addy) = '';

… is forced to use a sequential scan, lowercasing the addy column of each row before comparing it to the desired address:

                                               QUERY PLAN
 Seq Scan on email_addresses  (cost=0.00..1.04 rows=1 width=32) (actual time=0.031..0.032 rows=1 loops=1)
   Filter: (lower(addy) = ''::text)
   Rows Removed by Filter: 3
 Planning time: 0.087 ms
 Execution time: 0.051 ms
(5 rows)

For my toy example, a four-row table, that's not too expensive, but for a real table with thousands or millions of rows, it can become quite a pain point. And a regular index on email_addresses(addy) won't help; the lower() operation forces a sequential scan, regardless of whether an index is present.

But an expression index will do the trick. An astute commenter noted that, in my toy example, walking the index is actually slower than a sequential scan. But for a table with many rows (most of which are not the one email address I'm looking for!) an index scan will be dramatically faster.

sandbox# create index email_addresses__lower__addy on email_addresses (lower(addy));
sandbox# explain analyze select addy from email_addresses where lower(addy) = '';
                                                                  QUERY PLAN
 Index Scan using email_addresses__lower__addy on email_addresses  (cost=0.13..8.15 rows=1 width=32) (actual time=0.093..0.095 rows=1 loops=1)
   Index Cond: (lower(addy) = ''::text)
 Planning time: 0.105 ms
 Execution time: 0.115 ms
(4 rows)

In short, expression indexes are one of those neat tricks that can save you a lot of pain.

Posted by Quinn Weaver
Oct. 23, 2018, 11:09 p.m.

How to set up scram-sha-256 authentication in PostgreSQL

md5: everyone uses it. But it's insecure — unsuitable for authentication or password storage. PostgreSQL 10 adds support for SCRAM SHA-256 authentication, which is far better1… but how do you use it? Documentation is scarce.

Below I give the complete steps, for two different cases:

  • The case where you migrate every user.
  • The case where you migrate only a subset of users. Step 1 explains why you might need to do this. In this case, you can save time and complexity by creating a containing role, adding all the users you want to migrate to that role, and then configuring that role to work with scram-sha-256. That way you don't have to do it once for each user.
Enough preamble; let's begin!
  1. First, decide which users to migrate to scram-sha-256. The ideal is to migrate everyone, all at once. In practice you may be unable to do that, for a few reasons:
    • Older client tools don't support scram-sha-256. You need at least psql 10, or pgAdmin 4.
    • Your app's client libraries may not support scram-sha-256. At a minimum, you will probably have to upgrade your client libraries2.
    • And it's not enough just to upgrade your libraries; you also have to change the password for each user who's migrating, because PostgreSQL needs to hash this password in scram-sha-256 format. There's no automatic rehash (there can't be, since PostgreSQL doesn't store the cleartext of the old password).
    • Finally, besides your own code, you may have third-party apps connecting to your database, whose library versioning policy and password reset policy you don't control.
  2. Having chosen the users to migrate, upgrade your relevant client libraries, psql/postgresql-client packages, and/or pgAdmin versions.
  3. If you're migrating all users, skip to the next step. If, instead, you're migrating a subset of users, then create a new role, and put all those users in that role. This step saves a bit of repetitive config work.
  4. $ sudo su - postgres
    postgres=# create role scram_sha_256_users nologin ROLE alice, bob, eve;
    … where alice, bob, and eve are the users you're moving to scram-sha-256. We'll set their passwords in a minute.
  5. Edit pg_hba.conf (make a backup copy first, and don't do a reload yet! We'll do that in a future step). If you're migrating all users, you just change 'md5' to 'scram-sha-256' wherever it appears in the file. In that case, also edit your postgresql.conf to set
  6. password_encryption = 'scram-sha-256'
    If, instead, you're migrating a subset of users, the pg_hba.conf edits depend on which users you're adding. As appropriate for your setup, alter or remove any lines that tell PostgreSQL to use md5 authentication for your affected users; then add equivalent lines for scram-sha-256 authentication.

    For instance, if those users were formerly allowed to log in to all DBs with an md5 password, either locally or from any IPv4 host, then you would do the following. Note that this is only an example, and the particulars depend on your existing setup!
    # Local (Unix domain socket) connections:
    local all +scram_sha_256_users scram-sha-256
    # IPv4 connections, from any IP address:
    host all +scram_sha_256_users scram-sha-256
  7. Coordinate with stakeholders to schedule a downtime. During the downtime, do the following:
  8. Stop your app, cron jobs, and any other code that will touch the database (or just be prepared for these to fail when they try to connect).
  9. Do a pg_ctl reload to load the new pg_hba.conf. Note that this will break logins for all the users you're migrating!
  10. Now it's time to change those passwords. Notably, you must pass the cleartext password to your <code>alter user</code>command, so that PostgreSQL itself can salt and (iteratively) hash it; unlike with md5, you can't do the hashing yourself. This is required by the SCRAM protocol.
  11. So first alter the postgres user's ~/.psqlrc, to make sure it won't save cleartext of password-related commands in its history file. Add this line at the bottom:
  12. \set HISTFILE /dev/null
  13. Likewise, make sure the cleartext won't go to the log. Do psql -U postgres, and set these settings (just for this one session):
  14. set log_min_duration_statement = -1;
    set log_statement = 'none';
  15. In the same psql session, change passwords for the relevant users. if your users are going to connect using psql, keep in mind that it won't work with passwords longer than 99 characters. Longer passwords won't crash psql, but it will cause logins to fail.
  16. set password_encryption = 'scram-sha-256';
    alter role scram_sha_256_users set password_encryption = 'scram-sha-256';
    alter role bob with password 'bobsecret';
    alter role alice with password 'alicesecret';
  17. Securely communicate the new passwords to human users.
  18. Update ~/.pgpass entries with the new passwords, for both human users and non-human runners of cron jobs.
  19. Change other application password settings (config files, Vault, et cetera).
  20. Test, test, test! Make sure your apps, human users, and cron jobs can all connect.
  21. End the downtime.
  22. Remove this line from your ~/.psqlrc:
  23. \set HISTFILE /dev/null
With all that done, you are now on much more secure footing. Enjoy!

1. Because SCRAM passwords are stored in as a salted, iterated hash, even if they're leaked, they're harder to crack. Also, the SCRAM authentication protocol is immune to MITM attacks using certs from compromised CAs.
Support varies among libraries, though most wrap libpq, and hence should Just Work if compiled with libpq 10 (JDBC is a notable exception; it doesn't use libpq, so you need to get the right JDBC version).

Posted by Quinn Weaver
Oct. 11, 2018, 7:57 p.m.

HyperLogLog at SFPUG

HyperLogLog is one of those simple, useful, and ingenious algorithms that everyone should know. In SQL terms, it's a way to do COUNT(DISTINCT …) estimates with guaranteed accuracy bounds.

At this month's San Francisco PostgreSQL Users' Group, Sai Srirampur will explain HLL's workings, the postgresql-hll extension, and its applications in distributed PostgreSQL à la Citus.

Check it out!

Posted by Quinn Weaver
Sept. 26, 2018, 11:06 a.m.

Find the value of all database sequences

Upon occasion, you want to get the current value of all the sequences in the database. For example, you may have fields that are integer rather than bigint, and you’re concerned how close you are to overflowing one of them (since sequences are bigint and will happily crash through the size of a 32-bit integer). […]

Posted by Christophe Pettus
Sept. 12, 2018, 8:22 p.m.

“Securing PostgreSQL” at PDXPUG PostgreSQL Day 2018

The slides from my presentation, Securing PostgreSQL at PDXPUG PostgreSQL Day 2018 are now available.

Posted by Christophe Pettus
Sept. 5, 2018, 5:05 p.m.

Locks talk this Friday, at PostgresOpen! (2018-09-07)

UPDATE: my slides are available; the video has yet to be published.

Attending PostgresOpen?

Come join me Friday for a gentle introduction to locks in PostgreSQL. My example-driven talk covers basic lock theory, tools for lock debugging, and common pitfalls and solutions. I hope to see you there!

Time and place info is on the PostgresOpen SV website.

Posted by Quinn Weaver
Sept. 4, 2018, noon

CHAR: What is it good for?

In addition to the familiar text types VARCHAR and TEXT, PostgreSQL has a type CHAR. It’s little used… and that’s for a reason. It has some very unusual behaviors, which can be quite a surprise if you are not expecting them. First, CHAR is a fixed-width type. When character data is stored in it, it’s […]

Posted by Christophe Pettus
Aug. 28, 2018, 5 p.m.

Don’t LOCK tables. Just don’t.

It’s not uncommon that an application needs to serialize access to one or more resources. The temptation is very high to use the LOCK TABLE SQL statement to do this. Resist the temptation. There are many issues with using LOCK: It blocks autovacuum, which can cause bloat or even transaction ID wraparound in extreme cases. […]

Posted by Christophe Pettus
Aug. 9, 2018, 10:06 a.m.

Three Steps to pg_rewind Happiness

pg_rewind is a utility included with PostgreSQL since 9.x. It’s used to “rewind” a server so that it can be attached as a secondary to a primary. The server being rewound could be the former primary of the new primary, or a secondary that was a peer of the new primary. In pg_rewind terminology, and […]

Posted by Christophe Pettus
Aug. 7, 2018, 5:58 p.m.

Does anyone really know what time it is?

PostgreSQL has a variety of ways of telling time: now(), statement_timestamp(), and clock_timestamp(). Each has a different sense of when “now” is: now() is the time at the start of the transaction; it never changes while the current transaction is open. statement_timestamp() is the time that the current statement started running. It changes from statement […]

Posted by Christophe Pettus
July 23, 2018, 9 a.m.

One of those things: pgbouncer and self-signed certificates

(This is another intermittent series of small things that are easy to forget, and cause irritation in the PostgreSQL world…) Since version 1.7, pgbouncer has had built-in TLS support. This is very handy, since the previous versions required stunnel or something similar to provide TLS. Frequently, either for test purposes or because you’re in an […]

Posted by Christophe Pettus
July 18, 2018, 9 a.m.

pg_rewind and checkpoints: caution!

pg_rewind, introduced in PostgreSQL 9.5, is a powerful utility solving a particular problem: If you have a promoted a streaming replication secondary into being a primary, how can you make sure that the former primary, and any other secondaries that used to be connected to it, are able to connect to the new primary? Previously, […]

Posted by Christophe Pettus
July 17, 2018, 10:38 a.m.

That Google Checksum Tool

Google recently released a tool to check on-disk checksums in PostgreSQL. PostgreSQL being hot, and Google being Google, this generated a lot of press, much of it containing an element of FUD about the integrity of PostgreSQL backups (note that Google’s own announcements were quite straight-forward about it). First, some background: Since PostgreSQL 9.3, it […]

Posted by Christophe Pettus
July 16, 2018, 8:33 p.m.

One of those things: /tmp/pgpool_status

(This is another intermittent series of small things that are easy to forget, and cause irritation in the PostgreSQL world…) When setting up pgpool2, it’s common to tweak the configuration file repeatedly… and often get a hostname wrong. One common occurrence is then: You fix the hostnames. You restart pgpool2. It refuses to connect to […]

Posted by Christophe Pettus
May 10, 2018, noon

PostgreSQL Replication at PerconaLive 2018

The slides from my talk on PostgreSQL Replication at PerconaLive 2018 are available.

Posted by Christophe Pettus
April 10, 2018, 6:36 p.m.

Change these: wal_level and wal_hint_bits

(An intermittent series on PostgreSQL parameters whose default settings you should change.) wal_level = logical wal_level controls how much information is written into the write-ahead log. Higher settings write more information, enabling more features, at the expense of (somewhat) larger write-ahead log volume. Just set it to logical, the highest setting. The incremental volume is […]

Posted by Christophe Pettus
April 6, 2018, 8 a.m.

Change these: Logging

(An intermittent series on PostgreSQL parameters whose default settings you should change.) Here’s my preferred logging configuration: log_destination = 'csvlog' logging_collector = on log_directory = '/var/log/postgresql' log_filename = 'postgresql-%Y%m%d-%H%M%S.log' log_rotation_size = 1GB log_rotation_age = 1d log_min_duration_statement = 250ms log_checkpoints = on log_connections = on log_disconnections = on log_lock_waits = on log_statement = 'ddl' log_temp_files = […]

Posted by Christophe Pettus
April 3, 2018, 1 p.m.

Change these: archive_mode and archive_command

(An intermittent series on PostgreSQL parameters whose default settings you should change.) By default archive_mode is set to off, and thus archive_command is ignored. Even if you are not going to be using WAL archiving, you should change these. You can use settings such as: archive_mode = on archive_command = '/bin/true' It requires a server […]

Posted by Christophe Pettus
March 30, 2018, 9:28 p.m.

Change This: wal_compression

(An intermittent series on PostgreSQL parameters whose default settings you should change.) Introduced in PostgreSQL 9.5, wal_compression is off by default, but you should probably turn it on. First, what does it do? The documentation helpfully explains: When this parameter is on, the PostgreSQL server compresses a full page image written to WAL when full_page_writes […]

Posted by Christophe Pettus
March 27, 2018, 6:12 p.m.

Mountpoints and the Single PostgreSQL Server

Ultimately, a PostgreSQL database is just files, and those files have to be stored somewhere. Absent tablespaces, they are all stored in a single directory structure, traditionally called PGDATA. While smaller instllations can just use the defaults associated with the packaging, larger databases are often installed on their own volume with its own mountpoint. Two […]

Posted by Christophe Pettus
March 20, 2018, 6:28 a.m.

Why PostgreSQL is Terrible? at Nordic PGDay

The slides from my talk at Nordic PGDay, Why PostgreSQL is Terrible, are now available.

Posted by Christophe Pettus
Jan. 2, 2018, 10 p.m.

A Replication Cheat-Sheet

So many types of replication, so little time! What kind of PostgreSQL replication should you use?

Posted by Christophe Pettus
Oct. 30, 2017, 2 p.m.

Commitment Issues

One of the essentials of any database system is that a transaction is either in progress, committed, or rolled back. But consider what happens if… BEGIN; UPDATE table SET money=money+100000; COMMIT; -- And you get an error that the server has disconnected Did that transaction commit, or didn’t it? You don’t really know, because you […]

Posted by Christophe Pettus
Oct. 27, 2017, 1:29 a.m.

Streaming replication stopped? One more thing to check.

We recently were asked by a client to look at a strange replication situation. Replication had simply stopped from a primary to a streaming replication secondary. Everything looked fine: The secondary was connected to the primary, as we could see in pg_stat_replication. The secondary was not receiving any queries, so delays in applying incoming page […]

Posted by Christophe Pettus
Oct. 27, 2017, 1:06 a.m.

“Humans do not have a primary key” at PGConf.EU

The slides for “Humans do not have a primary key” as presented at PGConf EU are available here. Three links from the presentation are: Personal names around the world Kudzu and the California Marriage Amendment General Data Protection Regulations in the EU

Posted by Christophe Pettus
July 9, 2017, 5:19 p.m.

A word of praise for TextBelt

Quite often I need to kick off a long-running process for a client, then resume work immediately once it's done. pg_restore, pg_basebackup, pg_upgrade, and vacuumdb --analyze-only all come to mind as examples. The ideal thing is to get a text message upon completion. When I'm working on my own boxes, I can just mail(1) my mobile carrier's text gateway, but I can't count on clients' servers having sendmail or the like set up (they usually don't).

Enter TextBelt. This service is a dead-simple HTTP-to-SMS gateway. Adapted from their docs:

curl -X POST \
--data-urlencode phone="$MY_MOBILE_NUMBER" \
--data-urlencode message='The process completed.' \

Cleartext HTTP is also supported, in case the client box has broken SSL libraries. My texts are always nondescript, so I don't mind sending them in the clear.

The whole thing is open-source, so you can set up your own TextBelt server. Or you can be lazy and throw a few dollars their way for a certain number of texts. I rather like this business model, actually, as a way to support open-source work.

Posted by Quinn Weaver
July 7, 2017, 3:47 p.m.

The PATH of cron

Short version

Postgres Plus keeps psql out /usr/bin, so you need to set PATH in your cron jobs (including for WAL-E).

Longer version

Like all good people, I set up a cron jobs to run nightly WAL-E base backups. With one client, this failed the first time:

wal_e.main ERROR MSG: could not run one or more external programs WAL-E depends upon
DETAIL: Could not run the following programs, are they installed? psql
STRUCTURED: time=2017-07-07T03:00:01.957961-00 pid=25833
Turns out they were using Postgres Plus, and it puts psql in /opt/PostgresPlus/9.3AS/bin. That directory is in the enterprisedb user's PATH, of course, but not in the minimal PATH that cron jobs get by default. So I had to log in as enterprisedb, echo $PATH, and then paste PATH = [what echo said] at the top of my cron job. Moral of the story: take care when setting up cron jobs for PostgreSQL forks, or for non-standard community PostgreSQL installations.

Posted by Quinn Weaver
March 29, 2017, 7:37 p.m.

Change autovacuum_freeze_max_age without a restart (sort of…)

This blog post is kind of involved, so I'm giving a short version at the top, with some background for beginners at the bottom. The middle section explains the motivation for using this hack in the first place.

Short version

I came up with a useful and/or terrible hack the other day: setting autovacuum_freeze_max_age as a storage parameter. I definitely don't recommend doing this routinely, but it unblocked us during a critical maintenance window.

    ALTER TABLE my_table SET (autovacuum_freeze_max_age = 300000000);

Don't forget to set it back when you're done! Otherwise you will incur an even longer autovacuum freeze, probably when you least expect it.

Medium-length version

My colleague Kacey Holston was in the midst of upgrading a client from PostgreSQL 9.4 to 9.6, using Slony for minimal downtime. As planned, the client took a few minutes of downtime so Kacey could do. She was ready to reverse the direction of replication (so the 9.6 server was replicating to the 9.4 server, in case our client to fall back to it). But there was an autovacuum freeze (a.k.a. "autovacuum (to prevent wraparound)" that was keeping Slony from getting the brief ExclusiveLock it needed.

She knew from experience that this table takes three hours to freeze. But the client had only minutes of downtime scheduled – that was the whole point of using Slony!

If only it were possible to change autovacuum_freeze_max_age on the fly; then we could bump it up to stop that autovacuum. Unfortunately, you have to restart the database in order to change it. Except…

You can set it on a per-table basis, as follows. This took effect immediately:

    ALTER TABLE my_table SET (autovacuum_freeze_max_age = 300000000);

If you do this, don't forget to set it back to the normal value (by default, 200000000) once you're done! Otherwise autovacuum freezes on this table will come around less often and take even longer.

Background for beginners:

When the oldest transaction ID on any row in a table is more than autovacuum_freeze_max_age old (200 million transaction old, by default), then an "autovacuum (to prevent wraparound)" process runs on the table to reclaim old transaction IDs. For large tables, this can be a problem, because it can generate a lot of CPU and I/O activity during busy hours. Also, as we saw here, it locks the table (in a SHARE UPDATE EXCLUSIVE mode); this blocks DDL changes (a.k.a. migrations).

For more-technical background, see the official PostgreSQL docs on how transaction IDs work, and for a friendlier intro, see this series of blog posts.

Posted by Quinn Weaver
March 3, 2017, 5:20 p.m.

Slides from my talk at SCALE15x

The slides from my SCALE 15x talk: Minimizing PostgreSQL Major Version Upgrade Downtime Using Slony are available here:

You'll want to download the powerpoint version as it includes some embedded screencasts that don't play on slideshare.

Posted by Jeff Frost
Jan. 29, 2016, 2:46 p.m.

Language thoughts

All the languages I've used heavily have one primitive data structure you fall into using as a golden hammer:

  • C: arrays
  • Lisp: lists
  • Perl: hashes
  • JS: "objects"
Et cetera. Python is borderline; it's tempting to abuse dicts, but there's a bit less friction to using proper objects than with Perl.

Posted by Quinn Weaver
Jan. 16, 2016, 11:06 p.m.

Blocked by rdsadmin

One of our clients on RDS had a VACUUM FREEZE that was hanging for a long time. "I bet it's waiting on a lock," I thought. Yup, but the curious part is what it was waiting on: a mysterious process run by the rdsadmin role (see RECORD 2, below):

SELECT pg_stat_activity, pg_locks.mode
FROM pg_stat_activity
JOIN pg_locks USING (pid)
JOIN pg_class ON pg_locks.relation = pg_class.oid
WHERE pg_class.relname = 'users'
AND pg_locks.mode IN ('ShareUpdateExclusiveLock', 'ShareLock', 'ShareRowExclusiveLock', 'ExclusiveLock', 'AccessExclusiveLock');

-[ RECORD 1 ]----+------------------------------
datid | 1234
datname | my_database
pid | 14641
usesysid | 16396
usename | postgres
application_name | psql
client_addr |
client_hostname |
client_port | 5430
backend_start | 2016-01-15 22:05:06.161987+00
xact_start | 2016-01-15 22:14:39.301425+00
query_start | 2016-01-15 22:14:39.301425+00
state_change | 2016-01-15 22:14:39.301429+00
waiting | t
state | active
query | VACUUM FREEZE verbose users;
mode | ShareUpdateExclusiveLock
-[ RECORD 2 ]----+------------------------------
datid | 1234
datname | my_database
pid | 22328
usesysid | 10
usename | rdsadmin
application_name |
client_addr |
client_hostname |
client_port |
backend_start |
xact_start |
query_start |
state_change |
waiting |
state |
query |
mode | ShareUpdateExclusiveLock
Further examination showed that this process was locking only the users table (and its indexes):

SELECT locktype, relation::regclass AS tablename
FROM pg_locks
JOIN pg_stat_activity USING (pid)
WHERE pid = 22328;

locktype | tablename
relation | user_index_a
relation | user_index_b
relation | user_index_c
relation | users_pkey
virtualxid |
relation | users
(13 rows)

Has anyone else seen such a process? I'm curious as to what it is. My current best guess is a vacuum We opened a ticket with Amazon to ask them, so I'll update this post when Amazon replies.

EDIT 2016-01-17: the above is a braino for "an autovacuum." An Amazon rep wrote back the next day to say that it was, indeed, an autovacuum process, and included some vacuum-tuning tips. So good on them, although it's unfortunate that RDS's privilege system doesn't allow you to see the pg_stat_activity.query field when the rdsadmin role is working on a table you own.

Posted by Quinn Weaver
Nov. 10, 2015, 10:31 p.m.

wal-e - avoiding the cron emails

If you're like me, you use wal-e for PITR based backups on many of the PostgreSQL servers you manage.  While setting it up, you likely discovered the --terse option greatly reduces the log output.  However, if you run the backup-push job in cron and have email aliases set up properly, you probably receive emails like this every time the backup-push runs:

NOTICE:  pg_stop_backup complete, all required WAL segments have been archived

Why is that?  Well, the default value for client_min_messages in PostgreSQL is NOTICE, which means the client (wal-e in this case) will be sent messsages of NOTICE and higher priority.

So, how to change that?  You may have gotten this far and realized there isn't a handy command line switch that wal-e exposes.  That's ok though, wal-e uses libpq, so we just need to use the PGOPTIONS environment variable.

If you're using wal-e with envdir like in the documentation examples, then you can just plunk down a PGOPTIONS file in /etc/wal-e.d/env that looks like this:

-c client_min_messages=WARNING

and that's it. If you're using a wrapper script for wal-e instead of envdir, you just need to export that environment variable before calling wal-e like this:

export PGOPTIONS="-c client_min_messages=WARNING"

Now sit back and enjoy a few less cron emails in the morning.

EDIT: The envdir method only works if you have bash as your cron shell SHELL=/bin/bash in the crontab or you call the command like this:

bash -c "envdir /etc/wal-e.d/env /usr/local/bin/wal-e --terse backup-push /var/lib/postgresql/9.4/main"

Posted by Jeff Frost
Oct. 6, 2015, 8:19 p.m.

Reflections on (Humans) Trusting (Humans') Trust

One thing I've learned from consulting is that you should trust a competent person's feels about a technology, even if they can't immediately come up with arguments to support them. Seemingly vague statements like "Oh, I hate the Foo webserver, it's so flaky" or "Bar is a really annoying language to work in" or "the Baz API just doesn't feel very user-friendly" are actually the distilled product of many hard-won experiences that the speaker may not be able to call to mind individually. I observed this last night at Larry Wall's Perl 6 talk, specifically the Q&A: most of the time he came up with great examples, but sometimes he just had to talk about his own language in broad, esthetic terms. And I'm no Larry Wall, but today I found myself in a similar position when advising a client about which of two virtualization platforms to use. Of course, this applies to people speaking from experience and not from prejudice; you have to know how good someone's judgment is in order to transitively-trust them to have good opinions. That's one soft skill that gives you a major edge in technology. But, contrary to stereotype, most technical people are good at making these kinds of meta-judgments.

Posted by Quinn Weaver
June 9, 2015, 1:08 p.m.

Baidu Bot Blues

A client complained of mysterious connection spikes. "We think we're getting hit by bots, but we're not sure," they said; "Can you look into it?"

So I edited postgresql.conf, turned on log_connections and log_disconnections, did a 'pg_ctl -D MY_DATA_DIR reload', and waited (in my case I also set log_min_duration_statment = 0, but I don't recommend that unless you can watch to make sure the logs aren't filling the log partition — and make sure you're not going to hammer a slow log device).

A while later, I did

ls -t data/pg_log/*.csv | head -5 | xargs grep -c connection

(That says "List log files in order from most recent to least recent, take the five most recent, and count the lines in each file containing 'connection'.")

I got lucky. A spike was obviously in progress; the current log file showed a count two orders of magnitude higher than the other log files'. Since the problem was in progress, no post mortem was required; I just mailed the client and asked "Do your web logs show bot traffic right now?"

The client mailed right back saying, "Yes, we see a bunch of connections from Baidu's web crawler; it's going nuts hitting our site."

Baidu is a Chinese search engine. It turns out its crawling bot is well-known for being aggressive. As that link explains, the bot doesn't respect Crawl-delay in robots.txt (that's annoying but unsurprising, since Crawl-delay is a non-standard extension).

So if this happens to you, you have these options:

  1. Register at Baidu's site (Chinese-only) to request less-aggressive crawling.
  2. Try to tune PostgreSQL connection pooling to handle the onslaught.
  3. Block Baidu entirely. This is an extreme option — not advised if you do business in China, or if you have Chinese-speaking customers elsewhere who are reaching you through Baidu. The latter requires analytics to figure out.

Posted by Quinn Weaver
May 8, 2015, 2:37 p.m.

syslog Considered Dangerous

Edit 2015-10-01: there's a similar problem on RDS, where logs are written to the same partition as data, which can cause problems if you don't have enough provisioned IOPS (or disk space).

A client of ours was suffering mysterious, intermittent lock pileups on a large table. To diagnose the problem, I ran our lock-logging scripts.

Querying the resulting log_transaction_locks table, I saw lots of extend locks and RowExclusiveLocks piling up behind normal UPDATEs (and INSERTs) — there were even extend locks waiting on other extend locks. This is pretty unusual; you'd expect that when one UPDATE completed, the extend lock would succeed, PostgreSQL would quickly extend the relation (in this case, the big table's TOAST relation), and queries would keep running smoothly. These were all single-row writes; they should not have taken that long. However, a complicating factor was that a couple of columns in that big table were giant TEXT columns storing JSON strings. Was that the issue? The problem was unclear, and we needed more info.

At this point my colleague Christophe Pettus made the excellent suggestion to do 'strace -p' on one of the locked processes. The dump revealed lots of sendto() calls with giant strings, which he immediately recognized as syslog messages based on their formatting.

Now, this was a problem. syslog writes to /var, and /var was hanging off the root partition, which was an EBS volume (without provisioned IOPS). So each basically the entire SQL of each query was being written across the network, including those giant JSON strings, and the query itself wouldn't terminate till the write was fsync()'ed. In the meantime, extend locks and other UPDATEs piled up behind it. The PostgreSQL data directory was on a big, fast software RAID across SSD instance storage, but performance was being bottlenecked by that single EBS volume.

The client elected to turn off log_lock_waits, and the problem went away. In addition, Christophe suggested prepending a minus sign to the filename in /etc/rsyslog.d/postgres.conf, like so, which tells syslog not to flush between each log statement:

local0.* -/var/log/postgres.log

I thought that was neat: a potential fix to a huge performance problem, in one character(!)

Lessons learned:

1. If your log_destination contains syslog, be careful about filesystem layout. (If you stick to csvlog, this normally isn't a problem, because csvlogs, by default, go in pg_log/ under the PostgreSQL data dir.)

2. Logging info about a performance problem can sometimes make it worse. In this case, we decided to diagnose the pre-existing lock pileups by setting log_lock_waits = on and deadlock_timeout = 200. But, because of the syslog issue, those very settings exacerbated the lock problems. This was necessary to catch the problem, but it's something to keep in mind if you decide to use our lock-logging scripts.

Posted by Quinn Weaver
Sept. 18, 2014, 12:19 p.m.

WAL-E with Rackspace CloudFiles over Servicenet

Found a great walkthrough on setting up WAL-E to use python-swiftclient for storage in Rackspace Cloud Files:

Unfortunately by default, your backups use the public URL for Cloud Files and eat into metered public bandwidth.

The way to work around this is to set the endpoint_type to internalURL instead of the default publicURL.

You do that by setting the following environment variable:


That allows WAL-E to use Servicenet for base backups and WAL archiving which will be much faster and not eat into your metered public bandwidth.

Posted by Jeff Frost
Sept. 17, 2014, 1:18 p.m.

RDS for Postgres: List of Supported Extensions

Today I learned that Amazon doesn't keep any list of extensions supported in PostgreSQL. Instead, their documentation tells you to start a psql session and run 'SHOW rds.extensions'. But that creates a chicken-and-egg situation if you have an app that needs extensions, and you're trying to decide whether to migrate.

Update 2018-10-18: here's the list for RDS PostgreSQL 10.4 (in ASCIIbetical order). I'm pleased to see they added hll!


So here's a list of extensions supported as of today, 2014-09-17 (RDS PostgreSQL 9.3.3). I'll try to keep this current.



Posted by Quinn Weaver
April 29, 2014, 6:04 p.m.

How to start PostgreSQL the Debian way with pg_ctl

Sometimes I find myself wanting to start a PostgreSQL cluster with pg_ctl on a Debian or Ubuntu system.  If you've ever tried this, you know that something like this doesn't work:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main/ start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ postgres cannot access the server configuration file "/var/lib/postgresql/9.3/main/postgresql.conf": No such file or directory

because the Debian packages place the config files in /etc/postgresql/<version>/<cluster name>

So, I used to think that this was the way to start it:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /etc/postgresql/9.3/main/ start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ 2014-04-30 00:59:32 UTC LOG:  database system was shut down at 2014-04-30 00:57:49 UTC
2014-04-30 00:59:32 UTC LOG:  database system is ready to accept connections
2014-04-30 00:59:32 UTC LOG:  autovacuum launcher started

And that does generally work, but the proper way to do it is this:

vagrant@vagrant-ubuntu-trusty-64:/vagrant$ sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl -D /var/lib/postgresql/9.3/main/ -o "-c config_file=/etc/postgresql/9.3/main/postgresql.conf" start
server starting
vagrant@vagrant-ubuntu-trusty-64:/vagrant$ 2014-04-30 01:00:22 UTC LOG:  database system was shut down at 2014-04-30 01:00:16 UTC
2014-04-30 01:00:22 UTC LOG:  database system is ready to accept connections
2014-04-30 01:00:22 UTC LOG:  autovacuum launcher started

The real Debian way is to use pg_ctlcluster like so:

vagrant@vagrant-ubuntu-trusty-64:~$ sudo -u postgres pg_ctlcluster 9.3 main start

Posted by Jeff Frost
April 18, 2014, 3:54 p.m.

Grabbing the statement durations out of a log file using gnu grep

Sometimes I find myself wanting to grep all the query durations out of a file for sorting.  I recently discovered that I can use gnu grep's lookbehind syntax like so:

grep -oP '(?<=duration: )[0-9\.]+' postgresql-Mon.csv

and then sort them numerically like so:

grep -oP '(?<=duration: )[0-9\.]+' postgresql-Mon.csv | sort -n | tail

Posted by Jeff Frost
April 2, 2014, 4:02 p.m.

Putting stats_temp_directory on a ramdisk

This hack is an old chestnut among PostgreSQL performance tuners, but it doesn't seem to be widely known elsewhere. That's a shame, because it's pure win, and it's ridiculously easy to set up. You don't even need to restart PostgreSQL.

Here's the situation: PostgreSQL writes certain temporary statistics. These go in the dir given by the stats_temp_directory setting. By default, that's pg_stat_tmp in the data dir. Temp files get written a lot, but there's no need for them to persist.

That makes them perfect candidates for a ramdisk (a.k.a. RAM drive). A ramdisk is a chunk of memory treated as a block device by the OS. Because it's RAM, it's super-fast. As far as the app is concerned, the ramdisk just holds a filesystem that it can read and write like any other. Moreover, PostgreSQL generally only needs a few hundred kilobytes for stats_temp_directory; any modern server can fit that in RAM.

In Linux, you set up a ramdisk like this:

As root:

'mkdir /var/lib/pgsql_stats_tmp' [1]

'chmod 777 /var/lib/pgsql_stats_tmp'

'chmod +t /var/lib/pgsql_stats_tmp'

Add this line to /etc/fstab. That 2G is an upper limit; the system will use only as much as it needs.

tmpfs /var/lib/pgsql_stats_tmp tmpfs size=2G,uid=postgres,gid=postgres 0 0

'mount /var/lib/pgsql_stats_tmp'

Then, as postgres:

Change the stats_temp_directory setting in postgresql.conf:

stats_temp_directory = '/var/lib/pgsql_stats_tmp'

Tell PostgreSQL to re-read its configuration:

'pg_ctl -D YOUR_DATA_DIR reload'

And that's it!

Other operating systems have different ways to set up ramdisks. Perhaps I'll cover them in a later post.

[1] The directory /var/lib/pgsql_stats_tmp is an arbitrary choice, but it works well for Debian's filesystem layout.

Posted by Quinn Weaver
Nov. 21, 2013, 7:42 p.m.

Ubuntu 3.11.0 kernel not as stable as hoped

As I mentioned in my previous post, Ubuntu recently released a 3.11.0 kernel for Precise.  Unfortunately, while it appeared stable during a long burn-in process with various pgbench workloads and scale sizes, it crashed shortly after putting the machine into service, so make sure you test thoroughly on your hardware. 

This particular hardware was the Dell Poweredge 720 that rackspace uses.

Guess we're going back to custom compiling kernels for a little while longer.

Anyone else have experiences good or bad with it?

I believe this is the same kernel that's shipping with 13.10 Saucy Salamander.

Posted by Jeff Frost
Nov. 15, 2013, 1:34 p.m.

Ubuntu has released a 3.11.0 kernel for 12.04 LTS!

This one is for those of you compiling a custom kernel or staying on 10.04 LTS because of this issue on database servers in the 3.x kernels:

Ubuntu has released linux-image-3.11.0-13-generic for 12.04 in which that fix is included.  It's actually been included since 3.9rc2, but Ubuntu skipped straight to 3.11.

And there was much rejoicing in the PostgreSQL community.  At least from those of us who manage 12.04 LTS PostgreSQL servers!

For those of you who don't recall, here's a link to the big discussion thread:

One item of note on 3.11 is that the kernel.sched_migration_cost setting is now called kernel.sched_migration_cost_ns, so if you are setting that via sysctl, you will need to update your sysctl.conf or sysctl.d file after upgrade.

Now go upgrade your test servers and let us know how it performs!  I've been burning in a new server for about 24 hours with pgbench and the 3.11.0-13-generic kernel and so far it's been stable and a good performer.

Posted by Jeff Frost
Aug. 22, 2013, 3:57 p.m.


I prefer PostgreSQL, which is the core of our business at pgExperts. But I still have some legacy clients on MySQL. For one of them, I'm writing new features in Perl/MySQL. The mysql_auto_reconnect option just saved my day. I'd been getting errors like "DBD::mysql::st execute failed: MySQL server has gone away." To use it in DBI, do

my $dbh = DBI->connect(
{ mysql_auto_reconnect => 1 },
To use it in Rose::DB::Object, do

File MyApp::DB;
our @ISA = qw( Rose::DB );

mysql_auto_reconnect => 1,

Posted by Quinn Weaver
Aug. 10, 2013, 6:41 p.m.

Get your PgBouncer pools stats into Ganglia

Just whipped up a quick ruby script to put PgBouncer "show pools" data into ganglia using gmetric.  Based it heavily off the postgres_gmetric script.

You can find the pgbouncer gmetric script here:

Posted by Jeff Frost
Feb. 27, 2013, 5:38 p.m.

Script to add the respository on Debian/Ubuntu

You've probably heard the announcement: Martin Pitt's PPA is being deprecated.  This is in an effort to push folks towards the new official

See the announcement here:

Also, see the deprecation notice on Martin's PPA:

Since we have quite a few customers using Martin's PPA for PostgreSQL packages under Ubuntu, I wrote a little shell script to switch repositories if you're using Martin's PPA and just install the repository if you are not.

You can find it here:

Posted by Jeff Frost
Feb. 18, 2013, 4:30 p.m.

Python Type Gripe

I've been digging into Python lately. On Friday, I ran into an unintuitive behavior where two sets containing identical objects (of a custom class) were comparing unequal. I bet you can guess the reason: I hadn't defined __hash__ and __eq__ for my class. Take a look at this documentation for __repr__:

For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object. A class can control what this function returns for its instances by defining a __repr__() method.
"For many?" "makes an attempt?!" "often including?!!" To my way of thinking, this is a little crazy. There should be a single root class in the inheritance hierarchy, and that root should define __repr__, __hash__, and __eq__ methods that operate on/check for equality of each attribute in the class (by value, not by memory address!) Then they would behave consistently for all classes. Then two sets, each containing three objects of the same class, each containing the same attributes with the same values, would compare equal when you used == on them, following the principle of least surprise.

Of course, I can't rewrite Python to make this happen. I'm tempted just to make my own root class that implements these behaviors via metaprogramming, and make every class I ever define a subclass of it. Is there a better way?

PS: A colleague pointed out that this behavior reflects a deliberate decision in Python design philosophy: "Python […] favors informal protocols over inheritance trees […] You can think of __eq__ and __hash__ as kind of being parts of an informal 'collectable' protocol." That sort of makes sense. Of course, the thing about informal protocols is that there's nothing to enforce consistency.

Posted by Quinn Weaver
Feb. 14, 2013, 4:24 p.m.

Finding unreplicated tables and sequences in slony with check_postgres and nagios

Since slony doesn't automatically add tables to replication for you, sometimes it's handy to double check that everything is being replicated.  If your slony cluster is named "sample" then you would have a slony schema named "_sample". As such, you can query against the sl_table and sl_sequence tables to find unreplicated tables like so:

SELECT pg_class.OID::regclass as table_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'r' AND pg_class.OID NOT IN (SELECT tab_reloid FROM _sample.sl_table) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample')

SELECT pg_class.OID::regclass as sequence_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'S' AND pg_class.OID NOT IN (SELECT seq_reloid FROM _sample.sl_sequence) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample');

 Those queries are a bit unpleasant to remember, so I like to make them into views like so:

CREATE OR REPLACE VIEW public.unreplicated_tables AS SELECT pg_class.OID::regclass as table_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'r' AND pg_class.OID NOT IN (SELECT tab_reloid FROM _sample.sl_table) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample') AND pg_class.relname <> 'fwd_epc_map';

CREATE OR REPLACE VIEW public.unreplicated_sequences AS SELECT pg_class.OID::regclass as sequence_name FROM pg_class JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.OID WHERE relkind = 'S' AND pg_class.OID NOT IN (SELECT seq_reloid FROM _sample.sl_sequence) AND pg_namespace.nspname NOT IN ('information_schema', 'pg_catalog', '_sample');

Now, that's a little handier because we can just select straight out of the view.  Now that we have these easy to use views, we can use's custom_query check to setup a nagios alert whenever there are unreplicated tables or sequences.

First we need to create the check commands like so:

define command{
        command_name    check_pgsql_slony_unreplicated_tables
        command_line    /usr/lib64/nagios/plugins/ --host $HOSTADDRESS$ --port=5432 --dbuser nagios --dbname mydb --action=custom_query --query="SELECT count(table_name) AS result FROM public.unreplicated_tables" --valtype=integer --warn=1 --critical=5

define command{
        command_name    check_pgsql_slony_unreplicated_sequences
        command_line    /usr/lib64/nagios/plugins/ --host $HOSTADDRESS$ --port=5432 --dbuser nagios --dbname mydb --action=custom_query --query="SELECT count(sequence_name) AS result FROM public.unreplicated_sequences" --valtype=integer --warn=1 --critical=5
I apologize for all the wrapping there.

Now that we have the commands defined, we use them in a service definition:

define service {
    use                          postgres-service
    notifications_enabled        1
    hostgroup_name               slony_primary
    service_description          Slony Unreplicated Tables
    check_command                check_pgsql_slony_unreplicated_tables
    contact_groups               pgexperts

define service {
    use                          postgres-service
    notifications_enabled        1
    hostgroup_name               slony_primary
    service_description          Slony Unreplicated Sequences
    check_command                check_pgsql_slony_unreplicated_sequences
    contact_groups               pgexperts
And that's it! Now we'll receive an alert whenever an unreplicated table or sequence exists on the servers in hostgroup slony_primary.

Posted by Jeff Frost
Feb. 12, 2013, 2:48 p.m.

Russian Doll Syndrome / Please Monitor Your RAID Volumes

As part of setting up nagios monitoring for customers, I always make sure that the RAID volumes are being monitored on the DB servers.  More often than not, I have to tell the customer there are bad disks in one or more of the RAID volumes.  Don't be one of those customers, make sure you monitor the RAID volumes on your PostgreSQL (and other) servers!

So, today I was setting up some of this same monitoring for LSI RAID controllers in a Dell server using the check_sasraid_megaraid nagios plugin.   This plugin requires the LSI SNMP installer as well as net-snmp and a few other miscellaneous packages.

This brings us to the Russian Doll Syndrome.  I download the zip file from the LSI support site and unzip it to find a zip full of zips:

$ unzip

Ok, fine, they have a zip for each architecture, so let's unzip the Linux_x64 one and get on with this:

$ unzip
  inflating: MD5CheckSum.txt        

Oh, sweet, another zip file!

$ unzip
  inflating: readme.txt             
 extracting: sas_snmp_64bit.tar.gz  

And inside that, we find a tarball:

$ tar xvfz sas_snmp_64bit.tar.gz

And finally an RPM.  Yeesh LSI, why couldn't we just have the RPM right inside the file or even just in the top level zip file?

So, all kidding aside, please make sure you're monitoring the RAID volumes on your PostgreSQL servers!

Posted by Jeff Frost
Feb. 1, 2013, 3:12 p.m.

Why PostgreSQL Doesn't Allow Unsafe Implicit Type Coercions

I got bitten by a MySQL bug today, and it reminded me why we don't allow¹ unsafe type coercions in PostgreSQL. It's easy to talk about them in the abstract, but harder to explain why they're a problem in practice. Here's a really clear example.

I do most of my consulting through pgExperts these days, but I still have a legacy client on MySQL. Today I was writing a new feature for them. Using test-driven development, I was debugging a particular invocation of a five-way join. I found it was returning many rows from a certain right-hand table, when it should have been returning none.

I took a closer look at my SQL and found this code²:

WHERE order.status_id NOT IN ('completed', 'cancelled')

But that's not right! status_id isn't a varchar. It's an int FK to a lookup table:

CREATE TABLE statuses (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
name varchar(16) NOT NULL,
UNIQUE KEY name (name)

MySQL was silently coercing my integer order.status_id to a varchar, comparing it to 'completed' and 'cancelled' (which of course it could never match), and returning a bunch of right-hand table rows as a result. The filter was broken.

PostgreSQL would never haved allowed this. It would have complained about the type mismatch between status_is and 'completed' or 'cancelled'. Rather than a confusing resultset, I'd get a clear error message at SQL compile time:

ERROR: invalid input syntax for integer: "cancelled"
LINE 8: select * from frobs where order.status_id NOT IN ('cancelled', 'c...

Fortunately, I caught and fixed this quickly (hurrah for test-driven development!) The fix was simple:

JOIN statuses stat on = order.status_id
WHERE NOT IN ('completed', 'cancelled')

But under different circumstances this could have seriously burned me.

And that's why we disallow certain implicit type coercions. This post is not intended as a rant against MySQL. Rather, it's an illustration of how a seemingly abstract issue can seriously hose your code. Caveat hacker!

1. Since PostgreSQL 8.3 (2008), that is. Before that, we did allow unsafe coercions.

2. I've changed the code and identifiers to protect client confidentiality.

Posted by Quinn Weaver
Aug. 23, 2012, 5:57 p.m.

PostgreSQL, NUMA and zone reclaim mode on linux

Lately we've been seeing issues with zone reclaim mode on large memory multi processor NUMA linux systems.

What's NUMA?  It's just an acronym for Non-Uniform Memory Access.  This means that some memory in your system is more expensive for a particular CPU to access than its "local" memory.

You can see how much more distant the kernel considers the different zones by using the numactl command like so:

numactl --hardware
If you've got a modern multiprocessor system, you'll probably see something like this:

available: 2 nodes (0-1)
node 0 size: 48417 MB
node 0 free: 219 MB
node 1 size: 48480 MB
node 1 free: 135 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10 

Here we see a distance of 10 for node 0 to access node 0 memory and 21 for node 0 to access node 1 memory.   What does distance really mean?  It's a cost parameter based on number of "hops" or buses that separate the node from the distant memory.

Now, what's zone reclaim mode?

From the linux kernel docs:


Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1       = Zone reclaim on
2       = Zone reclaim writes dirty pages out
4       = Zone reclaim swaps pages

zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.

It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset

I highlighted the text above because PostgreSQL depends heavily on the filesystem cache and disabling zone reclaim mode is desirable in this situation.

There's been a bit of discussion about this on the pgsql-performance mailing list here:

If you've got a modern multi-socket system, odds are good that zone reclaim mode is enabled automatically on boot.  You can check this by looking at /proc/sys/vm/zone_reclaim_mode.

The biggest issue we've seen with zone reclaim mode enabled on customer multi-socket systems is the filesystem cache never filling up even when the database is much larger than RAM.  That's because the system is trying to keep some "local" memory available. After disabling zone_reclaim_mode, the filesystem cache fills up and performance improves.

So, how to disable zone_reclaim_mode?  The best way to do this is via sysctl.  Just add:

vm.zone_reclaim_mode = 0 

to /etc/sysctl.conf, save it and execute sysctl -p to load the new settings into the kernel.

Other interesting non PostgreSQL pages on NUMA/zone_reclaim_mode:

Posted by Jeff Frost
Aug. 2, 2012, 10:16 a.m.

Finding non-UTF8 data in a dump

Recently I was tasked with identifying all rows in a PostgreSQL database that contained non UTF8 data for migrating from SQL_ASCII encoding to UTF8.  Because of the database's size, just using iconv and diff on a dump file wouldn't work.

I started with this perl script from Jon Jensen:

It got me a little way down the road, but what I really wanted was to know which table had the bad row and better yet, be able to easily import the bad rows with a COPY command for cleanup and eventual use as temporary tables that we can be used with UPDATE statements to fix the production DB.

So, I turned to Python.  I found a isUTF8(text) definition here:

I took that and wrote which outputs COPY commands with only the non-UTF8 rows to STDOUT and also outputs a schema-only dump command to STDERR which only dumps the tables with bad data for ease of creating a temporary DB.

You can find the script on github here:

What other methods do you folks use?

Posted by Jeff Frost
May 23, 2012, 7:05 p.m. script to dump all PostgreSQL DBs

For a long time, I've written all my automation scripts in bourne shell and bash with occasional forays into perl.  I really like bash because you're always guaranteed to have bash installed on a modern linux system and so don't need to install perl and 15 modules to get the job done.

That said, I've recently begun using python for more automation.  As an exercise, I've rewritten my script for python 2.7 trying to use only the standard library.

The script pg_dumps all the databases in a PostgreSQL cluster in the custom format allowing you to specify the compression level.  (See this blog post regarding pg_dump compression level performance.)  It also dumps the globals to a bzip2'd SQL file and then removes files in the dump destination directory that are older than the --keep value (default 7 days).  Finally, if it had any errors during the run, it sends the error log file to the specified email address.  It also picks up libpq environment variables, but will override those with anything specified on the command line.

You can find it here:

Posted by Jeff Frost
May 15, 2012, 2:47 p.m.

Slony Upstart Script for Ubuntu 10.04 LTS

 Since Ubuntu is moving towards using upstart instead of SysV init scripts, I thought I'd write a quick upstart script for starting slony compiled and installed in /usr/local/slony.  Upstart 1.4+ supports setuid, but the version of upstart included with Ubuntu 10.04LTS is only 0.6.5 and even on Ubuntu 11.10, it's only 1.3, so instead I use start-stop-daemon's --chuid switch.  The script should be installed as /etc/init/slony.conf. Here's the script:

---cut here---
# slony - slon daemon
# slon is the slony replication daemon

description    "slony slon daemon"

start on net-device-up
stop on runlevel [06]

# can't use setuid until upstart 1.4
#setuid slony
# which is why we use start-stop-daemon with the --chuid switch


    exec start-stop-daemon --start --chuid slony --exec /usr/local/slony/bin/slon -- -f /etc/slon.conf >> /var/log/slony/slony.log 2>&1
end script
---cut here---

This will have the slon daemon run as the slony user and log to /var/log/slony/slony.log.  You'll also need a valid slony config file in /etc/slon.conf.

You can also find the script in the pgexperts upstart-scripts git repo:

Posted by Jeff Frost
April 2, 2012, 5:58 p.m.

Using the row_number() window function to reorder images

One of my customers uses the rails acts_as_list module for maintaining order of images on forms.

Through some application issues they ended up with images that had identical order for the same provider and image_category. So, it ended up looking like this:

select provider_id, image_category_id, position, id from images where image_category_id=1234 order by position;


provider_id | image_category_id | position | id

       9999 |              1234 |        3 | 484900
       9999 |              1234 |        4 | 484899
       9999 |              1234 |        4 | 484897
       9999 |              1234 |        4 | 484896
       9999 |              1234 |        4 | 484894
       9999 |              1234 |        4 | 484893
       9999 |              1234 |        4 | 484892
       9999 |              1234 |        4 | 484890
       9999 |              1234 |        5 | 484889
       9999 |              1234 |        6 | 484887
       9999 |              1234 |        7 | 484886
       9999 |              1234 |        8 | 479818
       9999 |              1234 |        9 | 479817
       9999 |              1234 |       10 | 479815

Note that there are several rows with position=4 above. This was a prevalent issue throughout the images table and not just with this example provider.

So, what they wanted was to reorder the images such that the equal positions were kept together and the overall order stayed the same.  So, for example, it should look like this (sorry for the small font):

 provider_id |image_category_id | position |   id  

        9999 |             1234 |        3 | 484900  ==> Position 1
        9999 |             1234 |        4 | 484899  ==> Position 2
        9999 |             1234 |        4 | 484897  ==> Position 3
        9999 |             1234 |        4 | 484896  ==> Position 4
        9999 |             1234 |        4 | 484894  ==> Position 5
        9999 |             1234 |        4 | 484893  ==> Position 6
        9999 |             1234 |        4 | 484892  ==> Position 7
        9999 |             1234 |        4 | 484890  ==> Position 8
        9999 |             1234 |        5 | 484889  ==> Position 9
        9999 |             1234 |        6 | 484887  ==> Position 10
        9999 |             1234 |        7 | 484886  ==> Position 11
        9999 |             1234 |        8 | 479818  ==> Position 12
        9999 |             1234 |        9 | 479817  ==> Position 13
        9999 |             1234 |       10 | 479815  ==> Position 14

So, I thought about it for a bit, then it occurred to me: why not use the row_number() window function.  So I wrote the following migration SQL script:

CREATE temp TABLE new_image_positions 
, row_number AS position
SELECT provider_type
, provider_id
, image_category_id
, position
, id
, row_number()
partition BY provider_id
, image_category_id
ORDER BY position
FROM images
WHERE provider_type = 'Project'
AND position IS NOT NULL
) x ;

UPDATE images
SET position = n.position
FROM new_image_positions n


And, voila, all the images are now properly ordered:

 provider_id | image_category_id | position |   id   
        9999 |              1234 |        1 | 484900
        9999 |              1234 |        2 | 484899
        9999 |              1234 |        3 | 484897
        9999 |              1234 |        4 | 484896
        9999 |              1234 |        5 | 484894
        9999 |              1234 |        6 | 484893
        9999 |              1234 |        7 | 484892
        9999 |              1234 |        8 | 484890
        9999 |              1234 |        9 | 484889
        9999 |              1234 |       10 | 484887
        9999 |              1234 |       11 | 484886
        9999 |              1234 |       12 | 479818
        9999 |              1234 |       13 | 479817
        9999 |              1234 |       14 | 479815
        9999 |              1234 |       46 | 403052
        9999 |              1234 |       47 | 403051
        9999 |              1234 |       48 | 403050
        9999 |              1234 |       49 | 403049
        9999 |              1235 |        1 | 484916
        9999 |              1235 |        2 | 484915
        9999 |              1235 |        3 | 484912
        9999 |              1235 |        4 | 484910
        9999 |              1236 |        1 | 493000
        9999 |              1236 |        2 | 492999

Posted by Jeff Frost
Dec. 23, 2011, 10:22 a.m.

pg_dump compression settings

After doing the base backup benchmarks, I thought it would be interesting to benchmark pg_dump run locally and remotely using all the different compression settings.  This will allow me to compare the dump times and the space savings as well as see how much dumping remotely slows the process.  Nobody is storing their dumps on the same server, right?

For the record, all these dumps used pg_dump's custom dump format (-Fc), the PGDATA directory is 93GB on disk and resides on a 4 disk RAID10 of 7200 RPM SATA drives.

CPU on the test server: Intel(R) Xeon(R) CPU X3330  @ 2.66GHz.
CPU on the test remote client: Intel(R) Core(TM) i7 CPU 960  @ 3.20GHz.
Network is gigabit.
The total size reported is from du -sh output.
Both client and test server are running postgresql-9.1.2.

For comparison, our best base backup time from the base backup blog post was: 15m52.221s

The results:

Local pg_dump:
compression: 0
time: 19m7.455s
total size: 77G
compression: 1
time: 21m53.128s
total size: 11G
compression: 2
time: 22m27.507s
total size: 11G
compression: 3
time: 24m18.966s
total size: 9.8G
compression: 4
time: 30m10.815s
total size: 9.2G
compression: 5
time: 34m26.119s
total size: 8.3G
compression: 6
time: 41m35.340s
total size: 8.0G
compression: 7
time: 49m4.484s
total size: 7.9G
compression: 8
time: 91m28.689s
total size: 7.8G
compression: 9
time: 103m24.883s
total size: 7.8G

Remote pg_dump:
compression: 0
time: 20m1.363s
total size: 77G
compression: 1
time: 22m9.205s
total size: 11G
compression: 2
time: 22m19.158s
total size: 11G
compression: 3
time: 23m7.426s
total size: 9.8G
compression: 4
time: 26m10.383s
total size: 9.2G
compression: 5
time: 28m57.431s
total size: 8.3G
compression: 6
time: 33m23.939s
total size: 8.0G
compression: 7
time: 38m11.321s
total size: 7.9G
compression: 8
time: 62m27.385s
total size: 7.8G
compression: 9
time: 72m5.123s
total size: 7.8G

So, a few interesting observations:

  • Base backups are indeed faster, but that's no surprise since they have less overhead. 
  • Taking the backup locally is only slightly faster than remotely for the smaller compression levels
  • At compression level 3 and above the faster CPU on the test client becomes a huge benefit
  • Looks like the default compression is level 6. This might not be what you want if you're concerned about minimizing the backup time and have the disk space to spare.

Posted by Jeff Frost
Dec. 6, 2011, 3:45 p.m.

PostgreSQL Base Backup Benchmark Results

For a while now we've sort of anecdotally thought that compression should help with base backups and also that the rsync server is faster than rsync over ssh, so while nothing was happening on our test server, I thought I'd benchmark a few different methods.

Looks like rsync using an rsync server with no compression was fastest, but not by much. Pg_basebackup is right there with rsync using any of the three RC4 ciphers all within a few seconds of one another. Looks like blowfish-cbc comes in next followed by aes128-cbc.

Most interesting is that adding compression quadruples the sync time! That's probably not the case over a slow link, but over gigabit it's definitely the case.

Tests were done with a 93GB PGDATA, gigabit network, linux software 4 disk RAID10 data dir on the first test server copied to linux software 2 disk RAID0 on the second server.

Following are the results:

Without compression:

time rsync test0::data /data/tmp/test-data/ -avP
real 15m52.221s
user 3m10.257s
sys 4m11.460s

time /usr/pgsql-9.1/bin/pg_basebackup -D /data/tmp/test-data -c fast -h test0 -U postgres
real 16m13.440s
user 0m17.314s
sys 2m18.217s

time rsync -e 'ssh -c arcfour' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m13.242s
user 8m32.674s
sys 5m28.312s

time rsync -e 'ssh -c arcfour256' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m13.656s
user 8m48.356s
sys 6m7.343s

time rsync -e 'ssh -c arcfour128' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 16m3.947s
user 8m44.627s
sys 6m7.499s

time rsync -e 'ssh -c blowfish-cbc' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 26m22.516s
user 19m16.038s
sys 8m45.046s

time rsync -e 'ssh -c aes128-cbc' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 29m30.785s
user 27m31.919s
sys 6m5.154s

time rsync -e 'ssh' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 30m50.301s
user 24m10.951s
sys 9m13.090s

time rsync -e 'ssh -c aes256-ctr' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avP
real 38m23.789s
user 31m36.011s
sys 8m39.210s

Our Best Performers With Compression:

time rsync -e 'ssh -c arcfour' postgres@test0:/data/9.1/data/ /data/tmp/test-data/ -avPz
real 62m31.933s
user 14m48.752s
sys 2m42.412s

time rsync test0::data /data/tmp/test-data/ -avPz
real 62m53.699s
user 6m42.846s
sys 1m44.357s

Since Rod, Joshua and Rodrigo's comments suggested pigz and netcat below, I thought I'd give them both a quick try to see if they produced a winner.  Conveniently, gnu-tar has a --use-compress-program option, so you can easily use pigz in this manner.

So, tar over ssh with the arcfour256 cipher using pigz for compression:

time ssh -c arcfour256 test0 "cd /data/9.1/data ; tar --use-compress-program=pigz -cf - ." | (cd /data/tmp/test-data ; tar --use-compress-program=pigz -xvf -)

 real    18m11.901s
user    7m3.587s
sys     3m53.487s

On test0: tar --use-compress-program=pigz -cf - . | nc -l 7000
On test1: time nc test0 7000 | tar --use-compress-program=pigz -xvf -

real    17m46.805s
user    6m17.562s
sys     4m4.563s

So, using tar with pigz over ssh and over netcat are much faster compression methods, they still are a little slower than no compression.  This got me wondering if perhaps the 93GB of PGDATA doesn't compress all that well:

14G     /data/tmp/data-test.tgz

So, that's 6:1 compression. Not bad!  Interestingly, it took gnu-tar + pigz about 17m to create the test tgz file. That's about identical to the time it took to ship it over the network and uncompress/untar it on the other end.

That made me try tar with no compression piped over netcat:

On test0: tar -cf - . | nc -l 7000

On test1: time nc test0 7000 | tar -xvf -

real    15m52.313s
user    0m10.033s
sys     3m17.630s

So, tar with no compression over netcat clocked in almost identical to using rsync's native server.

Anyone have any other tools for me to test?

Posted by Jeff Frost