Migrating MySQL to Postgres
Switching out your database layer is always a PITA, no matter what level you've abstracted away from it, but this week I ended up taking the leap instead of tackling a harder problem presented by MySQL, and it went pretty well.
Why is this a problem, they're both SQL-based relational databases, right?
You'd be forgiven for thinking the problem is as simple as a quick mysqldump --compatible=postgresql
. Unfortunately, whilst the Postgres compatible dumps that mysqldump
produces do get you some of the way there, they fall quite far short of what Postgres needs, and most likely, what you want, a few thinks you probably want to think about include:
- Asserting your defaults, foreign keys and indexes carry over correctly
- Converting types e.g.
TINYINT
s intoBOOLEAN
s and needlessVARCHAR(255)
's toTEXT
types (Postgres can indexTEXT
just fine) - Handle multibyte encoding on any string fields you do want to keep length restrictions on (you probably want to double the length – I opted for
TEXT
fields rather than lots ofVARCHAR(512)
's though) - Switch to timezone-aware
DATETIME
fields (you'll probably care later, if not now) - Switch to a document types for anything you had to serialize previously (e.g. JSON, powerful stuff)
- You'll also want to validate the new structure is compatible with your application(s), that nothing was lost, and that perfomance hasn't been hit
- ...and probably much much more
Getting started
pgloader to the rescue <small>(edited: 2015-11)</small>
Since writing this article, I came across Dimitri's pgloader at All Your Base Conf, which looks to make this problem a whole lot easier to solve:
- Load data into PostgreSQL. Any Data.
- Read data from MySQL, SQLite or dBase databases
- Migrate from MySQL to PostgreSQL. In one command.
- Transform your data. While streaming.
I am yet to try it, but it looks pretty darn sweet.
Lanyrd's MySQL to PostgreSQL Converter
At this point you might be thinking this is sounding a lot more like headache. You'll be pleased to hear the great folk at Lanyrd open-sourced the script they used, MySQL to PostgreSQL Converter, which is pretty easy to read through being a simple Python script, and whilst it has a handful of shortcomings, and is seemingly unmaintained now they're done with it, it does a pretty good job, and with a few tweaks to my specific use case, got me most of the way there.
A few of the changes I made included:
- Simplifying the
VARCHAR
handling to convert them all toTEXT
types. If yours is a Rails app, this is whatt.string
(with no length) would have evaluated to in your migrations had you used Postgres in the first place anyway, and there's no advantage ofVARCHAR
overTEXT
for indexing in Postgres, that's a MySQL thing - A fix to
BOOLEAN
handling where default values were lost - Trashing a fair bit of the index handling in favour of reapplying these with Rails migrations, it was easier
Application changes and MySQL gotchas
In short, if your application's database workload is relatively typical, and you're using a framework with an ORM or database abstraction – your changes should be few and far between.
The only notable things I came across in the Ruby on Rails application I was migrating was a handful of places where raw queries were written which included usage of the following:
Seeded randomisation
MySQL's seeded RAND
is different to Postgres', basically, MySQL accepts a integer to seed RAND
, whilst Postgres does it in a separate SETSEED
command you call before RAND
, and instead wants a float between 0 and 1.
MySQL's grouping used to be sloppy
A "feature" of MySQL up until recently allowed for the use of non-aggregated columns in SELECT
, HAVING
and ORDER BY
when using GROUP BY
– essentially a sloppy implementation of the SQL specification.
Postgres, of course, follows the SQL specification, so in short, you have to either aggregate things you want to use in these clauses e.g. with MAX
(a common hack), or include them in the GROUP BY
.
Note that as of MySQL 5.7.5, the default of ONLY_FULL_GROUP_BY
was changed, and this behaviour is no longer default - so maybe you're good to go.
https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_only_full_group_by
Absense of order
In the SQL world, order is not an inherent property of a set of data. Thus, you get no guarantees from your RDBMS that your data will come back in a certain order – or even in a consistent order – unless you query your data with an ORDER BY clause.
– http://dba.stackexchange.com/a/6053
I found a number of cases where ORDER BY
clauses were omitted from queries being ran, the application relied on MySQL returning by insertion order, I believe, either way this behaviour seemed far more chaotic in Postgres and as above isn't guaranteed anywhere – I stuck some ORDER BY id
's in.
Unicode NULLs
A problem I didn't expect to run into was caused by unicode nulls in my MySQL database screwing with my INSERT
'S into Postgres, and even when I worked out what the problem was, it still took longer than I'd like to admit to work out how to nuke them.
Postgres does not allow null bytes ('\0') in string fields (CHAR
, TEXT
or VARCHAR
), and if you try and store them, you'll receive the error:
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0x00
It appeared I had unicode nulls in a lot of places in the dataset I was working on, who knows where they came from, or where they were, but I chose to /simply/ replace them all, in every string type column, as as far as I could see they were semantically meaningless in the cases I came across.
To do that, after a lot of trial and error (I came across a lot of examples that didn't work), I whipped up a series of queries that looked something like this:
UPDATE comments SET content = REPLACE(content, CHAR(0x00), '') WHERE content LIKE CONCAT('%', 0x00, '%');
If you're feeling super lazy (I was), you can write a query to generate those queries for you based on the information_schema
table, which describes your other MySQL databases, mine looked something like this:
SELECT CONCAT(
"UPDATE `", TABLE_SCHEMA, "`.`", TABLE_NAME, "` ",
"SET `", COLUMN_NAME, "` = REPLACE(`", COLUMN_NAME, "`, CHAR(0x00), '') ",
"WHERE `", COLUMN_NAME, "` LIKE CONCAT('%', 0x00, '%');"
) AS q
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = '{{ your database name }}' AND DATA_TYPE IN ('char', 'text', 'varchar')
If you then copy the results of that query, and run those as a query, your MySQL database should be ready to go.
Data transformation
Transforming some serialized Ruby hashes into JSON so that it could be aggregated was the whole reason I wanted to migrate to Postgres.
I achieved this with a ~7 hour Rails migration that I ran overnight, which obviously, would have been pretty costly to repeat in testing, so I turned the resulting data into batched raw SQL updates (with a WHERE
on the updated_at
column in case the row was updated since I took the dump) so that it could be loaded in minutes rather than hours.
However, if you need to do something similar, you may have luck with pgloader's support for transforming data, while streaming.
Validating the result
If your test suite is anything like the one I had, it certainly doesn't run against a real database, so how do you go about testing this?
I don't have a great answer to this, I started with browsing the structure and data and seeing if areas I suspected might break looked okay – mostly this surrounded defaults, types, indexes and keys. Whilst this helped iron out some bugs with Lanyrd's script, my confidence was still low, and given I was running this whole process end-to-end a lot between changes, wanted a faster way to validate the quality of the result.
Count comparison
Admitedly, a pretty rudimentary check, but better than nothing, I used the information_schema
table to generate a bunch of queries to counts of rows of every table, and diff'd them to check they matched.
Schema comparison
ActiveRecord's schema.rb
was a big help – as a manifest describing ActiveRecord's understanding of what the database looks like, the one produced in development from the new Postgres database can be compared to the schema.rb
produced by the production MySQL database. Any changes that hadn't been intentionally made by me were pretty obvious, and using this, I was able to identify and resolve a number of issues I may well have missed otherwise, including:
BOOLEAN
defaults being lost- Indexes coming out differently
- Timestamps having nullable set different
Summary
I hope some of these tips come in useful if you're thinking about making the jump, best of luck!
Hi Steve!
It was during an interview last week that I got turned on to Postgres again. I hadn’t touched it since the late 90s. I’ve been working in the PHP realm and have been using MySQL. It appears that the bits of MySQL I’ve found frustration with are nonexistent in Postgres, so I am planning to convert an existing project and see how it plays out!
Thanks for your article!