Following my learnings from part I, I stopped doing procedural migrations and used temp tables to accomplish the same result through normal set-based operations. This sped things up quite a bit but only by a factor of two. It still took nearly an hour to migrate everything.
We perform complex queries on millions of these rows during normal operation and they complete on the order of 10s of seconds, so I knew something was still amiss. I was confident that everything should migrate in minutes if done properly.
The next thing I tried was batching my inserts. Rather than loading millions of rows joined with millions of rows joined with millions of row into memory, I migrated batches of tens of thousands at time; I did a batch of ten thousand, then the next ten thousand and so on. While that may have sped things up, it wasn't significant enough for me to notice. I may have been swapping some millions of rows to disk but the migration wasn't thrashing, it wasn't the cause of the major slowness. So I went back to ChatGPT for pointers.
This time ChatGPT told me to:
- mark destination tables as
UNLOGGED
- avoiding preemptive
fsync
s to the append only log (on disk) while performing many inserts, and
- avoiding preemptive
- to not put indices or constraints on the destination tables until after the data was already migrated
- avoiding
O(log(n))
index inserts and constraint checks runningO(n)
times without cache locality
- avoiding
Luckily, both were straightforward relative to my other changes. The
UNLOGGED
changes look like this:CREATE UNLOGGED TABLE "NewSomething" (
id SERIAL NOT NULL,
something INTEGER NOT NULL
);
-------------------------------
-- DO LOTS OF INSERTS HERE
-------------------------------
ALTER TABLE "NewSomething" SET LOGGED;
And the deferred constraints and indices only required reordering statements I had already written. Before, when I created the "NewSomething" table, I added our constraints, foreign key or otherwise, and indices before inserting any rows, like this:
CREATE UNLOGGED TABLE "NewSomething" (
id SERIAL NOT NULL,
something INTEGER NOT NULL
);
CREATE UNIQUE INDEX "NewSomething_something_key" ON "NewSomething"("something");
ALTER TABLE "NewSomething" ADD CONSTRAINT "something_positive" CHECK ("somthing" >= 0);
-------------------------------
-- DO LOTS OF INSERTS HERE
-------------------------------
ALTER TABLE "NewSomething" SET LOGGED;
The final changes look more like this:
CREATE UNLOGGED TABLE "NewSomething" (
id SERIAL NOT NULL,
something INTEGER NOT NULL
);
-------------------------------
-- DO LOTS OF INSERTS HERE
-------------------------------
CREATE UNIQUE INDEX "NewSomething_something_key" ON "NewSomething"("something");
ALTER TABLE "NewSomething" ADD CONSTRAINT "something_positive" CHECK ("somthing" >= 0);
ALTER TABLE "NewSomething" SET LOGGED;
Either unlogging or deferring, or the combination of the two, seemed to do the trick. After everything - the temp tables for set-based operations, the batching, the
UNLOGGED
destinations, and deferring indices and constraints - the "massive" migration completes in a few minutes now. Huzzah!bonus TIL
If you're doing an online migration, ie you don't want downtime/contention, and are adding constraints to an existing table, apparently it's best to add constraints in two steps. First you add the constraint for any new rows in a single statement, then in another statement you validate all of the existing rows:
-- add the constraint to future rows
ALTER TABLE "AnotherSomething"
ADD CONSTRAINT "AnotherSomething_fkey"
FOREIGN KEY ("somethingId") REFERENCES "Something"(id)
NOT VALID;
-- then, validate
ALTER TABLE "AnotherSomething" VALIDATE CONSTRAINT "AnotherSomething_fkey";
Apparently, when these are done in a single statement the table is locked so that no rows violating the constraint are inserted and no rows can be updated to violate the constraint. Done this way, row locks, which are much lighter, are taken instead.
UPDATE ... ON CONFLICT
, this might be a problem. In that case I'd add that constraint early then defer the others retaining proportional performance benefit.