pull down to refresh

Following my learnings from part I, I stopped doing procedural migrations and used temp tables to accomplish the same result through normal set-based operations. This sped things up quite a bit but only by a factor of two. It still took nearly an hour to migrate everything.
We perform complex queries on millions of these rows during normal operation and they complete on the order of 10s of seconds, so I knew something was still amiss. I was confident that everything should migrate in minutes if done properly.
The next thing I tried was batching my inserts. Rather than loading millions of rows joined with millions of rows joined with millions of row into memory, I migrated batches of tens of thousands at time; I did a batch of ten thousand, then the next ten thousand and so on. While that may have sped things up, it wasn't significant enough for me to notice. I may have been swapping some millions of rows to disk but the migration wasn't thrashing, it wasn't the cause of the major slowness. So I went back to ChatGPT for pointers.
This time ChatGPT told me to:
  1. mark destination tables as UNLOGGED
    • avoiding preemptive fsyncs to the append only log (on disk) while performing many inserts, and
  2. to not put indices or constraints on the destination tables until after the data was already migrated
    • avoiding O(log(n)) index inserts and constraint checks running O(n) times without cache locality
Luckily, both were straightforward relative to my other changes. The UNLOGGED changes look like this:
CREATE UNLOGGED TABLE "NewSomething" (
      id SERIAL NOT NULL,
      something INTEGER NOT NULL
);

-------------------------------
-- DO LOTS OF INSERTS HERE
-------------------------------

ALTER TABLE "NewSomething" SET LOGGED;
And the deferred constraints and indices only required reordering statements I had already written. Before, when I created the "NewSomething" table, I added our constraints, foreign key or otherwise, and indices before inserting any rows, like this:
CREATE UNLOGGED TABLE "NewSomething" (
      id SERIAL NOT NULL,
      something INTEGER NOT NULL
);

CREATE UNIQUE INDEX "NewSomething_something_key" ON "NewSomething"("something");
ALTER TABLE "NewSomething" ADD CONSTRAINT "something_positive" CHECK ("somthing" >= 0);

-------------------------------
-- DO LOTS OF INSERTS HERE
-------------------------------

ALTER TABLE "NewSomething" SET LOGGED;
The final changes look more like this:
CREATE UNLOGGED TABLE "NewSomething" (
      id SERIAL NOT NULL,
      something INTEGER NOT NULL
);

-------------------------------
-- DO LOTS OF INSERTS HERE
-------------------------------

CREATE UNIQUE INDEX "NewSomething_something_key" ON "NewSomething"("something");
ALTER TABLE "NewSomething" ADD CONSTRAINT "something_positive" CHECK ("somthing" >= 0);

ALTER TABLE "NewSomething" SET LOGGED;
Either unlogging or deferring, or the combination of the two, seemed to do the trick. After everything - the temp tables for set-based operations, the batching, the UNLOGGED destinations, and deferring indices and constraints - the "massive" migration completes in a few minutes now. Huzzah!

bonus TIL

If you're doing an online migration, ie you don't want downtime/contention, and are adding constraints to an existing table, apparently it's best to add constraints in two steps. First you add the constraint for any new rows in a single statement, then in another statement you validate all of the existing rows:
-- add the constraint to future rows
ALTER TABLE "AnotherSomething"
    ADD CONSTRAINT "AnotherSomething_fkey"
    FOREIGN KEY ("somethingId") REFERENCES "Something"(id)
    NOT VALID;

-- then, validate
ALTER TABLE "AnotherSomething" VALIDATE CONSTRAINT "AnotherSomething_fkey";
Apparently, when these are done in a single statement the table is locked so that no rows violating the constraint are inserted and no rows can be updated to violate the constraint. Done this way, row locks, which are much lighter, are taken instead.
50 sats \ 3 replies \ @optimism 11h
If you're doing an online migration, ie you don't want downtime/contention
It may be that I haven't drank enough coffee yet so apologies if I'm being dumb but I wonder: what magic do you possess to realize both online batch migration and zero downtime?
reply
100 sats \ 2 replies \ @k00b OP 6h
The bonus TIL is independent of a batch migration. I should’ve made that clear. It’s something I came across while trying to figure out why everything was so slow.
reply
100 sats \ 1 reply \ @optimism 5h
Ah! Okay, yeah. In the past (90s, lol) I've done, once:
  1. Batch without constraints & test
  2. Double inserts into old and new (slows live down a bit but doesn't have to be a big deal)
  3. Batch with constraints for whatever gap is left
  4. Switchover with grace
  5. Disable double inserts
But tbh the double inserts thing was a lot of effort for not having a 10 minute or so downtime gap - this was on a callcenter like 911, so we thought it was worth it - but if I'd ever be asked again I'd stress the question: "are you sure you can't have downtime"?
reply
100 sats \ 0 replies \ @k00b OP 5h
100%. I was telling the guys the other day that we've gotten a little anal about avoiding minor hiccups. It can eat up a bunch of energy.
reply
I found this very insightful, thank you.
When deferring the constraints to after the bulk inserts, don't you run the risk of having invalid data inserted and therefore applying the constraints has the potential to fail? I know in theory the migration would handle it, but in practice it would seem possible that there are some bad records that could cause an issue?
Also, how have you handled running migrations like this live in production while users are using the app, potentially creating data that needs to be included in the migration? Do you run it again after the app code is deployed to ensure you "got" all the records needing migration?
reply
50 sats \ 1 reply \ @k00b OP 18h
When deferring the constraints to after the bulk inserts, don't you run the risk of having invalid data inserted and therefore applying the constraints has the potential to fail?
Yes, but the migration is faster to fail than the alternative, so you can iterate to fix the cause of the failure faster. If you use constraints as part of your migration logic though, i.e. UPDATE ... ON CONFLICT, this might be a problem. In that case I'd add that constraint early then defer the others retaining proportional performance benefit.
Also, how have you handled running migrations like this live in production while users are using the app, potentially creating data that needs to be included in the migration? Do you run it again after the app code is deployed to ensure you "got" all the records needing migration?
I haven't done this kind of migration online before, but if I had to I'd either take explicit locks on tables while they're being migrated or run two migrations (once to get most of it, switch over any clients to use the new schema, then run another migration to get any lingering rows).
In this case, we'll have a maintenance window on some dead weekend day. Even then, having the migration take an hour when it might fail was way too slow.
reply
As usual, thank you for the thoughtful response!
reply
Needs to be someone fluent at postgresql to run these requests or maria db
reply