Polymorphic data in SQL \ stacker news ~devs

I came across this 37signals podcast describing how they manage polymorphism across their projects and it made me want to share how we do it.

From what the docs describe, they have a generic base table that along with generic columns stores an inheritor's id and the table name of that specific inheritor. Something like

model base {
    id
	created_at
	updated_at
	...
	inheritor_id 
	inheritor_type
}

model inheritor_a {
	id
	something_only_a_has
	...
}

model inheritor_b {
	id
	something_only_b_has
	...
}

a = insert into inheritor_a (something_only_a_has) values (some_a_value) returning *
insert into base (inheritor_id, inheritor_type) values (a.id, 'inheritor_a')
b = insert into inheritor_b (something_only_b_has) values (some_b_value) returning *
insert into base (inheritor_id, inheritor_type) values (b.id, 'inheritor_b')

Now they can modify any inheritor_X independently of base and because any inheritor_X is much smaller than base and concerns are separated it's easier to work with. I agree. It's relatively "yay."

We landed on a similar pattern for our polymorphism. Only I wasn't happy with the main drawback: when you want to query over all bases you need to join with and retrieve every inheritor table/row. If you have more than a few inheritor types, it's super unwieldy.

To get all base in a single query regardless of type using their pattern:

select *
from base
join inheritor_a on base.inheritor_type = 'inheritor_a' and base.inheritor_id = inheritor_a.id
join inheritor_b on base.inheritor_type = 'inheritor_b' and base.inheritor_id = inheritor_b.id
...
join inheritor_z on base.inheritor_type = 'inheritor_z' and base.inheritor_id = inheritor_z.id

They trade read performance for DX. What we do to avoid that is establish triggers on insert/updates to inheritor tables such that they denormalize themselves as JSON into a column of base.

model base {
    id
	created_at
	updated_at
	...
	inheritor_type
	inheritor_json
}

model inheritor_a {
	id
	base_id
	something_only_a_has
	...
}

create trigger inheritor_a_denorm
on insert or update of inheritor_a
execute 
	update base set 
	inheritor_json = to_json(NEW.*), 
	inheritor_type = 'inheritor_a' 
	where base.id = NEW.base_id

model inheritor_b {
	id
	base_id
	something_only_b_has
	...
}

create trigger inheritor_b_denorm
on insert or update of inheritor_b
execute 
	update base 
	set inheritor_json = to_json(NEW.*),
	inheritor_type = 'inheritor_b' 
	where base.id = NEW.base_id

base_a = insert into base returning *
insert into inheritor_a (base_id, something_only_a_has) values (base_a.id, some_a_value)
base_b = insert into base returning *
insert into inheritor_b (base_id, something_only_b_has) values (base_b.id, some_b_value)

Now we can query base generically without joins.

select *
from base

AFAICT we get the best of single table inheritance (STI) and "delegated types":

typed and constrained data at rest
concerns separated
join-free generic queries over base

For the tradeoff of:

marginally slower write performance
two copies of all inheritor rows
hypothetical type incompatibility among SQL and JSON (I haven't run into that yet)

We only use this pattern for wallets because that's when we discovered this pattern. Eventually all of our polymorphic tables (e.g. items) will follow it.

100 sats \ 2 replies \ @SimpleStacker 22 Dec

Only I wasn't happy with the main drawback: when you want to query over all bases you need to join with and retrieve every inheritor table/row

I always wondered. Is this a downside of the data model or a downside of the available SQL syntax / subroutines?

As a semi-technical person who works with databases all the time, but has little idea about the underlying technicals of how they're implemented, I've always though that philosophically what should be aimed for is the most coherent and easily expressible data model that minimizes redunancy, but with a syntax and subroutine library that allows for quick retrieval of even highly nested relational structures.

I may be talking out of my butt and simply unaware of the many alternative database solutions out there. But this is my perspective as someone who works frequently with table-based relational data structures.

67 sats \ 1 reply \ @k00b OP 22 Dec

I've always though that philosophically what should be aimed for is the most coherent and easily expressible data model that minimizes redunancy, but with a syntax and subroutine library that allows for quick retrieval of even highly nested relational structures.

That is the aim. We call this database normalization. It's the right way to design these systems. In practice, you often denormalize data for performance reasons because there are limits to how fast a computer can access and process data especially when it is spread around.

When you're joining tables, which are each a collection of files on your harddrive, you are randomly accessing another collection of files for each join. Indices can help the database know exactly which file it needs to read to get a particular row, but any file read is slower than none.

I try to keep our data "normal," so we always know what is "true," whether I end up also denormalizing the data or not.

I think the SQL standard should evolve to address polymorphism like this directly though. It's incredibly common and has no direct solution afaict.

100 sats \ 0 replies \ @SimpleStacker 22 Dec

Yeah that's what I think too. This seems like a common enough issue that it should be built in. You should be able to express your inheritance schema without having to hack together triggers to keep things normalized. Any denormalization that needs to happen for performance purposes should be kept "under the hood" and used in the background when needed.

But I may be underestimating the technical challenge in accomplishing that, or maybe there isn't an easily generalizable solution that works for everyone's data model or use cases.

0 sats \ 0 replies \ @WeAreAllSatoshi 22 Dec

@remindme in 14 hours