I've been looking into something like that: I envision a work queue where:
I spin up AWS Inf2 instances on demand that I pack with a large 100B+ instruct tuned reasoning LLM or maybe a LAM (but i'd have to learn more about that first). These do decomposition, review and maybe even prompt tuning?
Local m4 box(es) then run smaller models like devstral or codellama for actual operations.
devstral
orcodellama
for actual operations.