reply on: dawn of the token based economy \ stacker news ~econ

pull down to refresh

I've been looking into something like that: I envision a work queue where:

I spin up AWS Inf2 instances on demand that I pack with a large 100B+ instruct tuned reasoning LLM or maybe a LAM (but i'd have to learn more about that first). These do decomposition, review and maybe even prompt tuning?
Local m4 box(es) then run smaller models like devstral or codellama for actual operations.