honestly I'm disappointed you didn't [link to it directly](https://github.com/michaelneale/mesh-llm)

adlai

thanks

`magnet:`???

***

before the amateur hecklers complain about the pitfalls of drinking both bittorrent and git ... go read about forward-compatibility and maybe troll someone who reads notifications more frequently than once per zap.

project, we wanted to let people try more open models, but many didn't have capacity on their own. Open models continue to improve apace so it makes sense to make it easy to host and share as they get more capable and larger. That is what this experiment is about.

Model fits on one machine? Solo mode, full speed. Too big? Dense models pipeline-split by layers across nodes. MoE models (Qwen3, GLM, Mixtral, DeepSeek) split by experts — auto-detected from GGUF metadata, zero config. Splits are latency-aware — low-RTT peers preferred for tighter coordination.

Each node gets the full trunk plus an overlapping expert shard. Critical experts replicated everywhere, remaining distributed uniquely. Each node runs its own llama-server — zero cross-node traffic during inference.

Different nodes serve different models. API proxy routes by 

 field. Nodes auto-assigned based on what's needed and what's on disk.

Unified demand map propagates across the mesh via gossip. Standby nodes promote to serve unserved or hot models. Dead hosts replaced within 60 seconds.

Publish your mesh to Nostr relays. Others find it with 

. Smart scoring: region match, VRAM, health probe before joining.

Weights read from local GGUF files, not sent over the network. Model load: 111s → 5s. Per-token RPC round-trips: 558 → 8.

GPU nodes gossip. Clients use lightweight routing tables — zero per-client server state. Event-driven: cost proportional to topology changes, not node count.

Draft model runs locally, proposes tokens verified in one batched pass. +38% throughput on code. Auto-detected from catalog.

Live topology, VRAM bars, model picker, built-in chat. API-driven — everything the console shows comes from JSON endpoints.

. Use with goose, pi, opencode, or any tool that supports custom OpenAI endpoints.

Have your agents gossip over the mesh — share status, findings, and questions via CLI or MCP. Works standalone with any LLM setup, no GPU needed. 

Block released this today: 

![](https://m.stacker.news/136720)

> *As part of the* [goose](https://github.com/block/goose) *project, we wanted to let people try more open models, but many didn't have capacity on their own. Open models continue to improve apace so it makes sense to make it easy to host and share as they get more capable and larger. That is what this experiment is about.*

> # ⚡ Automatic distribution
>
> Model fits on one machine? Solo mode, full speed. Too big? Dense models pipeline-split by layers across nodes. MoE models (Qwen3, GLM, Mixtral, DeepSeek) split by experts — auto-detected from GGUF metadata, zero config. Splits are latency-aware — low-RTT peers preferred for tighter coordination.
>
> # 🧩 MoE expert sharding
> 
> Each node gets the full trunk plus an overlapping expert shard. Critical experts replicated everywhere, remaining distributed uniquely. Each node runs its own llama-server — zero cross-node traffic during inference.
>
> # 🔀 Multi-model routing
> 
> Different nodes serve different models. API proxy routes by `model` field. Nodes auto-assigned based on what's needed and what's on disk.
>
> # 📊 Demand-aware rebalancing
> 
> Unified demand map propagates across the mesh via gossip. Standby nodes promote to serve unserved or hot models. Dead hosts replaced within 60 seconds.
> 
> # 📡 Nostr discovery
> 
> Publish your mesh to Nostr relays. Others find it with `--auto`. Smart scoring: region match, VRAM, health probe before joining.
> 
> # 🚀 Zero-transfer loading
> 
> Weights read from local GGUF files, not sent over the network. Model load: 111s → 5s. Per-token RPC round-trips: 558 → 8.
> 
> # 📈 Scales passively
> 
> GPU nodes gossip. Clients use lightweight routing tables — zero per-client server state. Event-driven: cost proportional to topology changes, not node count.
> 
> # 🎯 Speculative decoding
> 
> Draft model runs locally, proposes tokens verified in one batched pass. +38% throughput on code. Auto-detected from catalog.
> 
> # 💻 Web console
> 
> Live topology, VRAM bars, model picker, built-in chat. API-driven — everything the console shows comes from JSON endpoints.
> 
> # 🤖 Works with agents
> 
> OpenAI-compatible API on `localhost:9337`. Use with goose, pi, opencode, or any tool that supports custom OpenAI endpoints.
> 
> # 📝 Agent gossip
> 
> Have your agents gossip over the mesh — share status, findings, and questions via CLI or MCP. Works standalone with any LLM setup, no GPU needed. [Learn more →](https://docs.anarchai.org/#blackboard)

⚡ Automatic distribution⚡ Automatic distribution

🧩 MoE expert sharding🧩 MoE expert sharding

🔀 Multi-model routing🔀 Multi-model routing

📊 Demand-aware rebalancing📊 Demand-aware rebalancing

📡 Nostr discovery📡 Nostr discovery

🚀 Zero-transfer loading🚀 Zero-transfer loading

📈 Scales passively📈 Scales passively

🎯 Speculative decoding🎯 Speculative decoding

💻 Web console💻 Web console

🤖 Works with agents🤖 Works with agents

📝 Agent gossip📝 Agent gossip