Lack of high availability makes it difficult to run either of the major distributions like a real enterprise system, with zero downtime. I suspect even the largest nodes on the network today are not much more sophisticated than what’s available out of the box with LND.
Improvements to LND on the db side have reduced downtime required between restarts (by enabling online compaction), and also setting up the architecture for multiple nodes to share a backend data store. Leader election can provide a primitive form of HA (failover but not load balancing). An external project, LNmux by Bottlepay, will allow load balancing HTLCs through multiple public gateways, which is the holy grail for enterprise LND.
For someone with some LN experience, who wants to learn more the above, I recommend setting up a regtest environment using containers or VMs, and configuring the following:
  • Create a basic postgres instance
  • Create an LND instance w/postgres backend
  • Upgrade your postgres to have high availability
  • Create an etcd cluster
  • Create 2 additional LNDs and configure leader election
  • Add centralized monitoring/logging (Grafana LGTM stack is phenomenal but other choices are available)
  • Add lndmon sidecars to lnd for monitoring LN metrics
If you can confidently configure all of that, you should have the basics to work on any major node on the network. Some will be drastically different in implementation but the concepts should be the same.
HTLC load balancing is still experimental as far as I know, but if you have some development experience you can add lnmux to your lab and experiment with it today.
An alternative node distribution that was created with more of an enterprise architecture from the start is ACINQ’s Eclair. They probably the single largest node on the network, however I have not seen Eclair elsewhere in the wild, which is why I recommend starting with LND then branching out when the fundamental concepts are well understood.
For something easy to digest and get you motivated to build an enterprise grade LN node, here’s a video from base58, Enterprise Lightning Engineering at CashApp w/Ryam Loomba:
It’s also important to keep in mind that building and operating enterprise grade infrastructure is completely orthogonal to managing liquidity. Doing both of these well is likely to require at least two talented full-time individuals (more likely a team for each).
An astute reader may recognize I left out a small bit critical piece: secrets management
There’s plenty of solutions there depending on your cloud provider. You definitely don’t need dedicated hardware, most of top nodes are in AWS.
For the LND lab I describe above, I would also add lndinit to initialize each node. You could set up Hashicorp Vault to use as a backend.
My advice is don’t spend any money on hardware until you have been running a virtual environment suitable for production use. Then evaluate your threat model and determine if it’s worth the upfront cost.
reply
Great advice. I'll keep that in mind. I guess AWS is the way to go.
reply
From someone who has wasted thousands on unnecessary cloud bills, and has a mid-size homelab that’s 80% idle…
I would run everything locally (on a laptop/desktop) until that becomes a limitation, which is probably only the case when you’re ready to go live on mainnet.
Seriously, just use docker, or maybe a kubernetes dev cluster (kind or k3d) if you’re comfortable with it.
From someone who has accidentally lost too many testnet coins…
Use regtest as much as possible. When you’re ready to interface with the world, try one of the signets.
Seriously, be as stingy as possible with your sats. I used to think spending money on hardware and elaborate cloud infra would make me “invested” but the reality is all other resources required to learn this stuff are dwarfed by time, the scarcest resource of all.
reply
Thanks for the link to the Ryan Loomba video. It gave me a nice overview. I know I have my work cut out for me, but at least I have a better sense of knowing what I don't know.
I felt a little better realizing that cashapp struggles with routing and liquidity issues just like the rest of us.
reply
Thank you very much. This is exactly the information I was looking for. I guess you're experienced in this area(obviously)?
reply
You’re welcome, thanks for sharing your curiosity, the world needs more of that!
I have production experience using everything I described above, except Eclair.
reply
Also didn’t mean to imply I’ve used lnmux in prod. I experimented with early versions in a lab, and haven’t tried it in a while but there’s been progress based on git activity. They have a great introductory blog post.
It relies on htlc interception, which has been in production use for a while now. That starts getting into what I would consider protocol territory (moreso than infra) and requires deep integration with backend app logic. It’s worth learning about once you have a solid grasp of fundamentals in both distributed systems and LN.
reply