Postgres is a phenomenal database. We ran it for two years, scaled it to handle millions of requests per day, and it served us well. But as Halcyon grew from a handful of regions to a globally distributed platform, we hit scaling walls that no amount of read replicas or connection pooling could solve.
This isn't a "Postgres is bad" post. It's a post about finding the right tool for our specific access patterns, consistency requirements, and operational constraints at our current scale.
What Broke
Three things pushed us toward a change:
- Cross-region writes. Our deployment metadata needed to be writable from any region. With Postgres, we had a single primary in us-east-1, which meant 150-300ms write latency from Asia-Pacific and Europe. Logical replication helped reads but didn't solve writes.
- Connection pressure. At peak, we had 14,000 concurrent connections across all edge nodes. PgBouncer helped, but we were constantly tuning pool sizes, and connection storms during deploys caused cascading timeouts.
- Operational complexity. Vacuuming, bloat management, partition maintenance, and replica lag monitoring consumed a disproportionate amount of our infra team's time. We wanted a system with fewer operational knobs.
The Evaluation
We benchmarked four candidates against our production workload profile. Here's the summary:
| Dimension | Postgres | TiKV |
|---|---|---|
| Write latency (cross-region) | 150-300ms | 15-40ms |
| Read latency (local) | 2-5ms | 3-8ms |
| Horizontal scaling | Manual sharding | Automatic |
| Connection model | Per-connection processes | gRPC multiplexed |
| Multi-region writes | Requires custom logic | Native Raft groups |
| Operational overhead | High (vacuum, bloat, etc.) | Moderate (PD tuning) |
TiKV's distributed transaction model using Raft consensus groups mapped perfectly to our requirement: strongly consistent, multi-region key-value storage with automatic sharding. The tradeoff was slightly higher local read latency, which we accepted because our hot path reads are served from a separate cache layer anyway.
Migration Approach
We ran both systems in parallel for six weeks using a dual-write pattern. Every write went to both Postgres and TiKV, and reads were served from Postgres with TiKV results used for validation. This let us catch consistency issues and performance regressions before cutting over.
The migration itself went surprisingly smoothly. The hardest part wasn't the data -- it was updating 47 services that had Postgres-specific SQL queries hardcoded. We ended up building a thin abstraction layer that we should have had from the beginning.
Lessons Learned
- Don't fight your database's grain. We spent six months trying to make Postgres work as a distributed system. That effort would have been better spent evaluating alternatives earlier.
- Dual-write migrations are worth the complexity. Running both systems in parallel gave us confidence we wouldn't have had with a big-bang cutover.
- Abstract your storage layer early. If there's any chance you'll change databases, put a clean interface in front of it from day one. We learned this the hard way.
- Local read latency isn't everything. Our p50 reads got 2ms slower, but our p99 writes got 10x faster. For a globally distributed system, that's a clear win.
We spent six months trying to make Postgres work as a distributed system. That effort would have been better spent evaluating alternatives earlier.