Multi-Carrier Routing Architecture

Achieving near-perfect SMS delivery across 190 countries requires a fundamentally different approach than simply partnering with a single aggregator. Our routing architecture maintains active connections to over 40 direct carrier partnerships and a curated set of tier-one aggregators, creating a mesh network capable of reaching virtually any mobile number on the planet.

Each message entering our system is evaluated against a real-time routing table that considers the destination country, carrier, number type, time of day, and current performance metrics for each available route. This evaluation happens in under 10 milliseconds, ensuring that routing intelligence never adds perceptible latency to the delivery flow.

The routing table itself is continuously updated by a machine learning model trained on billions of historical delivery attempts. It learns which routes perform best for specific carrier and country combinations, adapting to the constantly shifting landscape of global telecommunications infrastructure.

We also maintain what we call shadow routes: pre-warmed backup paths for every destination that are ready to accept traffic immediately if the primary route degrades. This eliminates the cold-start penalty that typically accompanies failover events.

Real-Time Delivery Monitoring

You cannot optimize what you cannot measure. Our delivery monitoring pipeline processes over 50 million events per day, tracking every message from the moment it enters our API to the final delivery receipt from the destination carrier. Each event is timestamped with microsecond precision and enriched with routing metadata.

We compute delivery rates on sliding windows of 1 minute, 5 minutes, 15 minutes, and 1 hour for every country-carrier combination. This multi-resolution approach lets us distinguish between momentary blips and genuine degradation trends. A 30-second outage at a single carrier looks very different from a gradual decline over an hour.

The monitoring system feeds into a set of anomaly detection models that are trained to recognize patterns preceding delivery failures. In many cases, we can detect and respond to carrier issues minutes before they would be visible in aggregate delivery rate metrics, often before the carrier themselves has identified the problem.

Automatic Failover System

When our monitoring detects degradation on a route, the failover system activates within seconds. The process is entirely automated: the degraded route is deprioritized in the routing table, traffic is shifted to the next-best route, and an incident is created in our operations dashboard for human review.

Failover decisions are made at the granularity of individual carrier-country pairs, not entire providers. If a single carrier in a single country is experiencing issues, only that specific traffic is rerouted. Everything else continues on its optimal path unaffected. This surgical approach minimizes disruption and prevents cascading failures.

We test our failover system continuously through what we call chaos routing: intentionally degrading routes in a controlled manner to verify that the system responds correctly. This practice, inspired by chaos engineering principles, has helped us identify and fix edge cases that would only surface during real incidents.

Recovery is also automated. Once a degraded route shows sustained improvement, traffic is gradually shifted back to maintain optimal cost and performance characteristics. The entire cycle from detection to failover to recovery typically completes without any human intervention.

Regional Carrier Partnerships

Direct carrier partnerships are the foundation of high delivery rates. Aggregator chains add latency, reduce visibility, and introduce points of failure. We invest heavily in establishing direct connections with carriers in every major market, and we now have direct routes covering over 85 percent of global mobile subscribers.

Each partnership is negotiated to include delivery receipts, which many indirect routes do not provide. Delivery receipts are essential for our monitoring pipeline and allow us to provide customers with accurate, real-time delivery status for every message. Without them, delivery rate calculations rely on inference and estimation.

In markets where direct partnerships are not feasible, we maintain relationships with carefully vetted regional aggregators who specialize in specific geographies. These partners are selected based on delivery performance, receipt support, and regulatory compliance, and their performance is continuously monitored against the same standards we apply to direct routes.

Message Queue Optimization and Edge Routing

Our message queue is built on a custom distributed system designed specifically for the unique requirements of SMS delivery. Messages are partitioned by destination region and prioritized based on customer tier and message type, ensuring that time-sensitive verification codes are never delayed by bulk traffic.

We operate edge nodes in 12 regions globally, placing our routing infrastructure as close as possible to carrier ingestion points. A message destined for a number in Southeast Asia is routed through our Singapore edge node, cutting round-trip latency by hundreds of milliseconds compared to routing through a centralized data center.

Edge nodes maintain local copies of the routing table and can make independent routing decisions even if connectivity to the central system is temporarily interrupted. This distributed architecture ensures that regional network issues do not cascade into global delivery problems.

Queue depth is monitored in real time, and our auto-scaling infrastructure can spin up additional processing capacity within 30 seconds of detecting elevated queue depth. During peak traffic events like Black Friday or New Year, we pre-scale based on historical patterns to ensure zero queuing delays.

Monitoring and Alerting Pipeline

Our alerting pipeline is built on a tiered severity model. Level 1 alerts fire when delivery rates for any country drop below 99 percent on a 15-minute window and page the on-call engineer immediately. Level 2 alerts trigger at 98 percent and escalate to the engineering lead. Level 3 alerts at 95 percent activate our incident response protocol.

Beyond delivery rates, we monitor latency percentiles, error code distributions, carrier response times, and queue depths. Each metric has its own set of thresholds calibrated through months of operational experience. We favor precision over recall in our alerting: every alert should be actionable, and alert fatigue is treated as a system bug.

Weekly operations reviews examine every alert that fired, every failover event, and every delivery rate dip, regardless of whether it triggered an alert. This retrospective process continuously refines our thresholds and detection models, driving the steady improvement that has taken us from 98.5 percent delivery to 99.9 percent over the past 18 months.

All operational data is exposed to customers through our dashboard, giving them the same visibility into delivery performance that our own engineering team has. Transparency is not just a value; it is a competitive advantage that builds trust and reduces support burden.

← Back to all articles

How We Achieved 99.9% Delivery Across 190 Countries