Skip to main content

Lessons from the AWS us-east-1 Outage: Why Local NVMe as Primary DB Storage Is Risky

· 5 min read
EloqData
EloqData
EloqData Core Team

On October 20, 2025, AWS experienced a major disruption across multiple services in the us-east-1 region. According to AWS Health Status, various compute, storage, and networking services were impacted simultaneously. For many teams running OLTP databases on instances backed by local NVMe, this was not just a downtime problem—it was a data durability nightmare.

x

Cloud databases must constantly balance durability, performance, and cost. In modern cloud environments, there are three main types of storage available:

Storage TypeDurabilityLatencyCostPersistence Across VM Crash
Block Storage (EBS)✅ High⚠ Medium💰 High✅ Data persists
Local NVMe❌ None✅ Ultra-fast✅ Low per IOPS❌ Lost on restart/crash
Object Storage (S3)✅ Very High🐢 Slow✅ Lowest✅ Persistent

Let’s break down the trade-offs and why recent events place a spotlight on risky architectural choices.


📦 Option 1: Block-Level Storage (EBS) — Durable but Expensive and Slow

EBS is the default choice for reliability:

  • It survives instance failures.
  • It supports cross-AZ replication via multi-replica setups.
  • It enables quick reattachment to replacement nodes.

But the downside?

  • GP2/GP3 disks deliver modest IOPS and high latency.
  • High-performance variants like IO2 are extremely expensive when provisioned for hundreds of thousands of IOPS.
  • Scaling performance often means scaling cost linearly.

EBS gives you durability—but performance per dollar is disappointing.


⚡ Option 2: Local NVMe — Fast but Ephemeral (and Now Proven Risky)

Instance families like i4i provide 400K+ to 1M+ IOPS from local NVMe, making them a natural fit for databases chasing performance.

So many database vendors recommend:

  • ✅ Use local NVMe for primary storage
  • ✅ Add cross-AZ replicas for durability

But here’s the problem: Local NVMe is tied to the node lifecycle. If the node restarts, fails, gets terminated due to spot interruption, or is impacted by a region-level failure such as the recent us-east-1 outage—you lose ALL the data.

During routine failures, cross-AZ replicas often protect you. But during region-wide degradation or cascading incidents, with local NVMe, there is nothing to recover. The storage is simply gone. What you can do is to recovery from recent backups — often lagging days. Write loss is guaranteed between last backup and crash.

In contrast, EBS volumes can always be reattached to a new node.

👉 The AWS us-east-1 outage just validated that “local NVMe + async replication” is a high-risk strategy for mission-critical databases.


☁️ Option 3: Object Storage (S3) — Durable & Cheap, But Latency Is a Challenge

Object storage is:

  • ✅ 3x cheaper than block storage
  • ✅ Regionally and cross-region durable
  • ✅ Built to survive region-level failures
  • ✅ Practically infinite
  • ✅ A first-class citizen for modern cloud-native platforms

But the challenge remains: S3 latency is too high for OLTP if accessed synchronously.

This is why traditional OLTP engines avoid it.

So the question becomes: 🧠 How do we get the cost & durability benefits of S3 without paying the latency penalty?


✅ The Data Substrate Approach: Object Storage First, NVMe as Cache, EBS for Logs

EloqData treats object storage (e.g., S3) as the primary data store, and architect the system to avoid the usual latency pitfalls:

LayerRoleWhy
S3 (Object Storage)Primary data store✅ Ultra-durable, ✅ Cheap
EBS (Block Storage)Durable log storage✅ Small volume, ✅ low latency writes
Local NVMeHigh-performance cache✅ Accelerates reads & async flushes

Through Data Substrate, we decouple storage from compute and split durability between:

  • ✅ Log: persists immediately to EBS
  • ✅ Data store: periodically checkpointed to S3 (async + batched)
  • ✅ NVMe: purely a cache layer, safe to lose at any time

This allows us to:

  • ✅ Withstand node crashes seamlessly
  • ✅ Recover fully even if local NVMe is wiped
  • ✅ Handle region-level disruption by replaying logs and checkpoints
  • ✅ Enjoy millions of IOPS from NVMe without durability risk
  • ✅ Cut storage cost by 3x+ compared to full EBS-based systems

Check out more on our products powered by Data Substrate:


🌍 The Larger Industry Trend

We are not alone in this shift. The broader ecosystem is moving object-storage-first:

SystemUse of Object Storage
SnowflakeOLAP on S3
StreamNative UrsaStreaming data on S3
Confluent Freight ClustersStreaming data on S3
TurbopufferVector & full-text search on S3

EloqData brings this model to OLTP with a transactional, low-latency engine powered by Data Substrate.


📘 After the Outage: A Hard Question Every Architect Should Ask

If my database node died right now, would I lose all my data?

If you're running a primary database on local NVMe, and relying solely on async replicas, the answer might be yes.

It’s time to rethink durability assumptions in the cloud era.


✅ TL;DR

StrategyPerformanceDurabilityRegion Outage RiskCost
EBS only❌ Limited💰💰💰
Local NVMe only✅✅✅💰💰
NVMe + async replicas✅✅✅⚠ Partial⚠ High💰💰
Object Storage + Log + NVMe Cache (EloqData)✅✅✅✅✅✅💰

🔥 Final Thought

AWS us-east-1 just reminded the industry: Performance is replaceable. Lost data is not.

With the right architecture, you don’t have to choose.

👉 Build fast. 👉 Stay durable. 👉 Be outage-proof.

That’s the future we’re building at EloqData.

Check out more on our open source databases: