Deliverability Analytics: Outage vs List Decay

Learn how to tell outage-driven deliverability drops from slow list decay using baselines, anomaly detectors, dashboards, and automated mitigations.

When Deliverability Drops Overnight: Is It an Outage or List Decay?

Hook: Your campaign metrics tank at 10:12 UTC — bounce rate spikes, accepted counts collapse, and stakeholders are demanding answers. Is a downstream platform outage to blame, or has your recipient list quietly eroded over months? For technology teams responsible for secure recipient workflows, distinguishing urgent outages from long-term deliverability erosion is mission-critical: the wrong response wastes time and increases risk.

Why this matters in 2026

Late 2025 and early 2026 saw a wave of high-profile CDN and cloud outages (notably Cloudflare and major edge services), and several widely-reported platform incidents that created global ripple effects for mail, webhooks, and API delivery. At the same time, privacy-preserving tracking, mailbox provider filtering, and more aggressive ISP heuristics have made deliverability signals noisier. Teams can no longer rely on single metrics or rigid thresholds; modern incident detection must combine baseline-aware analytics, cross-source correlation, and automated mitigations.

High-level approach: layered detection and response

Detecting outage-driven drops vs. list decay is a classification problem you solve with layered telemetry and playbooks. The recommended pipeline:

Collect high-fidelity metrics across transport, engagement, and external status feeds.
Build dynamic historical baselines with seasonality and campaign-aware segmentation.
Apply statistical and change-point anomaly detection to identify sudden vs. gradual shifts.
Correlate anomalies with external outage signals (CDN, cloud status, vendor incidents).
Trigger tiered alerts and automated mitigations (pause, retry, reroute, quarantines).
Run a structured postmortem and adjust runbooks and thresholds.

Which metrics matter — and why

Collect the following metrics at high cardinality (by sending IP, domain/recipient host, campaign, geo, and timestamp):

Sent / accepted / delivered counts — baseline for throughput.
Bounce types — 4xx (temporary), 5xx (permanent), connection errors, DNS failures.
SMTP response codes and subcodes — where available (e.g., 421 vs. 550).
Transport timeouts and connection latency — network-level failures often indicate outages. See also best practices for latency budgeting when you instrument short-window detectors.
Spam complaints and unsubscribes — signals of reputation decay.
Engagement signals — opens/clicks (noting privacy changes), unique opens, clicks per recipient.
Inbox placement probe results — synthetic recipients give independent placement tests.
External status feeds — Cloudflare, AWS, Google Workspace, major ISPs, and third-party incident trackers (e.g., DownDetector, Statuspage APIs).

Distinguishing outage-driven drops vs. list decay — practical heuristics

Below are reliable patterns and how to detect them:

Outage-driven drops (characteristics)

Sudden, large amplitude change: Accepted/delivered counts drop by >30–50% within minutes to an hour.
Transient 4xx/connection errors spike: Large jump in 421/451 or TCP timeouts across many domains and regions.
Cross-provider impact: Multiple recipient domains and ISPs affected simultaneously.
Network/transport anomalies: Increased SMTP connection latency and resets.
External corroboration: Incident reports from CDN/cloud vendors or public outage trackers at the same time window.

List decay / long-term deliverability erosion (characteristics)

Gradual trend: Bounce rates and complaint rates rise over weeks/months; engagement declines.
Domain-specific patterns: Problems concentrated on specific ISPs or older segments, not global.
Increase in hard bounces (5xx): Persistent 5xx and user unknown responses on the same recipients.
Rising complaint and unsubscribe rates: Lower engagement and higher spam reports indicate reputation issues.
Low inbox placement in probes: Synthetic tests show persistent spam folder placement.

Analytics techniques: baselines and anomaly detection

Fixed thresholds will fail in modern delivery contexts. Use dynamic baselines and layered detectors:

1. Time-series baselines with seasonality

Compute baselines with weekly and daily seasonality. For many senders, day-of-week and hour-of-day matter. Use at least 8–12 weeks of history for stable baselines but weigh recent data more heavily.

Recommended approach: an Exponentially Weighted Moving Average (EWMA) baseline with a seasonal decomposition (additive STL or similar). This captures normal cyclical patterns while responding reasonably fast to changes.

2. Short-window detectors for sudden drops

For outage detection, use short-window change detection that triggers on abrupt delta from the EWMA baseline. Two effective methods:

Z-score on the short window (e.g., 5–15 minute aggregates) vs. baseline standard deviation. Trigger when z < -3 over successive windows.
CUSUM / Page-Hinkley to detect negative shifts quickly with low false positives.

3. Slow drift detectors for decay

For list decay, detect statistically significant trends over longer windows — 7–30 days. Use:

Rolling linear regression slope on bounce rate or accepted rate (is slope significantly negative?).
Bootstrapped confidence intervals for engagement metrics; persistent deviation beyond CI indicates erosion.

4. Multi-dimensional anomaly rules

Combine metrics rather than alert on a single signal. Example composite rule for outage detection:

Trigger outage alarm if: accepted_rate drops > 40% vs. EWMA AND connection_timeouts > 4x baseline AND external_status indicates incident OR TCP resets spike > 3x baseline.

Sample detection pseudocode

// Pseudocode: short-window outage detector
baseline = EWMA(metric, alpha=0.05, seasonality=weekly)
std = rolling_std(metric, window=7d)
for each 5min bucket:
  z = (value - baseline[bucket.time_of_week]) / max(std[bucket.time_of_week], eps)
  if z < -3 and consecutive_buckets >= 3:
    check_correlation = (timeouts > timeout_baseline*3) or (external_incident=true)
    if check_correlation:
      fire_outage_alert()

Concrete SQL examples (ClickHouse / Postgres style)

These queries illustrate how to compute short-window z-scores and detect outages at the campaign level.

Short-window z-score (ClickHouse-like)

SELECT
  campaign_id,
  toStartOfFiveMinutes(event_time) AS bucket,
  sum(sent) AS sent,
  sum(accepted) AS accepted,
  accepted / sent AS accept_rate,
  (accept_rate - baseline_accept_rate) / NULLIF(baseline_std,0) AS z_score
FROM deliveries
LEFT JOIN baseline_table USING (campaign_id, hour_of_week)
WHERE event_time > now() - INTERVAL 1 DAY
GROUP BY campaign_id, bucket
HAVING z_score < -3
ORDER BY bucket DESC

Longer-term trend slope (Postgres-like)

WITH daily AS (
  SELECT date_trunc('day', event_time) day,
    SUM(accepted)/NULLIF(SUM(sent),0) AS accept_rate
  FROM deliveries
  WHERE event_time > now() - interval '60 days'
  GROUP BY day
)
SELECT regr_slope(accept_rate::float, extract(epoch from day)) AS slope
FROM daily;
-- Negative slope beyond threshold indicates erosion

Dashboard design: what panels to build

Your out-of-the-box deliverability dashboard should include both operational and analytical panels. Group them into three rows:

Row 1 — Real-time operational view

Sent / Accepted / Rejected live rates (5-minute resolution).
SMTP connection errors and timeouts per minute.
Per-ISP accepted rate heatmap (rows = ISPs, columns = 5-min buckets).
External incidents widget (statuspage feed + DownDetector).

Row 2 — Anomaly detection & signals

Z-score panel for accepted rate vs. baseline.
CUSUM change point annotations over the last 72 hours.
Composite trigger state (OK / WARNING / OUTAGE) with playbook link.

Row 3 — Historical trends for root cause analysis

30/60/90-day trend for hard bounces and complaint rates.
Inbox placement probe results over time by provider.
Segmented cohort engagement decay heatmap (by last_open_date).

Alerting and incident playbooks

Design alerts and playbooks for two classes: Operational outage and Deliverability erosion. Each alert should include automated context enrichment:

Evidence snapshot: last 60 minutes of metrics, top affected ISPs, sample SMTP responses.
External correlation: vendor status links and public outage summaries.
Suggested first steps tailored to the detected class.

Outage playbook (fast and reversible)

Confirm cross-system impact; check vendor status pages.
Immediately switch to safe mode: pause non-critical campaigns and throttled retries (exponential backoff).
Reroute to alternate sending IP pools or region if multi-region pools exist.
Notify stakeholders and update status page with an estimated impact window.
Keep synthetic probes running and capture full SMTP traces for timeline.

List-decay playbook (investigative and conservative)

Isolate affected segments (age, source, channel) and stop sending to high-risk cohorts.
Run targeted re-permissioning flows for stale recipients and re-engagement campaigns with strict throttles.
Adjust suppression lists — quarantine persistent hard-bounces and re-evaluate suppression TTLs.
Review authentication (SPF, DKIM, DMARC) and reputation signals. Consider warming new IPs/streams slowly.
Schedule a deeper deliverability audit with inbox-probe providers and remediation steps.

Automated mitigations — examples and caution

Automation reduces time-to-mitigation but carries risk. Use tiered automation with human confirmation for high-impact changes.

Low-risk automations (auto-run)

Auto-throttle (reduce send rate by X%) when accepted_rate drops below baseline by Y%.
Automatic retry window with exponential backoff for 4xx errors.
Auto-add transient-failure recipients to a retry queue with TTL.

Higher-risk automations (require approval)

Pausing all campaigns or changing sending IP pools (require on-call approval).
Automated suppression of large cohorts without verification.

Sample webhook to pause a campaign (example payload your system can call):

POST /api/v1/campaigns/pause
Content-Type: application/json
Authorization: Bearer <token>
{
  "campaign_id": "c_12345",
  "reason": "outage-detected",
  "evidence": {
    "accept_rate_drop_pct": 62,
    "z_score": -4.8,
    "top_error": "421 connection timeout",
    "external_incident": "https://status.cloudflare.com/"
  }
}

Postmortem and continuous improvement

A consistent postmortem is the difference between repeating the same firefight and actually reducing future outages and decay. A lean postmortem template for deliverability incidents should include:

Timeline: When did metrics deviate? When were mitigations applied?
Scope: Which campaigns, IP pools, ISPs, and geos were affected?
Root cause analysis: Why did this happen? Was it a vendor outage, misconfiguration, or list quality issue?
Mitigation effectiveness: Which actions restored delivery and how fast?
Action items: Concrete tasks, owners, deadlines, and verification criteria.

Good postmortems are short, evidence-based, and end with verifiable action items. In 2026, teams that tie action items to telemetry and automated tests recover faster and avoid repeat incidents.

Operationalizing learnings with tests and probes

Prevent future misclassifications by maintaining a suite of active tests:

Synthetic inbox probes that run on a schedule across providers and geos.
Transport probes validating SMTP handshake, TLS, and DNS resolution from multiple regions.
End-to-end pipeline monitors that validate webhook delivery and webhook retries to downstream systems.
Chaos testing for failover scenarios (simulate CDN or region failures) to validate runbooks and automated mitigations.

2026 trends and future-proofing

Key trends through early 2026 that should shape your observability and mitigation strategy:

Higher volatility in edge and CDN services; build multi-vendor fallbacks and region-aware routing.
Privacy changes reduce the reliability of opens; rely more on actionable metrics like bounces, complaints, and probe placement.
AI-powered filtering at major ISPs will lead to new, subtle signals — maintain probe inboxes and ML-assisted anomaly detectors to catch these patterns.
Regulatory and audit expectations emphasize documented runbooks and retention of delivery telemetry for postmortems and compliance.

Actionable checklist: 24-hour, 7-day, 90-day

24-hour

Ensure real-time dashboard with EWMA baselines and z-score panels.
Connect vendor status feeds and public outage trackers to the dashboard.
Implement at least one auto-throttle and one human-approval mitigation path.

7-day

Run a full chaos test simulating a CDN or regional cloud outage and validate runbooks.
Start cohort-level decay detection for oldest segments and orphaned lists.
Establish escalation policy and ticket templates for deliverability incidents.

90-day

Deploy synthetic inbox probes across providers and geos and integrate results into your baseline model.
Automate postmortem generation with telemetry snapshots and link to runbook improvements.
Audit suppression and quarantine policies for safety and compliance; update documentation.

Real-world example (condensed case study)

In January 2026, several organizations saw a simultaneous drop in accepted rates. One mid-market SaaS provider observed a 55% drop in accepted mail within 20 minutes. Their detection pipeline found:

A z-score < -5 on accepted rate across all sending IPs in three regions.
Transport-level TCP resets and TLS handshake failures — not a domain-specific 5xx.
Correlated Cloudflare incident reports and higher-than-normal DNS resolution failures to major mailbox providers.

They immediately throttled non-essential campaigns, switched to a secondary sending region, and temporarily enabled queued retries. Their postmortem identified a transient edge-network DNS issue at a CDN provider; mitigation removed the immediate impact and the team added active DNS probes and adjusted runbooks to speed future failover decisions.

Putting it together: recommended tech stack

An effective stack in 2026 combines time-series storage, streaming metrics, alerting, and automation:

Time-series DB: ClickHouse or Prometheus + long-term store for high-cardinality delivery events.
Dashboarding: Grafana with custom panels (z-score, CUSUM overlays).
Alerting & incident management: PagerDuty / OpsGenie + webhook automation to change sending state.
Probe providers: third-party inbox probes plus internal synthetic probes for transport checks.
Automation: API-driven campaign control (pause/resume/throttle), and safe-mode runbooks in a CD pipeline.

Final takeaways

Outages are sudden; decay is gradual. Use short-window change detectors for outages and long-window trend analysis for decay.
Correlate widely. Combine internal metrics with external status feeds and probes before deciding on high-impact mitigations.
Automate safely. Tier mitigations, require approvals for risky moves, and prefer reversible actions first.
Invest in probes and postmortems. They are the core of continuous improvement and compliance in 2026.

Call to action

If your team is evaluating deliverability observability, start with a minimal reproducible pipeline: implement EWMA baselines, a z-score outage detector, and one safe automated mitigation (auto-throttle). For hands-on help, try recipient.cloud's deliverability analytics templates (pre-built dashboards, anomaly detectors, and webhook playbooks) to cut investigation time and reduce false-positive mitigations. Get a free checklist and a 14-day evaluation of our outage-detection pack to harden your delivery pipeline for 2026.