Retry Implementation Strategy: App vs MTA vs Hybrid

Analysis of where email retry logic should be implemented - at the application layer, MTA layer, or a hybrid approach.

Last Updated: February 2025 Decision: Hybrid Approach (Postfix for immediate, Application for long-term)

The Question

Should Postchi handle email delivery retries at:

Application layer (BullMQ/Worker)
MTA layer (Postfix)
Hybrid approach (Both)

Decision: Hybrid Approach ✅

Postfix handles immediate retries (0-1 hour) Application handles long-term retries (1-72+ hours)

Why Hybrid is Best

1. Privacy & Compliance Requirements

From our message storage analysis, we identified:

Postfix queues messages in /var/spool/postfix/ with full email content
Default Postfix retry: 5 days (unacceptable for GDPR/HIPAA)
Our goal: Aggressive cleanup (1 hour max in Postfix queue)

Pure MTA retry would contradict our privacy requirements.

Hybrid solution:

Messages stay in Postfix queue max 1 hour
After 1 hour, content is only in S3 (encrypted, with lifecycle policies)
Application handles remaining retries (metadata only in database)

2. Competitor Analysis Insights

From our competitor research:

Provider	Implementation
SendGrid, Mailgun, SparkPost	Custom MTAs with app-level retry logic
AWS SES	"Cannot configure" = app controls everything
Mailgun (unique)	Offers configuration (5min-24h window)

Key insight: Successful ESPs control retry logic at the application layer, not relying on standard MTA behavior.

3. Feature Requirements

Features we need that pure MTA cannot provide:

Feature	Pure MTA	Hybrid
Per-organization retry policies	❌	✅
Track individual retry attempts	❌ (log parsing only)	✅ (database)
Fire webhooks on each retry	❌	✅
Tier-based retry duration	❌	✅
ISP-specific retry logic	❌	✅
Pause org if bounce rate `>` 10%	❌	✅
Analytics on retry success rates	❌	✅
Custom intervals per tier	❌	✅

These are competitive differentiators we'd lose with pure MTA approach.

Implementation Architecture

Phase 1: Postfix Handles Immediate (0-1 hour)

Configuration

# /etc/postfix/main.cf
maximal_queue_lifetime = 1h          # Max 1 hour in queue
bounce_queue_lifetime = 1h           # Bounce after 1 hour
minimal_backoff_time = 5m            # Retry every 5 min
maximal_backoff_time = 15m           # Max retry interval
queue_run_delay = 5m                 # Check queue every 5 min

What Postfix Handles

Connection timeouts
Network glitches
Temporary DNS issues
Recipient server temporarily unavailable
Quick ISP throttling

Typical Retry Pattern

00 - Initial attempt fails (4xx)
05 - Retry 1
10 - Retry 2
25 - Retry 3
40 - Retry 4
00 - Final attempt → If still failing, bounce to app

Result: Most temporary issues resolve within 1 hour.

Phase 2: Application Handles Long-Term (1-72+ hours)

BullMQ Configuration

interface RetryPolicy {
  maxDuration: number;    // milliseconds
  maxAttempts: number;
  intervals: number[];    // minutes
}

const retryPolicies: Record<string, RetryPolicy> = {
  FREE: {
    maxDuration: 24 * 60 * 60 * 1000,  // 24 hours
    maxAttempts: 5,
    intervals: [30, 60, 120, 240, 480]  // 30m, 1h, 2h, 4h, 8h
  },
  PRO: {
    maxDuration: 72 * 60 * 60 * 1000,  // 72 hours
    maxAttempts: 8,
    intervals: [30, 60, 120, 240, 480, 960, 1440, 2880]
  },
  ENTERPRISE: {
    maxDuration: 120 * 60 * 60 * 1000,  // 5 days
    maxAttempts: 12,
    intervals: [30, 60, 120, 240, 480, 960, 1440, 2880, 4320, 5760, 8640, 10080]
  }
};

Handler Implementation

// When Postfix bounces back after 1 hour
async function handleSoftBounce(message: Message) {
  const org = await getOrganization(message.orgId);
  const policy = retryPolicies[org.tier];

  // Check if we should retry
  if (message.retryCount >= policy.maxAttempts) {
    return markAsFailedPermanently(message);
  }

  // Check if max duration exceeded
  const timeSinceCreation = Date.now() - message.createdAt.getTime();
  if (timeSinceCreation > policy.maxDuration) {
    return markAsFailedPermanently(message);
  }

  // Calculate next retry time
  const nextInterval = policy.intervals[message.retryCount];
  const delay = nextInterval * 60 * 1000; // Convert to ms

  // Add to BullMQ with delay
  await emailQueue.add('retry-email', {
    messageId: message.id,
    attempt: message.retryCount + 1
  }, {
    delay,
    jobId: `retry-${message.id}-${message.retryCount}`,
    removeOnComplete: true,
    removeOnFail: false
  });

  // Update status
  await prisma.message.update({
    where: { id: message.id },
    data: {
      status: 'RETRYING',
      retryCount: message.retryCount + 1,
      nextRetryAt: new Date(Date.now() + delay),
      lastRetryError: bounceMessage
    }
  });

  // Fire webhook
  await fireWebhook(org, {
    event: 'email.retry_scheduled',
    messageId: message.id,
    attempt: message.retryCount + 1,
    nextRetryAt: new Date(Date.now() + delay),
    reason: 'soft_bounce'
  });
}

Comparison: Hybrid vs Alternatives

vs Pure MTA Retry (Postfix only)

Requirement	Pure MTA	Hybrid
Messages in queue `<` 1h	❌ (up to 5 days)	✅ (max 1 hour)
Per-org retry policies	❌	✅
Track retry attempts	❌ (only via log parsing)	✅ (database)
Fire webhooks on retry	❌	✅
Configurable per tier	❌	✅
Privacy compliant	❌ (content in queue)	✅ (1h max)
ISP-specific logic	❌	✅

Verdict: Pure MTA fails on privacy, features, and differentiation.

vs Pure App Retry (No Postfix retry)

Requirement	Pure App	Hybrid
Handle connection issues	Complex	✅ (Postfix handles)
SMTP protocol edge cases	Need to implement	✅ (Postfix handles)
Immediate retries (5min)	Worker overhead	✅ (Postfix handles)
Connection pooling	Need to manage	✅ (Postfix handles)
Standard SMTP behavior	Reinvent wheel	✅ (Postfix provides)
Greylisting support	Need to implement	✅ (Postfix handles)

Verdict: Pure app requires reinventing SMTP retry logic that Postfix provides for free.

Real-World Example

Scenario: Email to Gmail bounces with 450 4.2.1 The user you are trying to contact is receiving mail too quickly

Timeline with Hybrid Approach

0:00 - Worker sends → Postfix gets 450 error 0:05 - Postfix retries → Still 450 0:10 - Postfix retries → Still 450 0:15 - Postfix retries → Still 450 1:00 - Postfix gives up, bounces to app

App receives bounce:

Checks org tier: PRO (72h retry)
Checks retry history: Attempt 0 of 8
Checks reason: 450 (temporary, Gmail throttling)
Parses error: "receiving mail too quickly"
Decision: Gmail throttling detected, schedule retry in 30 minutes
Stores: status: RETRYING, nextRetryAt, retryReason: 'gmail_throttling'
Fires webhook: email.retry_scheduled

1:30 - BullMQ job fires → Worker resends → Success! ✅

Benefits:

Postfix caught immediate network issues (first 1 hour)
App made intelligent decision based on error code
Organization got real-time webhook notification
Full visibility in dashboard: "Retrying due to Gmail throttling (attempt 1 of 8)"
Message wasn't sitting in Postfix queue for 72 hours

Competitive Advantages

What the hybrid approach enables:

1. "Smart Retry Logic"

// Learn from historical data
if (bounceReason.includes('too quickly') && recipientDomain === 'gmail.com') {
  // Gmail throttling - wait longer
  nextRetry = 60 * 60 * 1000; // 1 hour
} else if (bounceReason.includes('greylisted')) {
  // Standard greylisting - retry in 5 minutes
  nextRetry = 5 * 60 * 1000;
} else if (recipientISP === 'yahoo') {
  // Yahoo prefers longer intervals
  nextRetry = 30 * 60 * 1000; // 30 minutes
}

2. "Configurable Retry Policies"

Better than Mailgun (only competitor offering configuration):

Free tier: 24h, 5 attempts
Pro tier: 72h, 8 attempts
Enterprise: 5 days, 12 attempts
Custom: Fully configurable

3. "Real-Time Visibility"

Dashboard shows:

"Retrying in 45 minutes (attempt 3 of 8)"
"Retry reason: Gmail throttling detected"
"Last attempt: 1 hour ago"

Webhooks fire on:

email.retry_scheduled
email.retry_succeeded
email.retry_failed_permanently

4. "Analytics & Insights"

Track and learn:

"72% of soft bounces succeed on retry 2"
"Gmail throttling usually resolves in 30 minutes"
"Yahoo bounces have 95% success rate with 1-hour interval"
"Enterprise tier customers have 85% retry success rate"

5. "Compliance-First Architecture"

Marketing message:

"Messages never stored in our MTA queue longer than 1 hour. After that, only encrypted metadata remains in our systems, with email content in S3 with automatic lifecycle policies per your tier."

Appeals to:

Healthcare (HIPAA)
Finance (SOC 2)
Legal (data retention policies)

Implementation Checklist

Postfix Configuration

Set maximal_queue_lifetime = 1h
Set bounce_queue_lifetime = 1h
Configure exponential backoff (5m, 10m, 15m)
Set up bounce processing

Application Layer

Create RetryPolicy interface
Implement tier-based retry policies
Create handleSoftBounce() function
Set up BullMQ delayed jobs
Add retry tracking to database
Implement webhook firing on retry events
Add retry status to dashboard

Database Schema

Add retryCount to messages table
Add nextRetryAt timestamp
Add lastRetryError text field
Add retryHistory JSONB field for tracking attempts

Monitoring

Track retry success rates per tier
Monitor queue depth
Alert on high retry failure rates
Dashboard for retry analytics

Future Enhancements

Phase 1: Basic Hybrid (MVP)

Postfix 1-hour retry
App BullMQ retry with tier-based policies
Simple exponential backoff

Phase 2: Smart Retry

ISP-specific retry schedules
Learn from historical bounce patterns
Adaptive retry intervals based on success rates

Phase 3: Machine Learning

Predict optimal retry timing per recipient
Cluster similar bounce patterns
Auto-adjust retry schedules based on deliverability metrics

Phase 4: Advanced Features

Retry A/B testing
Customer-configurable retry policies (Enterprise tier)
Retry simulation mode (test before deploying)

Conclusion

Decision: Hybrid Approach

Postfix: Handle immediate retries (0-1 hour) for standard SMTP issues Application: Handle long-term retries (1-72+ hours) with intelligence, configuration, and visibility

This approach provides:

✅ Privacy compliance (short Postfix queue time)
✅ Competitive features (configurable policies)
✅ Best of both worlds (SMTP reliability + app intelligence)
✅ Industry standard behavior (72h for Pro tier)
✅ Differentiation (smart retry logic, webhooks, analytics)
✅ Scalability (predictable resource usage)

Alternative approaches considered and rejected:

❌ Pure MTA: Fails privacy requirements, no competitive features
❌ Pure App: Reinvents SMTP wheel, complex to implement correctly

The Question​

Decision: Hybrid Approach ✅​

Why Hybrid is Best​

1. Privacy & Compliance Requirements​

2. Competitor Analysis Insights​

3. Feature Requirements​

Implementation Architecture​

Phase 1: Postfix Handles Immediate (0-1 hour)​

Configuration​

What Postfix Handles​

Typical Retry Pattern​

Phase 2: Application Handles Long-Term (1-72+ hours)​

BullMQ Configuration​

Handler Implementation​

Comparison: Hybrid vs Alternatives​

vs Pure MTA Retry (Postfix only)​

vs Pure App Retry (No Postfix retry)​

Real-World Example​

Timeline with Hybrid Approach​

Competitive Advantages​

1. "Smart Retry Logic"​

2. "Configurable Retry Policies"​

3. "Real-Time Visibility"​

4. "Analytics & Insights"​

5. "Compliance-First Architecture"​

Implementation Checklist​

Postfix Configuration​

Application Layer​

Database Schema​

Monitoring​

Future Enhancements​

Phase 1: Basic Hybrid (MVP)​

Phase 2: Smart Retry​

Phase 3: Machine Learning​

Phase 4: Advanced Features​

Conclusion​