Skip to main content

Retry Implementation Strategy: App vs MTA vs Hybrid

Analysis of where email retry logic should be implemented - at the application layer, MTA layer, or a hybrid approach.

Last Updated: February 2025 Decision: Hybrid Approach (Postfix for immediate, Application for long-term)


The Question

Should Postchi handle email delivery retries at:

  • Application layer (BullMQ/Worker)
  • MTA layer (Postfix)
  • Hybrid approach (Both)

Decision: Hybrid Approach ✅

Postfix handles immediate retries (0-1 hour) Application handles long-term retries (1-72+ hours)


Why Hybrid is Best

1. Privacy & Compliance Requirements

From our message storage analysis, we identified:

  • Postfix queues messages in /var/spool/postfix/ with full email content
  • Default Postfix retry: 5 days (unacceptable for GDPR/HIPAA)
  • Our goal: Aggressive cleanup (1 hour max in Postfix queue)

Pure MTA retry would contradict our privacy requirements.

Hybrid solution:

  • Messages stay in Postfix queue max 1 hour
  • After 1 hour, content is only in S3 (encrypted, with lifecycle policies)
  • Application handles remaining retries (metadata only in database)

2. Competitor Analysis Insights

From our competitor research:

ProviderImplementation
SendGrid, Mailgun, SparkPostCustom MTAs with app-level retry logic
AWS SES"Cannot configure" = app controls everything
Mailgun (unique)Offers configuration (5min-24h window)

Key insight: Successful ESPs control retry logic at the application layer, not relying on standard MTA behavior.


3. Feature Requirements

Features we need that pure MTA cannot provide:

FeaturePure MTAHybrid
Per-organization retry policies
Track individual retry attempts❌ (log parsing only)✅ (database)
Fire webhooks on each retry
Tier-based retry duration
ISP-specific retry logic
Pause org if bounce rate > 10%
Analytics on retry success rates
Custom intervals per tier

These are competitive differentiators we'd lose with pure MTA approach.


Implementation Architecture


Phase 1: Postfix Handles Immediate (0-1 hour)

Configuration

# /etc/postfix/main.cf
maximal_queue_lifetime = 1h # Max 1 hour in queue
bounce_queue_lifetime = 1h # Bounce after 1 hour
minimal_backoff_time = 5m # Retry every 5 min
maximal_backoff_time = 15m # Max retry interval
queue_run_delay = 5m # Check queue every 5 min

What Postfix Handles

  • Connection timeouts
  • Network glitches
  • Temporary DNS issues
  • Recipient server temporarily unavailable
  • Quick ISP throttling

Typical Retry Pattern

0:00 - Initial attempt fails (4xx)
0:05 - Retry 1
0:10 - Retry 2
0:25 - Retry 3
0:40 - Retry 4
1:00 - Final attempt → If still failing, bounce to app

Result: Most temporary issues resolve within 1 hour.


Phase 2: Application Handles Long-Term (1-72+ hours)

BullMQ Configuration

interface RetryPolicy {
maxDuration: number; // milliseconds
maxAttempts: number;
intervals: number[]; // minutes
}

const retryPolicies: Record<string, RetryPolicy> = {
FREE: {
maxDuration: 24 * 60 * 60 * 1000, // 24 hours
maxAttempts: 5,
intervals: [30, 60, 120, 240, 480] // 30m, 1h, 2h, 4h, 8h
},
PRO: {
maxDuration: 72 * 60 * 60 * 1000, // 72 hours
maxAttempts: 8,
intervals: [30, 60, 120, 240, 480, 960, 1440, 2880]
},
ENTERPRISE: {
maxDuration: 120 * 60 * 60 * 1000, // 5 days
maxAttempts: 12,
intervals: [30, 60, 120, 240, 480, 960, 1440, 2880, 4320, 5760, 8640, 10080]
}
};

Handler Implementation

// When Postfix bounces back after 1 hour
async function handleSoftBounce(message: Message) {
const org = await getOrganization(message.orgId);
const policy = retryPolicies[org.tier];

// Check if we should retry
if (message.retryCount >= policy.maxAttempts) {
return markAsFailedPermanently(message);
}

// Check if max duration exceeded
const timeSinceCreation = Date.now() - message.createdAt.getTime();
if (timeSinceCreation > policy.maxDuration) {
return markAsFailedPermanently(message);
}

// Calculate next retry time
const nextInterval = policy.intervals[message.retryCount];
const delay = nextInterval * 60 * 1000; // Convert to ms

// Add to BullMQ with delay
await emailQueue.add('retry-email', {
messageId: message.id,
attempt: message.retryCount + 1
}, {
delay,
jobId: `retry-${message.id}-${message.retryCount}`,
removeOnComplete: true,
removeOnFail: false
});

// Update status
await prisma.message.update({
where: { id: message.id },
data: {
status: 'RETRYING',
retryCount: message.retryCount + 1,
nextRetryAt: new Date(Date.now() + delay),
lastRetryError: bounceMessage
}
});

// Fire webhook
await fireWebhook(org, {
event: 'email.retry_scheduled',
messageId: message.id,
attempt: message.retryCount + 1,
nextRetryAt: new Date(Date.now() + delay),
reason: 'soft_bounce'
});
}

Comparison: Hybrid vs Alternatives

vs Pure MTA Retry (Postfix only)

RequirementPure MTAHybrid
Messages in queue < 1h❌ (up to 5 days)✅ (max 1 hour)
Per-org retry policies
Track retry attempts❌ (only via log parsing)✅ (database)
Fire webhooks on retry
Configurable per tier
Privacy compliant❌ (content in queue)✅ (1h max)
ISP-specific logic

Verdict: Pure MTA fails on privacy, features, and differentiation.


vs Pure App Retry (No Postfix retry)

RequirementPure AppHybrid
Handle connection issuesComplex✅ (Postfix handles)
SMTP protocol edge casesNeed to implement✅ (Postfix handles)
Immediate retries (5min)Worker overhead✅ (Postfix handles)
Connection poolingNeed to manage✅ (Postfix handles)
Standard SMTP behaviorReinvent wheel✅ (Postfix provides)
Greylisting supportNeed to implement✅ (Postfix handles)

Verdict: Pure app requires reinventing SMTP retry logic that Postfix provides for free.


Real-World Example

Scenario: Email to Gmail bounces with 450 4.2.1 The user you are trying to contact is receiving mail too quickly

Timeline with Hybrid Approach

0:00 - Worker sends → Postfix gets 450 error 0:05 - Postfix retries → Still 450 0:10 - Postfix retries → Still 450 0:15 - Postfix retries → Still 450 1:00 - Postfix gives up, bounces to app

App receives bounce:

  • Checks org tier: PRO (72h retry)
  • Checks retry history: Attempt 0 of 8
  • Checks reason: 450 (temporary, Gmail throttling)
  • Parses error: "receiving mail too quickly"
  • Decision: Gmail throttling detected, schedule retry in 30 minutes
  • Stores: status: RETRYING, nextRetryAt, retryReason: 'gmail_throttling'
  • Fires webhook: email.retry_scheduled

1:30 - BullMQ job fires → Worker resends → Success! ✅

Benefits:

  • Postfix caught immediate network issues (first 1 hour)
  • App made intelligent decision based on error code
  • Organization got real-time webhook notification
  • Full visibility in dashboard: "Retrying due to Gmail throttling (attempt 1 of 8)"
  • Message wasn't sitting in Postfix queue for 72 hours

Competitive Advantages

What the hybrid approach enables:

1. "Smart Retry Logic"

// Learn from historical data
if (bounceReason.includes('too quickly') && recipientDomain === 'gmail.com') {
// Gmail throttling - wait longer
nextRetry = 60 * 60 * 1000; // 1 hour
} else if (bounceReason.includes('greylisted')) {
// Standard greylisting - retry in 5 minutes
nextRetry = 5 * 60 * 1000;
} else if (recipientISP === 'yahoo') {
// Yahoo prefers longer intervals
nextRetry = 30 * 60 * 1000; // 30 minutes
}

2. "Configurable Retry Policies"

Better than Mailgun (only competitor offering configuration):

  • Free tier: 24h, 5 attempts
  • Pro tier: 72h, 8 attempts
  • Enterprise: 5 days, 12 attempts
  • Custom: Fully configurable

3. "Real-Time Visibility"

Dashboard shows:

  • "Retrying in 45 minutes (attempt 3 of 8)"
  • "Retry reason: Gmail throttling detected"
  • "Last attempt: 1 hour ago"

Webhooks fire on:

  • email.retry_scheduled
  • email.retry_succeeded
  • email.retry_failed_permanently

4. "Analytics & Insights"

Track and learn:

  • "72% of soft bounces succeed on retry 2"
  • "Gmail throttling usually resolves in 30 minutes"
  • "Yahoo bounces have 95% success rate with 1-hour interval"
  • "Enterprise tier customers have 85% retry success rate"

5. "Compliance-First Architecture"

Marketing message:

"Messages never stored in our MTA queue longer than 1 hour. After that, only encrypted metadata remains in our systems, with email content in S3 with automatic lifecycle policies per your tier."

Appeals to:

  • Healthcare (HIPAA)
  • Finance (SOC 2)
  • Legal (data retention policies)

Implementation Checklist

Postfix Configuration

  • Set maximal_queue_lifetime = 1h
  • Set bounce_queue_lifetime = 1h
  • Configure exponential backoff (5m, 10m, 15m)
  • Set up bounce processing

Application Layer

  • Create RetryPolicy interface
  • Implement tier-based retry policies
  • Create handleSoftBounce() function
  • Set up BullMQ delayed jobs
  • Add retry tracking to database
  • Implement webhook firing on retry events
  • Add retry status to dashboard

Database Schema

  • Add retryCount to messages table
  • Add nextRetryAt timestamp
  • Add lastRetryError text field
  • Add retryHistory JSONB field for tracking attempts

Monitoring

  • Track retry success rates per tier
  • Monitor queue depth
  • Alert on high retry failure rates
  • Dashboard for retry analytics

Future Enhancements

Phase 1: Basic Hybrid (MVP)

  • Postfix 1-hour retry
  • App BullMQ retry with tier-based policies
  • Simple exponential backoff

Phase 2: Smart Retry

  • ISP-specific retry schedules
  • Learn from historical bounce patterns
  • Adaptive retry intervals based on success rates

Phase 3: Machine Learning

  • Predict optimal retry timing per recipient
  • Cluster similar bounce patterns
  • Auto-adjust retry schedules based on deliverability metrics

Phase 4: Advanced Features

  • Retry A/B testing
  • Customer-configurable retry policies (Enterprise tier)
  • Retry simulation mode (test before deploying)

Conclusion

Decision: Hybrid Approach

Postfix: Handle immediate retries (0-1 hour) for standard SMTP issues Application: Handle long-term retries (1-72+ hours) with intelligence, configuration, and visibility

This approach provides:

  • ✅ Privacy compliance (short Postfix queue time)
  • ✅ Competitive features (configurable policies)
  • ✅ Best of both worlds (SMTP reliability + app intelligence)
  • ✅ Industry standard behavior (72h for Pro tier)
  • ✅ Differentiation (smart retry logic, webhooks, analytics)
  • ✅ Scalability (predictable resource usage)

Alternative approaches considered and rejected:

  • ❌ Pure MTA: Fails privacy requirements, no competitive features
  • ❌ Pure App: Reinvents SMTP wheel, complex to implement correctly