Retry Implementation Strategy: App vs MTA vs Hybrid
Analysis of where email retry logic should be implemented - at the application layer, MTA layer, or a hybrid approach.
Last Updated: February 2025 Decision: Hybrid Approach (Postfix for immediate, Application for long-term)
The Question
Should Postchi handle email delivery retries at:
- Application layer (BullMQ/Worker)
- MTA layer (Postfix)
- Hybrid approach (Both)
Decision: Hybrid Approach ✅
Postfix handles immediate retries (0-1 hour) Application handles long-term retries (1-72+ hours)
Why Hybrid is Best
1. Privacy & Compliance Requirements
From our message storage analysis, we identified:
- Postfix queues messages in
/var/spool/postfix/with full email content - Default Postfix retry: 5 days (unacceptable for GDPR/HIPAA)
- Our goal: Aggressive cleanup (1 hour max in Postfix queue)
Pure MTA retry would contradict our privacy requirements.
Hybrid solution:
- Messages stay in Postfix queue max 1 hour
- After 1 hour, content is only in S3 (encrypted, with lifecycle policies)
- Application handles remaining retries (metadata only in database)
2. Competitor Analysis Insights
From our competitor research:
| Provider | Implementation |
|---|---|
| SendGrid, Mailgun, SparkPost | Custom MTAs with app-level retry logic |
| AWS SES | "Cannot configure" = app controls everything |
| Mailgun (unique) | Offers configuration (5min-24h window) |
Key insight: Successful ESPs control retry logic at the application layer, not relying on standard MTA behavior.
3. Feature Requirements
Features we need that pure MTA cannot provide:
| Feature | Pure MTA | Hybrid |
|---|---|---|
| Per-organization retry policies | ❌ | ✅ |
| Track individual retry attempts | ❌ (log parsing only) | ✅ (database) |
| Fire webhooks on each retry | ❌ | ✅ |
| Tier-based retry duration | ❌ | ✅ |
| ISP-specific retry logic | ❌ | ✅ |
Pause org if bounce rate > 10% | ❌ | ✅ |
| Analytics on retry success rates | ❌ | ✅ |
| Custom intervals per tier | ❌ | ✅ |
These are competitive differentiators we'd lose with pure MTA approach.
Implementation Architecture
Phase 1: Postfix Handles Immediate (0-1 hour)
Configuration
# /etc/postfix/main.cf
maximal_queue_lifetime = 1h # Max 1 hour in queue
bounce_queue_lifetime = 1h # Bounce after 1 hour
minimal_backoff_time = 5m # Retry every 5 min
maximal_backoff_time = 15m # Max retry interval
queue_run_delay = 5m # Check queue every 5 min
What Postfix Handles
- Connection timeouts
- Network glitches
- Temporary DNS issues
- Recipient server temporarily unavailable
- Quick ISP throttling
Typical Retry Pattern
0:00 - Initial attempt fails (4xx)
0:05 - Retry 1
0:10 - Retry 2
0:25 - Retry 3
0:40 - Retry 4
1:00 - Final attempt → If still failing, bounce to app
Result: Most temporary issues resolve within 1 hour.
Phase 2: Application Handles Long-Term (1-72+ hours)
BullMQ Configuration
interface RetryPolicy {
maxDuration: number; // milliseconds
maxAttempts: number;
intervals: number[]; // minutes
}
const retryPolicies: Record<string, RetryPolicy> = {
FREE: {
maxDuration: 24 * 60 * 60 * 1000, // 24 hours
maxAttempts: 5,
intervals: [30, 60, 120, 240, 480] // 30m, 1h, 2h, 4h, 8h
},
PRO: {
maxDuration: 72 * 60 * 60 * 1000, // 72 hours
maxAttempts: 8,
intervals: [30, 60, 120, 240, 480, 960, 1440, 2880]
},
ENTERPRISE: {
maxDuration: 120 * 60 * 60 * 1000, // 5 days
maxAttempts: 12,
intervals: [30, 60, 120, 240, 480, 960, 1440, 2880, 4320, 5760, 8640, 10080]
}
};
Handler Implementation
// When Postfix bounces back after 1 hour
async function handleSoftBounce(message: Message) {
const org = await getOrganization(message.orgId);
const policy = retryPolicies[org.tier];
// Check if we should retry
if (message.retryCount >= policy.maxAttempts) {
return markAsFailedPermanently(message);
}
// Check if max duration exceeded
const timeSinceCreation = Date.now() - message.createdAt.getTime();
if (timeSinceCreation > policy.maxDuration) {
return markAsFailedPermanently(message);
}
// Calculate next retry time
const nextInterval = policy.intervals[message.retryCount];
const delay = nextInterval * 60 * 1000; // Convert to ms
// Add to BullMQ with delay
await emailQueue.add('retry-email', {
messageId: message.id,
attempt: message.retryCount + 1
}, {
delay,
jobId: `retry-${message.id}-${message.retryCount}`,
removeOnComplete: true,
removeOnFail: false
});
// Update status
await prisma.message.update({
where: { id: message.id },
data: {
status: 'RETRYING',
retryCount: message.retryCount + 1,
nextRetryAt: new Date(Date.now() + delay),
lastRetryError: bounceMessage
}
});
// Fire webhook
await fireWebhook(org, {
event: 'email.retry_scheduled',
messageId: message.id,
attempt: message.retryCount + 1,
nextRetryAt: new Date(Date.now() + delay),
reason: 'soft_bounce'
});
}
Comparison: Hybrid vs Alternatives
vs Pure MTA Retry (Postfix only)
| Requirement | Pure MTA | Hybrid |
|---|---|---|
Messages in queue < 1h | ❌ (up to 5 days) | ✅ (max 1 hour) |
| Per-org retry policies | ❌ | ✅ |
| Track retry attempts | ❌ (only via log parsing) | ✅ (database) |
| Fire webhooks on retry | ❌ | ✅ |
| Configurable per tier | ❌ | ✅ |
| Privacy compliant | ❌ (content in queue) | ✅ (1h max) |
| ISP-specific logic | ❌ | ✅ |
Verdict: Pure MTA fails on privacy, features, and differentiation.
vs Pure App Retry (No Postfix retry)
| Requirement | Pure App | Hybrid |
|---|---|---|
| Handle connection issues | Complex | ✅ (Postfix handles) |
| SMTP protocol edge cases | Need to implement | ✅ (Postfix handles) |
| Immediate retries (5min) | Worker overhead | ✅ (Postfix handles) |
| Connection pooling | Need to manage | ✅ (Postfix handles) |
| Standard SMTP behavior | Reinvent wheel | ✅ (Postfix provides) |
| Greylisting support | Need to implement | ✅ (Postfix handles) |
Verdict: Pure app requires reinventing SMTP retry logic that Postfix provides for free.
Real-World Example
Scenario: Email to Gmail bounces with 450 4.2.1 The user you are trying to contact is receiving mail too quickly
Timeline with Hybrid Approach
0:00 - Worker sends → Postfix gets 450 error 0:05 - Postfix retries → Still 450 0:10 - Postfix retries → Still 450 0:15 - Postfix retries → Still 450 1:00 - Postfix gives up, bounces to app
App receives bounce:
- Checks org tier: PRO (72h retry)
- Checks retry history: Attempt 0 of 8
- Checks reason: 450 (temporary, Gmail throttling)
- Parses error: "receiving mail too quickly"
- Decision: Gmail throttling detected, schedule retry in 30 minutes
- Stores:
status: RETRYING,nextRetryAt,retryReason: 'gmail_throttling' - Fires webhook:
email.retry_scheduled
1:30 - BullMQ job fires → Worker resends → Success! ✅
Benefits:
- Postfix caught immediate network issues (first 1 hour)
- App made intelligent decision based on error code
- Organization got real-time webhook notification
- Full visibility in dashboard: "Retrying due to Gmail throttling (attempt 1 of 8)"
- Message wasn't sitting in Postfix queue for 72 hours
Competitive Advantages
What the hybrid approach enables:
1. "Smart Retry Logic"
// Learn from historical data
if (bounceReason.includes('too quickly') && recipientDomain === 'gmail.com') {
// Gmail throttling - wait longer
nextRetry = 60 * 60 * 1000; // 1 hour
} else if (bounceReason.includes('greylisted')) {
// Standard greylisting - retry in 5 minutes
nextRetry = 5 * 60 * 1000;
} else if (recipientISP === 'yahoo') {
// Yahoo prefers longer intervals
nextRetry = 30 * 60 * 1000; // 30 minutes
}
2. "Configurable Retry Policies"
Better than Mailgun (only competitor offering configuration):
- Free tier: 24h, 5 attempts
- Pro tier: 72h, 8 attempts
- Enterprise: 5 days, 12 attempts
- Custom: Fully configurable
3. "Real-Time Visibility"
Dashboard shows:
- "Retrying in 45 minutes (attempt 3 of 8)"
- "Retry reason: Gmail throttling detected"
- "Last attempt: 1 hour ago"
Webhooks fire on:
email.retry_scheduledemail.retry_succeededemail.retry_failed_permanently
4. "Analytics & Insights"
Track and learn:
- "72% of soft bounces succeed on retry 2"
- "Gmail throttling usually resolves in 30 minutes"
- "Yahoo bounces have 95% success rate with 1-hour interval"
- "Enterprise tier customers have 85% retry success rate"
5. "Compliance-First Architecture"
Marketing message:
"Messages never stored in our MTA queue longer than 1 hour. After that, only encrypted metadata remains in our systems, with email content in S3 with automatic lifecycle policies per your tier."
Appeals to:
- Healthcare (HIPAA)
- Finance (SOC 2)
- Legal (data retention policies)
Implementation Checklist
Postfix Configuration
- Set
maximal_queue_lifetime = 1h - Set
bounce_queue_lifetime = 1h - Configure exponential backoff (5m, 10m, 15m)
- Set up bounce processing
Application Layer
- Create
RetryPolicyinterface - Implement tier-based retry policies
- Create
handleSoftBounce()function - Set up BullMQ delayed jobs
- Add retry tracking to database
- Implement webhook firing on retry events
- Add retry status to dashboard
Database Schema
- Add
retryCountto messages table - Add
nextRetryAttimestamp - Add
lastRetryErrortext field - Add
retryHistoryJSONB field for tracking attempts
Monitoring
- Track retry success rates per tier
- Monitor queue depth
- Alert on high retry failure rates
- Dashboard for retry analytics
Future Enhancements
Phase 1: Basic Hybrid (MVP)
- Postfix 1-hour retry
- App BullMQ retry with tier-based policies
- Simple exponential backoff
Phase 2: Smart Retry
- ISP-specific retry schedules
- Learn from historical bounce patterns
- Adaptive retry intervals based on success rates
Phase 3: Machine Learning
- Predict optimal retry timing per recipient
- Cluster similar bounce patterns
- Auto-adjust retry schedules based on deliverability metrics
Phase 4: Advanced Features
- Retry A/B testing
- Customer-configurable retry policies (Enterprise tier)
- Retry simulation mode (test before deploying)
Conclusion
Decision: Hybrid Approach
Postfix: Handle immediate retries (0-1 hour) for standard SMTP issues Application: Handle long-term retries (1-72+ hours) with intelligence, configuration, and visibility
This approach provides:
- ✅ Privacy compliance (short Postfix queue time)
- ✅ Competitive features (configurable policies)
- ✅ Best of both worlds (SMTP reliability + app intelligence)
- ✅ Industry standard behavior (72h for Pro tier)
- ✅ Differentiation (smart retry logic, webhooks, analytics)
- ✅ Scalability (predictable resource usage)
Alternative approaches considered and rejected:
- ❌ Pure MTA: Fails privacy requirements, no competitive features
- ❌ Pure App: Reinvents SMTP wheel, complex to implement correctly