Back to Blog
System Design Interview: Complete Preparation Guide for Software Engineers

System Design Interview: Complete Preparation Guide for Software Engineers

December 17, 2024
9 min read
Tushar Agrawal

Master system design interviews with this comprehensive guide. Learn scalability patterns, load balancing, caching, database sharding, microservices, and how to approach any system design problem methodically.

Introduction

System design interviews test your ability to design large-scale distributed systems. After conducting and giving dozens of system design interviews, I've compiled this comprehensive guide covering everything you need to know.

The Framework: How to Approach Any Problem

Step 1: Clarify Requirements (5 minutes)

Questions to Ask:
├── Functional Requirements
│   ├── What are the core features?
│   ├── Who are the users?
│   └── What are the use cases?
├── Non-Functional Requirements
│   ├── Scale: How many users? How much data?
│   ├── Performance: Latency requirements?
│   ├── Availability: 99.9%? 99.99%?
│   └── Consistency: Strong or eventual?
└── Constraints
    ├── Budget/cost considerations
    ├── Timeline constraints
    └── Existing infrastructure

Step 2: Estimate Scale (5 minutes)

Back-of-envelope calculations example: Twitter-like service

Users

total_users = 500_000_000 # 500M users daily_active_users = 200_000_000 # 200M DAU

Tweets per day

tweets_per_user_per_day = 2 total_tweets_per_day = daily_active_users * tweets_per_user_per_day

= 400M tweets/day

Read:Write ratio

reads_per_user_per_day = 100 # timeline views total_reads_per_day = daily_active_users * reads_per_user_per_day

= 20B reads/day

QPS (Queries Per Second)

read_qps = 20_000_000_000 / 86400 # ≈ 230K QPS write_qps = 400_000_000 / 86400 # ≈ 4.6K QPS

Storage

avg_tweet_size = 500 # bytes (text + metadata) daily_storage = total_tweets_per_day * avg_tweet_size

= 200GB/day, ~73TB/year

Bandwidth

incoming = write_qps * avg_tweet_size # ≈ 2.3 MB/s outgoing = read_qps * avg_tweet_size # ≈ 115 MB/s

Step 3: High-Level Design (10 minutes)

Start with a simple design, then evolve:

[Users] → [Load Balancer] → [API Servers] → [Database]

Then add components as needed:

  • Caching layer
  • Message queues
  • Search service
  • CDN
  • etc.

Step 4: Deep Dive (15-20 minutes)

Focus on 2-3 components in detail based on interviewer interest.

Step 5: Address Bottlenecks (5 minutes)

Identify single points of failure and scalability limits.

Core Concepts

Scalability Patterns

Vertical Scaling (Scale Up)
├── Add more CPU, RAM, storage
├── Simpler but has limits
└── Single point of failure

Horizontal Scaling (Scale Out) ├── Add more machines ├── Requires distributed architecture └── Better for large scale

Load Balancing

                    ┌─────────────┐
                    │   Client    │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │Load Balancer│
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │  Server 1   │ │  Server 2   │ │  Server 3   │
    └─────────────┘ └─────────────┘ └─────────────┘

Load Balancing Algorithms: 1. Round Robin - Simple rotation 2. Weighted Round Robin - Based on server capacity 3. Least Connections - Route to least busy server 4. IP Hash - Consistent routing for same client 5. Least Response Time - Route to fastest server

Caching Strategies

Cache Locations:
┌─────────────────────────────────────────────────┐
│                   CDN Cache                      │
│            (Static assets, edge)                 │
└─────────────────────┬───────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────┐
│               Application Cache                  │
│           (Redis/Memcached cluster)             │
└─────────────────────┬───────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────┐
│               Database Cache                     │
│          (Query cache, buffer pool)             │
└─────────────────────────────────────────────────┘

Caching Patterns: 1. Cache-Aside: App manages cache 2. Write-Through: Write to cache and DB 3. Write-Behind: Async DB writes 4. Read-Through: Cache fetches from DB

Database Scaling

Read Replicas:
                    ┌──────────────┐
           Writes   │    Master    │
        ──────────► │   Database   │
                    └──────┬───────┘
                           │ Replication
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼─────┐ ┌────▼─────┐ ┌────▼─────┐
       │  Replica 1 │ │ Replica 2│ │ Replica 3│
       └──────┬─────┘ └────┬─────┘ └────┬─────┘
              │            │            │
       Reads ◄─────────────┴────────────┘

Sharding (Horizontal Partitioning): ┌─────────────────────────────────────────────────┐ │ Shard Router │ └───────────────────────┬─────────────────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ Shard 0 │ │ Shard 1 │ │ Shard 2 │ │ A-H │ │ I-P │ │ Q-Z │ └─────────┘ └─────────┘ └─────────┘

Sharding Strategies: 1. Range-based: user_id 1-1M in shard 1 2. Hash-based: hash(user_id) % num_shards 3. Directory-based: Lookup table for shard mapping

Message Queues

Producer → [Queue] → Consumer

Benefits: 1. Decoupling: Services don't need to know each other 2. Async Processing: Handle spikes gracefully 3. Reliability: Messages persist until processed 4. Scalability: Add more consumers as needed

Use Cases:

  • Email sending
  • Image processing
  • Analytics events
  • Order processing

CAP Theorem

         Consistency
             /\
            /  \
           /    \
          / Pick \
         /  Two   \
        /──────────\
Availability ──── Partition Tolerance

In distributed systems, during network partition:

  • CP: Sacrifice availability (return error)
  • AP: Sacrifice consistency (return stale data)
Real-world choices:
  • Banking: CP (consistency critical)
  • Social media: AP (availability > perfect consistency)

Common System Design Problems

1. URL Shortener (Easy)

Requirements:
  • Shorten long URLs
  • Redirect short URLs to original
  • Analytics (optional)
Scale:
  • 100M URLs created/month
  • 10:1 read/write ratio
Design: ┌──────────┐ ┌─────────────┐ ┌──────────┐ │ Client │────►│ API │────►│ Cache │ └──────────┘ │ Server │ │ (Redis) │ └──────┬──────┘ └────┬─────┘ │ │ └────────┬────────┘ │ ┌──────▼──────┐ │ Database │ │ (NoSQL) │ └─────────────┘

Key Decisions:

  • Short URL generation: Base62 encoding or hash
  • Storage: NoSQL for simplicity
  • Caching: Heavy read traffic

2. Twitter/Social Feed (Medium)

Requirements:
  • Post tweets
  • Follow users
  • View home timeline
  • Search tweets
Scale:
  • 500M users, 200M DAU
  • 400M tweets/day
  • Timeline must load < 200ms
Design: ┌─────────────────────────────────────────────────────────────┐ │ │ │ ┌─────────┐ ┌────────────────┐ ┌─────────────────┐ │ │ │ Users │───►│ Load Balancer │───►│ API Gateway │ │ │ └─────────┘ └────────────────┘ └────────┬────────┘ │ │ │ │ │ ┌────────────────────────────────────────┼──────┐ │ │ │ │ │ │ │ │ ┌─────▼─────┐ ┌────────▼───────┐ ┌──────▼────┐ │ │ │ │ Tweet │ │ Timeline │ │ User │ │ │ │ │ Service │ │ Service │ │ Service │ │ │ │ └─────┬─────┘ └────────┬───────┘ └───────────┘ │ │ │ │ │ │ │ │ ┌─────▼─────┐ ┌────────▼───────┐ │ │ │ │ Tweet │ │ Timeline │ │ │ │ │ DB │ │ Cache (Fan- │ │ │ │ └───────────┘ │ out on write)│ │ │ │ └────────────────┘ │ │ └───────────────────────────────────────────────────────────┘

Fan-out Strategies:

  • Fan-out on Write: Pre-compute timelines (for normal users)
  • Fan-out on Read: Compute at read time (for celebrities)
  • Hybrid: Combination based on follower count

3. Chat System (Medium)

Requirements:
  • 1:1 messaging
  • Group chats
  • Online status
  • Message history
Key Challenges:
  • Real-time delivery
  • Message ordering
  • Offline message handling
Design: ┌──────────────────────────────────────────────────────────┐ │ │ │ ┌─────────┐ ┌────────────────┐ ┌───────────────┐ │ │ │ Client │◄──►│ WebSocket │◄──►│ Connection │ │ │ │ App │ │ Gateway │ │ Manager │ │ │ └─────────┘ └───────┬────────┘ └───────────────┘ │ │ │ │ │ ┌───────────────┼───────────────┐ │ │ │ │ │ │ │ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │ │ │ Message │ │ Presence │ │ Group │ │ │ │ Service │ │ Service │ │ Service │ │ │ └──────┬──────┘ └──────┬──────┘ └─────────────┘ │ │ │ │ │ │ ┌──────▼──────┐ ┌──────▼──────┐ │ │ │ Message │ │ Redis │ │ │ │ Storage │ │(Online users│ │ │ │(Cassandra) │ │ status) │ │ │ └─────────────┘ └─────────────┘ │ └──────────────────────────────────────────────────────────┘

Message Flow: 1. Client connects via WebSocket 2. Message sent to Message Service 3. Service looks up recipient's connection 4. If online: Forward via WebSocket 5. If offline: Store for later delivery

4. Rate Limiter (Easy-Medium)

Algorithms:

1. Token Bucket: - Tokens added at fixed rate - Request consumes token - No token = rejected

2. Sliding Window: - Count requests in time window - Window slides continuously

3. Fixed Window Counter: - Simple counter per time window - Can have edge case issues

Implementation (Redis + Lua):

import redis
import time

class RateLimiter: def __init__(self, redis_client: redis.Redis): self.redis = redis_client

def is_allowed(self, user_id: str, limit: int, window: int) -> bool: """Sliding window rate limiter using Redis sorted sets""" key = f"rate_limit:{user_id}" now = time.time() window_start = now - window

pipe = self.redis.pipeline() # Remove old entries pipe.zremrangebyscore(key, 0, window_start) # Count current entries pipe.zcard(key) # Add current request pipe.zadd(key, {str(now): now}) # Set expiry pipe.expire(key, window)

results = pipe.execute() request_count = results[1]

return request_count < limit

Key Technologies to Know

Databases

| Type | Examples | Best For | |------|----------|----------| | Relational | PostgreSQL, MySQL | ACID, complex queries | | Document | MongoDB, CouchDB | Flexible schema, JSON | | Key-Value | Redis, DynamoDB | Caching, sessions | | Wide-Column | Cassandra, HBase | Time-series, high write | | Graph | Neo4j, Neptune | Relationships, social |

Message Queues

| System | Strengths | Use Cases | |--------|-----------|-----------| | Kafka | High throughput, streaming | Event streaming, logs | | RabbitMQ | Flexible routing, AMQP | Task queues | | SQS | Managed, simple | AWS workloads | | Redis Pub/Sub | Low latency | Real-time features |

Caching

| System | Type | Best For | |--------|------|----------| | Redis | In-memory | Sessions, caching, queues | | Memcached | In-memory | Simple caching | | CDN | Edge | Static assets, API responses |

Interview Tips

Do's

1. Think out loud - Explain your reasoning 2. Start simple - Then add complexity 3. Ask questions - Clarify requirements 4. Draw diagrams - Visual communication 5. Consider trade-offs - No perfect solution 6. Discuss alternatives - Show breadth

Don'ts

1. Don't dive into details too fast 2. Don't ignore scale requirements 3. Don't forget about failure cases 4. Don't be silent - Keep communicating 5. Don't be defensive - Accept feedback

Practice Problems

Easy:

  • URL Shortener
  • Pastebin
  • Rate Limiter
Medium:
  • Twitter/Social Feed
  • Instagram
  • Web Crawler
  • Notification System
Hard:
  • Uber/Lyft
  • YouTube
  • Google Docs
  • Distributed Cache

Key Takeaways

1. Follow a structured approach - Requirements → Scale → Design → Deep Dive 2. Know the building blocks - Load balancers, caches, queues, databases 3. Understand trade-offs - Every decision has pros and cons 4. Practice estimation - Back-of-envelope math is crucial 5. Think about failures - How does the system handle failures? 6. Consider operations - Monitoring, deployment, maintenance

Conclusion

System design interviews test your ability to think at scale and make sound architectural decisions. Practice with real-world problems, understand the core concepts, and always communicate your thought process. The goal isn't a perfect design—it's demonstrating systematic problem-solving.

---

Preparing for system design interviews? Connect on LinkedIn to discuss preparation strategies.

Related Articles

Share this article

Related Articles