System Design

System Design Fundamentals

January 5, 2025 16 min read Rohan Handore

System design isn't about memorizing solutions—it's about understanding trade-offs. Every decision has costs and benefits. The best engineers can articulate those trade-offs clearly and make informed decisions based on actual requirements.

The Foundation: Requirements First

Before drawing any boxes and arrows, understand what you're building:

Functional Requirements

What does the system need to do?

Users can upload files
Users can share files with others
Files can be organized in folders

Non-Functional Requirements

How well does it need to work?

Scale: How many users? How much data?
Latency: How fast should responses be?
Availability: What uptime is required?
Consistency: Can we tolerate stale data?

💡 The Numbers Matter

There's a huge difference between designing for 1,000 users and 1,000,000 users. Always quantify requirements. "Fast" isn't a requirement—"95th percentile latency under 200ms" is.

1. Scalability: Vertical vs. Horizontal

Vertical Scaling (Scale Up)

Add more power to existing machines: more CPU, more RAM, bigger disks.

Pros: Simple, no code changes needed
Cons: Physical limits, single point of failure, expensive

Horizontal Scaling (Scale Out)

Add more machines to distribute the load.

Pros: Virtually unlimited scaling, better fault tolerance
Cons: More complex, requires distributed system thinking

// When you scale horizontally, you need a load balancer

Client → Load Balancer → [Server 1]
                      → [Server 2]
                      → [Server 3]

// Load balancing strategies:
// - Round Robin: Simple rotation
// - Least Connections: Route to server with fewest active connections
// - IP Hash: Same client always hits same server (useful for sessions)
// - Weighted: Route more traffic to more powerful servers

2. The CAP Theorem: Pick Two (Sort Of)

In a distributed system, you can't have all three:

Consistency: Every read gets the most recent write
Availability: Every request gets a response
Partition Tolerance: System works despite network failures

In practice, network partitions happen, so you're choosing between consistency and availability during failures.

// CP System (Consistency + Partition Tolerance)
// Example: Banking system
// During a network partition, some nodes become unavailable
// but data is always consistent

// AP System (Availability + Partition Tolerance)
// Example: Social media feed
// System always responds, but you might see stale data
// Eventually consistent

⚠️ Reality Check

Most systems don't need to make this choice most of the time. CAP only applies during network partitions. Design for the common case, have a strategy for the edge cases.

3. Databases: SQL vs. NoSQL

SQL Databases (PostgreSQL, MySQL)

Best for: Structured data, complex queries, ACID transactions

-- Strong consistency, complex joins
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id
HAVING COUNT(o.id) > 5;

NoSQL Databases (MongoDB, Cassandra, DynamoDB)

Best for: Flexible schemas, high write throughput, horizontal scaling

// Document store - flexible schema
{
    "user_id": "123",
    "name": "John",
    "orders": [
        { "id": "o1", "total": 99.99, "items": [...] },
        { "id": "o2", "total": 149.99, "items": [...] }
    ],
    "preferences": {
        "theme": "dark",
        "notifications": true
    }
}

When to Use What

SQL: Financial data, user accounts, anything requiring transactions
Document DB: Content management, user profiles, catalogs
Key-Value: Caching, session storage, real-time data
Wide-Column: Time-series data, analytics, high-write workloads
Graph: Social networks, recommendation engines, fraud detection

4. Caching: The Performance Multiplier

Caching is the #1 way to improve performance. But it introduces complexity.

// Cache-Aside Pattern (most common)
function getUser(userId) {
    // Try cache first
    let user = cache.get(`user:${userId}`);
    
    if (user) {
        return user; // Cache hit
    }
    
    // Cache miss - fetch from database
    user = database.query('SELECT * FROM users WHERE id = ?', userId);
    
    // Store in cache for next time
    cache.set(`user:${userId}`, user, { ttl: 3600 });
    
    return user;
}

Cache Invalidation Strategies

TTL (Time To Live): Cache expires after a set time. Simple but can serve stale data.
Write-Through: Update cache when updating database. Consistent but slower writes.
Write-Behind: Update cache immediately, sync to database later. Fast but complex.
Event-Based: Invalidate cache when data changes. Requires event infrastructure.

💡 The Two Hard Problems

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton. Take cache invalidation seriously.

5. Message Queues: Decoupling Services

When Service A doesn't need an immediate response from Service B, use a queue.

// Without queue - tight coupling
async function processOrder(order) {
    await saveOrder(order);
    await sendEmail(order);      // If email fails, order fails
    await updateInventory(order); // If inventory fails, order fails
    await notifyShipping(order);  // If shipping fails, order fails
}

// With queue - loose coupling
async function processOrder(order) {
    await saveOrder(order);
    
    // These happen asynchronously, can retry independently
    queue.publish('order.created', order);
}

// Separate consumers handle each concern
queue.subscribe('order.created', async (order) => {
    await sendEmail(order);
});

queue.subscribe('order.created', async (order) => {
    await updateInventory(order);
});

6. Database Replication and Sharding

Replication: Copies of Your Data

// Master-Slave Replication
// Writes go to master, reads can go to slaves

[Write] → [Master DB] → [Slave 1] ← [Read]
                      → [Slave 2] ← [Read]
                      → [Slave 3] ← [Read]

// Benefits: Read scaling, fault tolerance
// Drawbacks: Replication lag, write bottleneck

Sharding: Splitting Your Data

// Horizontal sharding by user_id
// Users 1-1M     → Shard 1
// Users 1M-2M    → Shard 2
// Users 2M-3M    → Shard 3

function getShard(userId) {
    const shardCount = 3;
    return userId % shardCount; // Simple modulo sharding
}

// Better: Consistent hashing for easier scaling

7. API Design for Scale

Rate Limiting

// Token bucket algorithm
class RateLimiter {
    constructor(tokensPerSecond, bucketSize) {
        this.tokens = bucketSize;
        this.lastRefill = Date.now();
        this.tokensPerSecond = tokensPerSecond;
        this.bucketSize = bucketSize;
    }
    
    tryConsume() {
        this.refill();
        if (this.tokens > 0) {
            this.tokens--;
            return true;
        }
        return false;
    }
    
    refill() {
        const now = Date.now();
        const elapsed = (now - this.lastRefill) / 1000;
        this.tokens = Math.min(
            this.bucketSize,
            this.tokens + elapsed * this.tokensPerSecond
        );
        this.lastRefill = now;
    }
}

Pagination

// Offset pagination - simple but slow at scale
GET /users?page=1000&limit=20
// Database: OFFSET 20000 LIMIT 20 (scans 20,000 rows!)

// Cursor pagination - efficient at any scale
GET /users?cursor=abc123&limit=20
// Database: WHERE id > cursor LIMIT 20 (uses index)

8. Observability: Know Your System

The Three Pillars

Logs: Detailed records of what happened
Metrics: Numerical measurements over time
Traces: Request flow across services

// Structured logging
logger.info('Order processed', {
    orderId: order.id,
    userId: order.userId,
    total: order.total,
    processingTimeMs: endTime - startTime,
    paymentMethod: order.paymentMethod
});

// Key metrics to track
- Request rate (requests/second)
- Error rate (errors/requests)
- Latency percentiles (p50, p95, p99)
- Saturation (CPU, memory, connections)

Design Process: A Framework

Clarify requirements: Ask questions, quantify constraints
High-level design: Draw the major components
Deep dive: Detail critical paths and algorithms
Identify bottlenecks: Where will the system fail first?
Scale: How do we handle 10x, 100x traffic?
Trade-offs: What are we sacrificing for our choices?

Conclusion

System design is fundamentally about trade-offs. There's no perfect architecture—only the right architecture for your specific requirements, constraints, and team.

Start simple. Measure everything. Optimize the bottlenecks. Repeat.