Databases (SQL vs NoSQL, Indexes, ACID)
Foundations
Choosing the right database is one of the most critical decisions in system design. The database is often the hardest part of a system to scale and change, so getting it right early on is crucial. This chapter covers the fundamental concepts you need to make an informed decision: SQL vs. NoSQL, Indexes, and ACID.
SQL (Relational) vs. NoSQL (Non-Relational)
This is the most common database trade-off question in interviews. The key is to understand that one is not inherently better than the other; they are designed to solve different problems.
SQL Databases (e.g., PostgreSQL, MySQL, Oracle)
What they are: SQL databases store data in a structured way, using tables with rows and columns. They enforce a rigid schema, meaning the structure of the data must be defined before you can insert it. They use the Structured Query Language (SQL) for data manipulation.
Key Characteristics:
- Structured Data & Fixed Schema: Data must conform to a predefined structure. This ensures data integrity and consistency.
- ACID Compliance: They are designed to provide strong transactional guarantees (more on this below).
- Joins: They are excellent at handling complex queries that require joining data from multiple tables.
- Vertical Scaling: Traditionally, SQL databases are scaled by increasing the resources (CPU, RAM, SSD) of a single server. While horizontal scaling (sharding) is possible, it's often more complex to implement.
When to use SQL:
- When your data is highly structured and you need to maintain strict data integrity (e.g., financial systems, e-commerce orders).
- When you need to perform complex queries and joins.
- When you require strong transactional (ACID) guarantees.
NoSQL Databases (e.g., MongoDB, Cassandra, Redis, DynamoDB)
What they are: NoSQL is an umbrella term for a wide variety of databases that do not use the traditional relational model. They are designed for scale, flexibility, and performance.
Key Characteristics:
- Flexible Schema: They do not require a predefined schema. This allows for storing unstructured or semi-structured data and makes it easier to evolve the application over time.
- Horizontal Scaling: They are typically designed to be scaled out across many commodity servers, making them suitable for very large datasets and high throughput.
- BASE Guarantees: They often prioritize availability over strict consistency (more on this in the CAP Theorem chapter). The acronym stands for Basically Available, Soft state, Eventual consistency.
- Optimized for Specific Data Models: There are several types of NoSQL databases, each optimized for a different use case.
Types of NoSQL Databases:
- Document Stores (e.g., MongoDB, Couchbase): Store data in flexible, JSON-like documents. Great for content management, user profiles, and e-commerce catalogs.
- Key-Value Stores (e.g., Redis, DynamoDB): The simplest model. Data is stored as a key-value pair. Incredibly fast and scalable for use cases like caching, session management, and real-time leaderboards.
- Wide-Column Stores (e.g., Cassandra, HBase): Store data in tables with rows and a dynamic number of columns. Excellent for write-heavy applications with very large datasets, like IoT sensor data or analytics.
- Graph Databases (e.g., Neo4j, Amazon Neptune): Designed to store and navigate relationships between entities. Perfect for social networks, fraud detection, and recommendation engines.
When to use NoSQL:
- When you need to handle large volumes of unstructured or rapidly changing data.
- When you require massive scale and high write throughput.
- When your application needs to be highly available and can tolerate eventual consistency.
- When your data model fits one of the specific NoSQL types (e.g., a social graph).
Indexes: Making Your Queries Fast
What they are: An index is a special data structure that allows a database to find rows in a table much more quickly than it could by scanning the entire table. It's like the index at the back of a book.
How they work: An index is typically a B-Tree (or a variation) that stores the values of the indexed column(s) and a pointer to the location of the corresponding row on disk. When you query for a specific value, the database can search the much smaller and more efficient index to find the data's location directly.
Why they matter in System Design:
- Performance: Indexes are the single most important tool for optimizing read performance. A missing index can cause a query to be thousands of times slower.
- Trade-offs: Indexes are not free.
- Write Overhead: Every time you
INSERT
,UPDATE
, orDELETE
a row, the database must also update all the indexes on that table. This slows down write operations. - Storage Cost: Indexes take up extra space on disk.
- Write Overhead: Every time you
Rule of Thumb: Add indexes to columns that are frequently used in WHERE
clauses, JOIN
conditions, and ORDER BY
clauses. Be mindful of the write overhead.
ACID: The Transactional Guarantee
ACID is an acronym that describes the four properties of a reliable database transaction in a SQL database.
- Atomicity: A transaction is an "all or nothing" operation. It either completes entirely, or it fails and the database is left in its original state. There are no partial transactions.
- Example: When you transfer money, both the debit from your account and the credit to the other account must succeed. If either fails, the entire transaction is rolled back.
- Consistency: A transaction will always bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, and triggers.
- Isolation: The concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially. One transaction should not be able to see the intermediate, uncommitted state of another transaction.
- Example: If two people try to book the last seat on a flight at the same time, the isolation property ensures that only one of them can succeed.
- Durability: Once a transaction has been committed, it will remain so, even in the event of a power loss, crash, or error. The changes are permanently stored.
Why it matters in System Design: ACID compliance is critical for systems where data integrity cannot be compromised (e.g., banking, e-commerce). However, enforcing these guarantees, especially in a distributed system, comes with a performance cost. Many NoSQL databases choose to relax some of these properties (typically consistency) in favor of higher availability and performance, leading to the concept of "eventual consistency."