System DesignData in a Distributed WorldBlob Storage for Large Files

Blob Storage for Large Files

Data in a Distributed World

So far, we've mostly talked about storing structured or semi-structured data in databases. But what about large, unstructured data like images, videos, audio files, and user-generated documents? These are often referred to as Binary Large Objects (BLOBs).

Storing blobs directly in a traditional database is almost always a bad idea. Databases are optimized for small, structured records, and storing large files in them can lead to poor performance, bloated database sizes, and high costs.

The solution is to use a dedicated Blob Storage service, also known as Object Storage.

What is Blob/Object Storage?

Object storage is a data storage architecture that manages data as objects, as opposed to a file hierarchy (like a file system) or tables (like a relational database).

Each object consists of three things:

  1. The data itself: The actual file (e.g., a JPEG image).
  2. A unique identifier (ID) or key: A globally unique name that is used to retrieve the object.
  3. Metadata: A set of attributes that describe the object (e.g., content type, creation date, size, author).

Key Characteristics:

  • Flat Namespace: Unlike a file system with nested directories, object storage has a flat structure. You can simulate directories by using prefixes in your object keys (e.g., users/123/profile.jpg), but the service itself just sees a flat collection of objects.
  • HTTP-based API: Objects are created, read, updated, and deleted (CRUD) via a simple, web-based API, typically using standard HTTP verbs.
  • Massive Scalability and Durability: These services are designed to be extremely scalable, capable of storing trillions of objects and exabytes of data. They are also highly durable, typically by replicating your data across multiple devices and facilities. For example, Amazon S3 is designed for 99.999999999% (11 nines) of durability.

Popular Examples:

  • Amazon S3 (Simple Storage Service)
  • Google Cloud Storage
  • Azure Blob Storage

The Common Workflow for Handling User Uploads

In a system design interview, if you need to handle user-uploaded files like profile pictures or videos, this is the standard and expected workflow:

  1. Client Initiates Upload: The user's client (e.g., a web browser or mobile app) tells the backend application that it wants to upload a file.
  2. Backend Generates a Pre-Signed URL: The backend application does not handle the file upload directly. Instead, it communicates with the object storage service (e.g., S3) and generates a special, short-lived, secure URL called a pre-signed URL. This URL grants the client temporary permission to upload a specific object directly to a specific location in the storage bucket.
  3. Backend Returns URL to Client: The backend sends this pre-signed URL back to the client.
  4. Client Uploads Directly to Blob Storage: The client then uses this URL to upload the file directly to the object storage service, bypassing the backend application entirely.
  5. Client Notifies Backend: Once the upload is complete, the client notifies the backend, typically by sending the unique ID or key of the newly created object.
  6. Backend Stores Metadata: The backend then stores this object ID in the main database, associating it with the relevant record (e.g., storing the profile_picture_key in the users table).

Why Use This Workflow?

  • Scalability and Performance: Your application servers are not tied up receiving large file uploads. They are free to handle other API requests. This offloads the heavy lifting of data transfer to the highly scalable object storage service.
  • Security: You don't have to expose your object storage credentials to the client. The pre-signed URL provides secure, temporary access.
  • Simplicity: The application logic is simplified. The backend only needs to handle lightweight metadata, not the files themselves.

Serving the Content (Downloads)

To serve the content, you have two main options:

  1. Public Objects: If the files are meant to be publicly accessible (like images in a news article), you can simply make the objects in your storage bucket public. You would then store the public URL in your database and use that in your application.
  2. Private Objects (via a CDN): For private or user-specific content, you should not make the objects public. Instead, you should put a Content Delivery Network (CDN) in front of your object storage. You can then configure the CDN to securely access the private objects and serve them to users. This has the added benefit of caching the content at the edge, which dramatically improves download speeds for users around the world.

In a system design interview, if the requirements involve storing and serving large files, you should immediately propose using a blob storage service like S3. Explaining the pre-signed URL workflow for uploads is a key way to demonstrate that you understand how to build a scalable and secure system for handling user-generated content.