Design PasteBin - System Design Interview Question

Asked In

Amazon, Ola Cabs

Key benefits after reading this Blog

Have you ever thought of any such service that could make our life easier by allowing us to store data over the internet and simultaneously allow others to access the same data? If it is so, then this is the perfect blog for you. Here, in this blog, we’ll be discussing the PasteBin System. Here, the aim is to design a highly scalable service that could allow users to paste various content (images, text, etc.) over the internet and allow others to access the pasted content. So let’s look at the formal definition of Pastebin.

What is Pastebin?

Pastebin is a type of online content hosting service where users can store and share content in the form of text or images over the internet by generating a unique URL so that anyone can access the contents via that URL. The user who created it can also update the content if he/she is logged in. So without much delay, let’s dive deep into designing such a scalable service from scratch.

Let’s Design Pastebin

Key Requirements

Requirements analysis is a critical process that enables the success of a system. Requirements are generally split into two types: Functional and Non-Functional Requirements.

Functional requirements are the essential requirements that the end-user demands as necessary facilities that the system should offer. For PasteBin, the functional requirements are:

Users should be able to generate a unique URL by pasting his/her content over there
The content and URLs should expire after a configurable time

Non-Functional Requirements are the quality constraints that the system must satisfy according to the project contract. For PasteBin, the non-functional requirements are:

The system should be reliable and highly available.
Service should be available in real-time with minimum latency.

As we know, what are our essential requirements, so it’s time to estimate the service’s capacity.

Capacity Estimation

It is necessary to know the approximate traffic to the service to be scaled and designed accordingly. Since this service would be read-heavy, it is better to consider read to write the ratio as 10:1. Assume we have 10M users, and each user contributes one write a request on average per month. So we have -

-> 10 Million write requests per month
-> 100 Million read requests per month (10X write request)

-> Hence the total number of writing request in 10 years
= 10M*12(months/year)*10(years)= 1.2 Billion write requests

Let the maximum size per paste be 10MB.
-> Amount of paste size per month 
= 10 M*10MB = 100 TB paste/month

-> Amount of paste content in one year 
= 100TB*12 = 1200TB or 1.2PB paste/year

Assume our service runs for 10 years, 
then total data accumulated = 1200TB * 10 years = 12 PB

High-Level Design

design pastebin image 1

We have two high-level working API for this service: one for writing the content and the other one for dealing with reading requests. We can use either SOAP or REST architecture to implement the system.

Write API requirements

write( key, content, expiry_date)

Parameters: key, content, expiry_data

key: A unique API key available with each user
content: The content that the user wants to paste.
expiry_date: Time after which the content should be expired.

Return Value: API should return a short URL to access the content or otherwise, in case of invalid input, it should return an error code

The user enters the content and gets the randomly generated URL. This works in the described way

-> The Client sends a create paste request to the Web Server, 
which in turn the request forwards to the Write API server. 

-> The Write API server does the following:
   -> Generates a unique URL 
   -> Saves to the SQL Database pastes table 
   -> Saves the content to the Object Store and returns the URL

Read API requirements

read (short_url)

Parameters: short_url

short_url: The URL provided by the Write method to access the content.

Return Value: API should return the original paste content or an invalid error code if the URL is incorrect.

The user enters a pasted URL and views the content. This works in the following way:

-> The Client sends a get paste request to the Web Server, 
which in turn forwards the request to the Read API server. 

-> The Read API server does the following:
    -> Checks for the generated URL in the SQL Database
    -> Fetch the paste contents from the Object Store if the URL is in the SQL Database
    -> Otherwise, return an error message for the user

Besides these essential services, we also have two more APIs responsible for Real-Time analytics(out of scope) and the deletion of expiration content.

Delete API requirements

delete (api_key, URL)

Parameters: api_key, URL

api_key: A unique API key provided to each user
URL: URL to access the particular content

Return Value: Boolean Value indicating success and failure.

Low-Level Design

The key component of this service relies on generating short URLs and storing the content. Let’s discuss these in details:

Generate Short URL

Every time whenever a user makes a paste, the service needs to generate a unique new short URL. We can use BASE64 for encoding the URL as all the characters of BASE64 are URL-safe characters(characters allowed [A-Z][a-z][0–9],’+’,’/’).

Since we have 12B write requests we can easily do with a URL of length 6 but by considering to make the system highly scalable we should choose a URL of length 7.

To generate the unique URL, we could take the MD5 hash of the user’s ipaddress + timestamps. We have various choices like ipaddress+timestamp or ip_address + content, but as content would be very large to handle hence we use the other. Alternatively, we could also take the MD5 hash of randomly generated data

Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.

Base64 encoding will use 6 bits to represent each character, 
MD5 -> Base64 give (128/6) ~ 22 character output.

Hence, we could use:

For choosing the short URL, first, we can choose the first 7 characters of the resulting Base64 encoding.

The drawback of this algorithm

The primary issue with this algorithm is that the generated URL can be repeated. Hence to overcome this, KGS(Key Generating Service) comes to rescue us. KGS algorithm will ensure that all the keys inserted into key-DB are unique and hence we don’t have to worry about duplication or collisions.

Database Design

We need to store both the content and the short URL. The average size of each content is 10MB. As calculated above, for our service to be available for ten years, we have to accommodate a massive amount of data, i.e., 12 PB, which is too much to be stored in any database. Hence, we have to use Object Storage Service, such as Amazon S3, for storing the content and a normal relational database for storing URLs.

Database Structure

The structure of the database is described here:

User Data

1. User ID: A unique user id for each user
2. Name: User’s Name
3. Email: The email id of the user
4. Password: Password of the user
5. Creation Date: The date on which the content was shared

Content Data

1. User ID: A unique user id for each user
2. Short URL: A unique short URL consisting of 7 char
3. Paste Path: The URL path of the S3 Bucket
4. Expiration Date: Time after which content expires
5. Last Accessed: Last time when content was accessed

The User ID is used as a Primary Key in the above structure.

design pastebin image 2

For storing the paste content (URL), we have two choices to choose from. We can either go with Relational Database like MySQL, or we can go with a NoSQL database. As we want fast read and write speed, but we don’t need lots of relationships among data. Hence NoSQL Databases are the optimal choice for our system.

Relational Databases are very efficient when there are many dependencies, and when we have lots of complex queries, they are too slow. However, NoSQL databases are very poor at handling the relationship queries, but they are faster.

Scaling the Service

This is the most critical aspect of business requirements. The service should be highly scalable, and at the same time, it should be reliable, available, and fast enough for a better user experience. We need to scale our database, have more servers to handle more requests, and be fast enough.

Database Sharding

It will be expensive, and the service will be prolonged and become more prone to failure if we use a single database to store the data. Hence we need to partition it into several machines or nodes to store information about billions of URLs.

For database partitioning, we can use the Consistent Hashing technique. The hash function will be used to distribute the URLs into different partitions. We need to decide the number of shards to make, and then we can choose an appropriate hash function that random number represents the partition/shard number.

Caching

We have ten times more read operation as compared to write operation. Hence to speed up our service, we need to use caching. Cashing is a way to speed up the reading process. The URLs which are accessed quite frequently should be stored in the cache for faster access. For example, if any URL appears on any social networking website’s trending page, many people will likely visit the URL. Hence we can store its content in our cache to avoid any delay. We can use services like Memcached or Redis for this purpose.

Various possibilities need to be kept in mind while designing the cache. Some of them are:

Whenever the cache is full, we need to replace the URLs with the trending ones. We can LRU method to implement this.
The cache should remain in synchronisation with the original content. If any chances are done in the original content, then it must be reflected in the cache.

We can use Distributed Cache to incorporate all the above possibilities.

Load Balancing

The designed system might be prone to network bandwidth bottlenecks and a single point of failure. To overcome this problem, we use Load Balancing. There might be instances when many requests went to a single server, resulting in the service’s failure. There are many algorithms available to distribute the load among servers, but here in this system, we will use the Least Bandwidth Method. This algorithm will divert the incoming request to the server serving the least amount of traffic.

Along with balancing the load, we need to replicate all the databases and servers to avoid failure. In case of any discrepancy, the service should never go down and should remain highly consistent.

Conclusion

PasteBin System is a complex and highly scalable service to design. In this blog, we have covered the fundamental concepts necessary for building such a service. We hope you like the content. Please do share your views.