Design Tiny URL

Source: Internet
Author: User
Tags value store database sharding

Part 1:

Objective:

Recently, some blogs about short URLs, some blogs talk about some good things, but they are not very full, so this blog is a summary of other blogs.

Introduction:

Short address, as the name implies, is to turn the long URL into a short URL, now provide this service has a lot of companies, we take the Google home URL shortener service: http://goo.gl/for example.

First we go to http://goo.gl/, then we enter the address of this blog Http://blog.csdn.net/beiyeqingteng, and finally it returns a shorter url,http://goo.gl/jfs6q. As shown in the following:


URL parsing:

When we enter http://goo.gl/Jfs6q in the browser, DNS first resolves to obtain the http://goo.gl/IP address. When the DNS obtains the IP address (for example: 74.125.225.72), it sends an HTTP GET request to this address, querying jfs6q, at this time, the http://goo.gl/server will send the request over HTTP 301 to the corresponding long URL Http://blog.csdn.net/beiyeqingteng. The subsequent parsing process is the same as the usual URL parsing.

Short Address nature:

The short address essentially implements a mapping function, f:x, Y . This mapping function must also have two features:

1. If x1! = x2, then f (x1)! = f (x2);

2. For each y, a unique x can be found to make f (x) = y;

For any linear function, such as f (x) = 2x, this condition is satisfied.

Well, if you understand the nature of the short address, let's see how it's implemented.

Note: In the Google URL Shortener service, it allows a long URL to correspond to multiple short URLs. This may be due to security considerations. In this article, we do not consider this situation.

implementation:

The length of the short address is generally set to 6 bits, and each one is made up of a total of 62 letters [A-Z, A-Z, 0-9], so 6-bit will have a total of 62^6 ~= 56.8 billion combinations, basically enough. In the Google URL Shortener service, the short address length is 5, and there are probably more than 900 million combinations.

Suppose we use a database to store long and short address mappings, then in table Longtoshorturl we have three columns:

1. Id,int, automatic growth;

2. Lurl,varchar,//long URL;

3. sURL, varchar,//short URL.

Now let's consider how long URLs get a unique short URL.

Before you talk about a specific algorithm, ask a question: Does the 10 and 16 binary numbers satisfy the two conditions in the f:x-y of the mapped function just mentioned?

Answer: Yes.

The idea of this article is also to use the conversion between the binary. Because we have a total of 62 letters, we can create a self-made, called 62 binary. The rules are as follows:

1 0   →a21  →b3... 4  - →z 5 ... 6  the 0 7  A 9

So, for each long address, we can get a 6-bit 62 binary number based on its ID, and this 6-bit 62 binary is our short address. The specific implementation is as follows:

1  PublicList<integer> base62 (intID) {3List<integer> value =NewLinkedlist<integer>();4      while(Id >0) {5         intremainder = id% +;6Value.add (0, remainder);7id = ID/ +;8     }    9     returnvalue;Ten}

Example:

For ID = 138, through BASE62 (138), we get value = [2, 14]. According to the corresponding rule table above, we can get its corresponding short address as: AAAABN. (The specific short address given by value can be obtained through a switch statement, because the code is too long to skip over here.) )

When we want to find the corresponding long address by the short location, the method is very simple, that is, the 62 binary number is converted to 10 decimal, so that we can get the ID of the long address.

For example, for a short address aaae9a, whose 62 binary is [0, 0, 0, 4,61,0], the ID of its long address is [0, 0, 0, 4,61,0] = 0x62^5+ 0x62^4 + 0x62^3 + 4x62^2 + 61x62^1 + 0x62^0 = 1915810. With the ID, we can naturally get a long address.

Part 2: Question

How to create a tinyurl system?

If you aren't familiar with Tinyrul, I'll briefly explain here. Basically, TinyURL is a URL shortening service, a Web service, this provides short aliases for redirection of long URLs. There is many other similar services like Google URL Shortener, bitly etc.

For example, URL http://blog.gainlo.co/index.php/2015/10/22/8-  Things-you-need-to-know-before-system-design-interviews/is long and hard to remember, tinyurl can create a alias for it– Http://tinyurl.com/j7ve58y. If you click the alias, it ' ll redirect the original URL.

So if you would design this system is allows people input URLs with a short alias URLs generated, how does would do it?

High-level idea

Let's get started with basic and high-level solutions and we can keep optimizing it later on.

At first glance, each long URL and the corresponding alias form a Key-value pair. I would expect think about something related to hash immediately.

Therefore, the question can is simplified like this–given a URL, how can we find hash function F. Maps URL to a Shor T alias:
F(URL) = alias
and satisfies following condition:s

    1. Each URL can is mapped to a unique alias
    2. Each alias can is mapped back to a unique URL easily

The second condition is the core as of the run time, the system should look up by alias and redirect to the corresponding URL quickly.

Basic Solution

To make things easier, we can assume the alias was something like http://tinyurl.com/<alias_hash> and Alias_ Hash is a fixed length string.

If the length is 7 containing [A-Z, A-Z, 0-9], we can serve 7 ~= 3500 billion URLs. It ' s said that there is ~644 million URLs at the time of this writing.

To begin with, let's store all the mappings in a single database. A straightforward approach is using Alias_hash as the ID of each mapping, which can be generated as a random string of Len Gth 7.

Therefore, we can first just store <id, Url>. When a user inputs a long URL "http://www.gainlo.co", the system creates a random 7-character string like "abcd123" as ID and inserts entry < "abcd123", "http://www.gainlo.co" > into the database.

In the run time, when someone visits http://tinyurl.com/abcd123, we look up by ID ' abcd123 ' and redirect to the correspond ing URL "http://www.gainlo.co".

Performance VS Flexibility

There is quite a few follow-up questions for this problem. One thing I ' d like to further discuss-here's the by-using GUID (globally Unique Identifier) as the entry ID, what would is pros/cons versus incremental ID in this problem?

If you dig to the insert/query process, you'll notice that using random string as IDs may sacrifice performance a Litt Le bit. More specifically, when you already has millions of records, insertion can be costly. Since IDs sequential, so every time a new record was inserted, the database needs to go look at the correct page fo R this ID. However, when using incremental IDs, insertion can is much easier–just go to the last page.

So one-to-optimize-to-use incremental IDs. Every time a new URL is inserted, we increment the ID by 1 for the new entry. We also need a hash function, which maps each integer ID to a 7-character string. If we think each string as a 62-base numeric, the mapping should is easy (of course, there is other ways).

On the flip side, using incremental IDs would make the mapping less flexible. For instance, if the system allows users to set a custom short URL, apparently GUID solution are easier because for whatever Custom short URL, we can just calculate the corresponding hash as the entry ID.

Note:in This case, we are not the use of the random generated key but a better hash function, which maps any short URLs into an ID, e.g . Some traditional hash functions like CRC32, SHA-1 etc.

Cost

I can hardly not ask about what to evaluate the cost of the system. For Insert/query, we ' ve already discussed above. So I'll focus on the storage cost.

Each entry are stored as <id, url> where ID is a 7-character string. Assuming Max URL length is 2083 characters and then each entry takes 7 * 4 bytes + 2083 * 4 bytes = 8.4 KB. If We store a million URL mappings, we need around 8.4G storage.

If We consider the size of the database index and we also store other information like user ID, date each entry is Inserte D etc., it definitely requires much more storage.

Multiple machines

Apparently, when the system had developed to certain scale, a single machine was not capable to store all the mappings. How does we scale with multiple instances?

The more general problem is what to store hash mapping across multiple machines. If you know distributed Key-value Store, you should know the this can is a very complicated problem. I'll discuss only high-level ideas here and if you're interested in all those details, I ' d recommend you read papers Li Ke Dynamo:amazon ' s highly Available key-value Store.

In a nutshell, if you want to store a huge amount of key-value pairs across multiple instances, you need to design a Looku P algorithm that allows your to find the corresponding machine for a given lookup key.

For example, if the incoming short alias was http://tinyurl.com/abcd123, based on key "abcd123" the system should know whic H machine Stores the database, contains entry for this key. This was exactly the same idea of the database sharding.

A common approach is to has machines that act as a proxy, which are responsible for dispatching requests to corresponding Backend stores based on the lookup key. Backend stores is actually databases that store the mapping. They can is split by various ways like use hash (key)% 1024x768 to divide mappings to 1024x768 stores.

There is tons of details that can make the system complicated, I'll just name a few here:

      • Replication. Data stores can crash for various random reasons, therefore a common solution are having multiple replicas for each databas E. There can many problems here:how to replicate instances? How to recover fast? How to keep read/write consistent?
      • Resharding. When the system is scales to another level, the original sharding algorithm might isn't work well. We may need to use a new hash algorithm to reshard the system. How to Reshard the database while keeping the system running can is an extremely difficult problem.
      • Concurrency. There can multiple users inserting the same URL or editing the same alias at the same time. With a single machine, you can control this with a lock. However, things become much more complicated when you scale to multiple instances.

Reference:

Http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener (source of this algorithm)

Http://blog.sina.com.cn/s/blog_65db99840100lg4n.html

Http://blog.csdn.net/beiyeqingteng

http://blog.gainlo.co/index.php/2016/03/08/system-design-interview-question-create-tinyurl-system/

Design Tiny URL

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.