C # URL short address compression algorithm and short URL Principle example, detailed introduction of the short URL mapping algorithm, the long URL MD5 generated 32-bit signature string, divided into 4 segments, 8 bytes per segment, and then generate a short URL, see text examples.
Short URL mapping algorithm :
The long URL MD5 generated 32-bit signature string, divided into 4 segments, 8 bytes per segment;
For this four-stage loop processing, take 8 bytes, he is regarded as 16 binary string and 0X3FFFFFFF (30 bit 1) and operation, that is, more than 30 bits of ignoring processing;
The 30 bits are divided into 6 segments, and each 5 digit number is taken as an index of the alphabet to obtain a specific character, followed by a 6-bit string;
The total MD5 string can get 4 6-bit strings, and any one inside can be used as a short URL address for this long URL;
Does not necessarily say that the resulting URL is unique, but can take out 4 sets of URLs, will not appear too large duplication.
Full code:
1 using System;
2 namespace ShortUrlDemo
3 {
4 class Program
5 {
6 static void Main (string [] args)
7 {
8 Random rd = new Random ();
9
10 for (int i = 0; i <100; i ++)
11 {
12 int index = rd.Next (0, 4);
13 var stortUrls = ShortUrl ("http://www.freemud.cn");
14 Console.WriteLine (string.Concat ("http://www.freemud.cn/", stortUrls [index]));
15}
16 Console.Read ();
17}
18
19 public static string [] ShortUrl (string url)
20 {
21 // Mixed KEY before MD5 encrypted character transmission can be customized
22 string key = "Freemud";
23 // characters to use to generate URL
24 string [] chars = new string []
25 {
26 "a", "b", "c", "d", "e", "f", "g", "h",
27 "i", "j", "k", "l", "m", "n", "o", "p",
28 "q", "r", "s", "t", "u", "v", "w", "x",
29 "y", "z", "0", "1", "2", "3", "4", "5",
30 "6", "7", "8", "9", "A", "B", "C", "D",
31 "E", "F", "G", "H", "I", "J", "K", "L",
32 "M", "N", "O", "P", "Q", "R", "S", "T",
33 "U", "V", "W", "X", "Y", "Z"
34};
35 // MD5 encryption of incoming URL
36 string hex = System.Web.Security.FormsAuthentication.HashPasswordForStoringInConfigFile (key + url, "md5");
37 string [] resUrl = new string [4];
38 for (int i = 0; i <4; i ++)
39 {
40 // The encrypted character is bitwise ANDed with a set of 8 hexadecimal and 0x3FFFFFFF
41 int hexint = 0x3FFFFFFF & Convert.ToInt32 ("0x" + hex.Substring (i * 8, 8), 16);
42 string outChars = string.Empty;
43 for (int j = 0; j <6; j ++)
44 {
45 // Bitwise AND the obtained value with 0x0000003D to get the chars index of the character array
46 int index = 0x0000003D & hexint;
47 // add the obtained characters
48 outChars + = chars [index];
49 // Shift right by 5 digits per cycle
50 hexint = hexint >> 5;
51}
52 // Save the string into the output array of the corresponding index
53 resUrl [i] = outChars;
54}
55 return resUrl;
56}
57}
58}
Results:
Ttserver is recommended for storing data for this URL.
Ttserver Database:
Tokyo Cabinet is a dbm database developed by the Japanese Mikio Hirabayashi (Ping Lin) のページ (note: The famous dbm database qdbm is what he developed), the database reads and writes very quickly.
insert:0.4sec/1000000 Recordes (2500000QPS), it only takes 0.4 seconds to write 1 million data.
search:0.33sec/1000000 recordes (3000000 QPS), it takes only 0.33 seconds to read 1 million data.
Can see the dictionary type of data key/value query, this database can be said that I have seen a very high efficiency, and he is so small, to the short Url/long URL to match the better.
The system uses 6 short code characters to denote URLs of any length.
Valid character codes are ASCII ' a ' to ' Z ' and ' 0′ ' 5′, where each character contains a 2 ^ 5 (32) state. 6 Short code characters can be used to draw 32 ^ 6 (1073741824) URLs
First, you need a database table to store and retrieve your mapped URLs.
1 CREATE TABLE mappedURL (CREATE TABLE mappedURL(
2 shortCode char (6) not null ,
3 lognURL text not null ,
4 PRIMARY KEY shortCodeInd (shortCode),
5 );
Second, you need to define an algorithm that maps long URLs to short URLs. The algorithm has been described above.
Third, you need to create a Web page, from the database's short URL mapping to find the original URL, and redirect it.
MD5 has been cracked, so it does not rule out the possibility of an attacker forging the same MD5 URL for malicious purposes. If this is not the case, the likelihood of MD5 collision should be low, and it is estimated that you will not see it in my lifetime.
Also I do not understand "the same URL every time the key value must be the same" the actual use of what will be. Even if the same URL corresponds to a different key value, it generally does not cause too much waste? Only 6-bit alphanumeric combinations can accommodate billions of variations.
I have MD5 collision worry to ask this question. The same URL corresponds to the same key value because each URL address needs to be unique to a table data in the database, but querying directly with a URL is slower because:
The amount of URL and related record data that will be stored is very large.
And some URLs can be very long, so use the text field.
And if the hash of the unique key value is stored with varchar, then the query according to the key value is very convenient and fast.
Just like the object hash in Git, there's no need to think about conflicts.
How do URL shorter services such as bit.ly be implemented?
Do I need to reverse the search for URLs from hash key values? If there is such a requirement, the URL must be stored in a place so that it can be hashed at the time of the conflict
The MD5 is a 128-bit hash code (4 integers, 4 bytes per integer). Therefore, a URL of the MD5 code, there is 2 of the 128-time square (that is, 2e128) is possible. Randomly find out the two URLs of the MD5 code equal probability, is one of 2e128 points, namely r=2e-128
If the URL is inserted into the database after MD5, the first URL inserted will not be duplicated, the second MD5 inserted, it is the probability of repetition with the first one is R. When the third URL is inserted, the repetition probability is 2XR, and so on, the probability that the repetition occurs when Nth is inserted is (n-1) XR. n MD5 codes, of which there are two probabilities of repetition are these probabilities plus. (1+2+3+...+ (n-1)) XR = (a) xnx (n-1) XR
For a set of n MD5 codes, there is a probability of repetition (N/2E64) E2
Therefore, only n is too large to be comparable with 2e64, it is necessary to consider its conflict problem. and 2 of the 64-time side is still very large.
Therefore, as long as it is not malicious attacks, the general application is not very likely to have collision.
How does C # implement a URL short address? C # short URL compression algorithm and the introduction of short URL principle