MP3 search engine-algorithm [Baidu]

Source: Internet
Author: User

Assume that an MP3 search engine contains 2 ^ 24 songs and records which can be listened.
2 ^ 30 URLs, but no more than 2 ^ 10 URLs for each song. The system regularly checks these URLs. If a URL is unavailable, it does not appear in the search results. Now the song name and URL are connected separately
The song_id and URL_ID of an integer are uniquely identified. This system has the following requirements:
1) Use song_id to search for the URL_ID of a song and provide the URL_ID count and list.
2) give a song_id and add a new URL_ID to it.
3) Add a new song_id
4) set a URL_ID to unavailable.

Restrictions: the memory usage cannot exceed 1 GB, the size of a single file cannot exceed 2 GB, and the number of files in a directory cannot exceed 128.
For the best performance, describe the data structure and searchAlgorithmAnd resource consumption. If the system's data volume expands, how can we handle the distribution of multiple machines?

Online solution:

The memory is insufficient to store these URLs, so data is written into several files.

The information format of each song stored in the file is (url1, url2 ......).

The total file size is about 2 ^ 24*2 ^ 10*4 = 2 ^ 36 = 64 GB. the location of a file can be determined based on the songid. A Random query operation takes about one time to open the file and execute the seek operation to read data.

Store the information of each song into a file. Because the URL of each song cannot exceed 2 ^ 10, the storage structure of each song in the file is 2 ^ 10 Int, each int number identifies a URL. -1 indicates that the URL does not exist. During initialization, the number of each int in the file is initialized to-1.
In this way, the space occupied by the information corresponding to each songid is 2 ^ 10*4 = 4 kb, and each file size is 1 GB. Therefore, each file can store 2 ^ 18 = K song information. A total of 64 files are required. These files are numbered from 0 to 63.

For any songid, the file ID corresponding to the URL Information is: songid> 18. The location in the file is: (songid & 0x3ffff) <12.

In addition, a 2 ^ 24 short int type array is used in the memory to save the number of URLs corresponding to each song. The count group name is urlcount [] and the initialization value is-1, the corresponding song_id does not exist. This array occupies 2 ^ 25 byte = 32 MB;

The URL Information is identified by bitmap. The bitmap is saved in the memory and the occupied space is 2 ^ 30/8 = 2 ^ 27 byte = 128 MB.

Required operations:
: 1) Use song_id to search for the URL_ID of a song and provide the URL_ID count and list.
Song_id is used to calculate the file number and its location in the file. The number of URLs is read from urlcount [], all URLs are read, and the bitmap is queried for each URL_ID to check whether the bitmap is available. If yes, add this URL to the return list.

: 2) give a song_id and add a new URL_ID to it.
The song_id is used to calculate the file number and its location in the file, set it to start, and obtain the number of URLs through urlcount []. Assume there are n URLs, write the new URL_ID to start + sizeof (INT) * n of the file. Modify the value of urlcount [song_id.

: 3) Add a new song_id
Check whether the corresponding urlcount [song_id] is changed to 0 if it is-1. If it is greater than or equal to 0, it indicates that the modification to song_id already exists.

: 4) set a URL_ID to unavailable.
Modify the URL bitmap to identify the bits corresponding to URL_ID, which is unavailable.

Core: 1. Use bitmap to store URLs and use bitmap to mark URLs. The bitmap is saved in the memory and the occupied space is 2 ^ 30/8 = 2 ^ 27 byte = 128 MB;

2. in the memory, a 2 ^ 24 short int array is used to save the number of URLs corresponding to each song. The count group name is urlcount [] and the initialization value is-1, the corresponding song_id does not exist. This array occupies 2 ^ 25 byte = 32 MB;

3. The total file size is about 2 ^ 24*2 ^ 10*4 = 2 ^ 36
= 64 GB. (Based on the songid, you can calculate the location of the file where the file is located. Therefore, the time consumed by a random query operation is the actual time when a random file is opened and the seek operation is executed to read the data,
It is about Ms level .) A total of 64 files are required. These files are numbered from 0 to 63.
6 bits are required, that is, the 6-bit height is used to determine which file belongs (for any songid (32bit), the ID of the file where the corresponding URL Information is located
Yes: songid> 18 ).

4. Save the information of each song to a file. Because the URL of each song cannot exceed 2 ^ 10
The storage structure is 2 ^ 10 int numbers, each int number identifies a URL. -1 indicates that the URL does not exist. During initialization, the number of each int in the file is initialized to-1.
In this way, the space occupied by each songid is 2 ^ 10*4 = 2 ^ 12 = 4kb.
For any songid, the location in the file is: (songid & 0x3ffff) <
<12 (the 18-bit lower is the "row number" in the file where the songid is located ").

5. If multiple machines are distributed, use the first digit of the sixth High to determine the location.

6. At this time, you can consider dividing each file into 512 MB and more files.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.