Big Data interview

Source: Internet
Author: User
Tags repetition
Document directory
  • 1. Extract the IP address with the most visits to Baidu on a certain day from massive log data
  • 2. The search engine records all the search strings used for each search using log files. The length of each query string is 1-bytes. Suppose there are currently 10 million records (these query strings have a relatively high degree of repetition, although the total number is 10 million, but if the repetition is not removed, there will be no more than 3 million records. The higher the repetition of a query string, the more users query it, that is, the more popular it is .), Please count the top 10 query strings. The memory required cannot exceed 1 GB.
  • 3. There is a 1 GB file with each row containing a word. The word size cannot exceed 16 bytes, and the memory size is limited to 1 MB. Returns the top 100 words with the highest frequency.
  • 4. There are 10 files, each of which is 1 GB. each row of each file stores the user's query, and the query of each file may be repeated. You are required to sort by query frequency
  • 5. Given two files a and B, each of them stores 5 billion URLs. Each URL occupies 64 bytes and the memory limit is 4 GB. Can you find the common URLs of files a and B?
  • 6. Locate non-repeated integers among the 0.25 billion integers. Note that the memory is insufficient to accommodate these 0.25 billion integers.
  • 7. Search for the median of 0.5 billion int values
1. massive log data is extracted. The IP address that most frequently accesses Baidu on a day is a 32-bit binary number. Therefore, a total of N = 2 ^ 32 = 4g ip addresses are different. If hashmap is used, the memory used (4 + 4) x 4G) is much larger than the memory size. Solution: Use the 8-bit high IP address to divide the IP address into 2 ^ 8 = 256 files, and separate each file (range) perform hashmap and record the IP addresses with the highest access times. Then, you can calculate the IP address with the highest access count among all IP addresses. 2. The search engine records all the search strings used for each search using log files. The length of each query string is 1-bytes. Suppose there are currently 10 million records (these query strings have a relatively high degree of repetition, although the total number is 10 million, but if the repetition is not removed, there will be no more than 3 million records. The higher the repetition of a query string, the more users query it, that is, the more popular it is .), Please count the top 10 query strings. The memory required cannot exceed 1 GB. (1) hashmap (N)/mergesort (N * logn) is used to calculate the number of times all queries appear. Because the memory capacity required by the question can accommodate 3 million queries, it is best to use hashmap (2) priorityqueue (N * log10) in an integer array, find the top K, classic algorithms, not to mention. 3. There is a 1 GB file with each row containing a word. The word size cannot exceed 16 bytes, and the memory size is limited to 1 MB. Returns the top 100 words with the highest frequency. This is the same idea as extracting the most frequently accessed IP addresses from massive logs. Modulo 2000 for each line of words and divide large files into 2000 small files. Each file contains Total size of different wordsIt cannot exceed 500 kb. In this way, hashmap can be made in 1 Mbit/s memory. Then, priorityqueue calculates the highest 100 frequencies in this segment. K ways merge sort finds the 100 most frequently used files. K <2 ^ (20-11) 4. there are 10 files, each of which is 1 GB. each row of each file stores the user's query, and the query of each file may be repeated. You are required to sort the questions according to the query frequency, and use the same idea as extracting the IP addresses with the most accesses from the sea logs. You only need to read these 10 files in sequence, and then hash each query, and then modulo 10. 5. Given two files a and B, each of them stores 5 billion URLs. Each URL occupies 64 bytes and the memory limit is 4 GB. Can you find the common URLs of files a and B? Hash 5 billion URLs and modulo 128. In this way, file a is divided into 128 small files, and file B is also divided into 128 small files. In this way, files with different numbers do not contain the same URL. 6. find non-repeated integers among the 0.25 billion integers. Note that the memory is insufficient to accommodate the 0.25 billion integers. method 1. similar to the method of dividing and administering the first question 2. 2-bitmap is used (2bit for each number, 00 indicates no, 01 indicates one occurrence, 10 indicates multiple times, and 11 indicates meaningless, the total memory size is 2 ^ 32*2 bit = 1 GB memory, which is acceptable. 7. Find the median of the 0.5 billion int values and divide the 0.5 billion int values into 2 ^ 8 files based on the 8-bit height. Then, count the number of numbers contained in each file. Then you can know the median in the first few files and the position in the file.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.