International - English

Cart Console

Topic Center

Contact Sales

Home > Internet > Online Trends

Several big data interview questions

Last Update:2018-10-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, to deal with big data interview questions, some basic concepts should be clear:

(1) 1Gb = 109bytes (1Gb = 1 billion bytes): 1Gb = 1024Mb, 1Mb = 1024Kb, 1Kb = 1024bytes;

(2) The basic process is to decompose big problems, solve small problems, and choose global optimal from local optimum; (Of course, if you can solve it directly in memory, then you can directly solve the solution without dissolving it. )

(3) Common method of decomposition process: hash(x)%m. Where x is the string / url / ip, m is the number of small problems, such as breaking a large file into 1000 copies, m = 1000;

(4) Problem-solving auxiliary data structure: hash_map, Trie tree, bit map, binary sort tree (AVL, SBT, red-black tree);

(5) Top K problem: the maximum K uses the smallest heap, and the minimum K uses the largest heap. (As for why? Write a small chestnut on paper, try it and you will know.)

(6) Common sorting for processing big data: quick sort / heap sort / merge sort / bucket sort

Here are a few examples (the solution to each question is not unique, only one of the many solutions is listed below):

1. Given a and b files, each stores 5 billion urls, each url is 64 bytes, the memory limit is 4G, let you find the common url of a and b files?

If each url size is 10bytes, then it can be estimated that the size of each file is 50G×64=320G, which is much larger than the memory limit of 4G, so it is impossible to load it completely into memory. You can use the idea of divide and conquer. solve.

Step1: Traverse the file a, obtain hash(url)%1000 for each url, and then store the url to 1000 small files according to the obtained value (denoted as a0, a1,..., a999, each small The document is about 300M);

Step2: Traverse the file b, store the url in the same way as a to 1000 small files (denoted as b0, b1, ..., b999);

Ingenious: After this processing, all possible urls are saved in the corresponding small files (a0 vs b0, a1 vs b1,...,a999 vs b999), and the corresponding small files may not have the same Url. Then we just need to find the same url in the 1000 pairs of small files.

Step3: When seeking the same url for each pair of small files ai and bi, you can store the ai url in hash_set/hash_map. Then iterate through each url of bi to see if it is in the hash_set just built. If it is, then it is the common url, and it can be stored in the file.

2. There is a file of 1G size, each line is a word, the size of the word is no more than 16 bytes, the memory limit is 1M, and the 100 words with the highest frequency are required to be returned.

Step1: In the sequential read file, for each word x, take hash(x)%5000, and then save the value to 5000 small files (denoted as f0, f1, ..., f4999), so that each file It is about 200k. If some of the files exceed the 1M size, you can continue to divide in the same way until the size of the small file obtained by the decomposition does not exceed 1M.

Step2: For each small file, count the words appearing in each file and the corresponding frequency (trie tree/hash_map can be used), and take out the 100 words with the most frequent occurrence (you can use the smallest heap with 100 nodes). ), and stored 100 words and the corresponding frequency into the file, which in turn got 5,000 files;

Step3: Combine these 5000 files (similar to sorting and sorting);

3. The existing massive log data is stored in a super large file. The file cannot be directly read into the memory, and it is required to extract the IP with the most access to Baidu on a certain day.

Step1: Take the IP accessing Baidu from the log data of this day and write it one by one into a large file;

Step2: Note that the IP is 32-bit and has a maximum of 2^32 IPs. It is also possible to use a mapping method, such as modulo 1000, to map the entire large file to 1000 small files;

Step3: Find the IP with the highest frequency in each small text (you can use hash_map for frequency statistics, then find the most frequent ones) and the corresponding frequency;

Step4: In the 1000 largest IPs, find the IP with the highest frequency, which is what you want.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

java big data interview questions big data developer interview questions big data interview questions for experienced data developer interview questions programming internship interview questions python developer interview questions simple programming interview questions

Front-end Must Learn: CDN Acceleration Principle 12-02

Elements of CDN Network 12-01

Understand the Principle of CDN Acceleration in One Article 12-01

Cloud Security Issues Derived from the Development of Cloud C... 11-26

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Hot Article

Hot Tags

computing conference access forum computer class data get http html applications

Popular Keywords

direct digital landing development documentation data user director of marketing deploy it ddos how to description of products and services ddos information data website domain to dns

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Several big data interview questions

Contact Us

Hot Article

Hot Tags

Popular Keywords

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support