Big Data Topic thought Summary

Source: Internet
Author: User
Tags file info

1) to a log filethatis larger than 100G in size , with an IP address stored in thelog , Design algorithms find the most frequently occurring IP addresses? (with how to know the top K IP, how to use the Linux System command Implementation)

Hash Split-barrel method:

Divide the 100G file into parts and map each IP address to the appropriate file:file_id = hash (IP)% 1000

in each file, the highest frequency is calculated separately. IP, and then merge the Hash of the bucket method;

Use Hash Bucket method to distribute the data to different documents;

each file counts top K respectively;

2) given billions of integers, the design algorithm finds an integer that appears only once.

Hash of the bucket method, the millions of integers mapped to different intervals, in each interval to find only one occurrence of the integer.

3) to two files, respectively, there are Billions of integers, we only have 1G of memory, how to find the intersection of two files

scan for each integer to see if it has occurred, save memory method useBitmap. Barrels of+ Bitmap. If the integer is32bit, directly usingBitmapimplementation of the method. All integers have a total2^32each number is expressed in two digits,xxindicates that none of the files have appeared,Tenrepresents a file1have appeared, onrepresents a file2have appeared, Oneindicates that all two files have appeared, requiring2^32*2/8 = 1GBmemory, traverse all the integers in two files, and then look forBitmapin Onethe corresponding integer is the intersection of two files so that the linear time complexity can be completed.

4)1 files with billions of int,1G of memory, Design algorithm to find all integers with large occurrences more than 2 times.

Bitmap extension: with 2 bit indicates a state , 0 indicates not appearing, 1 appeared 1 times, 2 2 times or more.

5) to two files, respectively, there are millions of query, we only have 1G of memory, how to find the intersection of two files? The exact algorithm and approximate algorithm are given respectively.

Precise algorithm: The method of Hash and Barrel

Hash the query in two files to N small files and indicate The source of the query;

find coincident query in each small file

Coincident query totals that will be found

Approximate algorithm: Bloomfilter Algorithm

6) How to extend the bloomfilter so that it supports the operation of deleting elements

will be each bit in the bloomfilter is extended to a counter that records how many hash functions are mapped to this bit, and when deleted, only if the reference count becomes 0 . The location is really 0.

7) How do I extend the bloomfilter so that it supports count operations?

will be each bit in the Bloomfilter is expanded to a counter, each INPUT element is to add 1to the corresponding position, thus supporting the counting operation. The count number is the minimum value for all the locations to which the map is mapped.

8) Give thousands of files, each file size is 1k-100m. Give n words, design algorithms for each word find all the files that contain it, you only have 100K of memory.

0: Use a file info to save n words and file information containing them.
1: First divide n words into x parts. For each copy, generate a fabric filter (because the memory may not be sufficient for n-word generation only one of the fabric filters). Save all the generated external memory filters in a file filter.
2: The memory is divided into two buffers, one for each read into a filter, a read file (read the buffer using the equivalent of a consumer problem model to achieve synchronization), large files can be divided into smaller files, However, it is necessary to store the marked information of the large file (such as which large file is the small file).
3: For each word read in the memory of the filter to determine whether to include this value, if not included, read from the filter file the next Bron filter to memory, until the inclusion or traversal of all the filter. If included, update the info file. Until all data is processed. Delete the filter file.

remark:
1: About Bron filter: In fact, is a bitmap to store string hash value.
2: There may be some detail problems, such as repeating the string caused by repeated calculations, etc. to consider.

9) A dictionary containing N English words, now arbitrary to a string, the design algorithm to find all the English words containing this string.

to lose into a string, using the alphabet to create an inverted index, which stores the word in the index and the position in the word, and, when queried, finds all the words in the inverted row and the intersection and position to be contiguous.

Big Data Topic thought Summary

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.