Big Data processing algorithm three: divide and conquer/hash map + hash Statistics + heap/quick/merge sort

Source: Internet
Author: User

Baidu face question 1, massive log data, extract a day to visit Baidu the most times the IP.

Ipis a +A bit, there's at most one2^32aIP. It is also possible to use mapping methods, such as modulo +, map the entire large file to +a small file, and then find out the most frequent occurrences of each textIP(You can useHash_mapFrequency Statistics, and then find the maximum frequency of several) and the corresponding frequency. And then in here . +one of the biggestIPto find the most frequentIP, that is, for the request.

Baidu Noodles 2, search engine through the log file will be used every time the user retrieves all the retrieved strings are recorded, the length of each query string is 1-255 bytes.

Assume that there are currently 10 million records (these query strings have a high degree of repeatability, although the total is 1 million, but if you remove the duplicates, no more than 3 . The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular of the ten query strings, require the use of memory can not exceed 1G.

First step borrowingHashPre-processing of this batch of massive data: first(Maintain aKeyto beQueryString,Valuefor theQuerythe number of occurrences, i.e.Hashmap (Query,Value), each time you read aQueryif the string is notTable, the string is added, and theValuevalue is set to1, if the string is in theTable, add a count of the string. In the end weO (N)(Nto be1Be sure to iterate through the entire array once and for each of the statistics officesQueryThe number of occurrences) is used within the time complexityHashtable completed the statistics;
The second step borrows the heap sort to find the hottestTena query string: The time complexity ofN ' *logk. Maintain aK (in this topic isTen)small Gan of size, and then traverse3millionQuery, respectively, compared to the root element (contrastvaluevalue) to find outTenavaluethe highest valueQuery
The final time complexity is:O(N)+ N ' *o(Logk), (Nto be +million,N' for -million)

Or: Using the trie Tree, the number of occurrences of the query string in the key field does not appear as 0. Finally , the occurrence frequency is sorted by the minimum push of the ten elements.

Let's start by looking at HashMap Implementation

1. data structure of HashMap

There are arrays and linked lists in the data structure that can be stored, but these are basically two extremes.

Array

The array storage interval is continuous and occupies a serious memory, so the space is very complex. But the binary finding time of the array is small and the complexity is O (1); The array is characterized by: easy addressing, insertion and deletion difficulties;

Linked list

The storage interval of the list is discrete, the memory is relatively loose, so the space complexity is very small, but the time complexity is very large, up to O(N). The list is characterized by difficult addressing, easy insertion and deletion.

Hash table

Can we combine the characteristics of both to make a data structure that is easy to address, insert and delete? The answer is yes, and that's the hash table we're going to mention. Hash table (hash table) not only satisfies the data search convenience, but also does not occupy too much content space, the use is very convenient.

There are a number of different implementations of the hash table, and what I'll explain next is the most commonly used method-the zipper method, which we can understand as "arrays of linked lists"

I use Java to achieve a hashmap, of course, this is relatively simple point, but can explain the approximate principle, the change of function basically has

Index=hashcode (Key) =key%16

Hash algorithm a lot, below I use the Java comes with, of course you can also use other

/** * Custom HASHMAP * @author JYC506 * @param <K> * @param <V> */public class hashmap<k, v> {private STA Tic Final int capactity = 16;transient entry<k, v>[] table = null, @SuppressWarnings ("unchecked") public HashMap () {su per (); table = new entry[capactity];} /* Hash Algorithm */private final int tohashcode (Object obj) {int h = 0;if (obj instanceof String) {return Stringhash.tohashcode (St (ring) obj);}   H ^= obj.hashcode () H ^= (H >>>) ^ (h >>> N); return h ^ (H >>> 7) ^ (H >>> 4);} /* put hashmap*/public void put (K key, V value) {int hashcode = This.tohashcode (key); int index = HASHCODE/CAPACTITY;IF (ta Ble[index] = = null) {Table[index] = new entry<k, v> (key, value, hashcode);} else {for (entry<k, v> Entry = tab Le[index]; Entry! = NULL; Entry = entry.nextentry) {if (Entry.hashcode = = Hashcode && (entry.key = key | | key.equals (ENTRY.KEY))) {Entry.va Lue = Value;return;}} Entry<k, v> entry2 = Table[index]; Entry<k,v> Entry3 = new entry<k, v> (key, value, hashcode); entry3.nextentry = Entry2;table[index] = entry3;}}  /* Get value */public V get (K key) {int hashcode = This.tohashcode (key); int index = HASHCODE/CAPACTITY;IF (Table[index] = = null) {return null;}  else {for (entry<k, v> Entry = Table[index]; Entry! = null; Entry = entry.nextentry) {if (Entry.hashcode = = Hashcode && (Entry.key = = Key | | key.equals (ENTRY.KEY))) {return entry.value;}} return null;}} /* Delete */public void remove (K key) {int hashcode = This.tohashcode (key); int index = HASHCODE/CAPACTITY;IF (Table[index] = = N ull) {return;} else {entry<k, v> parent=null;for (entry<k, v> Entry = Table[index]; Entry! = null; Entry = en Try.nextentry) {if (Entry.hashcode = = Hashcode && (entry.key = = Key | | key.equals (ENTRY.KEY))) {if (parent!=null) { Parent.nextentry=entry.nextentry;entry=null;return;}} Parent=entry;}}} public static void Main (string[] args) {hashmap<string,string> map=new Hashmap<string,striNg> (); Map.put ("1", "2"), Map.put ("1", "3"), Map.put ("3", "hahaha"); System.out.println (Map.get ("1")); System.out.println (Map.get ("3")); Map.Remove ("1"); System.out.println (Map.get ("1"));}} Class Entry<k, v> {K key; V Value;int hashcode; Entry<k, v> nextentry;public Entry (K key, V value, int hashcode) {super (); This.key = Key;this.value = Value;this.has Hcode = Hashcode;}} /* String hash algorithm */class Stringhash {public static final int tohashcode (string str) {/* I use Java to bring */return Str.hashcode ();}}



Big Data processing algorithm three: divide and conquer/hash map + hash Statistics + heap/quick/merge sort

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.