Openrtmfp/Cumulus development Note (6) cumulus Big Data Processing instance

Source: Internet
Author: User
Tags blizzard

I. Problem description: a massive volume of log data is extracted from the YY, http://yy.com/, the most visited web page on a day, as shown below:

, You can enter a channel number in the lower right corner, such as 2080 to enter the relevant channel.

Ii. problem simulation:

1. generate a large number of IP addresses and save these IP addresses in a file as follows:

Void constructbigdata: constructips (STD: String filename ){
STD: ofstream OUTFILE (filename. c_str (), STD: IOS: Out );
STD: stringstream IP ("");
Unsigned short num = 0;
Srand (unsigned) Time (null ));
For (INT I = 0; I <9000000; ++ I ){
For (Int J = 0; j <4; ++ J ){
Num = (RAND () % 256); // generates an IPv4 address
IP <num;
If (j <3)
IP <'.';
Else
IP address <'\ n ';
}
OUTFILE <IP. STR ();
IP. STR ("");
OUTFILE. Flush ();
}
OUTFILE. Close ();
}

Note: 9000000 IP addresses are generated here, which may be a little away from the massive number of IP addresses, but the massive number of IP addresses is relative. The processing method is the same and needs to be generated based on the actual situation, the general situation is considered here. Filename is the name of the file where the IP address is saved.

2. These 9000000 IP addresses can be loaded into the memory at a time for calculation and statistics, but we are talking about how to deal with such problems. For example, when one IP address cannot be loaded at a time, this can be done. Here we can read these IP addresses and divide them into several smaller files based on the modulo. The premise is that the same IP address should be divided into the same small file, as shown below:

Void constructbigdata: filepartition (STD: String filename ){
STD: ifstream infile (filename. c_str (), STD: IOS: In );
STD: ofstream outfile0 ("outfile0.txt", STD: IOS: Out );
STD: ofstream outfile1 ("outfile1.txt", STD: IOS: Out );
STD: ofstream outfile2 ("outfile2.txt", STD: IOS: Out );
STD: ofstream outfile3 ("outfile3.txt", STD: IOS: Out );
STD: ofstream outfile4 ("outfile4.txt", STD: IOS: Out );
If (! Infile ){
Return;
}
Unsigned short val1, val2, val3, val4;
Unsigned char values, CH2, CH3;
Unsigned long Marshal = 0;
Int modval = 0;
STD: stringstream SS;
STD: string buffer;
STD: stringstream ssbuf ("");
While (! Infile. EOF ()){
Getline (infile, buffer );
Ssbuf <buffer;
If (! Infile. EOF ()){
Ssbuf> val1> val1> val2> CH2> val3> CH3> val4;
Marshal = (val1 <8) + val2) <8) + val3) <8) + val4;
Modval = Marshal % 5;
Switch (modval ){
Case 0:
Outfile0 <ssbuf. STR () <'\ n ';
Break;
Case 1:
Outfile1 <ssbuf. STR () <'\ n ';
Break;
Case 2:
Outfile2 <ssbuf. STR () <'\ n ';
Break;
Case 3:
Outfile3 <ssbuf. STR () <'\ n ';
Break;
Case 4:
Outfile4 <ssbuf. STR () <'\ n ';
Break;
Default:
STD: cout <"SB" <STD: Endl;
Break;
}
Marshal = 0;
}
Ssbuf. Clear ();
Ssbuf. STR ("");
}
Outfile0.flush ();
Outfile1.flush ();
Outfile2.flush ();
Outfile3.flush ();
Outfile4.flush ();

Outfile0.close ();
Outfile1.close ();
Outfile2.close ();
Outfile3.close ();
Outfile4.close ();
Infile. Close ();
}

Note: The value here is mod5, which is just an example. It needs to be handled according to the actual situation. Here we will talk about the general method.

3. read each file in sequence, retrieve the IP address with the most accesses to each file, and compare the maximum value among these IP addresses. Here, we can use Blizzard's hash algorithm, because the hash algorithm is the fastest hash algorithm, as follows:

Void constructbigdata: findmax (){
Int filenum = 5;
STD: stringstream SS ("");
For (INT I = 0; I <filenum; ++ I ){
STD: String filename = "OUTFILE ";
STD: String suffix = ". txt ";
SS <FILENAME <I <suffix;
STD: ifstream infile (ss. STR (). c_str (), STD: IOS: In );
STD: string buffer;
While (! Infile. EOF ()){
Getline (infile, buffer );
If (! Infile. EOF ()){
Mpq. sethashtable (buffer); // hash, where mpq is an instance implemented by Blizzard hash algorithm.
}
}
Infile. Close ();
Printf ("from % s->", ss. STR (). c_str ());
Max ();
SS. Clear ();
SS. STR ("");
}
}

Void constructbigdata: max () {// you can call this operation to find the most accessed IP address in each file.
Int Index = 0;
For (INT I = 0; I If (mpq. m_hashindextable [I]. bexists ){
If (mpq. m_hashindextable [I]. Count> mpq. m_hashindextable [Index]. Count ){
Index = I;
}
}
}
If (mpq. m_hashindextable [Index]. bexists ){
Printf ("% s, % d \ n", mpq. m_hashindextable [Index]. test_filename, mpq. m_hashindextable [Index]. Count );
}
Mpq. Reset (hashmpqlen); // reset the hash structure
}

Iii. Examples of implementing blizzard hash algorithm are as follows:

1. Define the data structure of the hash table as follows:

Typedef struct
{
Long nhasha; // used for the hash algorithm
Long nhashb; // used for the hash algorithm
Bool bexists; // used for the hash algorithm
Char test_filename [maxfilename]; // storage IP Address
Int count; // The number of times access is stored.
} Mpqhashtable;

2. Hash Table initialization and resetting are as follows:

Void hashmpq: reset (const long ntablelength) {// ntablelength indicates the length of the hash table
For (INT I = 0; I <ntablelength; I ++ ){
M_hashindextable [I]. nhasha =-1;
M_hashindextable [I]. nhashb =-1;
M_hashindextable [I]. bexists = false;
M_hashindextable [I]. test_filename [0] = '\ 0 ';
M_hashindextable [I]. Count = 0;
}
}

3. Hash the IP address as follows:

Bool hashmpq: sethashtable (STD: String lpszstring)
{
Const unsigned long hash_offset = 0, hash_a = 1, hash_ B = 2;
Unsigned long nhash = hashstring (lpszstring, hash_offset );
Unsigned long nhasha = hashstring (lpszstring, hash_a );
Unsigned long nhashb = hashstring (lpszstring, hash_ B );
Unsigned long nhashstart = nhash % m_tablelength, nhashpos = nhashstart;
While (m_hashindextable [nhashpos]. bexists ){

/*

Todo to determine whether the IP address already exists. If yes, you only need to add 1 to the original IP address.

*/
Nhashpos = (nhashpos + 1) % m_tablelength; // The method used to handle the conflict, that is, the delay
If (nhashpos = nhashstart ){
Return false;
}
}

// Otherwise, the IP address is still available in the hash table.

M_hashindextable [nhashpos]. bexists = true;
M_hashindextable [nhashpos]. nhasha = nhasha;
M_hashindextable [nhashpos]. nhashb = nhashb;
Strcpy (m_hashindextable [nhashpos]. test_filename, lpszstring. c_str (); // indicates the IP address
M_hashindextable [nhashpos]. Count = 1; // indicates the first access, so set count to 1
Return true;
}

Iv. Main function, as follows:

# Include "constructbigdata. H"
# Include <iostream>
# Include <stdio. h>
Int main (INT argc, char ** argv ){
Constructbigdata BD (10000000); // The length of the hash table.
STD: String filename = "bigdata.txt"; // indicates the file name of the stored massive IP address.
BD. constructips (filename); // construct a massive IP Address
BD. filepartition (filename); // Partition
BD. findmax (); // find the maximum value in each partition
Return 0;
}

To be continued ~


PS: I wrote an article at the beginning. Please forgive me. If you have any questions or communication, you can add your YY: 301558660

Reprinted please indicate the source: zhujian blog, http://blog.csdn.net/linyanwen99/article/details/8182814


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.