"Learning" large file statistics and sorting (reprint)

Last Update:2015-06-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learning: Large file statistics and sorting

This is the main record of learning Aboutspeaker classmate of the following problem of the algorithm thinking and code.

The topic is this:

There are 10 files, each file 1G, each file is stored in each line of the user's query (please randomly generated), each file can be repeated query. Ask you to sort by the frequency of the query.

(Of course, the focus here is large files, so 10 1G files, or 1 10G files, the principle is the same)

The Aboutspeaker code is here:

https://gist.github.com/4009225

This is a very beautiful code, the solution and code are very well worth seeing.

Solution

The basic step is to constantly read into the file, and do preliminary statistics, to the limit of a certain memory to write the file, the way is written by the hash value of the query is assigned to 10 different files, until the completion of all the file content, and then the 10 files in the query by Count, and 10-way merge to sort out the final result.

Shuffle

The input file is passed in from the command line, read in line by row, and stored in a hashmap, while reading edge statistics <query, count> When the size of map arrives at the specified size (10*1000*1000, which mainly considers memory capacity), Write the contents of this hashmap first, write to the 10 file of the first hash (query)% 10, which ensures that the same query must be in the same file. This way, until you finish reading the file. So if the total input file size is 10G, each file size is <1g (because the same query and merged), you can do single-file in-memory processing. Note at this point, although the same query is in the same file, they may be distributed in several places, such as:

Query1 10
Query2 5
Query3 3
Query1 3
Query4 3
Query 2 7

Reduce

Merge the same query in each file and sort the query by count.

Merge

10 ordered files, sorted by merge to get the final result. Merging is done through a heap of 10 elements, which significantly reduces the time to read the file compared to the 22 iteration merge sort.

Run

The program runs only under Linux and requires Boost,ubunut to install boost first:

Apt-get Install Libboost-dev

Then compile, the program uses the C + + 0x feature, so you need to-std=c++0x:

g++ Sort.cpp-o sort-std=c++0x

Before running, you need to prepare input data, which is randomly generated in Lua: (https://gist.github.com/4045503)

--Updated version, use a table thus no GC involvedLocal file =Io.open ("File.txt","W")Local T ={}For i =1,500000000do local n = i% 10000string.format ( " this is a number%d\n " n ) table.insert (T, str) 10000 = = 0 then Span style= "color: #ff00ff;" >file:write (table.concat (t)) 
 T = {} endend

OK, start running:

Sort File.txt

The results are as follows:

$ time Sort file.txt
Processing file.txt
Shuffling done
Reading shard-00000-of-00010
Writing count-00000-of-00010
Reading shard-00001-of-00010
Writing count-00001-of-00010
Reading shard-00002-of-00010
Writing count-00002-of-00010
Reading shard-00003-of-00010
Writing count-00003-of-00010
Reading shard-00004-of-00010
Writing count-00004-of-00010
Reading shard-00005-of-00010
Writing count-00005-of-00010
Reading shard-00006-of-00010
Writing count-00006-of-00010
Reading shard-00007-of-00010
Writing count-00007-of-00010
Reading shard-00008-of-00010
Writing count-00008-of-00010
Reading shard-00009-of-00010
Writing count-00009-of-00010
Reducing done
Merging done

real19m18.805s
user14m20.726s
sys1m37.758s

On my 32-bit Ubuntu11.10 virtual machine, I allocated 1G of memory, 1 2.5G CPU cores, processed a 15G file, and spent 19m minutes.

Learning

Assign query to a hash value to a different file, make sure you want to use query in the same file, beautiful
10-way Merge sort, with a maximum (small) heap to do, reduce the file read and write, beautiful
Localsink, Shuffler, source and other small classes to encapsulate, decouple some of the special tasks, the structure is very beautiful
Some knowledge I'm not familiar with:
- __gnu_cxx::__sso_string, GNU short string optimization, here are more instructions
- Boost::function, Boost::bind
- When using map's [] operator, the insert data is initialized according to the default constructor, which is 0 for int by default
- C + + 0x for each:for (auto kv:queries)
- Boost::noncopyable: A class that cannot be copied inherits from this
- Std::hash<string> (): Returns a hash functor for string
- Boost::p tr_vector:boost a PTR version for each container, which is more efficient than simply using vector<shared_ptr<t>>
- A file for Unlink:delete
- Std::unordered_map<string, int64_t> queries (Read_shard (i, nbuckets)): Use move sematic, otherwise inefficient
- STD::p air defines < operator, first element is compared

"Learning" large file statistics and sorting (reprint)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Learning" large file statistics and sorting (reprint)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Learning" large file statistics and sorting (reprint)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support