"Learning" large file statistics and sorting (reprint)

Source: Internet
Author: User

Learning: Large file statistics and sorting

This is the main record of learning Aboutspeaker classmate of the following problem of the algorithm thinking and code.

The topic is this:

There are 10 files, each file 1G, each file is stored in each line of the user's query (please randomly generated), each file can be repeated query. Ask you to sort by the frequency of the query.

(Of course, the focus here is large files, so 10 1G files, or 1 10G files, the principle is the same)

The Aboutspeaker code is here:

https://gist.github.com/4009225

This is a very beautiful code, the solution and code are very well worth seeing.

Solution

The basic step is to constantly read into the file, and do preliminary statistics, to the limit of a certain memory to write the file, the way is written by the hash value of the query is assigned to 10 different files, until the completion of all the file content, and then the 10 files in the query by Count, and 10-way merge to sort out the final result.

Shuffle

The input file is passed in from the command line, read in line by row, and stored in a hashmap, while reading edge statistics <query, count> When the size of map arrives at the specified size (10*1000*1000, which mainly considers memory capacity), Write the contents of this hashmap first, write to the 10 file of the first hash (query)% 10, which ensures that the same query must be in the same file. This way, until you finish reading the file. So if the total input file size is 10G, each file size is <1g (because the same query and merged), you can do single-file in-memory processing. Note at this point, although the same query is in the same file, they may be distributed in several places, such as:

Query1 10
Query2 5
Query3 3
Query1 3
Query4 3
Query 2 7

Reduce

Merge the same query in each file and sort the query by count.

Merge

10 ordered files, sorted by merge to get the final result. Merging is done through a heap of 10 elements, which significantly reduces the time to read the file compared to the 22 iteration merge sort.

Run

The program runs only under Linux and requires Boost,ubunut to install boost first:

Apt-get Install Libboost-dev

Then compile, the program uses the C + + 0x feature, so you need to-std=c++0x:

g++ Sort.cpp-o sort-std=c++0x

Before running, you need to prepare input data, which is randomly generated in Lua: (https://gist.github.com/4045503)

--Updated version, use a table thus no GC involvedLocal file =Io.open ("File.txt","W")Local T ={}For i =1,500000000do local n = i% 10000string.format ( " this is a number%d\n " n ) table.insert (T, str) 10000 = = 0 then Span style= "color: #ff00ff;" >file:write (table.concat (t)) 
T = {} endend

OK, start running:

Sort File.txt

The results are as follows:

$ time Sort file.txt
Processing file.txt
Shuffling done
Reading shard-00000-of-00010
Writing count-00000-of-00010
Reading shard-00001-of-00010
Writing count-00001-of-00010
Reading shard-00002-of-00010
Writing count-00002-of-00010
Reading shard-00003-of-00010
Writing count-00003-of-00010
Reading shard-00004-of-00010
Writing count-00004-of-00010
Reading shard-00005-of-00010
Writing count-00005-of-00010
Reading shard-00006-of-00010
Writing count-00006-of-00010
Reading shard-00007-of-00010
Writing count-00007-of-00010
Reading shard-00008-of-00010
Writing count-00008-of-00010
Reading shard-00009-of-00010
Writing count-00009-of-00010
Reducing done
Merging done

real19m18.805s
user14m20.726s
sys1m37.758s

On my 32-bit Ubuntu11.10 virtual machine, I allocated 1G of memory, 1 2.5G CPU cores, processed a 15G file, and spent 19m minutes.

Learning

    • Assign query to a hash value to a different file, make sure you want to use query in the same file, beautiful
    • 10-way Merge sort, with a maximum (small) heap to do, reduce the file read and write, beautiful
    • Localsink, Shuffler, source and other small classes to encapsulate, decouple some of the special tasks, the structure is very beautiful
    • Some knowledge I'm not familiar with:
      • __gnu_cxx::__sso_string, GNU short string optimization, here are more instructions
      • Boost::function, Boost::bind
      • When using map's [] operator, the insert data is initialized according to the default constructor, which is 0 for int by default
      • C + + 0x for each:for (auto kv:queries)
      • Boost::noncopyable: A class that cannot be copied inherits from this
      • Std::hash<string> (): Returns a hash functor for string
      • Boost::p tr_vector:boost a PTR version for each container, which is more efficient than simply using vector<shared_ptr<t>>
      • A file for Unlink:delete
      • Std::unordered_map<string, int64_t> queries (Read_shard (i, nbuckets)): Use move sematic, otherwise inefficient
      • STD::p air defines < operator, first element is compared

"Learning" large file statistics and sorting (reprint)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.