Programmers should know-how to analyze massive amounts of data

Source: Internet
Author: User
Tags gz file

Programmers should know-how to analyze massive amounts of data

Http://www.cnblogs.com/MicroTeam/archive/2010/12/03/1895071.html

In this era of cloud computing stir, if you have not processed massive amounts of data, you will no longer be a qualified coder. Now hurry up and mend it ~

A few years ago, a data group (GZ file with a compression ratio of 10%) was analyzed for nearly 1TB. Because the first analysis of such a huge data, no experience, so wasted a lot of time. Here are some of the lessons I've collated to facilitate the latter.

Welcome all kinds of supplements, I will constantly update this article, feel useful, speed sharing links; if there are different opinions, please make a decisive shot of bricks; Download data
Q: How can I download multiple files automatically?

This is the first question I have ever encountered. When the amount of data is large, it is generally divided into a number of files to be stored. Downloading files at this time is more troublesome.

A: Use the wget command. Windows takes a little time to download the installation. But it can save a lot of time by manual downloading.

I offer two ways to download files,

A) use wget's recursive download option "-R". General commands are as follows

Wget–r http://< Download Data root directory >/-o < download record file name >-NP

Because the recursive download cannot control the progress, so it is recommended to download too many files recursively

b) Use Bat+wget to execute Wget multiple times. General commands are as follows

Wget–r http://< Download Data root branch 1>/-o < download record file name >-NP

Wget–r http://< Download Data root branch 2>/-o < download record file name >-NP

Wget–r http://< Download Data root branch 3>/-o < download record file name >-NP

...... ......

Wget–r http://< Download Data root branch n>/-o < download record file name >-NP

Using bat can reduce the impact of errors.

In addition, wget can specify the file path to be downloaded via the –P option by specifying the suffix of the files you want to download via the –A option. More commands, see Wget-h

Q: This speed ... When can we finish?

Speed is always a bottleneck

A: If the download service is far away, you should consider the agent. wget how to set up the agent as follows

Set http_proxy=http://< Proxy Server >

Don't forget to open a few more processes, 20 to try?

Open File
Q: How to open a text file

This is not a mental problem. You try to open a 1000MB file with a notepad.

A:LTF Viewer

Large Text File Viewer, open speed will surprise you

Q: How to open binary file

A:hex Editor Neo

You can choose the following ways to enter the system:

Right-click data area = Display as = + hex| decimal| octal| binary| Float| Double

You can choose how many bytes to display by the following way:

Right-click data area = Group by = bytes| words| double| Quad

Programming languages

When the amount of data is large, choose a language to be cautious. Because different languages have different characteristics, you have to weigh between programming time and run time.

Model Testing

Initially, a few small data are typically selected for testing to obtain the first analysis results. At this time, of course, I hope to implement quickly. Scripting languages are a good choice, like Python.

Bulk processing

When you start traversing all the data, it's not appropriate to use the scripting language to handle it. Because the scripting language is not acceptable to run in time. In addition, there is memory use, file read and write these you can not control. Unfortunately, few languages will optimize your processing of a huge number of files.

C + + is the best option at this point.

results show

The long wait finally passed, seeing the result. If you're still clinging to the long-awaited, C + +, you'll be frustrated sooner or later.

After I tried a lot of ways, I came to the conclusion that MATLAB would take over C + +. Matlab can easily display large amounts of data. More importantly, MATLAB supports reading binary files.

filename = ' out.bin '; % binary file
FID = fopen (filename);
data = Fread (FID, Itemsnumber, ' *uint32 ');
Fclose (FID);

Algorithm
Read files at once

I've been testing it several times, reading a file at a time is five times times faster than reading a file in a row

Remember O (N)

Then you have to think about the complexity of the algorithm. Any o (N2) algorithm is undesirable.

When necessary, you can change time by space. Usually a hash table can save a lot of time.

Parallel processing

Brush up on the parallel algorithm. This is much better than waiting for a single-threaded thread.

You might consider running the program on the GPU. Of course, memory and file read times are more likely to be bottlenecks.

Memory, CPU, disk read speed, who is the bottleneck, Task manager knows.

Optimize core code

Typically 80% of the time is running 20% of the code. So when you are free, optimize the code that is often executed frequently.

Distributed storage

It is a bad decision to have the analysis result in a file. This can cause a lot of trouble in the process. such as parallel processing, file too large and so on.

Binary means to save intermediate data

Binary mode storage typically saves half of the disk space. This also means less than half of the write drive time and read hard disk time. Of course, there are text conversion times.

Another important detail to note: In Windows, the way to read and write files is changed to "RB" and "WB". Otherwise the inexplicable bug will happen sooner or later, but not necessarily can be found.

Run
Debug Vs Release

Don't forget, the final runtime is to replace the compile mode with release. But just change the program, it is recommended to use the debug mode first run. This enables you to locate a run-time exception.

Batch Processing

Batch processing is a good way to reduce the risk of running errors. Because you're not sure the program will end normally. So a period of execution is a good choice. If something goes wrong, you don't have to rerun the previous program.

Assertion

When the amount of data is large, it is difficult to ensure that the input is legal. In another case, the data is legal, but we are under consideration. The assertion is very important at this point. Assert back to increase run time, but always better than spending a lot of time getting a bad result.

Record run results to file

As mentioned earlier, when the amount of data is large, it is difficult to ensure that the program ends properly. In general, few people sit around the monitor and wait for the output. It is very necessary to log the operation status to the file at regular intervals.

Also, don't forget the fclose ();

Attached: 64-bit programming issues

When the amount of data is large, memory is often not enough. There is a common sense to know: the maximum address space for a 32-bit program is 2GB. If you want to allocate nearly or more than 2G of memory, try the 64-bit program. Of course there are two conditions: the 64-bit cpu,64-bit operating system.

Here are some lessons for writing 64-bit programs

Compiling the environment

If it's an interpreted language, like Python, you need to download a 64-bit Python interpreter

In the case of a compiled language, such as C + +, you need to choose the appropriate compilation platform.

For example VS2008, project properties = Configuration Manager = Platform = New = X64

Memory

Allocating large arrays, you should use malloc instead of defining the array directly.

sizeof (int)! = sizeof (size_t)

In 64-bit programs, array subscripts should be replaced with size_t, constants also need to be cast, such as 4GB = size_t 1000000000000

File

Fwrite one-time write to an array larger than 4GB seems to have some problems.

Try writing the file multiple times.

Programmers should know-how to analyze massive amounts of data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.