Programmers should know-how to analyze massive amounts of data

Last Update:2015-04-06 Source: Internet

Author: User

Tags gz file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Programmers should know-how to analyze massive amounts of data

Http://www.cnblogs.com/MicroTeam/archive/2010/12/03/1895071.html

In this era of cloud computing stir, if you have not processed massive amounts of data, you will no longer be a qualified coder. Now hurry up and mend it ~

A few years ago, a data group (GZ file with a compression ratio of 10%) was analyzed for nearly 1TB. Because the first analysis of such a huge data, no experience, so wasted a lot of time. Here are some of the lessons I've collated to facilitate the latter.

Welcome all kinds of supplements, I will constantly update this article, feel useful, speed sharing links; if there are different opinions, please make a decisive shot of bricks; Download data

Q: How can I download multiple files automatically?
This is the first question I have ever encountered. When the amount of data is large, it is generally divided into a number of files to be stored. Downloading files at this time is more troublesome.

A: Use the wget command. Windows takes a little time to download the installation. But it can save a lot of time by manual downloading.

I offer two ways to download files,

A) use wget's recursive download option "-R". General commands are as follows

Wget–r http://< Download Data root directory >/-o < download record file name >-NP

Because the recursive download cannot control the progress, so it is recommended to download too many files recursively

b) Use Bat+wget to execute Wget multiple times. General commands are as follows

Wget–r http://< Download Data root branch 1>/-o < download record file name >-NP

Wget–r http://< Download Data root branch 2>/-o < download record file name >-NP

Wget–r http://< Download Data root branch 3>/-o < download record file name >-NP

...... ......

Wget–r http://< Download Data root branch n>/-o < download record file name >-NP

Using bat can reduce the impact of errors.

In addition, wget can specify the file path to be downloaded via the –P option by specifying the suffix of the files you want to download via the –A option. More commands, see Wget-h
Q: This speed ... When can we finish?
Speed is always a bottleneck

A: If the download service is far away, you should consider the agent. wget how to set up the agent as follows

Set http_proxy=http://< Proxy Server >

Don't forget to open a few more processes, 20 to try?

Open File

Q: How to open a text file
This is not a mental problem. You try to open a 1000MB file with a notepad.

A:LTF Viewer

Large Text File Viewer, open speed will surprise you
Q: How to open binary file
A:hex Editor Neo

You can choose the following ways to enter the system:

Right-click data area = Display as = + hex| decimal| octal| binary| Float| Double

You can choose how many bytes to display by the following way:

Right-click data area = Group by = bytes| words| double| Quad

Programming languages

When the amount of data is large, choose a language to be cautious. Because different languages have different characteristics, you have to weigh between programming time and run time.

Model Testing

Initially, a few small data are typically selected for testing to obtain the first analysis results. At this time, of course, I hope to implement quickly. Scripting languages are a good choice, like Python.

Bulk processing

When you start traversing all the data, it's not appropriate to use the scripting language to handle it. Because the scripting language is not acceptable to run in time. In addition, there is memory use, file read and write these you can not control. Unfortunately, few languages will optimize your processing of a huge number of files.

C + + is the best option at this point.

results show

The long wait finally passed, seeing the result. If you're still clinging to the long-awaited, C + +, you'll be frustrated sooner or later.

After I tried a lot of ways, I came to the conclusion that MATLAB would take over C + +. Matlab can easily display large amounts of data. More importantly, MATLAB supports reading binary files.

filename = ' out.bin '; % binary file
FID = fopen (filename);
data = Fread (FID, Itemsnumber, ' *uint32 ');
Fclose (FID);

Algorithm

Read files at once
I've been testing it several times, reading a file at a time is five times times faster than reading a file in a row
Remember O (N)
Then you have to think about the complexity of the algorithm. Any o (N2) algorithm is undesirable.

When necessary, you can change time by space. Usually a hash table can save a lot of time.
Parallel processing
Brush up on the parallel algorithm. This is much better than waiting for a single-threaded thread.

You might consider running the program on the GPU. Of course, memory and file read times are more likely to be bottlenecks.

Memory, CPU, disk read speed, who is the bottleneck, Task manager knows.
Optimize core code
Typically 80% of the time is running 20% of the code. So when you are free, optimize the code that is often executed frequently.
Distributed storage
It is a bad decision to have the analysis result in a file. This can cause a lot of trouble in the process. such as parallel processing, file too large and so on.
Binary means to save intermediate data
Binary mode storage typically saves half of the disk space. This also means less than half of the write drive time and read hard disk time. Of course, there are text conversion times.

Another important detail to note: In Windows, the way to read and write files is changed to "RB" and "WB". Otherwise the inexplicable bug will happen sooner or later, but not necessarily can be found.

Run

Debug Vs Release
Don't forget, the final runtime is to replace the compile mode with release. But just change the program, it is recommended to use the debug mode first run. This enables you to locate a run-time exception.
Batch Processing
Batch processing is a good way to reduce the risk of running errors. Because you're not sure the program will end normally. So a period of execution is a good choice. If something goes wrong, you don't have to rerun the previous program.
Assertion
When the amount of data is large, it is difficult to ensure that the input is legal. In another case, the data is legal, but we are under consideration. The assertion is very important at this point. Assert back to increase run time, but always better than spending a lot of time getting a bad result.
Record run results to file
As mentioned earlier, when the amount of data is large, it is difficult to ensure that the program ends properly. In general, few people sit around the monitor and wait for the output. It is very necessary to log the operation status to the file at regular intervals.

Also, don't forget the fclose ();

Attached: 64-bit programming issues

When the amount of data is large, memory is often not enough. There is a common sense to know: the maximum address space for a 32-bit program is 2GB. If you want to allocate nearly or more than 2G of memory, try the 64-bit program. Of course there are two conditions: the 64-bit cpu,64-bit operating system.

Here are some lessons for writing 64-bit programs
Compiling the environment
If it's an interpreted language, like Python, you need to download a 64-bit Python interpreter

In the case of a compiled language, such as C + +, you need to choose the appropriate compilation platform.

For example VS2008, project properties = Configuration Manager = Platform = New = X64

Memory

Allocating large arrays, you should use malloc instead of defining the array directly.

sizeof (int)! = sizeof (size_t)
In 64-bit programs, array subscripts should be replaced with size_t, constants also need to be cast, such as 4GB = size_t 1000000000000
File

Fwrite one-time write to an array larger than 4GB seems to have some problems.

Try writing the file multiple times.

Programmers should know-how to analyze massive amounts of data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More