Programmers should know how to analyze massive data

Source: Internet
Author: User

In this era of hot cloud computing, if you have not processed massive data, you will no longer be a qualified coder. Make up now ~

A while ago, I analyzed a Data Group of nearly 1 Tb (GZ file, compressed by 10% ). Because the first analysis of such huge data and no experience, it took a lot of time. Below are some of my experiences to facilitate the latter.

Download data

Q: How to automatically download multiple files?

This is my first problem. When the data volume is large, it is generally divided into many files for storage. Downloading files is troublesome.

A: Use the wget command. It takes some time to download and install windows. However, manual download saves a lot of time.

I provide two methods to download files,

A) use the recursive download option "-R" of wget ". The General Command is as follows:

Wget-r http: // <root directory of the downloaded data>/-O <download Record File Name>-NP

Because recursive download cannot control the progress, we recommend that you do not need to recursively download too many files.

B) use bat + wget to execute wget multiple times. The General Command is as follows:

Wget-r http: // <root directory branch of the downloaded data 1>/-O <download Record File Name>-NP

Wget-r http: // <root directory branch of the downloaded data 2>/-O <download Record File Name>-NP

Wget-r http: // <root directory branch of the downloaded data 3>/-O <download Record File Name>-NP

...... ......

Wget-r http: // <root directory branch of the downloaded data N>/-O <download Record File Name>-NP

BAT can reduce the impact of errors.

In addition, wget can use the-A option to specify the suffix of the object to be downloaded, and use the-P option to specify the path for storing the downloaded object. For more commands, see wget-H

Q: This speed... When can it be completed?

Network speed is always a bottleneck

A: If the download service is far away, you should consider proxy. The following is how wget sets Proxy:

Set http_proxy = http: // <Proxy Server>

Don't forget how many processes to open and try 20 more?

Open a file

Q: How to open a text file?

This is not a mental retardation. You can use NotePad to open a MB file.

A: LTF Viewer

Large Text File Viewer, opening speed will surprise you

Q: How to open a binary file?

A: Hex editor neo

You can select the hexadecimal mode as follows:

Right-click data zone => display as => hex | decimal | octal | binary | float | double

You can select the number of bytes to display in the following way:

Right-click data zone => group by => bytes | words | double | quad

Programming Language

When the data volume is large, you should be careful when selecting a language. Because different languages have different characteristics, You need to weigh the programming time and running time.

Model Test

At the beginning, a few small pieces of data are usually selected for testing to obtain the first analysis result. In this case, we certainly hope to implement programming quickly. Scripting is a good choice, such as Python.

Massive processing

It is not appropriate to use the script language to process all data during traversal. Because the running time of the script language is unacceptable. In addition, you can't control memory usage and file read/write operations. Unfortunately, few languages optimize your processing of massive files.

C/C ++ is the best choice.

Result Display

The long wait is over, and the result is coming soon. If you are still stuck with the long-waiting C/C ++, you will be frustrated sooner or later.

After trying a lot of methods, I came to the conclusion that we asked MATLAB to take over C/C ++. MATLAB can easily display a large amount of data. More importantly, Matlab supports reading binary files.

Filename = 'out. bin'; % binary file

FID = fopen (filename );

Data = fread (FID, itemsnumber, '* uint32 ');

Fclose (FID );

------------------------------

Link: http://publish.itpub.net/a2010/1203/1133/000001133931.shtml

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.