In this era of hot cloud computing, if you have not processed massive data, you will no longer be a qualified coder. Make up now ~
A while ago, I analyzed a Data Group of nearly 1 Tb (GZ file, compressed by 10% ). Because the first analysis of such huge data and no experience, it took a lot of time. Below are some of my experiences to facilitate the latter.
Download data
Q: How to automatically download multiple files?
This is my first problem. When the data volume is large, it is generally divided into many files for storage. Downloading files is troublesome.
A: Use the wget command. It takes some time to download and install windows. However, manual download saves a lot of time.
I provide two methods to download files,
A) use the recursive download option "-R" of wget ". The General Command is as follows:
Wget-r http: // <root directory of the downloaded data>/-O <download Record File Name>-NP
Because recursive download cannot control the progress, we recommend that you do not need to recursively download too many files.
B) use bat + wget to execute wget multiple times. The General Command is as follows:
Wget-r http: // <root directory branch of the downloaded data 1>/-O <download Record File Name>-NP
Wget-r http: // <root directory branch of the downloaded data 2>/-O <download Record File Name>-NP
Wget-r http: // <root directory branch of the downloaded data 3>/-O <download Record File Name>-NP
...... ......
Wget-r http: // <root directory branch of the downloaded data N>/-O <download Record File Name>-NP
BAT can reduce the impact of errors.
In addition, wget can use the-A option to specify the suffix of the object to be downloaded, and use the-P option to specify the path for storing the downloaded object. For more commands, see wget-H
Q: This speed... When can it be completed?
Network speed is always a bottleneck
A: If the download service is far away, you should consider proxy. The following is how wget sets Proxy:
Set http_proxy = http: // <Proxy Server>
Don't forget how many processes to open and try 20 more?
Open a file
Q: How to open a text file?
This is not a mental retardation. You can use NotePad to open a MB file.
A: LTF Viewer
Large Text File Viewer, opening speed will surprise you
Q: How to open a binary file?
A: Hex editor neo
You can select the hexadecimal mode as follows:
Right-click data zone => display as => hex | decimal | octal | binary | float | double
You can select the number of bytes to display in the following way:
Right-click data zone => group by => bytes | words | double | quad
Programming Language
When the data volume is large, you should be careful when selecting a language. Because different languages have different characteristics, You need to weigh the programming time and running time.
Model Test
At the beginning, a few small pieces of data are usually selected for testing to obtain the first analysis result. In this case, we certainly hope to implement programming quickly. Scripting is a good choice, such as Python.
Massive processing
It is not appropriate to use the script language to process all data during traversal. Because the running time of the script language is unacceptable. In addition, you can't control memory usage and file read/write operations. Unfortunately, few languages optimize your processing of massive files.
C/C ++ is the best choice.
Result Display
The long wait is over, and the result is coming soon. If you are still stuck with the long-waiting C/C ++, you will be frustrated sooner or later.
After trying a lot of methods, I came to the conclusion that we asked MATLAB to take over C/C ++. MATLAB can easily display a large amount of data. More importantly, Matlab supports reading binary files.
Filename = 'out. bin'; % binary file
FID = fopen (filename );
Data = fread (FID, itemsnumber, '* uint32 ');
Fclose (FID );
------------------------------
Link: http://publish.itpub.net/a2010/1203/1133/000001133931.shtml