Java Big Data Processing

Source: Internet
Author: User
Tags ftp file

Obtain xx data files from the ftp host.

Tens of millions is just a concept. It indicates that the data volume is equal to or greater than ten millions.
This article does not involve distributed collection and storage. data is processed on a machine. If the data volume is large, you can consider distributed processing. If you have experience in this area in the future, you will share it in time.

1. ftp tools used by the program,
2. A key part of tens of millions of ftp Cores-column directory to file. As long as this is done, the performance will not be too high.
 
You can use apache to send the ftp command "NLST" to list directories in files.
 
# Commands executed in the ftp column directory take precedence over environment variables. If this parameter is not set, the default column directory mode NLST is used.
[Java]
# DS_LIST_CMD = NLST
Public File sendCommandAndListToFile (String command, String localPathName) throws IOException
{
Try {
Return client. createFile (command, localPathName );
} Catch (IOException e ){
Log. error (e );
Throw new IOException ("the command" + command + "is incorrect ");
}
}

Of course there should be other forms. You can study them by yourself.
 
If you have more than 100,000 million data records, do not use the following method.
 
FTPFile [] dirList = client. listFiles ();
 
3. Read the file name to be downloaded from the file in batches. load to the memory for processing, or read a file name to download a file, do not load all the data into the memory, if there are many problems

Why batch?
Because it is a large amount of data, if there are more than records, the size of the listed directory file should be larger than 1 GB.


4. Core code for file download-determine the file resumable upload to obtain the ftp file size and local file size, and then use the resumable upload function provided by ftp
The binary format must be used to download files.
Client. enterLocalPassiveMode (); // set it to passive mode
Ftpclient. binary (); // The binary mode must be used.
[Java]
/** Download the required file and support resumable upload. After the download, delete the FTP file to avoid duplication.
* @ Param pathName: Remote File
* @ Param localPath: local file
* @ Param registerFileName record the file name directory
* @ Param size: size of the uploaded file
* @ Return true download and deletion successful
* @ Throws IOException
* @ Throws Exception
*/
Public boolean downLoad (String pathName, String localPath) throws IOException {
Boolean flag = false;
File file = new File (localPath + ". tmp"); // you can specify a temporary File.
FileOutputStream out = null;
Try {
Client. enterLocalPassiveMode (); // set it to passive mode
Client. setFileType (FTP. BINARY_FILE_TYPE); // set it to binary transmission.
If (lff. getIsFileExists (file) {// determines whether the local file exists. if the local file exists and the file length is smaller than the FTP file length, resumable upload is enabled.
Long size = this. getSize (pathName );
Long localFileSize = lff. getSize (file );
If (localFileSize> size ){
Return false;
}
Out = new FileOutputStream (file, true );
Client. setRestartOffset (localFileSize );
Flag = client. retrieveFile (new String (pathName. getBytes (), client. getControlEncoding (), out );

Out. flush ();
} Else {
Out = new FileOutputStream (file );
Flag = client. retrieveFile (new String (pathName. getBytes (), client. getControlEncoding (), out );

Out. flush ();
}

} Catch (IOException e ){
Log. error (e );
Log. error ("file download error! ");
Throw e;
} Finally {
Try {
If (null! = Out)
Out. close ();
If (flag)
Lff. rename (file, localPath );
} Catch (IOException e ){
Throw e;
}
}
Return flag;
}
/**
* Get the object Length
* @ Param fileNamepath local file
* @ Return
* @ Throws IOException
*/
Public long getSize (String fileNamepath) throws IOException {
FTPFile [] ftp = client. listFiles (new String (fileNamepath. getBytes (), client. getControlEncoding ()));
Return ftp. length = 0? 0: ftp [0]. getSize ();
}
 
Check whether the local file has been downloaded. If the size of the downloaded file is large.
 
/**
* Obtain the size of a local file.
* @ Param file
* @ Return
*/
Public long getSize (File file ){
Long size = 0;
If (getIsFileExists (file )){
Size = file. length ();
}
Return size;
}

5. Because the program can run more than 100 threads at most, some processing is done on thread monitoring to detect dead threads and pull them up in time.
T. setUncaughtExceptionHandler (new ThreadException (exList ));
Principle: add the UncaughtExceptionHandler to each thread. When the thread dies, add the information corresponding to the thread to a list, and then let the main thread scan the list at intervals. If there is data, directly re-create a thread to run

6. If the program is resident memory, do not forget to close unused ftp connections in finally.

7. One thing that must be taken into account when the database collection program is large is full disk space.
 
When the disk space of the Java Virtual Machine is full, it is usually reported on the linux aix machine in the English environment.
There is not enough space in the file system
Generally, the following error message is displayed: "the disk space is full"
You can use the following code for verification:

[Java]
// Linux aix There is not enough space in the file system
// Window There is not enough space in the file system
If (e. toString (). contains ("enough space") | e. toString (). contains ("disk space is full "))
{
Log. error ("channel" + channel_name + "There is not enough space on the disk ");
Runtime. getRuntime (). exit (0 );
}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.