Java Big Data Processing (tens of millions of FTP downloads)

Source: Internet
Author: User
Tags ftp file

Obtain xx data files from the ftp host. tens of millions is just a concept, which means that the data volume is equal to or greater than ten millions. This sharing does not involve distributed collection and storage. data is processed on a machine. If the data volume is large, you can consider distributed processing. If you have experience in this area in the future, you will share it in time. 1, the ftp tool used by the program, apache commons-net-ftp-2.0.jar 2, tens of millions of ftp core key part-column directory to the file, as long as this is done, basically, the performance is not a big problem. you can use apache to send the ftp command "NLST" to list the directories in the file. # The commands executed in the ftp directory take the environment variable configuration first, if this parameter is not set, use the default column directory method NLST # DS_LIST_CMD = NLST public File sendCommandAndListToFile (String command, String localPathName) throws IOE. Xception {try {return client. createFile (command, localPathName);} catch (IOException e) {log. error (e); throw new IOException ("the command" + command + "is incorrect");} Of course there should be other forms, if you want to study the data volume of more than 100,000 million data records, do not use the following method. If you want to use it, find the FTPFile [] dirList = client. listFiles (); 3. Read the file name to be downloaded in batches. load to the memory for processing, or read a file name to download a file. Do not load all the data into the memory. If there are many problems, why should we split the data into batches? Because it is a large amount of data, if there are records, the size of the listed directory file is more than 1 GB. 4. Core code for file download-resumable file transfer, determine the ftp file size and local file size, and then use the breakpoint Resume function provided by ftp to download files must use a binary client. enterLocalPassiveMode (); // set it to passive mode ftpclient. binary (); // you must use the binary mode./** download the required file and support resumable upload. After downloading, delete the FTP file, avoid repeated * @ param pathName Remote File * @ param localPath local file * @ param registerFileName record file name directory * @ param size file size * @ return true download and deletion successful * @ throws IOException * @ throws Exception */Public boolean downLoad (String pathName, String localPath) throws IOException {boolean flag = false; File file = new File (localPath + ". tmp "); // set the temporary file FileOutputStream out = null; try {client. enterLocalPassiveMode (); // set it to passive mode client. setFileType (FTP. BINARY_FILE_TYPE); // if (lff. getIsFileExists (file) {// determines whether the local file exists. If the local file exists and its length is smaller than the FTP file length, resumable data transfer is performed. The returned value is long size = this. getSize (pathName); long localF IleSize = lff. getSize (file); if (localFileSize> size) {return false;} out = new FileOutputStream (file, true); client. setRestartOffset (localFileSize); flag = client. retrieveFile (new String (pathName. getBytes (), client. getControlEncoding (), out); out. flush ();} else {out = new FileOutputStream (file); flag = client. retrieveFile (new String (pathName. getBytes (), client. getControlEncoding (), out); out. flush () ;}} Catch (IOException e) {log. error (e); log. error ("file download error! "); Throw e;} finally {try {if (null! = Out) out. close (); if (flag) lff. rename (file, localPath);} catch (IOException e) {throw e ;}} return flag ;} /*** get file length * @ param fileNamepath local file * @ return * @ throws IOException */public long getSize (String fileNamepath) throws IOException {FTPFile [] ftp = client. listFiles (new String (fileNamepath. getBytes (), client. getControlEncoding (); return ftp. length = 0? 0: ftp [0]. getSize () ;}check whether the local file has been downloaded. /*** obtain the size of the local file * @ param File * @ return */public long getSize (file) {long size = 0; if (getIsFileExists (file )) {size = file. length ();} return size;} 5. Because the program can run more than 100 threads at most, some processing is done on thread monitoring to detect dead threads, and pull up in time. T. setUncaughtExceptionHandler (new ThreadException (exList); Principle: add the UncaughtExceptionHandler to each thread. When the thread dies, add the corresponding thread information to a list, then let the main thread scan the list at intervals. If there is data, directly re-create a thread to run. 6. If the program is resident memory, do not forget to close unused ftp connections in finally. 7. One thing that must be taken into account when the database collection program is large is that the disk space is full and the disk space of the Java Virtual Machine is full, in an English environment on a linux aix machine, There is not enough space in the file system is generally reported as "the disk space is full" in the Chinese environment. You can use the following code to verify www.2cto.com/ /linux aix There is not enough space in the file system // window There is not enough space in the file system if (e. toString (). contains ("enough space") | e. toString (). contains ("disk space full") {log. error ("channel" + channel_name + "There is not enough space on the disk"); Runtime. getRuntime (). exit (0 );}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.