Oracle-based big data import solution Exploration

Source: Internet
Author: User
The data import function is often visible in report projects because it is the basis for displaying report data, there are few solutions that take both performance and efficiency into account. Recently in China Southern Airlines Guangxi Data Server

The data import function is often visible in report projects because it is the basis for displaying report data, there are few solutions that take both performance and efficiency into account. Recently in China Southern Airlines Guangxi Data Server

The data import function is often visible in report projects because it is the basis for displaying report data, there are few solutions that take both performance and efficiency into account. Recently, in the development of China Southern Airlines data service platform in Guangxi, I need to design a solution that can quickly import more than 0.4 million pieces of data into the Oracle database. In order to achieve efficient import, by collecting data online and conducting hands-on tests, I drew some analysis summaries and shared them with you.

When talking about the implementation of the data import function, there should be no more than two processes: Data File Upload and data import.

File Upload is unrelated to the import, but it is in a stage of the data import function, and its efficiency is also important. For traditional projects, file upload is usually implemented using Struts and other frameworks, as well as some open-source File Upload components, such as SmartUpload, the data File is obtained through the Input tag of the File type in Html, sent to the server in the form of a stream, and finally the server obtains the stream and writes it to the File, this allows you to upload files from the client to the server. These methods can all be called Web file upload.

In addition to the Web method, you can use an Applet to embed it into a webpage as a client Applet, read local data files in IO mode, and then send the file stream to the server through Socket. This method is more efficient than the Web method. First, the efficiency of converting a file into a file stream through IO is improved. Second, the data is transmitted through the Socket, it is a TCP-based network transmission. In addition to the file transmission restrictions imposed by the Http protocol on the Web, data communication is directly performed from the transmission layer of the network through the TCP protocol, the transmission speed will inevitably be faster. In summary, the Network File Upload performance is better than the Web mode by using Applet and Socket.

However, the Network File Upload implemented by the Applet is not optimal because the Applet is limited by the sandbox during running. To protect the client and server, the applet program in the web can only run in the restricted sandbox, which is restricted by many security policies and cannot directly access the local file system of the client in the applet, unless authorized by the applet, the applet can confirm that the client system is trusted by digital signature. In this way, the client system to use this function requires the installation of security certificates, which is very cumbersome in project deployment.

All of the above methods have their own drawbacks. Finally, through negotiation with users, we decided to use an external FTP tool and an open-source file transfer tool to allow users to directly upload data files to the specified directory on the server, on the website system, only the data file list is loaded. In addition, there are also plug-ins that embed FTP functions on the web, which are embedded in the web in the form of activeObject to implement File Upload functions similar to ftp. We plan to continue the research.

Importing big data into Oracle databases is the focus of function implementation.

This solution is implemented by multithreading on the basis of batch insert.

The first step of this solution is to load the required data files to the memory and generate an SQL array. For imported data files, they are generally in EXCEL format. For such data files, we need to use POI to load EXCEL files, you can use POI to read data from the EXCEL worksheet to generate an SQL statement for data insertion. The file code is as follows:

// Create a reference to an Excel Workbook File

// Create a reference to the worksheet.

XSSFSheet sheet = workbook. getSheetAt (0 );

// In the Excel document, the default index of the first worksheet is 0 $1. 7.

// The statement is: HSSFSheet sheet = workbook. getSheetAt (0 );

}

}

}

......

& I <sheet. getLastRowNum ()-sheet. getFirstRowNum ()

+ 1; I ++ ){

HSSFRow row = sheet. getRow (I );

String [] valuesPerRow = getHSSFRowValues (row );

}

Rows. add (valuesPerRow );

V. importToDB (rows );

Rows. clear ();

}

}

V. importToDB (rows );

If EXCEL is not used as the data source file, you can use CSV files instead. CSV is a data file format that can be saved as an EXCEL file. It is essentially a text file separated by commas. Therefore, for reading such files, we can use traditional IO to read files, each cell data is obtained by string segmentation and spliced into SQL to form an SQL array.

Step 2: multi-threaded data import.

The data import implemented by this solution is more efficient than the batch data import performed by a single thread. From the test data import result of 0.4 million, it takes 19 minutes for a single thread to import data in batches, multithreading-based import takes about five minutes. However, in terms of performance consumption, the average number of concurrent threads in the multi-thread solution is about 15, the CPU usage is as high as 90%, and the memory consumption is about 500 mb, which puts a certain pressure on the server itself, although the speed is improved, the stability of the server will cause security risks.

In addition, I have some experience in exploring the efficiency of multithreading. The emergence of Multithreading is more to cater to the innovation of multi-core processing technology. On a single CPU-running host, multithreading seems to be executed concurrently by multiple threads, however, from the perspective of the operating system, it is still in the serial state, because at the same time, the processor only schedules one task, but it is not easy to find that the polling time gap is short. If multiple processors are used to process concurrent threads simultaneously on a multi-core host, the real concurrency scheduling can be realized. Therefore, multithreading depends on the hardware. To verify the efficiency, when we deploy the threads that execute the import to different virtual machines in the form of webService, the effect will be different and the efficiency will be significantly improved. This leads to a hot spot in today's IT industry. The Implementation and Application of virtualization technology is conducive to the optimization of resource configuration and greater utilization value on limited resources, this technology is also widely used in the cloud computing field.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.