Some tips for java in Big Data Processing

Source: Internet
Author: User

As we all know, when java processes a large amount of data, loading to the memory will inevitably lead to memory overflow. In some data processing, we have to process massive data. In data processing, our common means are decomposition, compression, parallelism, temporary files, and other methods;

 

For example, we want to export data from a database (no matter what database) to a file, which is generally CSV in Excel or text format. For Excel, for POI and JXL interfaces, most of the time, you have no way to control when the memory is written to the disk. This is disgusting, and the size of the objects constructed by these APIs in the memory is much larger than the original size of the data, so you have to split the Excel file. Fortunately, POI began to realize this problem. After version 3.8.4, POI began to provide the number of cache rows, and provided the SXSSFWorkbook interface to set the number of rows in the memory, it is a pity that when you exceed this number of rows and add each row, it writes the row before the relative number of rows to the disk (for example, if you set 2000 rows, when you write 20,001st rows, the first row will be written to the disk.) In fact, some temporary files will not consume the memory, but you will find that, the frequency of disk flushing is very high. We really don't want to do this because we want to allow him to fl data like a disk at a time, for example, fl 1 m at a time, unfortunately, this API is not available yet. I did my own test. Writing a small Excel file is more efficient than using the APIs that currently provide disk flushing to write large files, in this case, if there are a little more disk I/O users, they may be unable to handle it. Because I/O resources are very limited, it is best to split files. When we write CSV files, that is, for text files, we can control them by ourselves most of the time. However, you do not need to use the APIS provided by CSV itself, which is also uncontrollable. CSV itself is a text file, you can be identified by CSV when writing data in text format. How can you write data? Let's talk about...

 

At the data processing level, for example, reading data from the database, generating local files, and writing code for convenience, we do not need to process the data in 1 MB. This is handed over to the underlying driver for splitting, for our program, we think it is continuous writing. For example, we want to export a database table with million data to a file. At this time, you can either perform paging, of course, oracle can be packaged in three layers. mysql can use limit, but a new query will be made every time by PAGE, and it will get slower and slower as page turning occurs. In fact, we want to get a handle and then move down, compile a part of the data (such as 10000 rows) and write the file once (this is the most basic if not much details about writing the file). Pay attention to the buffer data each time, when writing data with outputstream, it is best to flush it and clear the buffer. Next,Execute an SQL statement without the where condition, will the memory burst?? Yes, it is worth thinking about this problem. We can use APIs to find some SQL operations, such as PreparedStatement = connection. prepareStatement (SQL), which is pre-compiled by default. You can also set PreparedStatement = connection. prepareStatement (SQL, ResultSet. TYPE_FORWARD_ONLY, ResultSet. CONCUR_READ_ONLY );

So that the cursor does not directly cache data to the local memory, and then sets statement. setFetchSize (200); set the size of each cursor traversal; OK, this is actually I used, oracle used and useless no difference, because oracle's jdbc API does not cache data to java memory by default, and the settings in mysql are not effective at all,I spoke a bunch of nonsense above.I just want to say that the standard APIs provided by java are not necessarily valid. Many times, the vendor's implementation mechanism is required, and many other settings are valid on the Internet, but they are purely plagiarized; as mentioned above, oracle does not need to be concerned. It is not cache to memory, so java memory won't cause any problems. If it is mysql, you must first use a version above 5, then add the useCursorFetch = true parameter to the connection parameter. As for the cursor size, add defaultFetchSize = 1000 to the connection parameter. For example:

Jdbc: mysql: // xxx. xxx: 3306/abc? ZeroDateTimeBehavior = convertToNull & useCursorFetch = true & defaultFetchSize = 1000

The last time I had been entangled in this problem for a long time (mysql's data is too old, leading to program memory expansion, and two parallel direct systems will go down), I also went through a lot of source code to find that the miracle was actually here, finally, after confirmation from the mysql document, the test will be conducted and multiple parallel operations will be performed, and the data volume will be more than, which will not lead to memory expansion and GC will be normal. This problem is finally completed.

 

Let's talk about other things: Data splitting and merging. When there are many data files, we want to merge them. When the file is too large to split, the process of merging and splitting will also encounter similar problems, fortunately, this is within our controllable scope. If the data in the file can be organized in the end, when splitting and merging, do not follow the logical number of rows of data, in the end, you need to explain the data to determine the number of rows, but it is not necessary to split the data. What you need is binary processing. In this binary processing process, you should pay attention to it, do not use the same method for reading files in peacetime. Generally, only one read operation is used for reading a file. If the memory of a large file is definitely suspended, you can skip this step, at this time, because you should read data within a controllable range each time, the read method provides the range of the overloaded offset and length, which can be calculated by yourself during the loop process, writing a large file is the same as writing a large file. If you do not want to read a program, you need to flush it to the disk by writing a stream. In fact, processing small data volumes is also useful in modern NIO technology, for example, multiple terminals request a large file for download at the same time, for example, video download. Under normal circumstances, if you use a java container for processing, there are usually two situations:

 

One is memory overflow, because each request needs to load a file size of memory or even more, because many other memory overhead will be generated during java packaging, if binary is used, it will produce a little less, and the input and output streams will also experience several memory copies. Of course, if you have middleware similar to nginx, you can send it in send_file mode, but if you want to use a program for processing, the memory will be GC when the java memory is larger, unless you are large enough, if your memory is really large and the GC is dead, you can also consider using the Direct Memory call and release, but the remaining physical memory is required to be large enough, so how big is it? This is hard to say, depending on the size of the file and the frequency of access;

 

Second, if the memory is large enough and there is no limit, the limitation is the thread. The traditional I/O model is that the thread is a request to a thread. After the thread is allocated from the main thread from the thread pool, start to work, through your Context packaging, Filter, interceptor, Business Code layers and business logic, access the database, access files, rendering results, and so on, in fact, the entire process thread is suspended, so this part of resources is very limited, and if large file operations are IO-intensive operations, a large amount of CPU time is spare, the most direct method is to increase the number of threads for control. Of course, the memory is large enough and there is enough space to apply for a thread pool, however, generally, the thread pool of a process is restricted and not recommended too much. However, to improve performance with limited system resources, we start to adopt the new IO technology, that is, NIO technology. In the new version, there is AIO technology. NIO can only be regarded as Asynchronous IO, but it is still blocked in the intermediate read/write process (that is, in the real read/write process, but I will not care about the response in the middle). I have not implemented real asynchronous IO. when listening to connect, he does not need to participate in many threads and has a separate thread to handle it, the traditional socket is also turned into a selector. For those who do not need to process data, there is no need to allocate threads for processing. AIO is completed through a so-called callback registration, of course, it also requires OS support. When it is dropped, the thread will be allocated. Currently, it is not very mature, and the performance is equal to NIO at most. However, with the development of technology, AIO will inevitably surpass NIO, currently, the node driven by Google V8 Virtual Machine Engine. js is similar, and this technology is not the focus of this article;

 

Combining the above two is to solve the problem of large files and parallel processing. The most common method is to reduce the size of each file request to a certain extent, for example, 8 K (this size is suitable for network transmission after testing, and the local file reading does not need to be so small). If you do more in-depth research, you can make a certain degree of cache, cache the same files with multiple requests in the memory or distributed cache. You don't need to cache the entire file in the memory, just cache the recently used files in a few seconds, or you can use some popular algorithms. Similar to the resumable transfer of thunder downloads (but the network protocol of thunder is not the same), it may not be continuous when processing the downloaded data, as long as the data can be merged, on the server side, you can just give it the data that someone needs. After NIO is used, it can support a lot of connections and concurrency, local socket connection tests are performed through NIO. 100 terminals simultaneously request a thread server. A normal WEB application is that the first file is not sent completely, and the second request either waits or times out, you can either directly reject the connection and change it to NIO. At this time, 100 requests can be connected. On the server side, the server only needs one thread to process data. It passes a lot of data to these connection request resources, and each time it reads a part of data, it can be calculated that, the overall efficiency will not be improved during the overall persistent connection transmission process, but it is quantified to control the corresponding and overhead memory. This is the charm of technology and may not involve too many algorithms, but you have to understand him.

 

There are many other similar data processing problems. In some cases, efficiency issues may occur. For example, it is difficult for HBase to split and merge files without affecting online services, many problems are worthy of our research on scenarios, because different scenarios have different solutions, but they are similar, understand ideas and methods, and understand the memory and system architecture, understand that you are facing a scenario in Shenyang, but changes in details can bring amazing results.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.