Java in the processing of large data, some tips

Source: Internet
Author: User
Keywords Time this we very

As we all know, Java in the processing of data is relatively large, loading into memory will inevitably lead to memory overflow, while in some http://www.aliyun.com/zixun/aggregation/14345.html "> Data processing we have to deal with massive data, in doing data processing, our common means is decomposition, compression, parallel, temporary files and other methods;

For example, we want to export the database (no matter what database) data to a file, generally Excel or text format csv; for Excel, for POI and JXL interfaces, you often have no way to control when memory is written to disk, disgusting, And these APIs in memory constructs the object size to be bigger than the data original size to be many multiples, therefore you have to split the Excel, fortunately, POI began to realize this problem, after 3.8.4 version, started to provide the cache row number, provided the Sxssfworkbook interface, You can set the number of rows in memory, but unfortunately, he when you exceed this line number, each time you add a row, it writes a row before the relative number of rows to the disk (if you set line 2000, he writes the first line to the disk when you write line No. 20001), and in fact, his temporary files, So that it doesn't consume memory, however, you will find that the frequency of the brush disk is very high, we do not want to, because we want to let him reach a range of data such as a disk, such as a brush 1M, but now there is no such API, very painful, I have done the test, Writing large files by writing smaller Excel than using the current API to provide a brush disk more efficient, and so if the person who accesses a little more disk IO may not be able to carry, because IO resources are very limited, so it is best to split the file, and when we write CSV, which is the text type of file, Many times we can control ourselves, but you do not use CSV to provide their own APIs, is also less controllable, CSV itself is a text file, you write in text format can be recognized by CSV;

At processing data levels, such as reading data from a database, generate local files, write code for convenience, we do not have to 1M how to deal with, this to the bottom of the driver to split, for our program we think it is a continuous write can; we want to export a database table of 1000W data to a file; At this point, you are either paging, Oracle, of course, with three-layer packaging can be, MySQL with limit, not excessive page will be a new query every time, and with the paging, will be more and more slow, in fact, we want to get a handle, and then move downstream, compiled part of the data (such as 10000 lines) will write a file Write the details of the file, this is the most basic, you need to pay attention to the time each buffer data, in the use of OutputStream write, the best flush, the buffer is emptied; Next, execute a SQL without where condition, will the memory burst? Yes, The problem is worth thinking about, through the API discovery can do some operations on SQL, for example, through: preparedstatement statement = connection.preparestatement (SQL), this is the default to get precompiled, You can also set:

PreparedStatement statement = connection.preparestatement (sql,resultset.type_forward_only,resultset.concur_read_ only);

To set the cursor so that the cursor does not cache the data directly into the local memory, and then set Statement.setfetchsize (200); Set the size of each traversal of the cursor; OK, I used it, Oracle used it and it doesn't make any difference. Because the Oracle JDBC API default is not to cache data into Java memory, and MySQL inside the setting is not valid, I said a bunch of nonsense, oh, I just want to say, Java provides a standard API may not be effective, many times to see the implementation mechanism of manufacturers, And this setting is a lot of online said effective, but this is purely plagiarism; for Oracle above said don't care, he is not cache to memory, so Java memory will not cause any problems, if it is MySQL, you must first use more than 5 version, and then on the connection parameters plus Usecursorfetch=true This parameter, the cursor size can be set by adding: defaultfetchsize=1000 on the connection parameter, for example:

jdbc:mysql://xxx.xxx.xxx.xxx:3306/abc?zerodatetimeconverttonull&usecursorfetch=true& defaultFetchSize= 1000</span>

The last time the problem has been entangled for a long time (MySQL data always causes the program memory expansion, parallel 2 direct system is down, but also to see a lot of source code just found in here, finally through the MySQL document confirmation, and then test, parallel multiple, and the data volume is more than 500W, will not cause the memory to swell, the GC is all right, the problem finally ends.

We'll talk about other, data splitting and merging, we want to merge when we have more data files, when the file is too large to split, the process of merging and splitting will also encounter similar problems, but fortunately, this is within our control, if the data in the file can eventually be organized, then when splitting and merging, Do not do this by the number of logical lines of data. Because the number of lines in the end you need to interpret the data itself to determine, but just do the split is not necessary, you need to do binary processing, in this binary processing process, you should pay attention to, peace when read file do not use the same way, Usually read a file only once, if the large file memory must be hung directly, needless to say, you at this time by reading a controllable range of data, the Read method provides the overloaded offset and length range, which can be calculated in the loop process, Write large files as above, do not read to a certain program to flush through the write stream to the disk; In fact, for small data processing in the modern NIO technology also useful to, such as multiple terminals simultaneously request a large file download, such as video download bar, in the normal case, if the Java container to deal with , there are generally two kinds of situations:

One is the memory overflow, because each request loads a file size of memory and even more, because Java wrappers produce a lot of other memory overhead, if the binary will produce less, and in the process of input and output stream will experience several copies of memory, Of course, if you have middleware like Nginx, then you can send it through the send_file mode, but if you want to use the program to process, memory unless you are big enough, but the Java memory will also have a GC, if your memory is really large, the GC is dead, Of course, this place can also consider their own through direct memory calls and release to achieve, but require the remaining physical memory is large enough to do, then large enough? This is not good to say, to see the size of the file itself and the frequency of visits;

The second is that if the memory is large enough, unrestricted large, then the limit is the thread, the traditional IO model is a thread is a request to a thread, the thread from the main thread from the pool after the allocation, began to work, through your context packaging, filter, Interceptor, business code levels and business logic, Access to the database, access to files, rendering results, and so on, in fact, the entire process thread is hung, so this part of the resource is very limited, and if the large file operation is an IO-intensive operation, a large number of CPU time is free, the most direct way is to increase the number of threads Of course, the memory is large enough to have enough space to request a thread pool, however, generally speaking, a process of the thread pool is generally limited and not recommended too much, and in limited system resources, to improve performance, we began to have new IO technology, that is, NIO technology, the new version of the inside of the AIO technology, NIO is only asynchronous Io, but in the middle of the read and write process is still blocked (that is, in the real read and write process, but do not care about the response midway), not really asynchronous Io, when listening to connect when he does not need a lot of threads involved, there are separate threads to deal with, The connection is also the traditional socket into the selector, for no data processing is not required to allocate thread processing, and AIO through a so-called callback registration to complete, of course, also need the support of the OS, when it will drop to allocate threads, is not very mature, The most performance and NiO eat flat, but with the development of technology, AIO is bound to surpass NiO, the current Google V8 virtual machine engine driven by the node.js is a similar pattern, about this technology is not the focus of this paper;

The combination of the two above is to solve large files, also to the degree of parallelism, the most Earth method is to reduce the size of each request file to a certain extent, such as 8K (this size is tested after the network transmission is more appropriate size, local reading files do not need so small), if you do more in-depth, Can do a certain degree of cache, multiple requests of the same file, cache in memory or distributed cache, you do not have to cache the entire file in memory, will be used in the cache for a few seconds or so, or you can use a number of hot spots to match the algorithm; similar to the Thunder Download Breakpoint transmission ( But Thunder's network protocol is not quite the same), it in the process of downloading data may not be continuous, as long as the final can be merged, in the server side can be reversed, who just needs this piece of data, give it can; After using NIO, you can support a large connection and concurrency, local through NIO to do socket connection test , 100 terminals simultaneously request a server for a thread, the normal Web application is the first file is not sent to complete, the second request is either waiting, or time out, or directly denied the connection, after the change to NIO, 100 requests can be connected to the server side, the service only need 1 threads to process the data can , a lot of data is passed to these connection request resources, each read part of the data passed out, but it can be calculated that in the overall long connection transmission process overall efficiency will not be increased, but the relative corresponding and the overhead of the memory to be quantified control, this is the charm of technology, perhaps not too many algorithms, but you have to understand him.

There are a lot of similar data processing, some times will also be on the efficiency issues, such as in the HBase file splitting and merging process, or affect online business is more difficult things, many problems are worth us to study the scene, because different scenarios have different ways to solve, but the same, understand the ideas and methods, Understand the memory and architecture, understand that you are facing the Shenyang scene, but the details of the changes can bring amazing results.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.