Java is a little tricky when it comes to working with big data. Release time:2013-05-09 00:00:00 Source: China it lab anonymous Keyword: Java
As we all know, Java in the processing of large amounts of data, loaded into memory will inevitably lead to memory overflow, and in some data processing we have to deal with a huge amount of data, in doing data processing, our common means is decomposition, compression, parallel, temporary files and other methods;
For example, we want to export the database (no matter what the database) data to a file, usually Excel or the text format of the CSV; for Excel, for the POI and JXL interface, you often have no way to control the memory when writing to disk, very disgusting, And these APIs in memory constructed object size will be much larger than the original size of the data, so you have to split Excel, fortunately, POI began to realize this problem, after the 3.8.4 version, began to provide the cache row number, provides the Sxssfworkbook interface, You can set the number of rows in memory, but unfortunately, when you exceed the number of rows, each time you add a row, it writes a row in front of the relative number of rows to the disk (if you set 2000 lines, when you write the No. 20001 line, he will write the first line to disk), in fact, the temporary files, So that you do not consume memory, but you will find that the frequency of the brush disk is very high, we do not want to do so, because we want him to reach a range of a one-time brush data such as disk, such as a brush 1M, but now there is no such API, very painful, I have done the test, It is much more efficient to write large files by writing small Excel than using the API that currently provides the brush disk, and so that if the person who accesses a bit more disk IO may not be able to carry, because the IO resource is very limited, so it is the best policy to split the file, and when we write CSV, that is, the text type of file, Many times we can control, but you do not use the CSV to provide the API, is also not very controllable, the CSV itself is a text file, you write in text format can be recognized by CSV; How to write? Let's say ...
In processing the data plane, such as reading data from the database, generating local files, writing code in order to facilitate, we do not need 1M how to handle, this to the bottom of the driver to split, for our program we think it is continuous writing, we want to be a 1000W data database table, export to a file; At this point, you are either paging, Oracle of course with three layers of packaging can, MySQL with limit, not excessive page every time will be a new query, and with the page, will be more and more slow, in fact, we want to get a handle, and then move downstream, compile a portion of the data (such as 10000 lines) will write the file once ( Write file details do not say, this is the most basic), need to pay attention to each buffer data, when writing with OutputStream, it is best to flush a bit, the buffer is emptied; Next, execute a SQL without a where condition, will the memory burst? Yes, that's a question we're worth thinking about, through API discovery you can do something about SQL, for example, by: preparedstatement statement = connection.preparestatement (SQL), This is pre-compiled by default and can also be set by:
PreparedStatement statement = connection.preparestatement (sql,resultset.type_forward_only,resultset.concur_read_ only);
To set the cursor so that the cursor does not cache the data directly into local memory, and then sets Statement.setfetchsize (200), setting the size of each traversal of the cursor; OK, I used it, Oracle used and useless no difference, Because the Oracle JDBC API by default is not to cache the data into Java memory, and MySQL is not set at all, I said a bunch of crap, hehe, I just want to say that Java provides standard API is not necessarily effective, many times to see the implementation mechanism of manufacturers, And this setting is a lot of online say effective, but this is purely plagiarism, for Oracle said don't care, he is not the cache to memory, so Java memory does not cause any problems, if it is MySQL, must first use more than 5 version, and then on the connection parameters to add Usecursorfetch=true This parameter, the cursor size can be set by adding: defaultfetchsize=1000 to the connection parameter, for example:
Jdbc:mysql://xxx.xxx.xxx.xxx:3306/abc? zerodatetimeconverttonull&usecursorfetch=true& defaultfetchsize=1000</span>
Last time by this problem tangled for a long time (MySQL data old cause program memory bloat, parallel 2 direct system is down), also went to see a lot of source code only to find the miracle is here, finally through the MySQL document confirmation, and then tested, parallel multiple, and the amount of data is more than 500W, does not cause memory bloat, the GC is all right, and the problem is finally over.
Let's talk about the other, data splitting and merging, when the data files are many, we want to merge, when the file is too large to split, the merging and splitting process will encounter similar problems, fortunately, this is within our control, if the data in the file can ultimately be organized, then in the split and merge, At this point do not follow the number of logical lines of data to do, because the number of rows eventually you need to interpret the data itself to determine, but just do the split is not necessary, you need to do binary processing, in this binary processing process, you have to pay attention to the peace when the read file do not use the same way, usually mostly to a text Piece read only with a read operation, if for large file memory must be directly hung up, needless to say, you at this time because each read a controllable range of data, the Read method provides the overloaded offset and length of the range, which can be computed in the loop process, Write large files as above, do not read to a certain program will be flush to disk through the write stream, in fact, the small data processing in the modern NIO technology is also useful, for example, multiple terminals simultaneously request a large file download, such as video download bar, in general, if the Java container to handle , there are generally two situations:
One is a memory overflow, because each request has to load a file size of memory even more, because the Java wrapper will produce a lot of other memory overhead, if the use of binary will produce less, and in the process through the input and output stream will experience several memory copies, Of course, if you have a similar nginx like the middleware, then you can send out through the send_file mode, but if you want to use the program to handle the time, memory unless you are big enough, but Java memory is large also will have GC, if your memory is really big, GC time is dead, Of course, this place can also consider itself through direct memory call and release to achieve, but require the remaining physical memory is also large enough to go, so big enough is how big? This is not to say, to see the size of the file itself and the frequency of access;
The second is if the memory is large enough, unrestricted, then the limit is the thread, the traditional IO model is the thread is a request for a thread, the thread from the thread pool from the main threads allocated, began to work, through your context wrapper, filter, Interceptor, Business code all levels and business logic, access to the database, access to files, rendering results and so on, in fact, the entire process thread is suspended, so this part of the resource is very limited, and if the large file operation is an IO-intensive operation, a lot of CPU time is free, the most straightforward method of course is to increase the number of threads to control , of course, the memory is large enough to have enough space to apply for the thread pool, but generally a process of thread pools are generally limited or not recommended too much, and under limited system resources, to improve performance, we began to have new IO technology, which is NIO technology, the new version of the inside with AIO technology, NiO can only be asynchronous Io, but in the middle of the read and write process is still blocked (that is, in the real read and write process, but not to care about the middle of the response), not yet true asynchronous Io, when listening to connect when he does not need a lot of threads to participate, have a separate thread to deal with, The connection also traditional socket becomes the selector, does not need to carry on the data processing is not necessary to allocate the thread processing, but the AIO passes through a so called callback registration to complete, certainly also needs the OS support, when will drop the time to allocate the thread, is not very mature, Performance and NiO eat flat, but with the development of technology, AIO will inevitably surpass NiO, the current Google V8 virtual machine engine driven by node. JS is a similar pattern, the technology is not the focus of this article;
The combination of the above is to solve large files, but also the degree of parallelism, the most soil method is to reduce the size of the file each request to a certain extent, such as 8K (this size is tested after the network transmission is more appropriate size, the local read file does not need to be so small), if you do more in-depth, Can do a certain degree of cache, will be multiple requests for the same file, cache in memory or distributed cache, you do not have the entire file cache in memory, will be the recent use of the cache a few seconds or so, or you can use some hot algorithm to match; However, the network protocol of the Thunderbolt is not the same, it is not always in the process of downloading data is continuous, as long as the final can be merged, in the server can in turn, who just need this piece of data, it can be, only with NIO, can support a lot of connection and concurrency, local through NIO to do socket connection test , 100 terminals simultaneously request a thread of the server, the normal Web application is the first file is not sent complete, the second request either wait, either timeout, or directly deny the connection, change to NIO, then 100 requests can be connected to the server side, the service side only need 1 threads to process the data can , to pass a lot of data to these connection request resources, each time to read some of the data passed out, but it can be calculated that in the overall long connection transfer process overall efficiency does not increase, but relative and the cost of memory to be quantified control, this is the charm of technology, perhaps not too many algorithms, but you have to understand him.
There are a lot of similar data processing, some time will also be on the efficiency issues, such as in HBase file splitting and merging process, do not affect the online business is more difficult things, many problems worthy of us to study the scene, because different scenarios have different methods to solve, but the same, understand the ideas and methods, Understand the memory and architecture, understand that you are facing the Shenyang scene, but the details of the change can bring amazing effect
Java is a little tricky when it comes to working with big data.