As we all know, Java in the processing of large amounts of data, loaded into memory will inevitably lead to memory overflow, and in some data processing we have to deal with a huge amount of data, in doing data processing, our common means is decomposition, compression, parallel, temporary files and other methods;
For example, we want to export the database (no matter what the database) data to a file, usually Excel or the text format of the CSV; for Excel, for the POI and JXL interface, you often have no way to control the memory when writing to disk, very disgusting, And these APIs in memory constructed object size will be much larger than the original size of the data, so you have to split Excel, fortunately, POI began to realize the problem, After the 3.8.4 version, start to provide the number of cache rows, provide the Sxssfworkbook interface, can set the number of rows in memory, but unfortunately, when you exceed the number of rows, every time you add a row, it will be a row before the number of rows written to disk (if you set 2000 lines, When you write the No. 20001 line, he will write the first line to the disk, in fact, this time some of his temporary files, so that they do not consume memory, but you will find that the frequency of the brush disk is very high, we do not want to do so, because we want him to reach a range of data brush like a disk, For example, a brush 1M and other practices, unfortunately, there is no such API, very painful, I have done the test, by writing a small Excel than using the current provision of the brush disk API to write large files, more efficient, and so if the number of people accessing a little more disk IO may not be able to carry, Because the IO resource is very limited, so it is the best policy to split the file, and when we write the CSV, that is, the text type of the file, we can often control, but you do not use the CSV itself to provide the API, is not too controllable, the CSV itself is a text file, You write in text format can be recognized by CSV, how to write it?
In processing the data plane, such as reading data from the database, generating local files, writing code in order to facilitate, we do not need 1M how to handle, this to the bottom of the driver to split, for our program we think it is continuous writing, we want to be a 1000W data database table, export to a file; At this point, you are either paging, Oracle of course with three layers of packaging can, MySQL with limit, not excessive page every time will be a new query, and with the page, will be more and more slow, in fact, we want to get a handle, and then move downstream, compile a portion of the data (such as 10000 lines) will write the file once ( Write file details do not say, this is the most basic), need to pay attention to each buffer data, when writing with OutputStream, it is best to flush a bit, the buffer is emptied; Next, execute a SQL without a where condition, will the memory burst? Yes, This question is worth thinking about, through API discovery you can do something about SQL, for example, by: preparedstatement statement = connection.preparestatement (SQL), which is the default precompiled, can also be set by: PreparedStatement statement = connection.preparestatement (sql, Resultset.type_forward_only, RESULTSET.CONCUR_READ_ONLY);
To set the cursor so that the cursor does not cache the data directly into local memory, and then sets Statement.setfetchsize (200), setting the size of each traversal of the cursor; OK, I used it, Oracle used and useless no difference, Because the Oracle JDBC API by default is not to cache the data into Java memory, and MySQL is not set at all, I said a bunch of crap, hehe, I just want to say that Java provides standard API is not necessarily effective, many times to see the implementation mechanism of manufacturers, And this setting is a lot of online said effective, but this is purely plagiarism, for Oracle said don't care, he is not the cache to memory, so Java memory will not cause any problems, if it is MySQL, must first use more than 5 version, Then add the Usecursorfetch=true parameter to the connection parameter, and the cursor size can be set by adding: defaultfetchsize=1000 to the connection parameter, for example:
jdbc:mysql://xxx.xxx.xxx.xxx:3306/abc?zerodatetimebehavior=converttonull&usecursorfetch=true& defaultfetchsize=1000
Last time by this problem tangled for a long time (MySQL data old cause program memory bloat, parallel 2 direct system is down), also went to see a lot of source code only to find the miracle is here, finally through the MySQL document confirmation, and then tested, parallel multiple, and the amount of data is more than 500W, does not cause memory bloat, the GC is all right, and the problem is finally over.
After reading the above, we should be tired of it, and here is a small compilation of their big data to learn qun531629188 recommend to everyone whether it is Daniel or want to change careers to learn college students
I am very welcome, the evening 20:10 all have a "free" big Data broadcast course, focus on big data analysis methods, big Data programming, Big Data Warehouse, Big Data case, artificial intelligence, data mining are pure dry goods sharing,
Let's talk about the other, data splitting and merging, when the data files are many, we want to merge, when the file is too large to split, the merging and splitting process will encounter similar problems, fortunately, this is within our control, if the data in the file can ultimately be organized, then in the split and merge, At this point do not follow the number of logical lines of data to do, because the number of rows eventually you need to interpret the data itself to determine, but just do the split is not necessary, you need to do binary processing, in this binary processing process, you have to pay attention to the peace when the read file do not use the same way, Usually most of a file read only with a read operation, if the large file memory must be directly hung up, do not say, you at this time because each read a controllable range of data, the Read method provides the overloaded offset and length of the range, this in the loop process itself can be calculated, Write large files as above, do not read to a certain program will be flush to disk through the write stream, in fact, the small data processing in the modern NIO technology is also useful, for example, multiple terminals simultaneously request a large file download, such as video download bar, in general, if the Java container to handle , there are generally two situations:
One is a memory overflow, because each request has to load a file size of memory even more, because the Java wrapper will produce a lot of other memory overhead, if the use of binary will produce less, and in the process through the input and output stream will experience several memory copies, Of course, if you have a similar nginx like the middleware, then you can send out through the send_file mode, but if you want to use the program to handle the time, memory unless you are big enough, but Java memory is large also will have GC, if your memory is really big, GC time is dead, Of course, this place can also consider itself through direct memory call and release to achieve, but require the remaining physical memory is also large enough, so big enough? This is not to say, to see the size of the file itself and the frequency of access;
Second, if the memory is large enough, unlimited, then the limit is the thread, the traditional IO model is the thread is a request for a thread, the thread from the thread pool from the main threads allocated, began to work, through your context wrapper, Filter, Interceptor, business code all levels and business logic, Access to the database, access to files, rendering results and so on, in fact, the entire process thread is suspended, so this part of the resource is very limited, and if the large file operation is an IO-intensive operation, a lot of CPU time is free, the method is the most direct, of course, increase the number of threads to control, Of course, the memory is large enough to have enough space to apply for the thread pool, but generally a process of thread pools are generally limited or not recommended too much, and under limited system resources, to improve performance, we began to have new IO technology, which is NIO technology, the new version of the inside of the AIO technology, NiO can only be asynchronous Io, but in the middle of the read and write process is still blocked (that is, in the real read and write process, but not to care about the middle of the response), not yet true asynchronous Io, when listening to connect when he does not need a lot of threads to participate, have a separate thread to deal with, The connection also traditional socket becomes the selector, does not need to carry on the data processing is not necessary to allocate the thread processing, but the AIO passes through a so called callback registration to complete, certainly also needs the OS support, when will drop the time to allocate the thread, is not very mature, Performance and NiO eat flat, but with the development of technology, AIO will inevitably surpass NiO, the current Google V8 virtual machine engine driven by node. JS is a similar pattern, the technology is not the focus of this article;
The combination of the above is to solve large files, but also the degree of parallelism, the most soil method is to reduce the size of the file each request to a certain extent, such as 8K (this size is tested after the network transmission is more appropriate size, the local read file does not need to be so small), if you do more in-depth, Can do a certain degree of cache, will be multiple requests for the same file, cache in memory or distributed cache, you do not have the entire file cache in memory, will be the recent use of the cache a few seconds or so, or you can use some hot algorithm to match; However, the network protocol of the Thunderbolt is not the same, it is not always in the process of downloading data is continuous, as long as the final can be merged, in the server can in turn, who just need this piece of data, it can be, only with NIO, can support a lot of connection and concurrency, local through NIO to do socket connection test , 100 terminals simultaneously request a thread of the server, the normal Web application is the first file is not sent complete, the second request either wait, either timeout, or directly deny the connection, change to NIO, then 100 requests can be connected to the server side, the service side only need 1 threads to process the data can , to pass a lot of data to these connection request resources, each time to read some of the data passed out, but it can be calculated that in the overall long connection transfer process overall efficiency does not increase, but relative and the cost of memory to be quantified control, this is the charm of technology, perhaps not too many algorithms, but you have to understand him.
There are a lot of similar data processing, some time will also be on the efficiency issues, such as in HBase file splitting and merging process, do not affect the online business is more difficult things, many problems worthy of us to study the scene, because different scenarios have different methods to solve, but the same, understand the ideas and methods, Understand the memory and architecture, understand that you are facing the Shenyang scene, but the details of the change can bring amazing results.
Wind and fire data
Links: https://juejin.im/post/5b556c846fb9a04f9963a8b5
Source: Denver Nuggets
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
Advanced Java developers ' skills in solving big data problems