Java efficiently reads large files and Java reads files
1. Overview
This tutorial demonstrates how to use Java to efficiently read large files. This article is part of the "Java-regression basics" series of tutorials on Baeldung (http://www.baeldung.com.
2. Read data in memory
The standard way to read a file row is to read it in memory. Both Guava and Apache Commons IO provide the following method to quickly read the file row:
Files.readLines(new File(path), Charsets.UTF_8);FileUtils.readLines(new File(path));
This method causes the problem that all the lines of the file are stored in the memory. When the file is large enough, the program will soon throw an OutOfMemoryError exception.
For example, read an object of about 1 GB:
@Testpublic void givenUsingGuava_whenIteratingAFile_thenWorks() throws IOException { String path = ... Files.readLines(new File(path), Charsets.UTF_8);}
At the beginning, this method only occupies a small amount of memory: (about 0 MB of memory is consumed)
[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 128 Mb[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 116 Mb
However, when all the files are read into the memory, we can see that (about 2 GB of memory is consumed ):
[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 2666 Mb[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 490 Mb
This means that this process consumes about GB of memory-the reason is simple: all the lines of the file are stored in the memory ].
Putting all the file content in the memory will soon exhaust the available memory-no matter how large the actual available memory is, this is obvious.
In addition, we usually do not need to put all the lines of the file into the memory at a time-on the contrary, we only need to traverse each row of the file and then perform corresponding processing, after processing, discard it. Therefore, this is exactly what we will do-through row iteration, rather than putting all rows in the memory.
3. file stream
Now let's take a look at this solution-we will use the java. util. Contents class to scan the file content and read it row by row consecutively:
FileInputStream inputStream = null;Scanner sc = null;try { inputStream = new FileInputStream(path); sc = new Scanner(inputStream, "UTF-8"); while (sc.hasNextLine()) { String line = sc.nextLine(); // System.out.println(line); } // note that Scanner suppresses exceptions if (sc.ioException() != null) { throw sc.ioException(); }} finally { if (inputStream != null) { inputStream.close(); } if (sc != null) { sc.close(); }}
This scheme will traverse all rows in the file-each row can be processed without reference. In short, they are not stored in the memory: (about MB of memory is consumed)
[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 763 Mb[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 605 Mb
4. Apache Commons IO stream
You can also use the Commons IO library and use the custom LineIterator provided by the Library:
LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");try { while (it.hasNext()) { String line = it.nextLine(); // do something with line }} finally { LineIterator.closeQuietly(it);}
Because not all files are stored in the memory, this leads to a very conservative memory consumption: (approximately 150 MB of memory consumption)
[main] INFO o.b.java.CoreJavaIoIntegrationTest - Total Memory: 752 Mb[main] INFO o.b.java.CoreJavaIoIntegrationTest - Free Memory: 564 Mb
5. Conclusion
This short article describes how to process large files without repeated reading or consuming memory. This provides a useful solution for processing large files.
All the implementation and code snippets of these examples can be obtained from my github project -- this is an Eclipse-based project, so it should be easily imported and run.