Univocity-parsers is an open-source Java project. For csv/tsv/fixed-width text file parsing, it provides a rich and powerful function with the Simple API development interface. Further introductions will be made later.
Unlike other analytic libraries, Univocity-parsers has designed a set of own architectures based on high performance and scalability. Based on this architecture, developers can build a new file parser.
1. Overview
As a Java developer, I am currently involved in developing a Web project that helps communication operators evaluate the current network and provide solutions. In this project, the CSV file plays a critical role, which is the bearer format of the carrier's network data, which contains the real-time online status of the broadband user (online/offline) and its real-time traffic. Generally speaking, a single CSV file can reach more than 1GB, including the millions record. The CSV parsing library currently in use by the project is javacsv.
With the expansion of operator network and the increase of system monitoring cycle, CSV file becomes larger quickly. The project team had to address the performance issues (even the second-level parsing efficiency) of the large CSV data parsing, as well as the limitations of the expansion of the functionality brought about by business changes.
After many tests and analyses, we finally decided to use Univocity-parsers as the CSV file Parsing library, and then found out that it did solve our problem. In addition to better performance and scalability, the library provides developers with easy-to-understand APIs, development documentation, and tutorials. For the Advanced Function expansion Appeal, the official provides the corresponding charge service.
The project is hosted on GitHub and, as of now, has 69 star and 10 fork. You can find relevant development documentation and tutorials here and here, as well as more examples and news here.
It is worth noting that Apache Camel, a well-known open source project in the Univocity-parsers community, also integrates the csv/tsv/as a recommended repository for the project's resolution of fixed-width text files. For more information, please see here.
2. Installation
Our project team is currently using the 1.5.1 version, recommended to go to the Univocity-parsers official website to download the latest version.
The project is also posted in the MAVEN central repository, so you can also add the following code directly to your pom.xml:
<Dependency><groupId>Com.univocity</groupId><Artifactid>Univocity-parsers</Artifactid><version>1.5.1</version></Dependency>
3. Introduction to Features
Univocity-parsers offers a range of powerful features that are well-prepared to meet all of your processing needs for list-type data. The table shows some of the key features:
4. Reading list-style data
Read all rows in the CSV
New Csvparser (new csvparsersettings ()); List<String[]> allrows = Parser.parseall (Getreader ("/examples/example.csv"));
To view all the functions associated with file writing, please visit: https://github.com/uniVocity/univocity-parsers#reading-csv
5. Write list-style data
With just 2 lines of code, you can write data in CSV format:
list<string[]> rows =newnew csvwritersettings ()); Writer.writerowsandclose ( rows);
To view all the functions associated with file writing, please visit: https://github.com/uniVocity/univocity-parsers/blob/master/README.md#writing
6. Performance and Scalability
Here is a comparison table comparing the Univocity-parsers and javacsv:
file size |
javacsv parsing Time-consuming |
10MB, line 145453 |
1138ms |
|
100MB, line 809008 |
23s |
6s |
434MB, 4499959 rows |
91s |
|
1GB, 23803502 rows |
70s |
Here you can see the performance comparison tables of almost all CSV parsing libraries, which can be found in the table, univocity-parsers leading to other libraries with absolute advantage.
The benefits of univocity-parsers in terms of performance and flexibility are due to the following design and mechanism:
- Read data as a separate thread (set by calling Csvparsersettings.setreadinputonseparatethread ())
- Parallel row Data Processor (refer to Rowprocessor implementation class Concurrentrowprocessor)
- Process column data according to business requirements by inheriting the Columnprocessor class
- Process row data according to business requirements by inheriting the Rowprocessor class
7. Design and implementation
In Univocity-parsers, there are some core data-processing modules that are responsible for reading and writing data by line, reading and writing columns, and converting row and column data. Here is the diagram of these core modules:
You can develop your own data processing module by implementing the Rowprocessor interface or inheriting its implementation class. In the following code, I developed my own data processing module through a simple internal anonymous class.
Csvparsersettings settings =Newcsvparsersettings (); Settings.setrowprocessor (Newrowprocessor () {StringBuilder StringBuilder=NewStringBuilder ();/*** Before processing the first line of data, you can do the related initialization configuration according to the business logic. **/@Override Public voidprocessstarted (Parsingcontext context) {System.out.println ("Started to process rows of data.");}/*** Process line data according to your business logic **/@Override Public voidrowprocessed (string[] row, Parsingcontext context) {System.out.println ("The row in line #" + context.currentline () + ":"); for(String col:row) {stringbuilder.append (col). Append ("\ T");}}/*** After all line data processing is completed, do cleanup work. **/@Override Public voidprocessended (Parsingcontext context) {System.out.println ("Finished processing rows of data."); System.out.println (StringBuilder);}}); Csvparser Parser=Newcsvparser (settings); List<String[]> allrows = Parser.parseall (NewFileReader ("/myfile.csv"));
The Univocity-parsers Library offers more than that, as it plays a big role in our projects and recommends that you learn more.
Univocity-parsers: A powerful csv/tsv/fixed-width text file parsing library (Java)