Thinking logic of computer programs (64) and thinking 64
For file processing, we introduced the stream method, section 57 introduced the byte stream, section 58 introduced the ghost stream, and also introduced the underlying file operation method, section 60 describes random read/write files, Section 61 Describes memory ing files, Object serialization/deserialization mechanisms, and section 62 describes Java standard serialization, section 63 describes how to use Jackson to process other serialized formats such as XML/JSON and MessagePack.
In daily programming, we often need to process some specific types of files, such as CSV, Excel, and HTML. it is inconvenient to directly use the methods described in the previous sections to deal with them, some third-party class libraries often provide more convenient and easy-to-use interfaces based on the previously introduced technologies.
This section briefly describes how to use Java SDK and some third-party class libraries to process the following five types of files:
- Attribute files: attribute files are common configuration files used to change the behavior of programs without changing code.
- CSV: CSV is the abbreviation of Comma-Separated Values, which indicates Comma-Separated Values. It is a very common file type. Most log files are CSV, CSV is also often used to exchange table-type data. As we can see later, CSV looks simple but the complexity of processing is often underestimated.
- Excel: As you all know, in programming, you often need to export data of the table type to the Excel format for your convenience, it is also often necessary to accept Excel files as input to import data in batches.
- HTML: All webpages are in HTML format. We often need to analyze HTML webpages to extract information of interest.
- Compressed file: there are multiple formats of compressed files and many compression tools. In most cases, we can use tools without writing programs to process compressed files. However, in some cases, you need to compress or decompress the file by programming.
Property File
Attribute files are generally simple. A row represents an attribute, which is a key-value pair. Keys and values are separated by equal signs (=) or colons (:). These values are generally used to configure some program parameters. For example, configuration files are often used to configure database information in programs that need to connect to the database. For example, the configuration. properties file contains the following content:
db.host = 192.168.10.100db.port : 3306db.username = zhangsandb.password = mima1234
It is easier to process such files using the volume stream, but Java has a special class of java. util. Properties, which is also very simple to use. There are the following main methods:
public synchronized void load(InputStream inStream)public String getProperty(String key)public String getProperty(String key, String defaultValue)
Load is used to load attributes from the stream. getProperty is used to obtain attribute values. A default value is provided. If the configured value is not found, the default value is returned. For the above configuration file, you can use code similar to the following:
Properties prop = new Properties();prop.load(new FileInputStream("config.properties"));String host = prop.getProperty("db.host");int port = Integer.valueOf(prop.getProperty("db.port", "3306"));
The benefits of using the Properties class to Process Property Files are:
- Space can be automatically processed. We can see that the space before and after the separator = will be automatically ignored.
- Empty rows can be automatically ignored.
- You can add comments with characters # Or! The line at the beginning is treated as a comment and ignored.
However, the use of Properties is also limited. It cannot directly process Chinese characters. In the configuration file, all non-ASCII characters must use Unicode encoding. For example, you cannot directly write this in the configuration file:
Name = lauma
The "Old Horse" must be replaced with Unicode encoding, as shown below:
name=\u8001\u9A6C
In Java IDE such as Eclipse, if the attribute file editor is used, it will automatically replace Chinese with Unicode encoding. If other editors are used, they can first be written as Chinese, then use the command native2ascii provided by JDK to convert to Unicode encoding. The usage is as follows:
native2ascii -encoding UTF-8 native.properties ascii.properties
Native. properties is input, which contains Chinese, ascii. properties is output, Chinese is replaced with Unicode encoding,-encoding specifies the encoding of the input file, which is specified here as a UTF-8.
CSV file
CSV is the abbreviation of Comma-Separated Values, which indicates Comma-Separated Values. Generally, one line indicates a record. One record contains multiple fields, and fields are Separated by commas. However, generally, the separator is not necessarily a comma, but may be another character, such as the tab character '\ t', colon': ', semicolon. Various log files in the program are usually CSV files. CSV is also frequently used when importing and exporting table-type data.
The CSV format looks simple. For example, when we save the student list in section 58, we use the CSV format, as shown below:
Zhang San, 18, 80.9 Li Si, 17, 67.5
Using the history stream introduced earlier, it seems that it is easy to process CSV files, read by row, and split each row using String. split. But in fact, CSV has some complicated aspects. The most important thing is:
- What should I do if the field content contains delimiters?
- What should I do if the field content contains a line break?
For these issues, CSV has a reference standard, RFC-4180, gauge:
For example, if the field content has two rows, the content is:
Hello, world \ abc "lauma"
In the first method, the content changes:
"Hello, world \ abc" "lauma """
In the second method, the content changes:
Hello \, world \ abc \ n "lauma"
CSV also has some other details, and the processing methods of different programs are different, such:
- How does one represent a null value?
- How to deal with spaces between blank rows and fields?
- How do I represent comments?
Because of these complex problems, it is difficult to deal with the use of simple stream. There is a third-party class library, Apache Commons CSV, which provides good support for processing CSV, its official website address is: http://commons.apache.org/proper/commons-csv/index.html
This section uses version 1.4 to briefly introduce its usage. If you use Maven to manage projects, you can introduce the dependencies in the following files: https://github.com/swiftma/program-logic/blob/master/csv_lib/dependencies.xml. If not Maven, you can download the dependency Library: https://github.com/swiftma/program-logic/tree/master/csv_lib from the address below
Apache Commons CSV has an important class CSVFormat, which represents the CSV format. It has many methods to define specific CSV formats, such:
// Define the separator public CSVFormat withDelimiter (final char delimiter) // define the quote public CSVFormat withQuote (final char quoteChar) // define the escape Character public CSVFormat withEscape (final char escape) // The String value of the object whose definition value is null public CSVFormat withNullString (final String nullString) // defines the delimiter public CSVFormat withRecordSeparator (final char recordSeparator) between records) // define whether to ignore the blank space between fields public CSVFormat withIgnoreSurroundingSpaces (final boolean ignoreSurroundingSpaces)
For example, if the CSV format is defined as: Use A semicolon; as the separator, "as the quotation mark, use N/A to represent A null object, ignore the blank space between fields, CSVFormat can be created like this:
CSVFormat format = CSVFormat.newFormat(';') .withQuote('"').withNullString("N/A") .withIgnoreSurroundingSpaces(true);
In addition to custom CSVFormat, The CSVFormat class also defines some predefined formats, such as CSVFormat. DEFAULT and CSVFormat. RFC4180.
CSVFormat has a method to analyze the response stream:
public CSVParser parse(final Reader in) throws IOException
The return value type is CSVParser. It has the following methods to obtain record information:
public Iterator<CSVRecord> iterator()public List<CSVRecord> getRecords() throws IOExceptionpublic long getRecordNumber()
CSVRecord indicates a record. It has the following methods to obtain information about each field:
// Obtain the value based on the field column index. The index starts from 0 to public String get (final int I) // obtain the value public String get (final String name) based on the column name) // number of fields public int size () // field Iterator public iterator <String> Iterator ()
The basic code for analyzing CSV files is as follows:
CSVFormat format = CSVFormat.newFormat(';') .withQuote('"').withNullString("N/A") .withIgnoreSurroundingSpaces(true);Reader reader = new FileReader("student.csv");try{ for(CSVRecord record : format.parse(reader)){ int fieldNum = record.size(); for(int i=0; i<fieldNum; i++){ System.out.print(record.get(i)+" "); } System.out.println(); }}finally{ reader.close();}
In addition to analyzing CSV files, Apache Commons CSV can also be used to write CSV files. There is a CSVPrinter which has many printing methods, such:
// Output a record with variable parameters. Each parameter is a field value public void printRecord (final Object... values) throws IOException // output a record public void printRecord (final Iterable <?> Values) throws IOException
Let's look at the sample code:
CSVPrinter out = new CSVPrinter (new FileWriter ("student.csv"), CSVFormat. DEFAULT); out. printRecord ("lauma", 18, "watching movies, reading books, and listening to music"); out. printRecord ("Pony", 16, "lego; racing car;"); out. close ();
The output file student.csv contains the following content:
"Old Horse", 18, "watching movies, reading books, listening to music" "Pony", 16, LEGO; racing car;
Excel
There are two main formats for excel. the suffix is .xlsand .xlsx. .xlsx is the default extension after Office 2007. POI class libraries are widely used in Java to process Excel files and other Microsoft documents. Their official website is http://poi.apache.org /.
This section uses version 3.15 to briefly introduce its usage. If you use Maven to manage projects, you can introduce the dependencies in the following files: https://github.com/swiftma/program-logic/blob/master/excel_lib/dependencies.xml. If not Maven, you can download the dependency Library: https://github.com/swiftma/program-logic/tree/master/excel_lib from the address below
Using POI to process Excel files involves the following main categories:
- Workbook: workshop format.
- Sheet: a worksheet.
- Row: indicates a Row.
- Cell: indicates a Cell.
For example, if students are saved to student.xls, the code can be:
public static void saveAsExcel(List<Student> list) throws IOException { Workbook wb = new HSSFWorkbook(); Sheet sheet = wb.createSheet(); for (int i = 0; i < list.size(); i++) { Student student = list.get(i); Row row = sheet.createRow(i); row.createCell(0).setCellValue(student.getName()); row.createCell(1).setCellValue(student.getAge()); row.createCell(2).setCellValue(student.getScore()); } OutputStream out = new FileOutputStream("student.xls"); wb.write(out); out.close(); wb.close();}
To save the format as .xlsx, you only need to replace the first behavior:
Workbook wb = new XSSFWorkbook();
You can use POI to easily parse an Excel file and use the create method of WorkbookFactory as follows:
public static List<Student> readAsExcel() throws Exception { Workbook wb = WorkbookFactory.create(new File("student.xls")); List<Student> list = new ArrayList<Student>(); for(Sheet sheet : wb){ for(Row row : sheet){ String name = row.getCell(0).getStringCellValue(); int age = (int)row.getCell(1).getNumericCellValue(); double score = row.getCell(2).getNumericCellValue(); list.add(new Student(name, age, score)); } } wb.close(); return list;}
For more information, such as cell format, color, and font, see http://poi.apache.org/spreadsheet/quick-guide.html.
HTML
HTML is the webpage format. If you are not familiar with it, see http://www.w3school.com.cn/html/html_intro.asp. In daily work, you may need to analyze the HTML page and extract the information you are interested in. There are many HTML analyzer. We will briefly introduce jsoup, whose official website address is https://jsoup.org /.
This section uses version 1.10.2. If you use Maven to manage projects, you can introduce the dependencies in the following files: https://github.com/swiftma/program-logic/blob/master/html_lib/dependencies.xml. If you are not using Maven, you can download the dependent library from the following link: https://github.com/swiftma/program-logic/tree/master/html_lib.
Let's look at the use of jsoup through a simple example, the web address we want to analyze is: http://www.cnblogs.com/swiftma/p/5631311.html
The browser looks like this (partially ):
Save the webpage, and its HTML code looks like this (Part ):
Suppose we want to extract the title and link of each article in The webpage subject content. How can we achieve this? Jsoup supports searching elements using the CSS selector syntax. For details about CSS selectors, see http://www.w3school.com.cn/cssref/css_selectors.asp.
The CSS Selector Used to locate the document list can be
#cnblogs_post_body p a
Let's see the code (the hypothetical file is articles.html ):
Document doc = Jsoup.parse(new File("articles.html"), "UTF-8");Elements elements = doc.select("#cnblogs_post_body p a");for(Element e : elements){ String title = e.text(); String href = e.attr("href"); System.out.println(title+", "+href);}
Output is (partial ):
Computer Program thinking logic (1)-data and variables, http://www.cnblogs.com/swiftma/p/5396551.htmlcomputation machine program thinking logic (2)-assignment, http://www.cnblogs.com/swiftma/p/5399315.html
Jsoup can also be directly connected to the URL for analysis. For example, the first line of the above Code can be replaced:
String url = "http://www.cnblogs.com/swiftma/p/5631311.html";Document doc = Jsoup.connect(url).get();
For more instructions on jsoup, refer to its official website.
Compressed file
There are multiple formats of compressed files. Java SDK supports two formats: gzip and zip. gzip can only compress one file, while zip can contain multiple files. Next we will introduce the basic usage of Java SDK, if you need more formats, you can consider Apache Commons Compress: http://commons.apache.org/proper/commons-compress/
Let's take a look at gzip. There are two main classes:
java.util.zip.GZIPOutputStreamjava.util.zip.GZIPInputStream
They are subclasses of OutputStream and InputStream, both of which are decorative classes. GZIPOutputStream can be compressed after being added to an existing stream, and GZIPInputStream can be added to an existing stream for decompression. For example, the Code for compressing a file can be:
public static void gzip(String fileName) throws IOException { InputStream in = null; String gzipFileName = fileName + ".gz"; OutputStream out = null; try { in = new BufferedInputStream(new FileInputStream(fileName)); out = new GZIPOutputStream(new BufferedOutputStream( new FileOutputStream(gzipFileName))); copy(in, out); } finally { if (out != null) { out.close(); } if (in != null) { in.close(); } }}
The called copy method is described in Section 57. The code for decompressing the file can be:
public static void gunzip(String gzipFileName, String unzipFileName) throws IOException { InputStream in = null; OutputStream out = null; try { in = new GZIPInputStream(new BufferedInputStream( new FileInputStream(gzipFileName))); out = new BufferedOutputStream(new FileOutputStream( unzipFileName)); copy(in, out); } finally { if (out != null) { out.close(); } if (in != null) { in.close(); } }}
A zip file supports multiple files in a compressed file. The main classes of Java SDK are:
java.util.zip.ZipOutputStreamjava.util.zip.ZipInputStream
They are also subclasses of OutputStream and InputStream, and they are also Decorative classes, but they cannot be as easy to use as GZIPOutputStream/GZIPInputStream.
ZipOutputStream can write multiple files. It has an important method:
public void putNextEntry(ZipEntry e) throws IOException
Before writing a file, you must call this method to write a compressed entry ZipEntry. Each compressed entry has a name, which is the relative path of the compressed file, if the name ends with the character '/', it indicates the directory. Its construction method is:
public ZipEntry(String name)
Let's look at a piece of code to compress a file or a directory:
public static void zip(File inFile, File zipFile) throws IOException { ZipOutputStream out = new ZipOutputStream(new BufferedOutputStream( new FileOutputStream(zipFile))); try { if (!inFile.exists()) { throw new FileNotFoundException(inFile.getAbsolutePath()); } inFile = inFile.getCanonicalFile(); String rootPath = inFile.getParent(); if (!rootPath.endsWith(File.separator)) { rootPath += File.separator; } addFileToZipOut(inFile, out, rootPath); } finally { out.close(); }}
The inFile parameter indicates the input, which can be a common file or directory, zipFile indicates the output, and rootPath indicates the parent directory. It is used to calculate the relative path of each file. It mainly calls addFileToZipOut to add the file to ZipOutputStream, code:
private static void addFileToZipOut(File file, ZipOutputStream out, String rootPath) throws IOException { String relativePath = file.getCanonicalPath().substring( rootPath.length()); if (file.isFile()) { out.putNextEntry(new ZipEntry(relativePath)); InputStream in = new BufferedInputStream(new FileInputStream(file)); try { copy(in, out); } finally { in.close(); } } else { out.putNextEntry(new ZipEntry(relativePath + File.separator)); for (File f : file.listFiles()) { addFileToZipOut(f, out, rootPath); } }}
It also calls the copy method to write the file content to ZipOutputStream, and recursively calls the directory.
ZipInputStream is used to extract the zip file. It has a corresponding method to obtain the compression entries:
public ZipEntry getNextEntry() throws IOException
If the return value is null, no entry is returned. To decompress a file using ZipInputStream, you can use the following code:
public static void unzip(File zipFile, String destDir) throws IOException { ZipInputStream zin = new ZipInputStream(new BufferedInputStream( new FileInputStream(zipFile))); if (!destDir.endsWith(File.separator)) { destDir += File.separator; } try { ZipEntry entry = zin.getNextEntry(); while (entry != null) { extractZipEntry(entry, zin, destDir); entry = zin.getNextEntry(); } } finally { zin.close(); }}
Call extractZipEntry to process each compression entry. The Code is as follows:
private static void extractZipEntry(ZipEntry entry, ZipInputStream zin, String destDir) throws IOException { if (!entry.isDirectory()) { File parent = new File(destDir + entry.getName()).getParentFile(); if (!parent.exists()) { parent.mkdirs(); } OutputStream entryOut = new BufferedOutputStream( new FileOutputStream(destDir + entry.getName())); try { copy(zin, entryOut); } finally { entryOut.close(); } } else { new File(destDir + entry.getName()).mkdirs(); }}
Summary
This section briefly introduces five common file types: attribute files, CSV files, EXCEL files, HTML files, and compressed files. It introduces basic usage and reference links for more information.
So far, we have finished introducing all the parts of the file.
Starting from the next section, let's explore the world of concurrency and threads!
(As in other sections, all the code in this section is in the https://github.com/swiftma/program-logic)
----------------
For more information, see the latest article. Please pay attention to the Public Account "lauma says programming" (scan the QR code below), from entry to advanced, ma and you explore the essence of Java programming and computer technology. Retain All copyrights with original intent.