We in the daily computer operation, contact and processing the most, in addition to the Internet, is probably a variety of documents, starting from this section, we will discuss the file processing, this section mainly introduces some basic concepts and common sense of the file, Java processing files in the basic ideas and class structure, as well as the next chapter of the arrangement of ideas.
Basic concepts and common sense
Binary thinking
In order to understand the document thoroughly, we must first have a binary thinking. all files, whether executable files, picture files, video files, Word files, compressed files, txt files, are nothing mysterious, they are saved in binary form 0 and 1. The images, videos, and text that we see are the results of the application's parsing of these binary binaries.
As programmers, we should have an editor that can view the binary form of a file, such as UltraEdit, which supports viewing and editing in 16 binary. For example, a text file, see the content is:
Hello, 123, old horse
Open the hexadecimal editor and see the following:
The left part is the corresponding hexadecimal, "hello" corresponds to the hexadecimal is "6C 6C 6F", corresponding to the ASCII code number "104 101 108 108 111", "Ma" corresponds to the hexadecimal is "E9 A9 AC", which is "horse" UTF-8 code.
File type
As we said in the first section, all data is stored in binary form, but in order to facilitate processing of data, high-level languages introduce the concept of data types, file processing is similar, all files are stored in binary form, but in order to facilitate the understanding and processing of files, the file also has the concept of file type.
File types are usually represented as suffixes , for example, the suffix of the PDF file type is. pdf, a common suffix for picture files is. jpg, a common suffix for compressed files is. zip. Each file type has a format that represents the mapping between the file meaning and the binary. For example, a Word file that has text, pictures, tables, text that may have color, font, font size, and so on, the doc file type defines the mapping between these and binary representations. The format of some file types is public, some may be private, and we can define our own proprietary file formats.
For a file type, there are often one or more applications that can interpret it, be viewed and edited, and an application can often interpret one or more file types.
In the operating system, a suffix name is often associated with an application, such as a. doc suffix associated with the word app. When a user tries to open a file with a suffix by double-clicking it, the operating system looks for the associated application, launches the program, passes the file path to it, and the program then opens the file.
It is a convention to add the correct suffix to the file, but it is not mandatory, and if the suffix name and file type do not match, the application may error when attempting to open the file. In addition, a file can choose to use a variety of applications to interpret, in the operating system, usually by right-click on the file, select Open mode.
File types can be roughly divided into two categories, one for text files and the other for binary files. examples of text files are plain. txt files, program source code files. Java, HTML files, HTML, and so on, examples of binary files are compressed files. zip, pdf files, mp3 files, Excel files, etc.
Basically, every binary byte in a text file is part of a printable character that can be viewed and edited using the most basic text editor, such as Notepad on Windows, vi on Linux.
binary files, each byte does not necessarily represent the character, may indicate the color, may indicate the font, may indicate the sound size and so on, if opens with the basic text editor, is usually full screen garbled, needs the specialized application to be able to view and edit.
Encoding of text files
For text files, we must also pay attention to how the files are encoded. Text file contains the basic are printable characters, but the character to binary mapping, that is, encoding, but there are many ways, such as GB18030, UTF-8, we in how to recover from garbled in a section detailed introduction of various codes, here do not repeat.
What encoding does it take for a given text file? Generally speaking, we do not know. What coding method does the application use to interpret it? The general use of some default encoding, may be the application default, it may be the operating system default, of course, some of the more intelligent algorithm can be automatically inferred encoding.
For UTF-8 encoded files, we need to specifically explain that there is a way to mark the file is UTF-8 encoded, that is, at the beginning of the file, add three special bytes (0xEF 0xBB 0xBF), these three special bytes are called BOM header , The BOM is an abbreviation for the byte order mark (that is, the byte-order tag). For example, for the previous Hello.txt file, the UTF-8 encoding with the BOM header is in 16 binary form:
are UTF-8 encoded, see the same character content, but the binary content is not the same, a with BOM head, one without BOM header.
Note that the UTF-8 encoded file with the BOM header is not supported by all applications , such as PHP does not support the BOM, if your PHP source code files with the BOM header, PHP run error, when you encounter this problem, the previous Binary thinking is especially important, do not just look at the file display, but also the binary behind the file.
In addition, we need to explain the text file line break, in the Windows system, the line break is generally two characters "\ r \ n", that is, ASCII code of (' \ S ') and ten (' \ n '), in the Linux system, the line break is generally a character "\ n".
File system
Files are generally placed on the hard disk, there may be multiple hard disks on one machine, but various operating systems will hide the concept of physical hard disk, providing a unified structure of logic. In Windows, you can have multiple logical disks, C, D, E, and so on, each disk can be formatted as a different file system, and the common file system has FAT32 and NTFS. In Linux, there is only one logical root directory, with a slash/representation, Linux supports a number of different file systems, such as EXT2/EXT3/EXT4. Different file systems have different file organization styles, structures, and features, but in general programming, the language and class libraries provide us with a unified API, and we don't need to be concerned with their details.
Logically, there is more than one root directory in Windows, and Linux has a root directory, where each root directory is a subdirectory and a tree of files. Each file has the concept of a file path, the path has two forms, one is an absolute path and the other is a relative path.
The so-called absolute path is the full path from the root directory to the current file, in Windows, the directory is separated by backslashes, such as "C:\code\hello.java", in Linux, the directory is separated by a slash, such as "/users/laoma/desktop/ Code/hello.java ". In Java, the Java.io.File class defines a static variable File.separator, which represents the path delimiter, which should be used in programming to avoid hard coding.
The so-called relative path is relative to the current directory, on the command line terminal, the directory entered through the CD command is the current directory, in Java, through the System.getproperty ("User.dir") can be run Java program's current directory, Relative paths do not start with the root directory, such as on Windows, where the current directory is "D:\laoma" and the relative path is "Code\hello.java", the full path is "D:\laoma\code\hello.java".
In addition to the specific content of each file, there are metadata information , such as file name, creation time, modification time, file size, and so on. The file also has a hidden Nature in the Linux system, if the file name starts with a., Hidden file, in Windows system, hidden is a property of the file that can be set.
Most file systems, each file and directory also has the concept of access rights, the owner, the user group can have different permissions, the specific permissions include read, write, execute.
The notion that filenames are case sensitive is generally case insensitive in Windows systems, while Linux is generally case-sensitive, that is, in the same directory, "Abc.txt" and "ABC.txt" is considered the same file in Windows, and Linux is considered a different file.
The operating system has a temporary file concept, temporary files located in a specific directory, such as Windows 7, generally located in "C:\Users\ User name \AppData\Local\Temp", Linux system, located in "/tmp", The operating system will have a certain policy to automatically clean up unused temporary files. Temporary files are not typically created manually by the user, but are generated by the application for temporary purposes.
File read/write
Files are placed on the hard disk, the program processing files need to read the file into memory, after modification, need to write back to the hard disk. The operating system provides a basic API for file read and write, different operating system interfaces and implementations are not the same, but there are some common concepts, Java encapsulates the functionality of the operating system, provides a unified API.
A basic common sense is that the drive's access delay, compared to memory, is very slow , the operating system and the hard disk is generally bulk transfer by block, rather than by byte, to amortization delay overhead, block size is generally at least 512 bytes, even if the application only requires a byte of the file, The operating system will also read in at least one block. In general, you should minimize contact with the hard disk, contact once, do more things at once, for network requests, and other input, the principle is similar.
Another basic common sense is that the general read and write files require two copies of the data, such as reading files, you need to copy from the hard disk to the operating system kernel, and then from the kernel copy to the application allocated memory, the operating system is running in the environment and the application is not the same, the operating system is located in the kernel State, The application is the user state, the application calls the operating system function, needs two times the environment the switchover, first from the user state to the kernel state, then from the kernel state to the user state, the problem is, this kind of user state/kernel state switch is the overhead, should minimize this kind of switch.
To improve the efficiency of file operations, applications often use a common strategy, which is to use buffers. read the file, even if only a small amount of content, but the prediction will continue to read, read more than one time, put into the read buffer, the next time the buffer is read, read directly from the buffer, reduce access to the operating system and hard disk. Write the file, write to write buffer, write buffer full, and then once again a sexual call to the operating system to write to the hard disk. Note, however, that at the end of the write, remember to synchronize the remaining contents of the buffer to the hard disk. The operating system itself also uses buffers, but the application is more aware of the read-write mode, and proper use can often be more efficient.
Operating system operations files generally have the concept of open and close , open the file will be in the operating system kernel to establish a memory structure about the file, this structure is generally referred to by an integer index, this index is generally called the file descriptor, this structure is consumed memory, The operating system can open at the same time the file is generally limited, in the use of files, should remember to close the file, close the file will generally synchronize the buffer contents to the hard disk, and release the occupied memory structure.
The operating system generally supports a kind of memory-mapped file is called the efficient method of random read and write large files, directly map the file to memory, the operation of memory is the operation of the file, in the memory map file, only the access to the data will be actually copied to memory, and the data will only be copied once, shared by the operating system and multiple applications. Further chapters will be described later.
Java File overview
Flow
In Java (many other languages are similar), files are generally not handled separately, but rather as one of the input-output (io-input/output) devices. Java uses a basic, unified concept to process all Io, including keyboards, display terminals, networks, and so on.
This unified concept is the flow, the flow has the input stream and the output stream, the input stream is can obtain the data, the input stream actual provider may be the keyboard, the file, the network and so on, the output stream is can write the data to it, The actual destination of the output stream can be a display terminal, a file, a network, and so on.
The basic classes of Java Io are mostly in package java.io, class InputStream represents the input stream, OutputStream represents the output stream, and FileInputStream represents the file input stream, and FileOutputStream represents the file output stream.
With the concept of flow, there are a lot of stream-oriented code , such as convection to do encryption, compression, Computing information Digest, calculation test and so on, the code accepts parameters and return results are abstract streams, They constitute a collaborative system, which is similar to the previous introduction of the interface concept, Interface-oriented programming, and container-class collaboration systems. Some of the data sources and destinations that are not actually IO are also converted for streaming to facilitate participation in such collaborations, such as byte arrays, also packaged for streaming bytearrayinputstream and Bytearrayoutputstream.
Decorator design Pattern
Basic stream by byte read and write, no buffer, this is not convenient to use, Java solution to this problem is to use the adorner design pattern , introduced a lot of decorative classes, the basic flow to add functionality to facilitate the use, generally a class only focus on one aspect, the actual use, often will need more than decorative class.
There are many decorative classes in Java, there are two base classes, filter input stream FilterInputStream and filter output stream Filteroutputstream, so-called filter, similar to the water pipeline, the inflow is water, the outflow is also water, function unchanged, or just add function, It has many subcategories, and here are some examples:
- The sub-classes of the convection-buffered decorations are bufferedinputstream and bufferedoutputstream.
- Subclasses that can read and write on eight basic types and string streams are DataInputStream and dataoutputstream.
- Sub-classes that can compress and decompress convection are Gzipinputstream, Zipinputstream, Gzipoutputstream, Zipoutputstream.
- You can have printstream for a subclass whose base type, object output is a string representation.
Many decorative categories, so that the entire class structure becomes more complex, the completion of basic operations also require more code, but the advantages are very flexible, in solving some problems are also very elegant.
Reader/writer
Inputstream/outputstream-based streams are basically processing data in binary form, not easy to process text files, no coding concepts, the basic class of text data that can be conveniently processed by character is reader and writer, and it also has many subclasses:
- The subclasses of read-write files are FileReader and FileWriter.
- The sub-categories of cushioning decorations are bufferedreader and bufferedwriter.
- Subclasses that wrap character arrays as Reader/writer are CharArrayReader and Chararraywriter.
- Subclasses that wrap strings as Reader/writer are StringReader and StringWriter.
- The subclass that converts inputstream/outputstream to Reader/writer is InputStreamReader outputstreamwriter.
- PrintWriter the base type, the object output as a subclass of its string representation.
Random Read and write files
In most cases, a stream or reader/writer is used to read and write the contents of a file, but Java provides a separate class randomaccessfile that can read and write files randomly, for files of well-known records, which we use less in our daily application development, However, in some system applications will be more.
File
The above is the operation of the data itself, and about the file path, file metadata, file directory, temporary files, access rights Management, Java uses the file class to represent.
Java NIO
The classes described above are basically located under package java.io, and Java has a package Java.nio,nio for IO operations representing the new IO, which also includes a large number of classes.
NIO represents a different way of looking at Io, which has the concept of buffers and channels, and the use of buffers and channels can often be achieved and streamed for similar purposes, but they are closer to the concept of the operating system, and some operations are more performance-efficient. For example, to copy files to a network, the channel can take advantage of the DMA mechanism provided by the operating system and hardware (direct memory access, directly-accessible), without CPU and application involvement, to directly copy the data from the hard disk to the NIC.
In addition to looking at different ways,NiO also supports some of the more underlying features, such as memory-mapped files, file locking, custom file systems, non-blocking IO, asynchronous Io, and so on.
However, these features are either relatively low-level, ordinary applications use less, or mainly for network IO operations, most of us will not introduce, only the memory map file.
Serialization and deserialization
Simply put, serialization is the persistence of persisting Java objects in memory into a stream, and deserialization is the restoration of Java objects from the stream to memory. There are two main uses of serialization/deserialization, one is object state persistence and the other is a network remote call for passing and returning objects.
Java mainly through the interface serializable and class Objectinputstream/objectoutputstream provide support for serialization, the basic use is relatively simple, but there are some complex places.
However, the default serialization of Java has some drawbacks, such as a large, wasted space after serialization, a low performance for serialization/deserialization, and, more importantly, Java-specific technology that cannot interact with other languages.
XML is the most popular language and format for describing structured data in previous years, and Java objects can also be serialized into XML format, XML is easy to read and edit, and can be easily interacted with in other languages.
XML emphasizes formatting but is more cumbersome, and JSON is a lightweight data interchange format that has become popular in recent years, replacing XML in many cases, being easy to read and edit, and Java objects being serialized into JSON format and interacting with other languages.
XML and JSON are text format, easy to read, but occupy a relatively large space, in the case of network remote call only, there are many popular, cross-language, streamlined and efficient object serialization mechanism, such as Protobuf, Thrift, Messagepack and so on. Messagepack is a binary form of JSON, smaller and faster.
Chapter Arrangement
The file looks like a very simple thing, but the reality is not so simple, Java design is not too perfect, contains a lot of classes, which makes the understanding of the file difficult.
For the sake of understanding, we will use the following ideas in the next chapter to explore.
First, we describe how to deal with binary files, or consider all files as binary, describe how to do it, and for common operations we encapsulate and provide some easy-to-use methods.
Next, we describe how to work with text files, we consider coding, line processing, and so on, and for common operations, we encapsulate and provide an easy-to-use approach.
Next, we introduce the file itself and the directory operations file class, and we encapsulate common operations as well.
We will also introduce the Randomaccessfile classes of the underlying file operations and the memory mapping files, and we will describe their use and application.
The actual processing of files, often for specific file types, we will introduce some common types of processing, such as CSV files, Excel files, pictures, HTML files, compressed files and so on.
Finally, for serialization, in addition to introducing the default serialization mechanism for Java, we also introduce XML, JSON, and Messagepack.
Summary
This section describes some of the basic concepts and common sense of files, the basic thinking and class structure of working with files in Java, and finally we summarize the following chapters to arrange the ideas.
The file should look very simple, but actually contains a lot of content, let us withstand the temperament, the next section, first from the binary start.
----------------
To be continued, check out the latest articles, please pay attention to the public number "old Horse Programming" (Scan the QR code below), from the introduction to advanced, in layman's words, Lao Ma and you explore the nature of Java programming and computer technology. Original intentions, All rights reserved.
Thinking Logic of computer programs (56)-File overview