This is a creation in Article, where the information may have evolved or changed.
Recently writing a private project, named SMALLVM, is SmallVM
designed to Java
deepen the Java
knowledge and understanding of virtual machines by implementing a lightweight virtual machine. In the Java Virtual machine load class process, need to Class
parse the file, I have implemented a single Java
version of the Class
byte parser Classanalyzer, compared to the Java
version, the new version ( Golang
version) more robust, more clear thinking. This paper describes my Class
idea of implementing a byte parser.
class file
As a carrier of class or interface information, each Class
file defines a class in its entirety. In order for the Java
program to "write once, run everywhere", the Java Virtual Machine specification makes Class
strict rules about the file. Class
the basic unit of data that makes up a file is bytes, and there is no delimiter between those bytes, which makes the content stored in the entire Class
file almost all the data necessary for the program to run, and the data that cannot be represented by a single byte is represented by multiple contiguous bytes.
According to the Java
Virtual machine specification, the Class
file stores data in a pseudo-structure similar to the C
language structure, with only two data types: unsigned number and table. The Java
Virtual machine specification defines U1
, U2
, U4
, and U8
to represent 1
bytes respectively , 2
bytes, 4
bytes, and 8
bytes of unsigned numbers, the unsigned number can be used to describe a number, an index reference, a quantity value, or a string. A table is a composite data type that is composed of multiple unsigned numbers or other tables as data items, and tables are used to describe the data of a hierarchical composite structure, so the entire Class
file is essentially a table. In Classanalyzer
, U1
, U2
, U4
, and U8
correspond to byte
, short
, int
, and long
, the class
file is described as the following Java
class.
public class Classfile {public U4 magic; Magic public U2 minorversion; Minor_version public U2 MajorVersion; Major_version public U2 Constantpoolcount; Constant_pool_count public constantpoolinfo[] Cpinfo; Cp_info public U2 AccessFlags; Access_flags public U2 ThisClass; This_class public U2 superclass; Super_class public U2 Interfacescount; Interfaces_count public u2[] interfaces; Interfaces public U2 Fieldscount; Fields_count public fieldinfo[] fields; Fields public U2 Methodscount; Methods_count public methodinfo[] methods; Methods public U2 Attributescount; Attributes_count Public basicattributeinfo[] AttributeS Attributes
How to Parse
Class
each data item that makes up a file, such as a magic number, Class
a file's version, an Access flag, a class index, and an index of a parent class, Class
consumes a fixed number of bytes in each file and only needs to read the corresponding number of bytes when parsing. In addition, the main parts that need to be handled flexibly include 4
: constant pool, Field table collection, Method table collection, and property sheet collection. Fields and methods can have their own properties, and they also have corresponding properties, so the parsing of the table Class
of fields and the collection of method tables also includes the parsing of the attribute table.
A constant pool occupies a Class
large portion of the file's data and is used to store all constant information, including numeric and string constants, class names, interface names, field names, and method names. The Java
virtual machine specification defines a number of constant types, each of which has its own structure. The constant pool itself is a table, and there are a few things to be aware of when parsing.
Each constant type is identified by a u1
type tag
.
The constant pool size given by the header is constantPoolCount
larger than the actual, 1
for example, if constantPoolCount
equal 47
, there is 46
an item constant in the constant pool.
The index range of a constant pool 1
starts from, for example, if constantPoolCount
equal 47
, the index range of the constant pool 1~46
. The designer 0
vacated the item for the purpose of expressing "no reference to a constant pool item".
If the index of one or struct item in a constant pool is the index of the next valid item in the constant pool CONSTANT_Long_info
CONSTANT_Double_info
, the n
n+2
item indexed in the constant pool is n+1
valid but must be considered unavailable.
CONSTANT_Utf8_info
The structure of a type constant consists of a type, a type, u1
tag
u2
length
and length
a u1
type bytes
, and this length
byte of continuous data is a used MUTF-8
( Modified UTF-8)
encoded string. MUTF-8
and UTF-8
incompatible, the main difference is two points: first null
, the characters are encoded into 2
bytes ( 0xC0
and), and the 0x80
second is that the supplementary characters are UTF-16
coded separately according to the split agent, and the relevant details can be seen here (variant UTF-8).
Property sheets are used to describe certain scenarios that are proprietary to information, and Class
the file, Field table, and method tables have corresponding sets of property tables. The Java
virtual machine specification defines a variety of properties, and SmallVM
currently implements the parsing of common properties. Unlike data items of constant type, properties do not have a tag
type that identifies a property, but each property contains a constant of a u2
type attribute_name_index
that attribute_name_index
points to a constant in the pool of constants CONSTANT_Utf8_info
that contains the name of the property. When you parse a property, you know the type of the SmallVM
attribute_name_index
property by pointing to the property name of the constant.
A field table is used to describe variables declared in a class or interface, including class-level variables and instance-level variables. The structure of a field table contains one type, one type, u2
access_flags
u2
name_index
one u2
type descriptor_index
, one type, u2
attributes_count
and attributes_count
attribute_info
attributes
one type. We have introduced the parsing of the attribute table, which is resolved in the attributes
same way as the attribute table.
Class
The File method table uses the same storage format as the field table, but access_flags
the corresponding meanings are different. The method table contains an important property: a Code
property. Code
Property Stores Java
code-compiled bytecode directives, in ClassAnalyzer
which the Code
corresponding classes are Java
shown below (only class properties are listed).
type Code struct { pool []constantpool.ConstantInfo attributeNameIndex uint16 attributeLength uint32 maxStack uint16 maxLocals uint16 codeLength uint32 code []byte exceptionTableLength uint16 exceptionTable []ExceptionInfo attributesCount uint16 attributes []AttributeInfo}type ExceptionInfo struct { startPc uint16 endPc uint16 handlerPc uint16 catchType uint16}
In Code
attributes, codeLength
and code
respectively, for storing bytecode lengths and bytecode instructions, each instruction is a byte ( u1
type). When the virtual machine executes, it reads code
the bytecode in the code and translates the bytecode into the corresponding instructions. In addition, although codeLength
it is a u4
type of value, in fact a method does not allow more than a 65535
bytecode instruction.
Code implementation
The source of the entire Class
byte parser has been placed on GitHub, the byte parser is just SmallVM
a small module, corresponding to the directory src/classfile
. In addition, you can refer to Classanalyzer's readme, I take a class of Class
files as an example, the Class
file of each byte is analyzed, I hope that everyone's understanding to help.