This is a creation in Article, where the information may have evolved or changed.
A while ago saw a Golang JSON library go-simplejson
, used to encapsulate and parse anonymous JSON, White is to use map
or slice
wait to parse json, think quite fun, then there is a project just to parse JSON, so try, accidentally looked at the source code, found that it is using the Golang encoding/json
to do with the library to do the analysis, and its own only to the library encapsulated a layer, look better. So I thought I could write a parser with my bare hands, after all, after all the years of writing code, also JSON.parse
, no JSON.stringify
number of times. Churn two days, finally became, tested a bit, performance than the library to a lot higher, the speed is basically between the 1.6
7
times (depending on the size and structure of the JSON string), so decided to write this article share ideas.
First insert a piece, as a finished writing nearly three years old code, the previous interview, more than once the interviewer asked me: How to deep Copy an object (JS), I laughed and said to write a walk function recursion on the line Ah, If you want to consider the StackOverflow, then use the Stack + iteration is good. And then they always ask me, is there a better way, and then say to yourself that you can serialize and deserialize the JSON first ...
Project name cheapjson
, meaning is cheap, because you do not need to define each struct, performance is faster than the original, so very cheap. Address in https://github.com/acrazing/c ..., interested can see ~
JSON value
Since it is cheap, it has nothing to do with reflection, so void *
is required, of course, in Golang interface{}
, and then you need a structure to hold the required information. Make type judgments and boundary checks. If it is C, the array size, string length, and object Key/value mappings are all necessary work. But in the Golang there is no need, the compiler has done all the work.
In JSON, a complete JSON should contain a value
, the type of value
may be null
, true
, false
, number
, string
, array
, and object
a total of 6. array
and object
may also contain a child value
structure. These types of values are mapped to Golang, which is nil
, bool
, bool
, Int64/float64
, String
, []interface{}
, map[string]interface{}
, which can be done with a union
structure. Note that number
here can be converted to an integer or a floating-point number, and in JavaScript, all are stored with a double-precision floating point number of *
bits, so the largest exact integer is the non-protocol count is the tail part 2^ 53-1
, which is already much larger than int32
, so this maps the integer to int64
instead of int
, because it can overflow on some machines, strictly differentiating a Integers and floating-point numbers in the IEEE-754
format are not an easy thing to do, which is simplified to if the fractional part of the mantissa and the exponential portion are not present, it is considered an integer , and in order to simplify the operation, for any illegal UTF-16
string, the structure is considered to be problematic, and termination parsing . For convenience, define a structure to hold a JSON value
:
type struct Value { value interface{}}
The field in the structure value
holds the Value
actual value of the JSON and determines its type by type determination. So there will be a lot of decisions, assignments, and accessor functions, such as the string
action to determine whether a type is required, the Value
string
IsString()
assignment AsString()
, and the operation to get the real value String()
:
// 判定是否为string,如果是,则返回true,否则返回falsefunc (v *Value) IsString() bool { if _, ok := v.value.(string); ok { return true } return false}// 将一个Value赋值为一个stringfunc (v *Value) AsString(value string) { v.value = value}// 从一个string类型的Value中取出String值func (v *Value) String() string { if value, ok := v.value.(string); ok { return value } // 如果不是一个string类型,则报错,所以需要先判定是否为string类型 panic("not a string value")}
There are many things to do with this, so you can refer to Cheapjson/value.go.
JSON Parser
For,,,, string
true
false
null
number
Such values, are literal, that is, no deep structure, it is advisable to read directly, and the middle cannot be cut off by whitespace characters, so it can be read directly. And for one array
or another object
, it is a multi-layered tree structure. The most straightforward idea is definitely recursive, but we all know that this is not feasible, because in parsing the large JSON is likely to overflow the stack, so can only use the Stack + iteration method.
People who have learned the principles of compiling know that when doing AST analysis, the first thing to do is analyze the token, and then analyze the AST, which should be done when parsing the JSON, although the token is less: only a few literal and {
[
:
]
}
,,, Several delimiting characters. Unfortunately I did not learn the principle of compiling, come up to take the state machine to iterate. Because JSON is a tree, the parsing process is the process of traversing from the root tree to each leaf node and back to the root. It will naturally involve the loading and ejection of the stack. Specifically, it is in the encounter array
and object
the child node to press into the stack, when encountering a value
terminator to pop up the stack. At the same time, the stack node corresponding to Value
its status information is also saved. So I've defined a stack node structure:
type struct state { state int value *Value parent *state}
Which state
represents the state of the current stack node, indicating that the value
value it represents parent
represents its parent node, and the parent node of the root node is nil
. When you want to press the stack, just create a new node, set it parent
to the current node, and when it pops up, set the current junction to the current one parent
. If the current node is nil
, then the traversal ends, and the JSON itself should end, with the exception of whitespace characters, which should not contain any characters.
The possible states of a node are:
const ( // start of a value stateNone = iota stateString // after [ must be a value or ] stateArrayValueOrEnd // after a value, must be a , or ] stateArrayEndOrComma // after a {, must be a key string or } stateObjectKeyOrEnd // after a key string must be a : stateObjectColon // after a : must be a value // after a value, must be , or } stateObjectEndOrComma // after a , must be key string stateObjectKey)
The meaning of the state is the same as the literal meaning, for example, the state indicates that the stateArrayValueOrEnd
current stack node encounters an array's starting flag [
, waits for a Value
terminator of a sub or an array ]
, and states stateArrayEndOrComma
that an array has encountered a child Value
, in the wait Terminator ]
or Value
delimiter ,
. Therefore, when parsing an array, the complete stack operation process is: encountered [
, the current node is set to the state stateArrayValueOrEnd
, and then filter the white space characters, determine whether the first character is ]
or another character, if ]
so, then the array ends, pop up the stack, if not, Then change its state to stateArrayEndOrComma
, and press into a new stack node, set its state to stateNone
, re-start parsing, after this node resolution is completed, this node pops up, the decision is ,
still ]
, if it is ]
, then the end of the popup, if it is ,
Does not change its state, and a new stack node is re-opened, starting a new loop. The finished state machine is as follows:
It has the following meanings:
First initialize an empty node, the state is set to stateNone
, then determine the first non-null character, if it is t/f/n/[-0-9]
, then directly parse the literal, then pop, if it is [
, then set the state to stateArrayValueOrEnd
, then determine the first character, if it is, then the ]
end pops up, otherwise press into the new stack , and set its own state to stateArrayEndOrComma
start a new loop, if it is {
, set the state to stateObjectKeyOrEnd
, if the next non-null character is }
, then the end pops up, otherwise the resolution key
, after completion, presses into the new stack and sets its state to stateObjectEndOrComma
.
What is more special is stateString
that it is also a literal, and does not need to be parsed into a new loop. But because one object
key
is also one string
, in order to reuse the code and avoid the performance overhead of calling the function, the string
type and object are treated key
as the same type, as follows:
root := &state{&Value{nil}, stateNone, nil}curr := rootfor { // ignore whitespace // check curr is nil or not switch curr.state { case stateNone: switch data[offset] { case '"': // go to new loop curr.state = stateString continue } case stateObjectKey, stateString: // parse string if curr.state == stateObjectKey { // create new stack node } else { // pop stack } }}
In addition, the more special is after parsing an object key, immediately pressed into a new stack node, and set its state to, at the same time, set its stateObjectColon
own state to, stateObjectEndOrComma
after parsing colon after the state of the node is set to stateNone
, start a new cycle, specifically:
if curr.state == stateObjectKey { curr.state = stateObjectEndOrComma curr = &state{&Value{nil}, stateObjectColon, nil} continue}
This is because :
there may be whitespace characters both before and after, and this is to reuse the code logic: all the blanks are filtered out at the beginning of each iteration.
for { LOOP_WS: for ; offset < len(data); offset++ { switch data[offset] { case '\t', '\r', '\n', ' ': continue default: break LOOP_WS } // do staff}
After filtering out the blank, if the current stack is nil
, then there should be no characters exist, the entire parsing end, otherwise there must be characters, and need to parse:
for { // ignore whitespace if curr == nil { if offset == len(data) { return } else { // unexpected char data[offset] at offset } } else if offset == len(data) { // unexpected EOF at offset } // do staff}
The corresponding resolution is then based on the current state.
Postscript
From the current open source project, the performance should also have the optimization of space, after all, someone has done 2-4x
so-called speed, and now there are many projects in the Golang of the struct is compiled first, and then call the generated function for a specific structure to parse, faster, But since the pre-compilation, why still use JSON ah, directly pb/msgpack got. In particular djson
, this library, when parsing small json speed is 3-4 times the original, but the big time only twice times, but cheapjson
in the parsing of the large JSON performance is almost the original 7 times times, quite funny. From the test results, the overall performance and memory are also possible, but in the resolution of the array than the original is worse. So it's worth improving, especially the frequent creation and destruction of state
nodes, as well as the dynamic expansion of arrays.
Later, you can do it slowly, I do not want to have more white hair.