Write a JSON parser with freehand (Golang)

Source: Internet
Author: User
Tags key string parse string
This is a creation in Article, where the information may have evolved or changed.

A while ago saw a Golang JSON library go-simplejson , used to encapsulate and parse anonymous JSON, White is to use map or slice wait to parse json, think quite fun, then there is a project just to parse JSON, so try, accidentally looked at the source code, found that it is using the Golang encoding/json to do with the library to do the analysis, and its own only to the library encapsulated a layer, look better. So I thought I could write a parser with my bare hands, after all, after all the years of writing code, also JSON.parse , no JSON.stringify number of times. Churn two days, finally became, tested a bit, performance than the library to a lot higher, the speed is basically between the 1.6 7 times (depending on the size and structure of the JSON string), so decided to write this article share ideas.

First insert a piece, as a finished writing nearly three years old code, the previous interview, more than once the interviewer asked me: How to deep Copy an object (JS), I laughed and said to write a walk function recursion on the line Ah, If you want to consider the StackOverflow, then use the Stack + iteration is good. And then they always ask me, is there a better way, and then say to yourself that you can serialize and deserialize the JSON first ...

Project name cheapjson , meaning is cheap, because you do not need to define each struct, performance is faster than the original, so very cheap. Address in https://github.com/acrazing/c ..., interested can see ~

JSON value

Since it is cheap, it has nothing to do with reflection, so void * is required, of course, in Golang interface{} , and then you need a structure to hold the required information. Make type judgments and boundary checks. If it is C, the array size, string length, and object Key/value mappings are all necessary work. But in the Golang there is no need, the compiler has done all the work.

In JSON, a complete JSON should contain a value , the type of value may be null , true , false , number , string , array , and object a total of 6. array and object may also contain a child value structure. These types of values are mapped to Golang, which is nil , bool , bool , Int64/float64 , String , []interface{} , map[string]interface{} , which can be done with a union structure. Note that number here can be converted to an integer or a floating-point number, and in JavaScript, all are stored with a double-precision floating point number of * bits, so the largest exact integer is the non-protocol count is the tail part 2^ 53-1 , which is already much larger than int32 , so this maps the integer to int64 instead of int , because it can overflow on some machines, strictly differentiating a Integers and floating-point numbers in the IEEE-754 format are not an easy thing to do, which is simplified to if the fractional part of the mantissa and the exponential portion are not present, it is considered an integer , and in order to simplify the operation, for any illegal UTF-16 string, the structure is considered to be problematic, and termination parsing . For convenience, define a structure to hold a JSON value :

type struct Value {  value interface{}}

The field in the structure value holds the Value actual value of the JSON and determines its type by type determination. So there will be a lot of decisions, assignments, and accessor functions, such as the string action to determine whether a type is required, the Value string IsString() assignment AsString() , and the operation to get the real value String() :

// 判定是否为string,如果是,则返回true,否则返回falsefunc (v *Value) IsString() bool {  if _, ok := v.value.(string); ok {    return true  }  return false}// 将一个Value赋值为一个stringfunc (v *Value) AsString(value string) {  v.value = value}// 从一个string类型的Value中取出String值func (v *Value) String() string {  if value, ok := v.value.(string); ok {    return value  }  // 如果不是一个string类型,则报错,所以需要先判定是否为string类型  panic("not a string value")}

There are many things to do with this, so you can refer to Cheapjson/value.go.

JSON Parser

For,,,, string true false null number Such values, are literal, that is, no deep structure, it is advisable to read directly, and the middle cannot be cut off by whitespace characters, so it can be read directly. And for one array or another object , it is a multi-layered tree structure. The most straightforward idea is definitely recursive, but we all know that this is not feasible, because in parsing the large JSON is likely to overflow the stack, so can only use the Stack + iteration method.

People who have learned the principles of compiling know that when doing AST analysis, the first thing to do is analyze the token, and then analyze the AST, which should be done when parsing the JSON, although the token is less: only a few literal and { [ : ] } ,,, Several delimiting characters. Unfortunately I did not learn the principle of compiling, come up to take the state machine to iterate. Because JSON is a tree, the parsing process is the process of traversing from the root tree to each leaf node and back to the root. It will naturally involve the loading and ejection of the stack. Specifically, it is in the encounter array and object the child node to press into the stack, when encountering a value terminator to pop up the stack. At the same time, the stack node corresponding to Value its status information is also saved. So I've defined a stack node structure:

type struct state {  state int  value *Value  parent *state}

Which state represents the state of the current stack node, indicating that the value value it represents parent represents its parent node, and the parent node of the root node is nil . When you want to press the stack, just create a new node, set it parent to the current node, and when it pops up, set the current junction to the current one parent . If the current node is nil , then the traversal ends, and the JSON itself should end, with the exception of whitespace characters, which should not contain any characters.

The possible states of a node are:

const (    // start of a value    stateNone = iota    stateString    // after [ must be a value or ]    stateArrayValueOrEnd    // after a value, must be a , or ]    stateArrayEndOrComma    // after a {, must be a key string or }    stateObjectKeyOrEnd    // after a key string must be a :    stateObjectColon    // after a : must be a value    // after a value, must be , or }    stateObjectEndOrComma    // after a , must be key string    stateObjectKey)

The meaning of the state is the same as the literal meaning, for example, the state indicates that the stateArrayValueOrEnd current stack node encounters an array's starting flag [ , waits for a Value terminator of a sub or an array ] , and states stateArrayEndOrComma that an array has encountered a child Value , in the wait Terminator ] or Value delimiter , . Therefore, when parsing an array, the complete stack operation process is: encountered [ , the current node is set to the state stateArrayValueOrEnd , and then filter the white space characters, determine whether the first character is ] or another character, if ] so, then the array ends, pop up the stack, if not, Then change its state to stateArrayEndOrComma , and press into a new stack node, set its state to stateNone , re-start parsing, after this node resolution is completed, this node pops up, the decision is , still ] , if it is ] , then the end of the popup, if it is , Does not change its state, and a new stack node is re-opened, starting a new loop. The finished state machine is as follows:

It has the following meanings:

First initialize an empty node, the state is set to stateNone , then determine the first non-null character, if it is t/f/n/[-0-9] , then directly parse the literal, then pop, if it is [ , then set the state to stateArrayValueOrEnd , then determine the first character, if it is, then the ] end pops up, otherwise press into the new stack , and set its own state to stateArrayEndOrComma start a new loop, if it is { , set the state to stateObjectKeyOrEnd , if the next non-null character is } , then the end pops up, otherwise the resolution key , after completion, presses into the new stack and sets its state to stateObjectEndOrComma .

What is more special is stateString that it is also a literal, and does not need to be parsed into a new loop. But because one object key is also one string , in order to reuse the code and avoid the performance overhead of calling the function, the string type and object are treated key as the same type, as follows:

root := &state{&Value{nil}, stateNone, nil}curr := rootfor {  // ignore whitespace  // check curr is nil or not  switch curr.state {    case stateNone:      switch data[offset] {        case '"':          // go to new loop          curr.state = stateString          continue      }    case stateObjectKey, stateString:      // parse string      if curr.state == stateObjectKey {        // create new stack node      } else {        // pop stack      }  }}

In addition, the more special is after parsing an object key, immediately pressed into a new stack node, and set its state to, at the same time, set its stateObjectColon own state to, stateObjectEndOrComma after parsing colon after the state of the node is set to stateNone , start a new cycle, specifically:

if curr.state == stateObjectKey {  curr.state = stateObjectEndOrComma  curr = &state{&Value{nil}, stateObjectColon, nil}  continue}

This is because : there may be whitespace characters both before and after, and this is to reuse the code logic: all the blanks are filtered out at the beginning of each iteration.

for {  LOOP_WS:  for ; offset < len(data); offset++ {    switch data[offset] {    case '\t', '\r', '\n', ' ':      continue    default:      break LOOP_WS  }  // do staff}

After filtering out the blank, if the current stack is nil , then there should be no characters exist, the entire parsing end, otherwise there must be characters, and need to parse:

for {  // ignore whitespace  if curr == nil {    if offset == len(data) {      return    } else {      // unexpected char data[offset] at offset    }  } else if offset == len(data) {    // unexpected EOF at offset  }  // do staff}

The corresponding resolution is then based on the current state.

Postscript

From the current open source project, the performance should also have the optimization of space, after all, someone has done 2-4x so-called speed, and now there are many projects in the Golang of the struct is compiled first, and then call the generated function for a specific structure to parse, faster, But since the pre-compilation, why still use JSON ah, directly pb/msgpack got. In particular djson , this library, when parsing small json speed is 3-4 times the original, but the big time only twice times, but cheapjson in the parsing of the large JSON performance is almost the original 7 times times, quite funny. From the test results, the overall performance and memory are also possible, but in the resolution of the array than the original is worse. So it's worth improving, especially the frequent creation and destruction of state nodes, as well as the dynamic expansion of arrays.

Later, you can do it slowly, I do not want to have more white hair.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.