This is an independently implemented sgf go chess and music file parser by liigo. This article introduces its implementation details. There is no doubt that a complete open-source sgf parser can be found on the network. I Don't directly use them, nor refer to their implementation code, but implement them independently, there is a reason, because I want to reinvent the wheel myself and think it will help improve my coding capability. (I will write an article about my immature argument that "we must learn to reinvent the wheel .)
This sgf parser developed by liigo uses simple event-based APIs, similar to the sax (Simple API for XML) in the XML parser ). The core of this parser is that the user provides a series of callback functions in advance. During the parsing process, the parser calls the relevant callback functions in sequence and passes in the corresponding parameters, the user program performs corresponding processing in the callback function. This type of parser is a lightweight parser with fast resolution speed, low memory usage, clear structure and easy implementation. It is not as easy to use as a dom-based parser.
The sgf format, smart game format, is designed to record a variety of common game chess and music formats. It has been promoted in the Go field and is the most important and most common form for describing go chess and music. It is a text-only tree-based structure for easy identification, storage and transmission. The format is simple and practical, and it is very easy to parse by programming. Official sgf format Website: http://www.red-bean.com/sgf /. (When talking about Go games, you have to be amazed. It only needs to use a picture to completely restore the changing landscape of a game. As a comparison, A chess image can only describe the scene of a certain moment in the scene .)
The main structure of sgf is composed of tree (gametree), node sequence, node, and property. "Attribute" is the most important basic unit. It consists of a property identifier (propident) and a property value (propvalue. Multiple Attributes separated by semicolons (;) are called nodes. Multiple nodes are ordered in sequence. A node sequence enclosed by parentheses (")", called a tree, which can contain Subtrees. The ebnf definition of sgf is as follows (see http://www.red-bean.com/sgf/sgf4.html#ebnf-def ):
Collection = gametree {gametree} <br/> gametree = "(" sequence {gametree} ")" <br/> sequence = node {node} <br/> node = "; "{property} <br/> property = propident propvalue {propvalue} <br/> propident = ucletter {ucletter} <br/> propvalue =" ["cvaluetype"] "<br /> cvaluetype = (valuetype | compose) <br/> valuetype = (none | Number | real | double | color | simpletext | text | point | move | stone)
The following is a simple and representative sgf text. Let's have a perceptual knowledge:
(; FF [4] GM [1] SZ [19] FG [257: figure 1] PM [1] <br/> Pb [takemiya Masaki] BR [9 Dan] PW [CHO chikun] <br/> WR [9 Dan] re [W + resign] km [5.5] TM [28800] dt [1996-10-] <br/> EV [21st meijin] ro [2 (final)] so [go world #78] us [Arno hollosi] <br/>; B [Pd]; W [DP]; B [PP]; W [DD]; B [PJ]; W [Nc]; B [OE]; W [QC]; B [PC]; W [QD] <br/> (; B [QF]; W [RF]; B [RG]; W [re]; B [QG]; W [Pb]; B [ob]; W [QB] <br/> (; B [MP]; W [FQ]; B [CI]; W [CG]; B [DL]; W [CN]; B [Qo]; W [EC]; B [JP]; W [JD] <br/>; B [ei]; W [eg]; B [Kk] LB [QQ: A] [DJ: b] [CK: C] [QP: d] n [Figure 1] </P> <p>; W [Me] FG [257: Figure 2]; B [KF]; W [ke]; B [LF]; W [JF]; B [JG] <br/> (; W [MF]; B [if]; W [je]; B [ig]; W [Mg]; B [MJ]; W [MQ]; B [SCSI]; W [NQ] <br/> (; B [LR]; W [QQ]; B [PQ]; W [pr]; B [RQ]; W [RR]; B [RP]; W [OQ]; B [Mr]; W [oo]; B [Mn] <br/> (; W [Nr]; B [QP] LB [KD: A] [KH: b] n [Figure 2] </P> <p>; W [PK] FG [257: Figure 3]; B [PM]; W [OJ]; B [OK]; W [QR]; B [OS]; W [Ol]; B [NK]; W [Qj] <br/>; B [Pi]; W [pl]; B [Qm]; W [ns]; B [SR]; W [om]; B [op]; W [Qi]; B [OI] <br/> (; W [RL]; B [QH]; W [RM]; B [Rn]; W [ri]; B [QL]; W [qk]; B [sm]; W [SK]; B [sh]; W [og] <br/>; B [OH]; W [NP]; B [No]; W [mm]; B [NN]; W [LP]; B [KP]; W [lo]; B [Ln]; W [Ko]; B [Mo] <br/>; W [Jo]; B [km] n [Figure 3]) </P> <p> (; W [QL] VW [JA: SS] FG [257: Dia. 6] Mn [1]; B [RM]; W [pH]; B [OH]; W [PG]; B [og]; W [pf] <br/>; B [QH]; W [QE]; B [sh]; W [of]; B [SJ] tr [OE] [Pd] [PC] [ob] LB [PE: A] [SG: B] [Si: c] <br/> N [di1_6]) </P> <p> (; W [No] VW [JJ: SS] FG [257: Dia. 5] Mn [1]; B [pn] n [di1_5]) </P> <p> (; B [pr] FG [257: Dia. 4] Mn [1]; W [KQ]; B [LP]; W [LR]; B [JQ]; W [Jr]; B [KP]; W [Kr]; B [ir] <br/>; W [HR] LB [is: A] [JS: B] [or: c] n [di1_4]) </P> <p> (; W [if] FG [257: Dia. 3] Mn [1]; B [MF]; W [ig]; B [weight] LB [Ki: A] n [di1_3]) </P> <p> (; W [OC] VW [AA: SK] FG [257: Dia. 2] Mn [1]; B [MD]; W [MC]; B [ld] n [di1_2]) </P> <p> (; B [QE] VW [AA: SJ] FG [257: Dia. 1] Mn [1]; W [re]; B [QF]; W [RF]; B [QG]; W [Pb]; B [ob] <br/>; W [QB] LB [RG: A] n [di1_1])
Programmers who are familiar with writing text parsers should be clear about it. According to the definition of ebnf, it is quite simple and intuitive to write the corresponding parsers. It seems to be just a translation job. I implemented the sgf parser and once again confirmed this point. In most cases, I just translated ebnf into C language code step by step.
I first designed the "sgfparsecontext" structure to save the relevant data during the parser's work:
Typedef struct _ tagsgfparsecontext <br/>{< br/> void * puserdata; <br/> int treeindex; </P> <p> pfn_on_tree pfnontree; <br/> pfn_on_tree_end pfnontreeend; <br/> pfn_on_node pfnonnode; <br/> pfn_on_node_end pfnonnodeend; <br/> pfn_on_property pfnonproperty; </P> <p> char idbuffer [16]; <br/> char * valuebuffer; <br/> int valuebuffersize; <br/>}< br/> sgfparsecontext;
There are also functions that initialize and clean up the sgfparsecontext structure. initsgfparsecontext and cleanupsgfparsecontext are not the key of this parser.
Then I (liigo) designed the prototype of the five callback functions:
Typedef void (* pfn_on_tree) (sgfparsecontext * pcontext, const char * sztreeheader, int treeindex); <br/> typedef void (* handle) (sgfparsecontext * pcontext, int treeindex ); <br/> typedef void (* pfn_on_node) (sgfparsecontext * pcontext, const char * sznodeheader); <br/> typedef void (* pfn_on_node_end) (sgfparsecontext * pcontext ); <br/> typedef void (* pfn_on_property) (sgfparsecontext * pcontext, const char * szid, const char * szvalue );
These five callback functions are called by the parser when the parser resolves to "tree Start", "tree end", "Node start", "Node end", and "encountered attribute. When the parser calls each callback function, it will pass in the required parameters for immediate use by the callback function.
Next we will officially start parsing. The entire parser is divided into parseproperty, parsenode, parsenodesequence, parsegametree, and parsesgf. It is a bottom-up analysis implementation mode. These parts correspond to one of the ebnf definitions of sgf. All the parsing functions receive the const char * szcollection, int frompos parameter. The previous parsing function will determine the starting position of the subsequent parsing function.
Step 1: parse the property (parseproperty ).The key here is to locate the start and end symbols "[" and "] of the attribute value (szvalue), which are the attribute values, before "[", it is the property identifier (szid ). Because the Escape Character "/" may exist between [and], you cannot simply search for the character "]". it takes a considerable amount of time for the code to process escape characters (I use the local variable in_escape to record the escape State and process it separately ). In addition, you need to allocate enough storage space for the extracted property identifiers and attribute values so that they can be passed to the user callback function. The former will not use static allocation for too long, if the latter is longer, dynamic allocation is used (the storage space is pre-allocated automatically and cached to avoid frequent memory application ). The Code is as follows:
// Property: Id [value] <br/> int parseproperty (sgfparsecontext * pcontext, const char * szcollection, int frompos) <br/>{< br/> const char * szfrompos; <br/> int lindex; <br/> int nidbuffersize = sizeof (pcontext-> idbuffer)-1; <br/> assert (szcollection & frompos> = 0); <br/> szfrompos = szcollection + frompos; </P> <p> lindex = findchar (szfrompos, -1, '['); <br/> assert (lindex> 0 & lindex <nidbuffers Ize); <br/> If (lindex> 0 & lindex <nidbuffersize) <br/> {<br/> memcpy (pcontext-> idbuffer, szfrompos, lindex ); <br/> pcontext-> idbuffer [lindex] = '/0'; </P> <p> If (istextpropertyid (pcontext-> idbuffer )) <br/>{< br/> // parse the text or simple-text value, consider the '/' Escape Character <br/> const char * s = szfrompos + lindex + 1; <br/> char C; <br/> int in_escape = 0; <br/> int valuelen = 0; <B R/> getenoughbuffer (pcontext, 1024); <br/> pcontext-> valuebuffer [0] = '/0'; <br/> while (1) <br/>{< br/> C = * s; <br/> assert (c); <br/> If (! In_escape) <br/>{< br/> If (C = '//') <br/>{< br/> in_escape = 1; <br/>}< br/> else if (C = ']') <br/>{< br/> break; <br/>}< br/> else <br/>{< br/> getenoughbuffer (pcontext, valuelen + 1 ); <br/> pcontext-> valuebuffer [valuelen ++] = C; <br/>}< br/> else <br/> {<br/> // ignore the newline after '/' <br/> If (C! = '/R' & C! = '/N') <br/>{< br/> getenoughbuffer (pcontext, valuelen + 1); <br/> pcontext-> valuebuffer [valuelen ++] = C; <br/>}< br/> else <br/>{< br/> char NC = * (S + 1); <br/> If (NC) <br/> {<br/> If (C = '/R' & NC ='/N ') | (C = '/N' & NC ='/R') <br/> S ++; <br/>}< br/> in_escape = 0; <br/>}< br/> S ++; <br/>}< br/> getenoughbuffer (pcontext, valuelen + 1); <br/> pcontext-> valuebuffer [valuelen] = '/0 '; </P> <p> If (pcontext-> pfnonproperty) <br/> pcontext-> pfnonproperty (pcontext, pcontext-> idbuffer, pcontext-> valuebuffer ); </P> <p> return (S-szcollection + 1 ); <br/>}< br/> else <br/> {<br/> int rindex = findchar (szfrompos,-1, ']'); <br/> int nneedbuffersize = rindex-lindex-1; <br/> assert (rindex> = 0); <br/> getenoughbuffer (pcontext, nneedbuffersize ); <br/> memcpy (pcontext-> valuebuffer, szfrompos + lindex + 1, nneedbuffersize); <br/> pcontext-> valuebuffer [nneedbuffersize] = '/0 '; </P> <p> If (pcontext-> pfnonproperty) <br/> pcontext-> pfnonproperty (pcontext, pcontext-> idbuffer, pcontext-> valuebuffer ); </P> <p> return (frompos + rindex + 1); <br/>}< br/> return-1; <br/>}
Step 2: Resolve the node (parsenode ).The semicolon ";" is followed by the next n attributes. a while loop calls parseproperty () to parse the attributes one by one:
// Node:; {property} <br/> int parsenode (sgfparsecontext * pcontext, const char * szcollection, int frompos) <br/>{< br/> const char * szfrompos = szcollection + frompos; <br/> assert (frompos> = 0 ); <br/> // assert (szfrompos [0] = ';'); </P> <p> If (pcontext-> pfnonnode) <br/> pcontext-> pfnonnode (pcontext, szfrompos); </P> <p> If (szfrompos [0] = ';') <br/>{< br/> frompos ++; szfrompos ++; <br/>}</P> <p> while (1) <br/>{< br/> frompos + = skipspacechars (szfrompos, null ); <br/> If (szcollection [frompos] = '/0' | findchar (";) (",-1, szcollection [frompos])> = 0) <br/> break; <br/> frompos = parseproperty (pcontext, szcollection, frompos); <br/> szfrompos = szcollection + frompos; <br/>}< br/> return frompos; <br/>}
Step 3: parse the node sequence (parsenodesequence ).Nodes are arranged sequentially. There must be at least one node, and there may be zero or more nodes. It is still a while loop:
// Nodesequence: node {node} <br/> int parsenodesequence (sgfparsecontext * pcontext, const char * szcollection, int frompos) <br/>{< br/> const char * szfrompos = szcollection + frompos; <br/> assert (frompos> = 0 ); <br/> // assert (szfrompos [0] = ';'); <br/> while (1) <br/>{< br/> frompos = parsenode (pcontext, szcollection, frompos); <br/> frompos ++ = skipspacechars (szfrompos, null ); <br/> szfrompos = szco Llection + frompos; <br/> If (szfrompos [0]! = ';') <Br/>{< br/> If (pcontext-> pfnonnodeend) <br/> pcontext-> pfnonnodeend (pcontext); <br/> break; <br/>}< br/> return frompos; <br/>}
Step 4: parsegametree ).A tree is a nested structure, and the outermost layer is a pair of parentheses (")", which contains N node sequences or N nested Subtrees. We still use a while loop to solve the problem. If we encounter "(", we call parsegametree () to parse the tree or its Subtrees recursively. Otherwise, we call parsenodesequence () to parse the node sequence. The Code is as follows:
// Gametree: ({[nodesequence] | [gametree]}) <br/> // old gametree: (nodesequence {gametree }) <br/> int parsegametree (sgfparsecontext * pcontext, const char * szcollection, int frompos) <br/>{< br/> char C; <br/> const char * szfrompos = szcollection + frompos; <br/> assert (frompos> = 0 ); <br/> assert (szfrompos [0] = '('); </P> <p> pcontext-> treeindex ++; <br/> If (pcontext-> pfnontree) <br/> pcontext-> pfnontree (pcontext, szfrompos, pcontext-> treeindex ); </P> <p> frompos ++; szfrompos ++; <br/> frompos + = skipspacechars (szfrompos, null ); </P> <p> C = szcollection [frompos]; <br/> while (1) <br/>{< br/> If (C = '(') <br/> frompos = parsegametree (pcontext, szcollection, frompos); <br/> else <br/> frompos = parsenodesequence (pcontext, szcollection, frompos ); </P> <p> szfrompos = szcollection + frompos; <br/> frompos + = skipspacechars (szfrompos, null); <br/> C = szcollection [frompos]; <br/> If (C = ') <br/>{< br/> If (pcontext-> pfnontreeend) <br/> pcontext-> pfnontreeend (pcontext, pcontext-> treeindex); <br/> pcontext-> treeindex --; <br/> break; <br/>}</P> <p> return (frompos + 1); <br/>}
Step 5: The last step is to parse the entire sgf text content (parsesgf ). This is the core interface for external exposure.It's easy to arrange n trees in order. It's okay to call parsegametree () cyclically to parse each tree? The Code is as follows:
// Sgfcollection: gametree {gametree} <br/> int parsesgf (sgfparsecontext * pcontext, const char * szcollection, int frompos) <br/>{< br/> const char * szfrompos = szcollection + frompos; <br/> assert (frompos> = 0 ); <br/> assert (szfrompos [0] = '('); <br/> pcontext-> treeindex =-1; <br/> while (1) <br/>{< br/> frompos = parsegametree (pcontext, szcollection, frompos); <br/> frompos + = skipspacechars (szfro MPOs, null); <br/> szfrompos = szcollection + frompos; <br/> If (szfrompos [0]! = '(') <Br/> break; <br/>}< br/> return frompos; <br/>}
Test code:
Int main (INT argc, char * argv []) <br/>{< br/> char * s; <br/> int X; <br/> sgfparsecontext context; <br/> // initsgfparsecontext (& context, ontree, ontreeend, onnode, onnodeend, onproperty, null); <br/> initsgfparsecontext (& context, ontree2, ontreeend2, onnode2, onnodeend2, onproperty2, null); </P> <p> // test parse property: <br/> {<br/> S = "AB [cdef] X [xyz]"; <br/> printf ("/ntest parse property: -----/N "); <br/> X = parseproperty (& context, S, 0); <br/> X = parseproperty (& context, S, 8 ); <br/> S = "C [AB/] cd]"; <br/> X = parseproperty (& context, S, 0 ); <br/>}< br/> // test parse node: <br/>{< br/> S = "; A [a] BB [BB] C [] "; <br/> printf ("/ntest parse node: -----/N "); <br/> X = parsenode (& context, S, 0); <br/> S = "; A [a]; bb [BB] C []"; <br/> X = parsenode (& context, S, 0); <br/> X = parsenodesequence (& context, S, 0 ); <br/>}< br/> // test parse tree: <br/>{< br/> printf ("/ntest parse tree: -----/N "); <br/> S = "(; A [a] (; C [c] (X [x]) Z [Z]); D [d] (; E [e] (f [ff]) "; <br/> X = parsegametree (& context, S, 0 ); <br/>}</P> <p> # If 1 <br/> // parse real sgf file: <br/>{< br/> int Len = 0; <br/> void * Data = NULL; <br/> file * pfile = fopen ("D: // x.txt", "R "); <br/> printf ("/n ---------- test parse real sgf file: --------/N"); <br/> If (pfile) <br/>{< br/> fseek (pfile, 0, seek_end); <br/> Len = ftell (pfile); <br/> assert (LEN> 0 ); <br/> fseek (pfile, 0, seek_set); <br/> DATA = malloc (LEN); <br/> assert (data ); <br/> fread (data, 1, Len, pfile); </P> <p> parsesgf (& context, Data, 0 ); </P> <p> fclose (pfile); <br/> pfile = NULL; <br/>}< br/> # endif </P> <p >{< br/> char C; <br/> printf ("/n ----- any key to exit: -----/N"); <br/> fflush (stdout ); <br/> scanf ("% C", & C); <br/>}< br/>}
Conclusion: The structure of the sgf parser is clear. It can be processed step by step according to the ebnf definition, which is not particularly complicated. But because it involves text, pointers, and recursion, there are many details that need attention. Dear friends, how long does it take to write an sgf parser like this? If there is plenty of time, you may wish to write it first to see if it is easy to understand? The so-called "re-invent the wheel" is not absolutely meaningless. At least I can exercise my hands-on skills.
In addition, there is a design trade-off, I do not know whether it is better or worse. All callback functions currently have an sgfparsecontext * pcontext, and the parameter at the same position is void * puserdata. Later, considering that the callback function may need to access the related data in sgfparsecontext (for example, reading treeindex in pfn_on_node), The pcontext parameter is introduced to facilitate the user's use. (You can also pass the pcontext parameter through puserdata, after all ). The current practice seems to expose the internal structure of the Parser (sgfparsecontext), and seem to enhance the stability and scalability of the callback function (pcontext can provide additional parameters even if the function is not changed ).
Although the sgf parser has been applied to the open source software "M8 go spectrum (http://code.google.com/p/m8weiqipu/), and initially achieved the practical purpose, but it cannot ensure that the parser has reached the industrial strength, in fact, many cases have not been tested yet, And there will inevitably be mistakes and omissions. Please criticize and correct them.
In addition, considering the compatibility with the existing sgf format files, ebnf in the sgf specification is slightly extended.
For complete source code, see:
Http://code.google.com/p/m8weiqipu/source/browse/trunk/sgf.h
Http://code.google.com/p/m8weiqipu/source/browse/trunk/sgf.c