Python實現JSON產生器和遞迴下降解譯器

最後更新：2017-12-21 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：publish 目標 ddn 空白提高 2.7 div 直接 eric

github地址：https://github.com/EStormLynn/Python-JSON-Parser

目標

從零開始寫一個JSON的解析器，特徵如下：

符合標準的JSON解析器和產生器
手寫遞迴下降的解譯器（recursive descent parser）
使用Python語言(2.7)
解譯器和產生器少於500行
使用cProfile完成效能分析和最佳化

實現內容

[x] 解析字面量(true false null)
[x] 解析數字
[x] 解析字串
[x] 解析Unicode
[x] 解析數組
[x] 解析對象
[x] 單元測試
[x] 產生器
[x] cProfile效能最佳化

詳細介紹JSON是什麼

JSON（JavaScript Object Notation）是一個用於資料交換的文字格式設定，參考ecma標準,JSON Data Interchange Format,先看一段JSON的資料格式:

{    "title": "Design Patterns",    "subtitle": "Elements of Reusable Object-Oriented Software",    "author": [        "Erich Gamma",        "Richard Helm",        "Ralph Johnson",        "John Vlissides"    ],    "year": 2009,    "weight": 1.8,    "hardcover": true,    "publisher": {        "Company": "Pearson Education",        "Country": "India"    },    "website": null}

在json的樹狀結構中

null: 表示為 null
boolean: 表示為 true 或 false
number: 一般的浮點數表示方式，在下一單元詳細說明
string: 表示為 "..."
array: 表示為 [ ... ]
object: 表示為 { ... }

實現解譯器

es_parser 是一個手寫的遞迴下降解析器（recursive descent parser）。由於 JSON 文法特別簡單，可以將分詞器（tokenizer）省略，直接檢測下一個字元，便可以知道它是哪種類型的值，然後調用相關的分析函數。對於完整的 JSON 文法，跳過空白後，只需檢測當前字元：

n ? literalt ? truef ? false" ? string0-9/- ? number[ ? array{ ? object

對於json的typevalue和json string編寫了這樣2個類

class EsValue(object):    __slots__ = (‘type‘, ‘num‘, ‘str‘, ‘array‘, ‘obj‘)        def __init__(self):        self.type = JTYPE_UNKNOWclass context(object):    def __init__(self, jstr):        self.json = list(jstr)        self.pos = 0

以解析多餘的空格，製表位，換行為例：

def es_parse_whitespace(context):    if not context.json:        return    pos = 0    while re.compile(‘[\s]+‘).match(context.json[pos]):        pos += 1    context.json = context.json[pos:]

解析字面量

字面量包括了false，true，null三種。

def es_parse_literal(context, literal, mytype):    e_value = EsValue()    if ‘‘.join(context.json[context.pos:context.pos + len(literal)]) != literal:        raise MyException("PARSE_STATE_INVALID_VALUE, literal error")    e_value.type = mytype    context.json = context.json[context.pos + len(literal):]    return PARSE_STATE_OK, e_valuedef es_parse_value(context, typevalue):    if context.json[context.pos] == ‘t‘:        return es_parse_literal(context, "true", JTYPE_TRUE)    if context.json[context.pos] == ‘f‘:        return es_parse_literal(context, "false", JTYPE_FALSE)    if context.json[context.pos] == ‘n‘:        return es_parse_literal(context, "null", JTYPE_NULL)

解析數字

JSON number類型，number 是以十進位表示，它主要由 4 部分順序組成：負號、整數、小數、指數。只有整數是必需部分。

JSON 可使用科學記號標記法，指數部分由大寫 E 或小寫 e 開始，然後可有加號或減號，之後是一或多個數字（0-9）。

JSON 標準 ECMA-404 採用圖的形式表示文法，可以更直觀地看到解析時可能經過的路徑：

python是一種動態語言，所以es_value中num可以是整數也可以是小數，

class es_value():    def __init__(self, type):        self.type = type        self.num = 0

python對於string類型，可以強制轉換成float和int，但是int(string)無法處理科學記號標記法的情況，所以統一先轉成float在轉成int

typevalue.num = float(numstr)if isint:    typevalue.num = int(typevalue.num)

實現的單元測試包含：

    def testnum(self):        print("\n------------test number-----------")        self.assertEqual(type(self.parse("24")), type(1))        self.assertEqual(type(self.parse("1e4")), type(10000))        self.assertEqual(type(self.parse("-1.5")), type(-1.5))        self.assertEqual(type(self.parse("1.5e3")), type(1.500))

解析字串

對於字串中存在逸出字元，在load的時候須要處理逸出字元,\u的情況，進行編碼成unicode

def es_parse_string(context):    charlist = {        ‘\\"‘: ‘\"‘,        "\\‘": "\‘",        "\\b": "\b",        "\\f": "\f",        "\\r": "\r",        "\\n": "\n",        "\\t": "\t",        "\\u": "u",        "\\\\": "\\",        "\\/": "/",        "\\a": "\a",        "\\v": "\v"    }    while context.json[pos] != ‘"‘:        # 處理轉意字元        if context.json[pos] == ‘\\‘:            c = context.json[pos:pos + 2]            if c in charlist:                e_value.str += charlist[c]            else:                e_value.str += ‘‘.join(context.json[pos])                pos += 1                continue            pos += 2        else:            e_value.str += ‘‘.join(context.json[pos])            pos += 1        e_value.type = JTYPE_STRING        context.json = context.json[pos + 1:]        context.pos = 1        if ‘\u‘ in e_value.str:            e_value.str = e_value.str.encode(‘latin-1‘).decode(‘unicode_escape‘)        return PARSE_STATE_OK, e_value

單元測試：

    def teststring(self):        print("\n------------test string----------")        self.assertEqual(type(self.parse("\" \\\\line1\\nline2 \"")), type("string"))         # input \\  is \        self.assertEqual(type(self.parse("\"  abc\\def\"")), type("string"))        self.assertEqual(type(self.parse("\"      null\"")), type("string"))        self.assertEqual(type(self.parse("\"hello world!\"")), type("string"))        self.assertEqual(type(self.parse("\"   \u751F\u5316\u5371\u673A  \"")), type("string"))

es_dumps函數,json產生器

將python dict結構dumps成json串

def es_dumps(obj):    obj_str = ""    if isinstance(obj, bool):        if obj is True:            obj_str += "True"        else:            obj_str += "False"    elif obj is None:        obj_str += "null"    elif isinstance(obj, basestring):        for ch in obj.decode(‘utf-8‘):            if u‘\u4e00‘ <= ch <= u‘\u9fff‘:                obj_str += "\"" + repr(obj.decode(‘UTF-8‘)) + "\""                break        else:            obj_str += "\"" + obj + "\""    elif isinstance(obj, list):        obj_str += ‘[‘        if len(obj):            for i in obj:                obj_str += es_dumps(i) + ", "            obj_str = obj_str[:-2]        obj_str += ‘]‘    elif isinstance(obj, int) or isinstance(obj, float):     # number        obj_str += str(obj)    elif isinstance(obj, dict):        obj_str += ‘{‘        if len(obj):            for (k, v) in obj.items():                obj_str += es_dumps(k) + ": "                obj_str += es_dumps(v) + ", "            obj_str = obj_str[:-2]        obj_str += ‘}‘    return obj_str

cProfile效能分析

匯入cProfile模組進行效能分析，load中國34個省份地區人口發布，

import cProfilefrom jsonparser import *import jsoncProfile.run("print(es_load(\"china.json\"))")

修改部分代碼使用python build-in，最佳化context結構，string在copy的時候比list效能顯著提高。消耗時間從20s降到1s

Python實現JSON產生器和遞迴下降解譯器

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More