Python crawler Development "1th" "JSON and Jsonpath"

Source: Internet
Author: User
Tags xpath

JSON (JavaScript Object Notation) is a lightweight data interchange format that makes it easy for people to read and write. It also facilitates the analysis and generation of machines. Suitable for data interaction scenarios, such as data interaction between the foreground and background of a website.

Official Document: Http://docs.python.org/library/json.html

JSON online parsing website: http://www.json.cn/#

Json

JSON is simply called objects and arrays in JavaScript, so these two structures are objects and arrays of two structures that can represent a variety of complex structures

    1. Object: The object is represented in JS as the { } enclosed content, the { key:value, key:value, ... } structure of the key-value pairs, in the object-oriented language, key is the property of the object, value is the corresponding property value, so it is easy to understand, the value of the method is the object. Key Gets the property value, The type of this property value can be a number, a string, an array, an object.

    2. Array: The array in JS is enclosed in parentheses [ ] , the data structure is ["Python", "javascript", "C++", ...] , the value is the same as in all languages, using index get, the type of field value can be number, string, array, object several.

1. Import JSON

The JSON module provides four functions:,,, dumps dump loads load for converting between string and Python data types.

1.1, Json.loads ()

Convert the JSON format string decoding to a Python object from JSON to Python type conversions against the following:

# Json_loads.pyimport jsonstrlist = ' [1, 2, 3, 4] ' strdict = ' {"City": "Beijing", "name": "Big Cat"} ' Json.loads (strlist) # [1, 2, 3, 4]json.loads (strdict) # JSON data is automatically stored by Unicode # {u ' city ': U ' \u5317\u4eac ', U ' name ': U ' \u5927\u732b '}
1.2, Json.dumps ()

Implement the Python type into a JSON string and return a Str object to convert a Python object encoding into a JSON string

The conversion from the Python primitive type to the JSON type is compared to the following:

# json_dumps.pyimport Jsonimport chardetliststr = [1, 2, 3, 4]tuplestr = (1, 2, 3, 4) Dictstr = {"City": "Beijing", "name": "Big cat" }json.dumps (LISTSTR) # ' [1, 2, 3, 4] ' Json.dumps (tuplestr) # ' [1, 2, 3, 4] ' # Note: json.dumps () default ASCII encoding used when serializing # Add parameter ensure_as Cii=false disables ASCII encoding and returns the dictionary by Utf-8 Encoding # Chardet.detect (), where confidence is detection accuracy json.dumps (dictstr) # ' {"City": "\\u5317\\ U4eac "," name ":" \\u5927\\u5218 "} ' Chardet.detect (Json.dumps (DICTSTR)) # {' Confidence ': 1.0, ' Encoding ': ' ASCII '}print Json.dumps (Dictstr, Ensure_ascii=false) # {"City": "Beijing", "name": "Da Liu"}chardet.detect (Json.dumps (DICTSTR, Ensure_ascii =false) # {' Confidence ': 0.99, ' encoding ': ' Utf-8 '}
1.3, Json.dump ()

Serializing a python built-in type to a JSON object after writing to a file

# Json_dump.pyimport jsonliststr = [{"City": "Beijing"}, {"Name": "Big Liu"}]json.dump (Liststr, open ("Liststr.json", "W"), Ensure_ Ascii=false) Dictstr = {"City": "Beijing", "name": "Big Liu"}json.dump (Dictstr, open ("Dictstr.json", "W"), Ensure_ascii=false)
1.4, Json.load ()

Read a JSON-like string element in a file into a Python type

# Json_load.pyimport jsonstrlist = json.load (Open ("Liststr.json")) print strlist# [{u ' city ': U ' \u5317\u4eac '}, {u ' name ' : U ' \u5927\u5218 '}]strdict = json.load (Open ("Dictstr.json")) print strdict# {u ' city ': U ' \u5317\u4eac ', U ' name ': U ' \ u5927\u5218 '}
JsonPath

JsonPath is an information extraction class library that extracts specified information from JSON documents and provides multiple language implementations, including: Javascript, Python, PHP, and Java.

: Https://pypi.python.org/pypi/jsonpath

Installation method: Click on Download URL the link to download Jsonpath, after decompression to executepython setup.py install

Official Document: Http://goessner.net/articles/JsonPath

Jsonpath vs. XPath syntax:

The JSON structure is clear, readable, complex, and very easy to match, and the following table corresponds to the use of XPath.

XPath JSONPath Description
/ $ Root node
. @ Current node
/ .Or[] Take child nodes
.. N/A Take parent node, Jsonpath not supported
// .. That is, regardless of location, select all eligible conditions
* * Match all ELEMENT nodes
@ N/A JSON is not supported based on property access because JSON is a key-value recursive structure and is not required.
[] [] Iterator indicator (can be used to do simple iterative operations, such as array subscript, based on content selection, etc.)
| [,] Supports long selection in iterators.
[] ?() Supports filtering operations.
N/A () Supports expression evaluation
() N/A Grouping, Jsonpath not supported

Jsonpath Crawl Pull Hook net case:

Address: Http://www.lagou.com/lbs/getAllCitySearchLabels.json

Goal: Get all cities.

# jsonpath_lagou.pyimport Urllib2import jsonpathimport jsonimport chardeturl = ' http://www.lagou.com/lbs/ Getallcitysearchlabels.json ' request =urllib2. Request (URL) response = Urllib2.urlopen (request) HTML = Response.read () # Converts a JSON-formatted string into a Python object jsonobj = json.loads (html # starting from the root node, match the name node CityList = Jsonpath.jsonpath (jsonobj, ' $ '). Name ') Print Citylistprint type (citylist) fp = open (' City.json ', ' w ') content = Json.dumps (citylist, Ensure_ascii=false) Print Contentfp.write (Content.encode (' Utf-8 ')) Fp.close ()
Precautions:

Json.loads () is to convert the JSON format string decoding to a Python object, and if there is an error in json.loads, be aware of the encoding of the decoded JSON character.

If the encoding of the passed-in string is not UTF-8, you need to specify a character-encoded parameterencoding

dataDict = json.loads(jsonStrGBK);

Datajsonstr is a JSON string, assuming its encoding itself is non-UTF-8 words but GBK, then the above code will cause an error, instead of the corresponding:

    • dataDict = json.loads(jsonStrGBK, encoding="GBK");

If DATAJSONSTR specifies the appropriate encoding through encoding, but contains other encoded characters, you need to convert the DATAJSONSTR to Unicode before specifying the encoding format to call Json.loads ()

    • Datajsonstruni = Datajsonstr.decode ("GB2312"); Datadict = Json.loads (Datajsonstruni, encoding= "GB2312");
##字符串编码转换####任何平台的任何编码 都能和 Unicode 互相转换UTF-8 与 GBK 互相转换,那就先把UTF-8转换成Unicode,再从Unicode转换成GBK,反之同理。``` python # 这是一个 UTF-8 编码的字符串utf8Str = "你好地球"# 1. 将 UTF-8 编码的字符串 转换成 Unicode 编码unicodeStr = utf8Str.decode("UTF-8")# 2. 再将 Unicode 编码格式字符串 转换成 GBK 编码gbkData = unicodeStr.encode("GBK")# 1. 再将 GBK 编码格式字符串 转化成 UnicodeunicodeStr = gbkData.decode("gbk")# 2. 再将 Unicode 编码格式字符串转换成 UTF-8utf8Str = unicodeStr.encode("UTF-8")

decodeTo convert other encoded strings to Unicode encoding

encodeThe role of converting Unicode encoding to another encoded string

 

Python crawler Development "1th" "JSON and Jsonpath"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.