Crawler-json module and jsonpath module, crawler jsonjsonpath
JSON (JavaScript Object Notation) is a lightweight data exchange format, which makes it easy for people to read and write. It also facilitates machine parsing and generation. Suitable for Data Interaction scenarios, such as data interaction between the front-end and backend of a website.
JSON is comparable to XML.
Python 3.x comes with the JSON module, which can be used directly by importing json.
Official documents: http://docs.python.org/library/json.html
Json online resolution site: http://www.json.cn /#
JSON
Json is simply an object and an array in JavaScript. Therefore, the two structures are objects and arrays. These two structures can represent various complex structures.
Json Module
The json module provides four functions: dumps, dump, loads, and load for conversion between string and Python data types.
1. json. dumps ()
Converts the Python type to a Json string and returns a str object. The conversion from Python to Json type is as follows:
Python |
Json |
Dict |
Object |
List, tuple |
Array |
Str, UTF-8 |
String |
Int, float |
Number |
True |
True |
False |
False |
None |
Null |
#! /Usr/bin/python3 #-*-conding: UTF-8-*-_ author _ = 'mayi' import jsonlistStr = [1, 2, 3, 4] tupleStr = (1, 2, 3, 4) dictStr = {"city": "Beijing", "name": "ant"} print (json. dumps (listStr) # [1, 2, 3, 4] print (type (json. dumps (listStr) # <class 'str'> print (json. dumps (tupleStr) # [1, 2, 3, 4] print (type (json. dumps (tupleStr) # <class 'str'> # Note: json. default ascii encoding used for dumps () serialization # Add the parameter ensure_ascii = False to disable ascii encoding, print by UTF-8 encoding (json. dumps (dictStr, ensure_ascii = False) # {"city": "Beijing", "name": "ant"} print (type (json. dumps (dictStr, ensure_ascii = False) # <class 'str'>
2. json. dump ()
Serialize the Python built-in type into a Json object and write it to the file.
#! /Usr/bin/python3 #-*-conding: UTF-8-*-_ author _ = 'mayi' import jsonlistStr = [{"city": "Beijing "}, {"name": "ant"}] json. dump (listStr, open ("listStr. json "," w ", encoding =" UTF-8 "), ensure_ascii = False) dictStr = {" city ":" Beijing "," name ":" ant "} json. dump (dictStr, open ("dictStr. json "," w ", encoding =" UTF-8 "), ensure_ascii = False)
3. json. loads ()
Decodes and converts a Json string to a Python object. The type conversion from Json to Python is as follows:
Json |
Python |
Object |
Dict |
Array |
List |
String |
UTF-8 |
Number (int) |
Int |
Number (real) |
Float |
True |
True |
False |
False |
Null |
None |
#! /Usr/bin/python3 #-*-conding: UTF-8-*-_ author _ = 'mayi' import jsonstrList = '[1, 2, 3, 4] 'strdict = '{"city": "Beijing", "name": "ant"} 'print (json. loads (strList) # [1, 2, 3, 4] # json data is automatically stored in print (json. loads (strDict) # {'city': 'beijing', 'name': 'ant '}
4. json. load ()
Reads A Json string from a file and converts it to the Python type.
#! /Usr/bin/python3 #-*-conding: UTF-8-*-_ author _ = 'mayi' import jsonstrList = json. load (open ("listStr. json "," r ", encoding =" UTF-8 ") print (strList) # [{'city': 'beijing'}, {'name ': 'ant'}] strDict = json. load (open ("dictStr. json "," r ", encoding =" UTF-8 ") print (strDict) # {'city': 'beijing', 'name': 'ant '}
JsonPath
JsonPath is an information extraction class library. It is a tool for extracting specified information from JSON documents and provides implementation versions in multiple languages, including JavaScript, Python, PHP, and Java.
For JSON, JsonPath is equivalent to XPATH for XML.
- : Https://pypi.python.org/pypi/jsonpath
- Installation Method: Decompress the package and run python setup. py install.
- Official documents: http://goessner.net/articles/JsonPath
Comparison between JsonPath and XPath Syntax:
JsonPath has a clear structure, high readability, low complexity, and easy matching. The following table corresponds to the use of XPath.
Xpath |
JSONPath |
Description |
/ |
$ |
Root Node |
. |
@ |
Current Node |
/ |
. Or [] |
Subnode Extraction |
.. |
N/ |
Obtain the parent node. Jsonpath is not supported. |
// |
.. |
Select all qualified nodes regardless of their locations |
* |
* |
Match All element nodes |
@ |
N/ |
JsonPath does not support attribute-based access. |
[] |
[] |
Iterator (simple iteration operations can be performed inside, such as array subscript and Value Selection Based on content) |
| |
[,] |
Support multiple selections in the iterator |
[] |
? () |
Supports Filter Operations |
N/ |
() |
Expressions supported |
() |
N/ |
Group, not supported by JsonPath |
Example:
Take the hook net city JSON file: http://www.lagou.com/lbs/getAllCitySearchLabels.json as an example, get all the city names.
#! /Usr/bin/python3 #-*-conding: UTF-8-*-_ author _ = 'mayi' import urllib. requestimport jsonimport jsonpath # url = 'HTTP: // www.lagou.com/lbs/getAllCitySearchLabels.json'# User-Agent header = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/100'} # construct a Request together with headers, this request will be accompanied by chrome's User-Agentrequest = urllib. request. request (url, headers = header) # Send this Request to the server response = urllib. request. urlopen (request) # obtain the page content: byteshtml = response. read () # transcoding: bytes to strhtml = html. decode ("UTF-8") # convert a json string to a python object obj = json. loads (html) # match the name node city_list = jsonpath from the root node. jsonpath (obj, '$ .. name ') # print the obtained name node print (city_list) # print its type print (type (city_list) # Write the file to the local disk with open ("city. json "," w ", encoding =" UTF-8 ") as f: content = json. dumps (city_list, ensure_ascii = False) f. write (content)