Through this API, you can directly obtain a tested extraction script, which is a standard XSLT program. you only need to run it on the DOM of the target webpage to obtain the results in XML format, get API instructions for all fields at a time-download the gsExtractor content extraction tool
1. Interface name
Download Content Extraction Tool
2. Interface Description
If you want to write a web crawler program, you will find that most of the time is spent on debugging webpage content extraction rules, without talking about how weird the regular expression syntax is, even if you use XPath, you must also write and debug them one by one.
If you want to extract many fields from a web page, it takes a lot of time to debug XPath one by one. Through this interface, you can directly obtain a tested extraction script program, which is a standard XSLT program. you only need to run it on the DOM of the target webpage to obtain the results in XML format, all fields are obtained at one time.
This XSLT extraction tool can be generated by using MS for several machines, or shared by others. you can download and use it if you have read permission.
In web crawler programs used for data analysis and data mining, content extraction is a key obstacle affecting universality. if the extraction is obtained from the API, your web crawler program can be written into a general framework.
3. Interface Specifications
3.1, interface address (URL)
Http://www.gooseeker.com/api/getextractor
3.2, request type (contentType)
Unlimited
3.3, request method
HTTP GET
3.4, request parameters
Key required: Yes; Type: String; Description: The AppKey allocated when you apply for an API
Theme: Required: Yes; Type: String; description: extract device name, which is the rule name defined by MS
Required: No; Type: String; description: Rule Number. if multiple rules are defined under the same rule name, enter
Bname: Required: No; Type: String; description: name of the binning box. if the rule contains multiple binning boxes, enter
3.5, return type (contentType)
Text/xml; charset = UTF-8
3.6, return parameters
Parameters in the HTTP message header are as follows:
More-extractor Type: String; description: Number of extractors under the same rule. Usually, you need to pay attention to this parameter only when the optional parameters are not filled in to prompt the client to have multiple rules and collation boxes. the client determines whether to include clear parameters when sending the request.
3.7, error message returned
Message layer errors are returned as HTTP 400. for example, the parameters in the URL do not comply with this specification.
An application-layer error is returned with HTTP 200 OK. the specific error code is stored in the message body using an XML file. the XML structure is as follows:
Specific error codes
The specific code value is as follows: keyError: permission verification failed
KeyError: permission verification failed. paramError: the parameter in the URL is incorrect. for example, the parameter name or value is incorrect.
4. usage example (python)
Sample code:
#-*-Coding: UTF-8-*-from urllib import requesturl = 'http: // www.gooseeker.com/api/getextractor? Key = your key & theme = your extract name 'resp = request. urlopen (url) content = resp. read () if (content): print (content)
Next, I will test this API.