Python Instant web crawler: API description

Source: Internet
Author: User
Tags http 200
Through this API, you can directly obtain a tested extraction script, which is a standard XSLT program. you only need to run it on the DOM of the target webpage to obtain the results in XML format, get API instructions for all fields at a time-download the gsExtractor content extraction tool

1. Interface name

Download Content Extraction Tool

2. Interface Description

If you want to write a web crawler program, you will find that most of the time is spent on debugging webpage content extraction rules, without talking about how weird the regular expression syntax is, even if you use XPath, you must also write and debug them one by one.

If you want to extract many fields from a web page, it takes a lot of time to debug XPath one by one. Through this interface, you can directly obtain a tested extraction script program, which is a standard XSLT program. you only need to run it on the DOM of the target webpage to obtain the results in XML format, all fields are obtained at one time.

This XSLT extraction tool can be generated by using MS for several machines, or shared by others. you can download and use it if you have read permission.

In web crawler programs used for data analysis and data mining, content extraction is a key obstacle affecting universality. if the extraction is obtained from the API, your web crawler program can be written into a general framework.

3. Interface Specifications

3.1, interface address (URL)

Http://www.gooseeker.com/api/getextractor

3.2, request type (contentType)

Unlimited

3.3, request method

HTTP GET

3.4, request parameters

Key required: Yes; Type: String; Description: The AppKey allocated when you apply for an API

Theme: Required: Yes; Type: String; description: extract device name, which is the rule name defined by MS

Required: No; Type: String; description: Rule Number. if multiple rules are defined under the same rule name, enter

Bname: Required: No; Type: String; description: name of the binning box. if the rule contains multiple binning boxes, enter

3.5, return type (contentType)

Text/xml; charset = UTF-8

3.6, return parameters

Parameters in the HTTP message header are as follows:

More-extractor Type: String; description: Number of extractors under the same rule. Usually, you need to pay attention to this parameter only when the optional parameters are not filled in to prompt the client to have multiple rules and collation boxes. the client determines whether to include clear parameters when sending the request.

3.7, error message returned

Message layer errors are returned as HTTP 400. for example, the parameters in the URL do not comply with this specification.

An application-layer error is returned with HTTP 200 OK. the specific error code is stored in the message body using an XML file. the XML structure is as follows:

     
  Specific error codes
 

The specific code value is as follows: keyError: permission verification failed

KeyError: permission verification failed. paramError: the parameter in the URL is incorrect. for example, the parameter name or value is incorrect.

4. usage example (python)

Sample code:

#-*-Coding: UTF-8-*-from urllib import requesturl = 'http: // www.gooseeker.com/api/getextractor? Key = your key & theme = your extract name 'resp = request. urlopen (url) content = resp. read () if (content): print (content)

Next, I will test this API.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.