The Crawler Foundation for Python development

Last Update:2018-08-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawler Introduction

Crawler: You can think of the Internet as a large network, reptiles are like the spider in the net, if you want to get the resources in this net, you can crawl it down.

In short, it is the automated program that requests the site and extracts the data.

The basic flow of reptiles:

Initiating a request: sends a request to the target site via the HTTP library, sending a request that can contain additional information such as headers, waiting for the server to respond.
Get response content: If the server responds properly, it will get a response,response content that is the content of the page to get, the type may be Html,json string, binary data and other types.
Parsing content: The resulting content may be HTML, you can use regular expressions, Web page parsing library to parse. Could be JSON, can be converted directly to JSON object parsing, possibly binary data, can be saved or further processing.
Save data: A variety of saved data can be saved as text, stored in a database, or saved as a file in a specific format.

Request and Response process:

(1) The browser sends a message to the server where the URL is located, and this process is called the HTTP Request

(2) After receiving the message from the browser, the server can send the message content according to the browser, do the corresponding processing, and then send the message back to the browser, this process is called the HTTP Response

(3) When the browser receives the response information from the server, it will process and display the information.

Request requests:

Request mode: Mainly get, post two types, in addition to head, PUT, DELETE, options and so on.
Request Url:url full name is the same resource locator, such as a Web page document, a picture, a video, etc. can be uniquely identified by the URL.
Request Header: Contains header information when requested, such as User-agent, Host, cookies, etc.
Request Body: Additional data that is carried when requested, such as form data when a form is submitted.

Response response:

Response Status: There are multiple response states, such as 200 for success, 301 jump, 404 Page Not found, 502 server error, etc.
Response headers: such as content type, content length, server information, settings cookie, and so on.
Response body: The most important part contains the content of the requested resource, such as Web page HTML, picture, binary data, etc.

Simple example:

Import requestsresponse= requests.get (' http://www.baidu.com ') print (response.text) # Get response body print (response.headers) # Get the corresponding head print (response.status_code) # Status code

What kind of data can I catch?

Web page text, such as HTML documents, JSON-formatted text, and so on.
Picture, get get is binary file, save in picture format.
Video, same as binary file, save as video format.
Other, as long as the request can be obtained.

Data processing:

Direct processing
JSON parsing
Regular expressions
BeautifulSoup
Pyquery
Xpath

How to save data

Text, plain text, JSON, XML, etc.
relational databases such as MySQL, Oracle, etc.
Non-relational database, MongoDB, Redis and other key-value form storage
binary files, slices, videos, audio, etc. are saved directly in the specified format.

Small example:

Crawl the href and picture of the A tag on the https://www.autohome.com.cn/news/page and save the picture locally

Import requestsfrom bs4 Import beautifulsoupresponse = Requests.get (    url= ' https://www.autohome.com.cn/news/') response.encoding = response.apparent_encoding  # fix garbled soup = beautifulsoup (response.text,features= ' Html.parser ') target = Soup.find (id= ' auto-channel-lazyload-article ') li_list = Target.find_all (' li ') for I in Li_list: # Every i is a soup object , you can use find to continue to find    a = I.find (' a ') # If A is not found, call A.attrs will error, all need to determine    if a:        a_href = a.attrs.get (' href ')        a_href = ' http: ' + a_href        print (a_href)        txt = a.find (' h3 '). Text        print (TXT)        img = a.find (' img '). Attrs.get (' src ')        img = ' http: ' + img        print (img)        img_response = Requests.get (url=img)        import UUID        filename = str (UUID.UUID4 ()) + '. jpg '        with open (filename, ' WB ') as F:            F.write (img_response.content)

Simple summary:

"' Response = request.get (' url ') response.textresopnse.contentresponse.encodingresponse.encoding = Response.apparent_encodingresponse.status_code "" Soup = BeautifulSoup (response.text,features= ' html.parser ') v1 = Soup.find (' div ') # found the first qualifying soup.find (id= ' I1 ') soup.find (' div ', id= ' i1 ') v2 = soup.find_all (' div ') obj = V1obj = v2[0] # from List by index to each object Obj.textobj.attrs # property ""

Requests Module Introduction

1. Method Relationship of Call

"' Requests.get () requests.post () Requests.put () Requests.delete () ... These methods are essentially called the Requests.request () method, for example: def get (URL, Params=none, **kwargs):    R "" "sends a GET request.    :p Aram Url:url for the New:class: ' Request ' object.    :p Aram params: (optional) Dictionary or bytes to being sent in the query St Ring for The:class: ' request '.    :p Aram \*\*kwargs:optional arguments, "request" takes.    : return:: Class: ' Response <Response> ' object    : Rtype:requests. Response "" "    kwargs.setdefault (' allow_redirects ', True)    return request (' Get ', URL, params=params, * * Kwargs) ""

2. Common parameters:

' Requests.request ()-method: How to Submit-URL: Submit Address-Params: parameters passed on the URL, GET, such as requests.request (method= ' get ', Url= ' http://www.baidu.com ', params={' username ': ' user ', ' Password ': ' pwd '}) # Http://www.baidu.com?username=us        Er&password=pwd-data: Data passed in the request body requests.request (method= ' POST ', url= ' http://www.baidu.com ', data={' username ': ' user ', ' Password ': ' pwd '}-JSON: Data passed in the request body requests.request (method= ' POST ', url= ' htt  P://www.baidu.com ', json={' username ': ' user ', ' Password ': ' pwd '} ' # json= ' {' username ': ' user ', ' Password ': ' pwd '} ' Overall send-headers: request Header Requests.request (method= ' POST ', url= ' http://www.baidu.com ', json={' username ': ' User ', ' Password ': ' pwd '}, headers={' referer ': ' https://dig.chouti.com/', ' user-agent ': ' Mozill a/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36 '})-cookies:cookies, usually on head ERS sent over. ‘‘‘

The Crawler Foundation for Python development

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The Crawler Foundation for Python development

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The Crawler Foundation for Python development

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support