The Crawler Foundation for Python development

Source: Internet
Author: User

Crawler Introduction

Crawler: You can think of the Internet as a large network, reptiles are like the spider in the net, if you want to get the resources in this net, you can crawl it down.

In short, it is the automated program that requests the site and extracts the data.

The basic flow of reptiles:

    • Initiating a request: sends a request to the target site via the HTTP library, sending a request that can contain additional information such as headers, waiting for the server to respond.
    • Get response content: If the server responds properly, it will get a response,response content that is the content of the page to get, the type may be Html,json string, binary data and other types.
    • Parsing content: The resulting content may be HTML, you can use regular expressions, Web page parsing library to parse. Could be JSON, can be converted directly to JSON object parsing, possibly binary data, can be saved or further processing.
    • Save data: A variety of saved data can be saved as text, stored in a database, or saved as a file in a specific format.

Request and Response process:

(1) The browser sends a message to the server where the URL is located, and this process is called the HTTP Request

(2) After receiving the message from the browser, the server can send the message content according to the browser, do the corresponding processing, and then send the message back to the browser, this process is called the HTTP Response

(3) When the browser receives the response information from the server, it will process and display the information.

Request requests:

    • Request mode: Mainly get, post two types, in addition to head, PUT, DELETE, options and so on.
    • Request Url:url full name is the same resource locator, such as a Web page document, a picture, a video, etc. can be uniquely identified by the URL.
    • Request Header: Contains header information when requested, such as User-agent, Host, cookies, etc.
    • Request Body: Additional data that is carried when requested, such as form data when a form is submitted.

Response response:

    • Response Status: There are multiple response states, such as 200 for success, 301 jump, 404 Page Not found, 502 server error, etc.
    • Response headers: such as content type, content length, server information, settings cookie, and so on.
    • Response body: The most important part contains the content of the requested resource, such as Web page HTML, picture, binary data, etc.

Simple example:

Import requestsresponse= requests.get (' http://www.baidu.com ') print (response.text) # Get response body print (response.headers) # Get the corresponding head print (response.status_code) # Status code

What kind of data can I catch?

    • Web page text, such as HTML documents, JSON-formatted text, and so on.
    • Picture, get get is binary file, save in picture format.
    • Video, same as binary file, save as video format.
    • Other, as long as the request can be obtained.

Data processing:

    • Direct processing
    • JSON parsing
    • Regular expressions
    • BeautifulSoup
    • Pyquery
    • Xpath

How to save data

    • Text, plain text, JSON, XML, etc.
    • relational databases such as MySQL, Oracle, etc.
    • Non-relational database, MongoDB, Redis and other key-value form storage
    • binary files, slices, videos, audio, etc. are saved directly in the specified format.

Small example:

Crawl the href and picture of the A tag on the https://www.autohome.com.cn/news/page and save the picture locally

Import requestsfrom bs4 Import beautifulsoupresponse = Requests.get (    url= ' https://www.autohome.com.cn/news/') response.encoding = response.apparent_encoding  # fix garbled soup = beautifulsoup (response.text,features= ' Html.parser ') target = Soup.find (id= ' auto-channel-lazyload-article ') li_list = Target.find_all (' li ') for I in Li_list: # Every i is a soup object , you can use find to continue to find    a = I.find (' a ') # If A is not found, call A.attrs will error, all need to determine    if a:        a_href = a.attrs.get (' href ')        a_href = ' http: ' + a_href        print (a_href)        txt = a.find (' h3 '). Text        print (TXT)        img = a.find (' img '). Attrs.get (' src ')        img = ' http: ' + img        print (img)        img_response = Requests.get (url=img)        import UUID        filename = str (UUID.UUID4 ()) + '. jpg '        with open (filename, ' WB ') as F:            F.write (img_response.content)

Simple summary:

"' Response = request.get (' url ') response.textresopnse.contentresponse.encodingresponse.encoding = Response.apparent_encodingresponse.status_code "" Soup = BeautifulSoup (response.text,features= ' html.parser ') v1 = Soup.find (' div ') # found the first qualifying soup.find (id= ' I1 ') soup.find (' div ', id= ' i1 ') v2 = soup.find_all (' div ') obj = V1obj = v2[0] # from List by index to each object Obj.textobj.attrs # property ""
Requests Module Introduction

1. Method Relationship of Call

"' Requests.get () requests.post () Requests.put () Requests.delete () ... These methods are essentially called the Requests.request () method, for example: def get (URL, Params=none, **kwargs):    R "" "sends a GET request.    :p Aram Url:url for the New:class: ' Request ' object.    :p Aram params: (optional) Dictionary or bytes to being sent in the query St Ring for The:class: ' request '.    :p Aram \*\*kwargs:optional arguments, "request" takes.    : return:: Class: ' Response <Response> ' object    : Rtype:requests. Response "" "    kwargs.setdefault (' allow_redirects ', True)    return request (' Get ', URL, params=params, * * Kwargs) ""

2. Common parameters:

' Requests.request ()-method: How to Submit-URL: Submit Address-Params: parameters passed on the URL, GET, such as requests.request (method= ' get ', Url= ' http://www.baidu.com ', params={' username ': ' user ', ' Password ': ' pwd '}) # Http://www.baidu.com?username=us        Er&password=pwd-data: Data passed in the request body requests.request (method= ' POST ', url= ' http://www.baidu.com ', data={' username ': ' user ', ' Password ': ' pwd '}-JSON: Data passed in the request body requests.request (method= ' POST ', url= ' htt  P://www.baidu.com ', json={' username ': ' user ', ' Password ': ' pwd '} ' # json= ' {' username ': ' user ', ' Password ': ' pwd '} ' Overall send-headers: request Header Requests.request (method= ' POST ', url= ' http://www.baidu.com ', json={' username ': ' User ', ' Password ': ' pwd '}, headers={' referer ': ' https://dig.chouti.com/', ' user-agent ': ' Mozill a/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36 '})-cookies:cookies, usually on head ERS sent over. ‘‘‘

The Crawler Foundation for Python development

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.