Using Python to play the crawler, first of all have a process, this process is best for small white use!

Source: Internet
Author: User

Basic Crawler Flow

    • Initiating a request

      Sending request,request to the destination server via the HTTP library can contain additional headers information.

    • Get response Content

      If the server responds normally, it returns response, which contains the contents of the page.

    • Parsing data

      Content may be HTML, you can use regular expressions, Web page parsing library to parse.

      Perhaps JSON, which can be converted directly to JSON object parsing.

    • Save data

      Can be stored as text or saved to a database, or other specific type of file.

What's included in the response

    • Response status

      Status code:200

      That is, the status code, typically 200, indicates a successful response.

    • Response header

      Response Headers

      Content type, content length, server information, settings cookie, etc.

    • Response body

      Request the contents of the resource, such as Web page source code, binary data, etc.

When you do a Web page request, you can determine whether the status code is 200, and then remove the response body for parsing.

Parsing method

    • Direct processing

    • JSON parsing

    • Regular expressions

    • BeautifulSoup

    • Pyquery

    • Xpath

Select the appropriate parsing method, depending on the situation.

Save data

    • Text Save

      Plain text, Json, XML, and so on.

    • relational database saving

      MySQL, Oracle, SQL Server, and more.

    • Non-relational database save

      MongoDB, Redis, and other key-value forms of storage.

    • binary files

      Specific files such as pictures, videos, and audio.

Urllib Library

Python built-in HTTP request library

Module Description
Urllib.request Request Module
Urllib.error Exception Handling Module
Urllib.parse URL Parsing module
Urllib.robotparser Robots.txt Parsing Module

You are welcome to follow my blog: https://home.cnblogs.com/u/Python1234/

Welcome to join thousands of people Exchange Learning Group: 125240963

Using Python to play the crawler, first of all have a process, this process is best for small white use!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.