Basic Crawler Flow
Initiating a request
Sending request,request to the destination server via the HTTP library can contain additional headers information.
Get response Content
If the server responds normally, it returns response, which contains the contents of the page.
Parsing data
Content may be HTML, you can use regular expressions, Web page parsing library to parse.
Perhaps JSON, which can be converted directly to JSON object parsing.
Save data
Can be stored as text or saved to a database, or other specific type of file.
What's included in the response
Response status
Status code:200
That is, the status code, typically 200, indicates a successful response.
Response header
Response Headers
Content type, content length, server information, settings cookie, etc.
Response body
Request the contents of the resource, such as Web page source code, binary data, etc.
When you do a Web page request, you can determine whether the status code is 200, and then remove the response body for parsing.
Parsing method
Direct processing
JSON parsing
Regular expressions
BeautifulSoup
Pyquery
Xpath
Select the appropriate parsing method, depending on the situation.
Save data
Text Save
Plain text, Json, XML, and so on.
relational database saving
MySQL, Oracle, SQL Server, and more.
Non-relational database save
MongoDB, Redis, and other key-value forms of storage.
binary files
Specific files such as pictures, videos, and audio.
Urllib Library
Python built-in HTTP request library
Module |
Description |
Urllib.request |
Request Module |
Urllib.error |
Exception Handling Module |
Urllib.parse |
URL Parsing module |
Urllib.robotparser |
Robots.txt Parsing Module |
You are welcome to follow my blog: https://home.cnblogs.com/u/Python1234/
Welcome to join thousands of people Exchange Learning Group: 125240963
Using Python to play the crawler, first of all have a process, this process is best for small white use!