A lot of people learn to use Python, most of them are all kinds of crawler script: have written the script to catch proxy native verification, have written the automatic mail-receiving script, as well as write a simple verification code recognition script, then we will summarize the Python crawler grasp some of the practical skills.
Static Web page
For the static web crawler does not have to say that everyone knows, because crawling static web page is very simple, as long as the HTML crawl directly with requests and then use regular expression matching.
Dynamic Web pages
Relative to the static Web page is simple, but the Dynamic Web page will be relatively complex, and now the speed of development of the Internet, Dynamic Web page is the most, static pages are relatively small, but he has a good count, I have a wall ladder.
HTTP requests for dynamic Web pages fall into two forms:
Get method and Post method
- Get method: For example, we enter a network address on the browser, which is the request to initiate a GET method. This network address is the URL.
- Post method: Not common in reptiles, so not detailed introduction
If you know the form of a website request, be skilled in using the F12 Developer tool and check the network inside.
Take a look at the case
Of course, not all Web pages are sent by the request to get the data, there are non-sending data Dynamic Web page.
For such a site, we generally use selenium to do the simulation browser behavior, you can directly get the results of the browser rendering. But the speed of selenium is relatively slow.
The specific cases are as follows:
So whether the page is static or Dynamic Web page is a method of crawling, of course, many sites are required to login and identify verification code, anti-crawling, and so on, no matter what the site measures are there is a way to deal with, the key is you will not.
Python web crawler Tips Small Summary, static, Dynamic Web page crawl data easily