Requests+bs4+lxml get and parse HTML data directly
Grab packet Ajax requests, use requests to get and parse JSON data
Anti-crawl sites that use selenium crawl
Set up Proxy
A.URLLIB/REQUESTS/SELENIUM+CHROME/SELENIUM+PHANTOMJS Setting up Agents
B. Crawl free proxy IP in the free proxy site into Redis as an agent pool, and periodically extract the detection (visit the target website), use flask to build the website, return to the random proxy IP from Redis (not suitable for commercial)
C. Multiple ADSL dial-up host installation Tinyproxy do agent, timed dialing to get their own IP into the remote Redis agent pool, using flask to build the site, from Redis return random proxy IP (crawl eye check IT/it orange/Sogou)
d. Charge Agent IP (Crawl day eye Check IT/orange/Sogou)
Cookie Pool
Crawl app:
A.charles/fiddler/wireshark/mitmproxy/anyproxy Grab bag, Appium automated crawl app
B.mitmdump docking Python script direct processing, Appium automated crawl app
Pyspider Frame Crawl
Scrapy/scrapy-redis/scrapyd Framework Distributed Crawl
Verification Code:
A. Verification: Selenium outbound verification Code pattern, pil contrast chromatic aberration, calculated position, Selenium leveling acceleration + leveling deceleration simulation human drag and verify
B. Weibo mobile version: Selenium outbound verification Code pattern, make image template, Selenium outbound verification code pattern, using PIL will compare with image template chromatic aberration, match successfully follow the numerical order in the template name using selenium to drag and verify
C. Access coding platform, selenium outbound verification code, send to code platform, platform return coordinates, selenium move to coordinates and click and verify
Common ways of Python crawlers