After reading the official documents of BeautifulSoup, today tried to climb a wave of their own school Moodle, wrote a simple check splitter, is still successful, the code has been thrown on GitHub, interested friends can go to see.
Https://github.com/zhang77595103/web-crawler
Today imitate the great God Xlzd, prepare to write a crawl to take the watercress film top250, mainly want to see the mechanism of anti-crawler, after all, not every website is like our school's official website So, the crawler comes in with the out ...
Because I have not seen requests this library of official documents, so there is now a knowledge point to remember a knowledge point it ...
ImportRequestshomepageresponse= Requests.get ("https://movie.douban.com/")Print(Type (homepageresponse.content))Print(Type (homepageresponse.text))---------------------------------------------------------------/library/frameworks/python.framework/versions/ 3.5/bin/python3.5/users/zhangzhimin/pycharmprojects/pythonagain/test2.py<class 'bytes'><class 'Str'>Process finished with exit code 0
This is the difference between. Content and. Text, so here's a knowledge that in Py3, bytes decoding (decode ()) becomes STR, and the str code turns into bytes (encode) ...
So overall. Content.decode () is equivalent to. Text ...
According to the idea of the great God, the script should be this:
" white"><center>
The reason for this is 403, generally because the site that needs to be logged in is not logged in or is considered a crawler by the server to deny access, which is obviously the second case. In general, when a browser sends a request to the server, it has a request header--which User-Agent
is used to identify the type of browser. When we use requests to send a request, the default user-agent is python-requests/2.8.1
(the following number may be different, indicating the version number). So, let's try if user-agent disguised as a browser, will it solve this problem?
Then unfortunately, there is no big God called 403, which makes me very embarrassed ... But you still have the patience to climb.
I first pretended that my web page was 403 (actually I found that I did not add a head to the perfect crawl), and then we come to our own crawler user-agent it ...
I casually find a website, open check, refresh a bit ... You can see that the last line is the UA we need ...
The final code is as follows ...
1 ImportRequests2 fromBs4ImportBeautifulSoup3 4 5 defpagecheck (URL, headers):6Homepageresponse = Requests.get (URL, headers=headers)7Homepagesoup = BeautifulSoup (Homepageresponse.content,'lxml')8 forLiinchHomepagesoup.find ("ol", Class_ ="Grid_view"). Find_all ("Li"):9link = li.find ("Div", Class_ ="HD"). ATen Print(Link.span.string +" : "+ link['href']) One A -URL ="https://movie.douban.com/top250" -Start = 25 theheaders = { - 'user-agent':'mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) - applewebkit/537.36 (khtml, like Gecko) chrome/51.0.2704.84 safari/537.36' - } + forCntinchRange (0, 10): - Print("\n\nthis is page"+ STR (cnt+1) +"... \ nyou") + ifCNT = =0: ACururl =URL at Else: -Cururl = URL +"? start="+ STR (START * cnt) +"&filter" -Pagecheck (Cururl, headers)
Reptile Combat (1)