Reptile Combat (1)

Source: Internet
Author: User

After reading the official documents of BeautifulSoup, today tried to climb a wave of their own school Moodle, wrote a simple check splitter, is still successful, the code has been thrown on GitHub, interested friends can go to see.

Https://github.com/zhang77595103/web-crawler

Today imitate the great God Xlzd, prepare to write a crawl to take the watercress film top250, mainly want to see the mechanism of anti-crawler, after all, not every website is like our school's official website So, the crawler comes in with the out ...

Because I have not seen requests this library of official documents, so there is now a knowledge point to remember a knowledge point it ...

ImportRequestshomepageresponse= Requests.get ("https://movie.douban.com/")Print(Type (homepageresponse.content))Print(Type (homepageresponse.text))---------------------------------------------------------------/library/frameworks/python.framework/versions/ 3.5/bin/python3.5/users/zhangzhimin/pycharmprojects/pythonagain/test2.py<class 'bytes'><class 'Str'>Process finished with exit code 0

This is the difference between. Content and. Text, so here's a knowledge that in Py3, bytes decoding (decode ()) becomes STR, and the str code turns into bytes (encode) ...

So overall. Content.decode () is equivalent to. Text ...

According to the idea of the great God, the script should be this:

"  white"><center>

The reason for this is 403, generally because the site that needs to be logged in is not logged in or is considered a crawler by the server to deny access, which is obviously the second case. In general, when a browser sends a request to the server, it has a request header--which User-Agent is used to identify the type of browser. When we use requests to send a request, the default user-agent is python-requests/2.8.1 (the following number may be different, indicating the version number). So, let's try if user-agent disguised as a browser, will it solve this problem?

Then unfortunately, there is no big God called 403, which makes me very embarrassed ... But you still have the patience to climb.

I first pretended that my web page was 403 (actually I found that I did not add a head to the perfect crawl), and then we come to our own crawler user-agent it ...

I casually find a website, open check, refresh a bit ... You can see that the last line is the UA we need ...

The final code is as follows ...

1 ImportRequests2  fromBs4ImportBeautifulSoup3 4 5 defpagecheck (URL, headers):6Homepageresponse = Requests.get (URL, headers=headers)7Homepagesoup = BeautifulSoup (Homepageresponse.content,'lxml')8      forLiinchHomepagesoup.find ("ol", Class_ ="Grid_view"). Find_all ("Li"):9link = li.find ("Div", Class_ ="HD"). ATen         Print(Link.span.string +" : "+ link['href']) One  A  -URL ="https://movie.douban.com/top250" -Start = 25 theheaders = { -     'user-agent':'mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) - applewebkit/537.36 (khtml, like Gecko) chrome/51.0.2704.84 safari/537.36' - } +  forCntinchRange (0, 10): -     Print("\n\nthis is page"+ STR (cnt+1) +"... \ nyou") +     ifCNT = =0: ACururl =URL at     Else: -Cururl = URL +"? start="+ STR (START * cnt) +"&filter" -Pagecheck (Cururl, headers)

Reptile Combat (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.