Python3 making hilarious web page crawler

Source: Internet
Author: User
Tags virtual environment
The online Python tutorials are mostly 2.X versions, python2.x and python3.x compared to the larger changes, a lot of library usage is not the same, I installed the python3.x, let's take a look at the detailed example

0x01

Spring Festival Idle No matter (is how busy), wrote a simple program, to crawl some jokes to see, incidentally record the process of writing procedures. The first contact crawler is to see such a post, a tease, climb to take the fried egg online sister photos, simply not too convenient. So he tiger, grabbed a little picture.

Science and technology to enlighten the future, as a programmer, how to do this kind of thing, or crawl jokes more beneficial to physical and mental health.


0x02

Before we roll up our sleeves, we begin to popularize some theoretical knowledge.

To put it simply, we have to BA la down the content of the specific location on the webpage, how to Ba la, we must analyze this webpage first, see that piece of content is we need. For example, this time crawl is hilarious online jokes, open hilarious web page we can see a lot of jokes, our purpose is to obtain these content. Come back and calm down, you keep smiling, we can't write code. In Chrome, we open the review element and then expand the HTML tag at the first level, or click that little mouse to locate the element we need.


Finally, we can find that the content of <p> is the joke we need, and in the second joke, so it is. So, we can find all the <p> in this webpage, and then extract the contents of the page and complete it.

0x03

Well, now that we know our purpose, we can roll up our sleeves and start working. Here I use the Python3, about Python2 and python3 selection, we can decide, the function can be achieved, just a little different. But it is recommended to use Python3.
We want to BA la the content we need, first we have to BA la this page down, how to Ba la it, here we need to use a library, called Urllib, we use the method provided by this library to obtain the entire page.
First, we import urllib


Copy the Code code as follows:

Import Urllib.request as Request

Then we can use the request to get the page,


Copy the Code code as follows:

def gethtml (URL):
return Request.urlopen (URL). Read ()

Life is too short, I use Python, a line of code, download the Web page, you say, there is no reason to use Python.
After downloading the page, we have to parse the page to get the elements we need. To parse the element, we need to use another tool called Beautiful Soup, which allows you to quickly parse HTML and XML and get the elements we need.


Copy the Code code as follows:

Soup = BeautifulSoup (gethtml ("http://www.pengfu.com/xiaohua_1.html"))

Using BeautifulSoup to parse the page is a word, but when you run the code, there will be a warning that you want to specify a parser, otherwise, may be on other platforms or systems to report errors.


Copy the Code code as follows:

/library/frameworks/python.framework/versions/3.5/lib/python3.5/site-packages/bs4/__init__.py:181:userwarning: No parser is explicitly specified, so I ' m using the best available HTML parser for this system ("lxml"). This usually isn ' t a problem, but if you run the this code on another system, or in a different virtual environment, it may us e a different parser and behave differently.

The code that caused this warning are on line in the the file joke.py. To get rid of the This warning, the change code is looks like this:

BeautifulSoup ([Your markup])

To this:

BeautifulSoup ([Your markup], "lxml")

Markup_type=markup_type))

The type of parser and the difference between the different parsers the official document has a detailed description, at present, or with lxml analysis of the more reliable.
After modification


Copy the Code code as follows:

Soup = BeautifulSoup (gethtml ("http://www.pengfu.com/xiaohua_1.html", ' lxml '))

In this way, there is no such warning.


Copy the Code code as follows:

P_array = Soup.find_all (' p ', {' class ': "Content-img clearfix pt10 Relative"})

Use the Find_all function to find all the p tags of class = Content-img Clearfix pt10 relative and iterate over this array


Copy the Code code as follows:

For x in p_array:content = x.string

In this way, we take the content of the purpose p. At this point, we have reached our goal and climbed to our jokes.
But when the same way to climb the embarrassing hundred, will report such a mistake


Copy the Code code as follows:

Raise Remotedisconnected ("Remote End closed connection without" Http.client.RemoteDisconnected:Remote end closed Connection without response

Said the remote is unresponsive, closed the link, looked at the network is not a problem, what is the situation caused by it? Am I in a wrong position?
Open Charles grabbed the bag, and sure enough, no response. Alas, this is strange, a good site, how the browser can access, Python cannot access it, is not the problem of UA? Looking at Charles, I found that with Urllib's request, UA was python-urllib/3.5 by default and access to UA in Chrome was user-agent:mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36, could it be that the server rejected the PY based on UA Thon Crawler. Let's try it in disguise.


Copy the Code code as follows:

def gethtml (URL):
headers = {' user-agent ': ' user-agent:mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36 '}
req = Request. Request (URL, headers=headers)
Return Request.urlopen (req). Read ()

In this way, Python is disguised as Chrome to get a Web page that is embarrassing and can get data smoothly.

At this point, the use of Python crawl embarrassing hundred and hilarious net joke has ended, we only need to analyze the corresponding page, find our elements of interest, using Python's powerful features, we can achieve our goal, whether it is the Xxoo map, or the connotation of the satin, can be a button to do, do not say, I'm going to go find some sister.

#-*-Coding:utf-8-*-import sysimport urllib.request as requestfrom bs4 import beautifulsoupdef gethtml (URL): headers = {' user-agent ': ' user-agent:mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36 '} req = Request. Request (URL, headers=headers) return Request.urlopen (req). Read () def get_pengfu_results (URL): Soup = BeautifulSoup ( gethtml (URL), ' lxml ') return Soup.find_all (' P ', {' class ': "Content-img clearfix pt10 Relative"}) def get_pengfu_joke (): F or x in range (1, 2): url = ' http://www.pengfu.com/xiaohua_%d.html '% x for x in Get_pengfu_results (URL): Conten t = x.string try:string = Content.lstrip () print (string + ' \ n ') except:continue return def get_qiubai_results (URL): soup = BeautifulSoup (gethtml (URL), ' lxml ') contents = Soup.find_all (' p ', {' class ': ' Content '}) Restlus = [] for x in contents:str = X.find (' span '). GetText (' \ n ', ' <br/> ') restlus.append (StR) return Restlusdef Get_qiubai_joke (): For x in range (1, 2): url = ' http://www.qiushibaike.com/8hr/page/%d/?s=495252 6 '% x for x in Get_qiubai_results (URL): print (x + ' \ n ') returnif __name__ = = ' __main__ ': Get_pengfu_joke () G Et_qiubai_joke ()

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.