Python3 makes a web crawler and python3 Crawlers

Source: Internet
Author: User
Tags virtual environment

Python3 makes a web crawler and python3 Crawlers

0x01

When the Spring Festival is idle (there are many idle times), I wrote a simple program to crawl some jokes and read the program writing process. The first time I came into contact with crawlers, I read such a post. It was not very convenient to crawl photos of my sister online on the egg. As a result, I caught some pictures by myself.

Technology inspires the future. As a programmer, how can this kind of thing be done? It is good for both physical and mental health to crawl jokes.


0x02

Before we get started, we should first popularize some theoretical knowledge.

To put it simply, we need to pull down the content at a specific location on the webpage. We need to analyze the webpage to see what we need. For example, this crawling is a joke on the Internet. when we open the webpage, we can see a lot of jokes. Our goal is to get the content. After reading it, calm down. You are always smiling and we cannot write code. In chrome, we open the review element, expand the HTML tag at the first level, or click the mouse to locate the elements we need.


Finally, we can find that the content in <div> is the joke we need. The same is true when we watch the second joke. Therefore, we can find all the <div> In this webpage and extract the content from it.

0x03

Now that we know what we want, we can start rolling up our sleeves. Here, I use python3. You can decide on the options of python2 and python3. The functions can be implemented, but they are somewhat different. However, we recommend using python3.
We need to pull down what we need. First, we need to pull this webpage down. How can we pull it? Here we need to use a library called urllib. We use the method provided by this library, to obtain the entire web page.
First, import urllib

Copy codeThe Code is as follows: import urllib. request as request

Then, we can use the request to obtain the webpage,

Copy codeThe Code is as follows: def getHTML (url ):
Return request. urlopen (url). read ()

My life is short. I use python, a line of code, and download the webpage. You said, why not use python.
After downloading the webpage, We have to parse the webpage to obtain the elements we need. To parse an element, we need to use another tool called Beautiful Soup to quickly parse HTML and XML and obtain the elements we need.

Copy codeCode: soup = BeautifulSoup (getHTML ("http://www.pengfu.com/xiaohua_1.html "))

Using BeautifulSoup to parse a webpage is just one sentence, but when you run the Code, such a warning will appear, prompting you to specify a parser. Otherwise, errors may be reported on other platforms or systems.

Copy codeThe Code is as follows:/Library/Frameworks/Python. framework/Versions/3.5/lib/python3.5/site-packages/bs4/_ init __. py: 181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml "). this usually isn' t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 64 of the file joke. py. To get rid of this warning, change code that looks like this:

BeautifulSoup ([your markup])

To this:

BeautifulSoup ([your markup], "lxml ")

Markup_type = markup_type ))

The differences between the types of Resolvers and different Resolvers are described in detail in the official documentation. Currently, lxml Parsing is more reliable.
After modification

Copy codeCode: soup = BeautifulSoup (getHTML ("http://www.pengfu.com/xiaohua_1.html", 'lxml '))

In this way, the above warning is not given.

Copy codeThe Code is as follows: div_array = soup. find_all ('div ', {'class': "content-img clearfix pt10 relative "})

Use the find_all function to find all div labels of class = content-img clearfix pt10 relative and traverse the array.

Copy codeThe Code is as follows: for x in div_array: content = x. string

In this way, the content of the target div is obtained. So far, we have achieved our goal and climbed to our joke.
However, if you crawl hundreds of objects in the same way, this error will be reported.

Copy codeThe Code is as follows: raise RemoteDisconnected ("Remote end closed connection without" http. client. RemoteDisconnected: Remote end closed connection without response

It is said that there is no response from the remote end, the link is closed, and there is no problem with the network. What is the cause? Is my posture incorrect?
Opening charles to capture packets did not respond. Alas, this is strange. How can I access a good website through a browser and cannot access python? Is UA a problem? After reading charles, we found that for a request initiated by using urllib, UA is Python-urllib/3.5 by default, and access ua in chrome is User-Agent: Mozilla/5.0 (Macintosh; intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36, is it because the server determines that the python crawler is rejected Based on UA. Let's try it out in disguise.

Copy codeThe Code is as follows: def getHTML (url ):
Headers = {'user-agent': 'user-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) chrome/56.0.2924.87 Safari/537.36 '}
Req = request. Request (url, headers = headers)
Return request. urlopen (req). read ()

In this way, python is disguised as chrome to get hundreds of webpages, and data can be obtained smoothly.

At this point, the joke of crawling hundreds of thousands of Internet users using python has ended. We only need to analyze the corresponding web pages, find the elements we are interested in, and use the powerful functions of python, we can achieve our goal. No matter the XXOO graph or the connotation section, we can do it with one click. If you don't want to talk about it, I will find a sister figure.

# -*- coding: utf-8 -*-import sysimport urllib.request as requestfrom bs4 import BeautifulSoupdef getHTML(url):  headers = {'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}  req = request.Request(url, headers=headers)  return request.urlopen(req).read()def get_pengfu_results(url):  soup = BeautifulSoup(getHTML(url), 'lxml')  return soup.find_all('div', {'class':"content-img clearfix pt10 relative"})def get_pengfu_joke():  for x in range(1, 2):    url = 'http://www.pengfu.com/xiaohua_%d.html' % x    for x in get_pengfu_results(url):      content = x.string      try:        string = content.lstrip()        print(string + '\n\n')      except:        continue  returndef get_qiubai_results(url):  soup = BeautifulSoup(getHTML(url), 'lxml')  contents = soup.find_all('div', {'class':'content'})  restlus = []  for x in contents:    str = x.find('span').getText('\n','<br/>')    restlus.append(str)  return restlusdef get_qiubai_joke():  for x in range(1, 2):    url = 'http://www.qiushibaike.com/8hr/page/%d/?s=4952526' % x    for x in get_qiubai_results(url):      print(x + '\n\n')  returnif __name__ == '__main__':  get_pengfu_joke()  get_qiubai_joke()

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.