Use python to implement web page crawling

Source: Internet
Author: User
Tags virtual environment
Python tutorials on the Internet are mostly 2. in version X, python2.X and python3.X are greatly changed, and the usage of many libraries is not the same. I installed python3.X. let's take a look at the detailed examples. The Python tutorials on the Internet are mostly 2. in version X, python2.X and python3.X are greatly changed, and the usage of many libraries is not the same. I installed python3.X. let's take a look at the detailed examples.

0x01

When the Spring Festival is idle (there are many idle times), I wrote a simple program to crawl some jokes and read the program writing process. The first time I came into contact with crawlers, I read such a post. it was not very convenient to crawl photos of my sister online on the egg. As a result, I caught some pictures by myself.

Technology inspires the future. as a programmer, how can this kind of thing be done? it is good for both physical and mental health to crawl jokes.

The code is as follows:

soup = BeautifulSoup(getHTML("http://www.pengfu.com/xiaohua_1.html"))

Using BeautifulSoup to parse a webpage is just one sentence, but when you run the code, such a warning will appear, prompting you to specify a parser. otherwise, errors may be reported on other platforms or systems.

The code is as follows:

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/init.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 64 of the file joke.py. To get rid of this warning, change code that looks like this: BeautifulSoup([your markup])to this: BeautifulSoup([your markup], "lxml")  markup_type=markup_type))

The differences between the types of resolvers and different resolvers are described in detail in the official documentation. Currently, lxml parsing is more reliable.
After modification

The code is as follows:

soup = BeautifulSoup(getHTML("http://www.pengfu.com/xiaohua_1.html", 'lxml'))

In this way, the above warning is not given.

The code is as follows:

p_array = soup.find_all('p', {'class':"content-img clearfix pt10 relative"})

Use the find_all function to locate all the p labels of class = content-img clearfix pt10 relative and traverse the array.

The code is as follows:

for x in p_array: content = x.string

In this way, we get the content of the target p. So far, we have achieved our goal and climbed to our joke.
However, if you crawl hundreds of objects in the same way, this error will be reported.

The code is as follows:

raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

It is said that there is no response from the remote end, the link is closed, and there is no problem with the network. what is the cause? Is my posture incorrect?
Opening charles to capture packets did not respond. Alas, this is strange. how can I access a good website through a browser and cannot access python? is UA a problem? After reading charles, we found that for a request initiated by using urllib, UA is Python-urllib/3.5 by default, and access UA in chrome is User-Agent: Mozilla/5.0 (Macintosh; intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36, is it because the server determines that the python crawler is rejected based on UA. Let's try it out in disguise.

The code is as follows:

def getHTML(url):    headers = {'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}    req = request.Request(url, headers=headers)    return request.urlopen(req).read()

In this way, python is disguised as chrome to get hundreds of webpages, and data can be obtained smoothly.

At this point, the joke of crawling hundreds of thousands of Internet users using python has ended. we only need to analyze the corresponding web pages, find the elements we are interested in, and use the powerful functions of python, we can achieve our goal. no matter the XXOO graph or the connotation section, we can do it with one click. if you don't want to talk about it, I will find a sister figure.

# -*- coding: utf-8 -*-import sysimport urllib.request as requestfrom bs4 import BeautifulSoupdef getHTML(url):  headers = {'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}  req = request.Request(url, headers=headers)  return request.urlopen(req).read()def get_pengfu_results(url):  soup = BeautifulSoup(getHTML(url), 'lxml')  return soup.find_all('p', {'class':"content-img clearfix pt10 relative"})def get_pengfu_joke():  for x in range(1, 2):    url = 'http://www.pengfu.com/xiaohua_%d.html' % x    for x in get_pengfu_results(url):      content = x.string      try:        string = content.lstrip()        print(string + '\n\n')      except:        continue  returndef get_qiubai_results(url):  soup = BeautifulSoup(getHTML(url), 'lxml')  contents = soup.find_all('p', {'class':'content'})  restlus = []  for x in contents:    str = x.find('span').getText('\n','
') restlus.append(str) return restlusdef get_qiubai_joke(): for x in range(1, 2): url = 'http://www.qiushibaike.com/8hr/page/%d/?s=4952526' % x for x in get_qiubai_results(url): print(x + '\n\n') returnif name == 'main': get_pengfu_joke() get_qiubai_joke()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.