Wrote a simple web crawler:#Coding=utf-8 fromBs4ImportBeautifulSoupImportRequestsurl="http://www.weather.com.cn/textFC/hb.shtml"defget_temperature (URL): Headers= { 'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36', 'upgrade-insecure-requests':'1', 'Referer':'http://www.weather.com.cn/weather1d/10129160502A.shtml
1, Python web crawler to get Taobao commodity price code:
#-*-coding:utf-8-*-' Created on March 17, 2017 @author: Lavi ' "Import requests from BS4 import BeautifulSoup import BS4 I Mport re def gethtmltext (URL): try:r = Requests.get (url,timeout=30) r.raise_for_status R.enco Ding = r.apparent_encoding return r.text e
beautifulsoup corresponds to the entire contents of a html/xml document.Beautiful Soup Library ParserSoup = beautifulsoup (' Data ', ' Html.parser ')
Parser
How to use
conditions
HTML parser for BS4
BeautifulSoup (MK, ' Html.parser ')
Installing the BS4 Library
HTML parser for lxml
BeautifulSoup (MK, ' lxml ')
Pip Install lxml
XML parser for lxml
BeautifulSoup (MK, ' xml ')
Pip Install lxml
.
The engine gets the first URL to crawl from the spider, and then dispatches it as a request in the schedule.
The engine gets the page that crawls next from the dispatch.
The schedule returns the next crawled URL to the engine, which the engine sends to the downloader via the download middleware.
When the Web page is downloaded by the downloader, the response content is sent to the engine via the download middleware.
The engine re
scrapy bench, will create a local server and will crawl at the maximum speed, again in order to test the performance of local hardware, to avoid the impact of too many factors, all only connected follow-up, not content processingPurely on the hardware performance, the display can crawl about 2,400 pages per minute, this is a reference standard, in the actual operation of crawler projects, due to various factors caused by different speed, in general,
=============== crawler principle ==================Access the website via python, get the HTML code of the website, and get the image address of SRC in the specific IMG tag via regular expression.Then access the image address and save the picture locally via IO.=============== script code ==================ImportUrlli
Like climbing baidu.com, which should be written in Python 3.4.Error tip 1:print "Hello" syntaxerror:missing parentheses in call to ' print 'The syntax for print is different in 2 and 3 .print ("Hello") in Python 3print "Hello" in Python 2Error Tip 2:No module named ' Urllib2 'python3.3 inside, replace URLLIB2 with Urllib.requestReference Official Document HTTPS:
) #---------5 seconds after the next stepReq3=urllib.urlopen ()Always multiple simple page crawls with 5 second intervals in the middleImport Urllib,urllib2url = ' Https://api.douban.com/v2/book/user/ahbei/collections 'data={' status ': ' read ', ' rating ': 3, ' tag ': ' Novel '}Data=urllib.urlencode (data)Req=urllib2. Request (Url,data)Res=urllib2.urlopen (req)Print Res.read ()This is a standard post request, but due to multiple visits to the site, it is easy for IP to be blockedImport Urllib,
When I was bored last night, I tried to practice python, so I wrote a little reptile to get a knife. Update data in the entertainment network[Python]View PlainCopy
#!/usr/bin/python
# Coding:utf-8
Import Urllib.request
Import re
#定义一个获取网页源码的子程序
Head = "www.xiaodao.la"
Def get ():
data = Urllib.request.urlopen (' http://www.xiaodao.la '). Read
First, install the ScrapyImporting GPG keyssudo apt-key adv--keyserver hkp://keyserver.ubuntu.com:80--recv 627220E7Add a software sourceEcho ' Deb Http://archive.scrapy.org/ubuntu scrapy main ' | sudo tee/etc/apt/sources.list.d/scrapy.listUpdate the package list and install Scrapysudo apt-get update sudo apt-get install scrapy-0.22Ii. Composition of ScrapyThree, fast start scrapyAfter you run scrapy, you only need to rewrite a download.Here is someone else's example of crawling job site informa
Share--https://pan.baidu.com/s/1c3emfje Password: eew4Alternate address--https://pan.baidu.com/s/1htwp1ak Password: u45nContent IntroductionThis course is intended for students who have never been in touch with Python, starting with the most basic grammar and gradually moving into popular applications. The whole course is divided into two units of foundation and actual combat.The basic part includes Python
[Python] web crawler (4): Opener, Handler, and openerhandler
Before proceeding, let's first explain the two methods in urllib2: info and geturl.The response object response (or HTTPError instance) returned by urlopen has two useful methods: info () and geturl ()
1. geturl ():
Geturl () returns the obtained real URL, which is useful because urlopen (or the opener
for convenience, under Windows I used the pycharm, personal feeling that this is an excellent Python learning software. Crawler, that is, web crawler, we can be understood as crawling on the internet has been spiders, the internet is likened to a large network, and the crawler
= urllib.request.HTTPCookieProcessor(cookie)opener = urllib.request.build_opener(handler)response = opener.open('http://www.baidu.com')print(response.read().decode('utf-8')Urllib Handling ExceptionsIn the run program to get data, if the program encountered errors in the middle of the time we did not write exception processing, as far as possible to run the data lost; in obtaining the Watercress movie top250, some of the movie parameters are incomplete, causing the
The requests Library is an HTTP client written in Python . Requests Cubby urlopen more convenient. Can save a lot of intermediate processing process, so that directly crawl Web data. Take a look at specific examples: defRequest_function_try ():headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) gecko/20100101 firefox/44.0 '}R=requests.get (Url="Http://www.baidu.com",Headers=headers)pri
Python3x, we can get the content of the Web page in two ways
Get address: National Geographic Chinese Network
url = ' http://www.ngchina.com.cn/travel/'
Urllib Library
1, guide warehousing
From Urllib Import Request
2, get the content of the Web page
With Request.urlopen (URL) as file:
data = File.read ()
print (data)
Run found an error:
Urllib.error.HTTPError:HTTP Error 403:forbidden
Mainly bec
ImportReImporturllib.request#------ways to get Web page source code---defgethtml (URL): page=urllib.request.urlopen (URL) HTML=Page.read ()returnHTML#Enter the URL of any post------gethtml ()------html = gethtml ("https://tieba.baidu.com/p/5352556650")#------Modify the character encoding within the HTML object to UTF-8------html = Html.decode ('UTF-8')#------How
First, the definition of web crawlerThe web crawler, the spider, is a very vivid name.The internet is likened to a spider's web, so spiders are crawling around the web. Web spiders are looking for
also set multiple parameter lookups, such as finding the label for a formHtml.find_all (' form ',method="POST",target="_blank" ) ):A.encode (' GBK ')Of course, in the search can also use regular expressions, such as Re.complie ("a.*") and other methodsYou can also limit the number of lookups: The following expression is the first 5 search resultsHtml.find_all (' A ', limit=5):a.attrs[' class ']The Find family also has find_parents/find_parent to locate the parent node. find_next_siblings ()/fin
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.