eligible Web page URLs stored up to continue crawling.
Let's write the first crawler, named dmoz_spider.py, saved in the Tutorial\spiders directory.The dmoz_spider.py code is as follows:
From Scrapy.spider import Spider class Dmozspider (spider): name = "DMOZ" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/
site in sites:item = Dmozitem ()item[' title '] = Site.xpath (' A/text () '). Extract ()item[' link ' = Site.xpath (' A ' @href '). Extract ()item[' desc '] = Site.xpath (' text () '). Extract ()Items.append (item)return items4. Storage content (Pipeline)The simplest way to save information is through the feed exports, there are four main types: Json,json lines,csv,xml.We export the results in the most commonly used JSON, with the following commands:The code is as follows:Scrapy Crawl Dmoz-o it
page URLs stored up to continue crawling.
Let's write the first crawler, named dmoz_spider.py, saved in the Tutorial\spiders directory.The dmoz_spider.py code is as follows:
Copy the Code code as follows:
From Scrapy.spider import spider
Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/
: If Hasattr (E, ' Code ') and # Retry 5XX HTTP Errors html = download4 (URL, user_agent, num_retries-1) return HTML5. Support AgentSometimes we need to use a proxy to access a website. For example, Nteflix shielded most countries outside the United States. We use the requests module to implement the function of the network agent.Import Urllib2Import Urlparsedef download5 (URL, user_agent= ' wswp ', Proxy=none, num_retries=2): "" "Download function
when we visited the site, we found that some of the page IDs were numbered sequentially, and we could crawl the content using ID traversal. But the limitation is that some ID numbers are around 10 digits, so the crawl efficiency will be very low and low! Import itertools from common import download def iteration (): Max_errors = 5 # Maximu M number of consecutive download errors allowed Num_errors = 0 # Current number of consecutive download errors For page in Itertools.count (1):
stored down and gradually spread away from the beginning, crawl all eligible Web page URLs stored up to continue crawling.
Here we write the first reptile, named Dmoz_spider.py, in the Tutorial\spiders directory.The dmoz_spider.py code is as follows:
Copy Code code as follows:
From Scrapy.spider import spider
Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Comp
The previous nine articles from the basis to the writing have done a detailed introduction, the tenth is a perfect, then we will be detailed records of a crawler how to write a step by step, you crossing can see carefully
First of all, the website of our school:
Http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html
Query results need to log in, and then show the results of each subject, but only show the resu
The previous nine articles from the basis to the writing have done a detailed introduction, the tenth is a perfect, then we will be detailed records of a crawler how to write a step by step, you crossing can see carefully
First of all, the website of our school:
Http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html
Query results need to log in, and then show the results of each subject, but only show the resu
see where the post data is sent:
Well, visually this is the address where the post data is submitted.
In the address bar, the complete address should be as follows:
Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login
(The way to get it is simple, just click that link in Firefox to see the link's address.)
5. Small trial Sledgehammer
The next task is to use Python to simulate sending a post data and fetch the returned cookie value.
The operation
General web site will have robots.txt files, in this file to allow web crawler access to the directory, also provides a directory to prohibit crawler access.The reason to pay attention to this file is that access to the Forbidden directory will be banned from your IP address accessThe following defines a Web site map crawler,def crawl_sitemap (URL): # Download
:
Well, this is the address for submitting the post data.
To the Address bar, the full address should read as follows:
Http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login
(The way to get it is simple, click the link directly in Firefox browser to see the link's address)
5. Small test
The next task is to use Python to simulate sending a post's data and fetch the returned cookie value.
For cookies, take a look at this blog post:
Http://www
to write to the file"" Defines writing data to the file function "" " forIinchrange (num): U=Ulist[i] with open ('D:/test.txt','a') as data:Print(U, file=data)if __name__=='__main__': List= [] # I previously put list=[] in the for loop of the Get_data () function, resulting in each loop emptying the list before appending the data, and finally traversing the last set of data ...URL='http://www.zuihaodaxue.com/shengyuanzhiliangpaiming2018.html'HTML=ge
HttpClient toolkit and width crawler.That is to say, store the Url and gradually spread it from here. capture all the qualified webpage URLs for storage and continue crawling.
Next we will write the first crawler named dmoz_spider.py and save it in the tutorial \ spiders directory.The d1__spider.py code is as follows:
The code is as follows:
From scrapy. spider import Spider Class DmozSpider (Spider ):Na
Baidu Bar Crawler production and embarrassing hundred crawler production principle is basically the same, are through the View Source button key data, and then store it to the local TXT file.
Project content:
Use Python to write the web crawler Baidu Bar.
How to use:
Cre
very systematic Daniel Belt, can only rely on the online fragmented blog article to learn, then still learn 2.7 learning 3.x, after all, after learning 2.7 3.x to get started quickly.
Multi-threaded crawler involves the knowledge point:
In fact, for any software project, we all want to know what knowledge is needed to write this project, and we can observe which packages are imported into the main portal f
0 Basic Write Python crawler urllib2 usage GuideIn front of Urllib2 's simple introduction, the following is a partial urllib2 of the use of the details.Settings for 1.ProxyURLLIB2 uses the environment variable HTTP_PROXY to set the HTTP proxy by default.You can use proxies if you want to explicitly control the proxy in your program and not be affected by environ
systematic Daniel Belt, can only rely on the online fragmented blog article to learn, then still learn 2.7 learning 3.x, after all, after learning 2.7 3.x to get started quickly.Multi-threaded crawler involves the knowledge point:In fact, for any software project, we all want to know what knowledge is needed to write this project, and we can observe which packages are imported into the main portal file of
Recently used Python to write a crawler, the beginning of the 3.4, very hard to write, but in the runtime often stop working, after reducing the number of threads (16->8), stability has improved, but still occasionally there is a problem of stop work. Therefore changed python3.5, found that some packages do not support
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.