Python scrapy simple crawler record (for simple crawling)

Last Update:2017-06-16 Source: Internet

Author: User

Tags xpath python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before writing a scrapy study record, just a brief introduction of the next scrapy some of the content, and no actual examples, now start to record examples

The environment used is python2.7, scrapy1.2.0

Create a project first

Execute the command in the directory where you want to build the project Scrapy startproject Tutorial

Scrapy will help you build a project and then create a spider scrapy genspider zhuhuspider zhihu.com

The current file structure is

--Tutorial --spiders --__init__. py --zhihuspider.py --__init__. py --items.py --pipelines.py

--settings.py

The main files used are three zhihuspider.py,items.py,settings.py

Zhihuspider is used to write crawling spiders, items are used to save crawled data, settings modify the configuration

#-*-coding:utf-8-*- fromScrapy.spidersImportSpider fromScrapy.selectorImportSelector fromTutorial.itemsImportthe main body of the zhihuitem# spiderclassZhihuspider (Spider): Name="Zhihuspider"Allowed_domains= ['zhihu.com'] Start_urls= []
#获取要爬取的url, I put the keywords in zhihu.txt.defstart_requests (self): Url_head='https://www.zhihu.com/search?type=content&q='With Open ('Zhihu.txt','R') as F:datas=F.readlines () forDatainchDatas:url= url_head+DataPrinturl self.start_urls.append (URL) forUrlinchSelf.start_urls:yieldself.make_requests_from_url (URL)
#对返回的response进行解析, this step can be used in conjunction with the browser to find XPathdefParse (self, Response):
#删除所有的 <em> Response= Response.replace (Body=response.body.replace ('<em>',"')) HxS=Selector (response) Contents= Hxs.xpath ('//*[@class = "Zu-main-content"]//*[contains (@class, "List")]') Item=Zhihuitem () forContentinchcontents:item['Search_title'] = Content.xpath ('//*[@class = "title"]/a/text ()'). Extract () item['Search_title_link'] = Content.xpath ('//*[@class = "title"]/a/@href'). Extract () item['Search_answer'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-content")]//*[contains (@class, "Summary")]/text ()'). Extract () item['Search_answer_link'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-content")]//*[contains (@class, "summary")]/a/@href'). Extract () item['Search_answer_writer'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-meta")]//a[contains (@class, "author")]/text ()'). Extract ()PrintItemyieldItem

 from Import Item, Field class Zhihuitem (Item):     = Field () = field ()    = field    () = field ()    = field ()

The items file holds data

Because there are restrictions on reptiles, we need to add anti-reptile mechanisms

There are basically 4 kinds, add useragent, add agent, disable cookie and crawl time limit

Add Python package middlewares under current project, and setttings in the same directory, only one __init__.py file below, create a new randomuseragent.py file

#Coding:utf-8ImportRandomclassrandomuseragent (object):def __init__(self, Agents): Self.agents=Agents @classmethoddefFrom_crawler (CLS, crawler):returnCLS (Crawler.settings.getlist ('user_agents'))    defprocess_request (self, request, spider): Request.headers.setdefault ('user-agent', Random.choice (self.agents))

Simply join the User-agent

Finally set in the settings

Downloader_middlewares = {   #' tutorial.middlewares.MyCustomDownloaderMiddleware ': 543,    'tutorial.middlewares.RandomUserAgent.RandomUserAgent': 1,} User_agents= [    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)",    "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",    "mozilla/5.0 (Windows; U MSIE 9.0; Windows NT 9.0; En -US)",    "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.50727; Media Center PC 6.0)",    "mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0; WOW64; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 1.0.3705;. NET CLR 1.1.4322)",    "mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2;. NET CLR 1.1.4322;. NET CLR 2.0.50727; infopath.2;. NET CLR 3.0.04506.30)",    "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN) applewebkit/523.15 (khtml, like Gecko, safari/419.3) arora/0.3 (change:287 c9dfb30)",    "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6",    "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1",    "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0",    "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5",    "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11",    "mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/535.20 (khtml, like Gecko) chrome/19.0.1036.7 safari/535.20",    "opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U FR) presto/2.9.168 version/11.52",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.11 taobrowser/2.0 safari/536.11",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.71 safari/537.1 lbbrowser",    "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; Lbbrowser)",    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. net4.0e; Lbbrowser)",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.84 safari/535.11 lbbrowser",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. NET4.0E)",    "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; qqbrowser/7.0.3698.400)",    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. NET4.0E)",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SV1; Qqdownload 732;. net4.0c;. net4.0e; 360SE)",    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. NET4.0E)",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. NET4.0E)",    "mozilla/5.0 (Windows NT 5.1) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.89 safari/537.1",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.89 safari/537.1",    "mozilla/5.0 (IPad; U CPU os 4_2_1 like Mac os X; ZH-CN) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8c148 safari/6533.18.5",    "mozilla/5.0 (Windows NT 6.1; Win64; x64; Rv:2.0b13pre) gecko/20110307 Firefox/4.0b13pre",    "mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) gecko/20100101 firefox/16.0",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11",    "mozilla/5.0 (X11; U Linux x86_64; ZH-CN; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (Maverick) firefox/3.6.10"]cookies_enabled=Falsedownload_delay=3

The entire project was completed, Scrapy crawl zhihuspider-o zhihu.json export data. It's so simple.

Finally, when crawling data, pay attention to setting the crawl time, and do not burden the server.

Project has been submitted to GitHub on the address is Https://github.com/lin344902118/doubanSpider.git

And there's a pair of worms in the bean.

Python scrapy simple crawler record (for simple crawl)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More