Python scrapy simple crawler record (for simple crawling)

Source: Internet
Author: User
Tags xpath python scrapy

Before writing a scrapy study record, just a brief introduction of the next scrapy some of the content, and no actual examples, now start to record examples

The environment used is python2.7, scrapy1.2.0

Create a project first

Execute the command in the directory where you want to build the project Scrapy startproject Tutorial

Scrapy will help you build a project and then create a spider scrapy genspider zhuhuspider zhihu.com

The current file structure is

--Tutorial --spiders --__init__. py --zhihuspider.py --__init__. py --items.py --pipelines.py

--settings.py

The main files used are three zhihuspider.py,items.py,settings.py

Zhihuspider is used to write crawling spiders, items are used to save crawled data, settings modify the configuration

#-*-coding:utf-8-*- fromScrapy.spidersImportSpider fromScrapy.selectorImportSelector fromTutorial.itemsImportthe main body of the zhihuitem# spiderclassZhihuspider (Spider): Name="Zhihuspider"Allowed_domains= ['zhihu.com'] Start_urls= []
#获取要爬取的url, I put the keywords in zhihu.txt.defstart_requests (self): Url_head='https://www.zhihu.com/search?type=content&q='With Open ('Zhihu.txt','R') as F:datas=F.readlines () forDatainchDatas:url= url_head+DataPrinturl self.start_urls.append (URL) forUrlinchSelf.start_urls:yieldself.make_requests_from_url (URL)
#对返回的response进行解析, this step can be used in conjunction with the browser to find XPathdefParse (self, Response):
#删除所有的 <em> Response= Response.replace (Body=response.body.replace ('<em>',"')) HxS=Selector (response) Contents= Hxs.xpath ('//*[@class = "Zu-main-content"]//*[contains (@class, "List")]') Item=Zhihuitem () forContentinchcontents:item['Search_title'] = Content.xpath ('//*[@class = "title"]/a/text ()'). Extract () item['Search_title_link'] = Content.xpath ('//*[@class = "title"]/a/@href'). Extract () item['Search_answer'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-content")]//*[contains (@class, "Summary")]/text ()'). Extract () item['Search_answer_link'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-content")]//*[contains (@class, "summary")]/a/@href'). Extract () item['Search_answer_writer'] = Content.xpath ('//*[@class = "Content"]//*[contains (@class, "Entry-meta")]//a[contains (@class, "author")]/text ()'). Extract ()PrintItemyieldItem
 from Import Item, Field class Zhihuitem (Item):     = Field () = field ()    = field    () = field ()    = field ()    

The items file holds data

Because there are restrictions on reptiles, we need to add anti-reptile mechanisms

There are basically 4 kinds, add useragent, add agent, disable cookie and crawl time limit

Add Python package middlewares under current project, and setttings in the same directory, only one __init__.py file below, create a new randomuseragent.py file

#Coding:utf-8ImportRandomclassrandomuseragent (object):def __init__(self, Agents): Self.agents=Agents @classmethoddefFrom_crawler (CLS, crawler):returnCLS (Crawler.settings.getlist ('user_agents'))    defprocess_request (self, request, spider): Request.headers.setdefault ('user-agent', Random.choice (self.agents))

Simply join the User-agent

Finally set in the settings

Downloader_middlewares = {   #' tutorial.middlewares.MyCustomDownloaderMiddleware ': 543,    'tutorial.middlewares.RandomUserAgent.RandomUserAgent': 1,} User_agents= [    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)",    "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727)",    "mozilla/5.0 (Windows; U MSIE 9.0; Windows NT 9.0; En -US)",    "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; trident/5.0;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 2.0.50727; Media Center PC 6.0)",    "mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; trident/4.0; WOW64; trident/4.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729;. NET CLR 1.0.3705;. NET CLR 1.1.4322)",    "mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2;. NET CLR 1.1.4322;. NET CLR 2.0.50727; infopath.2;. NET CLR 3.0.04506.30)",    "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN) applewebkit/523.15 (khtml, like Gecko, safari/419.3) arora/0.3 (change:287 c9dfb30)",    "mozilla/5.0 (X11; U Linux; En-US) applewebkit/527+ (khtml, like Gecko, safari/419.3) arora/0.6",    "mozilla/5.0 (Windows; U Windows NT 5.1; En-us; Rv:1.8.1.2pre) gecko/20070215 k-ninja/2.1.1",    "mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.9) gecko/20080705 firefox/3.0 kapiko/3.0",    "mozilla/5.0 (X11; Linux i686; U;) gecko/20070322 kazehakase/0.4.5",    "mozilla/5.0 (X11; U Linux i686; En-us; rv:1.9.0.8) Gecko fedora/1.9.0.8-1.fc10 kazehakase/0.5.6",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.56 safari/535.11",    "mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/535.20 (khtml, like Gecko) chrome/19.0.1036.7 safari/535.20",    "opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U FR) presto/2.9.168 version/11.52",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.11 taobrowser/2.0 safari/536.11",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.71 safari/537.1 lbbrowser",    "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; Lbbrowser)",    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. net4.0e; Lbbrowser)",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.11 (khtml, like Gecko) chrome/17.0.963.84 safari/535.11 lbbrowser",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. NET4.0E)",    "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. net4.0e; qqbrowser/7.0.3698.400)",    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. NET4.0E)",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SV1; Qqdownload 732;. net4.0c;. net4.0e; 360SE)",    "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Qqdownload 732;. net4.0c;. NET4.0E)",    "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; trident/5.0; SLCC2;. NET CLR 2.0.50727;. NET CLR 3.5.30729;. NET CLR 3.0.30729; Media Center PC 6.0;. net4.0c;. NET4.0E)",    "mozilla/5.0 (Windows NT 5.1) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.89 safari/537.1",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/21.0.1180.89 safari/537.1",    "mozilla/5.0 (IPad; U CPU os 4_2_1 like Mac os X; ZH-CN) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8c148 safari/6533.18.5",    "mozilla/5.0 (Windows NT 6.1; Win64; x64; Rv:2.0b13pre) gecko/20110307 Firefox/4.0b13pre",    "mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) gecko/20100101 firefox/16.0",    "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11",    "mozilla/5.0 (X11; U Linux x86_64; ZH-CN; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (Maverick) firefox/3.6.10"]cookies_enabled=Falsedownload_delay=3

The entire project was completed, Scrapy crawl zhihuspider-o zhihu.json export data. It's so simple.

Finally, when crawling data, pay attention to setting the crawl time, and do not burden the server.

Project has been submitted to GitHub on the address is Https://github.com/lin344902118/doubanSpider.git

And there's a pair of worms in the bean.

Python scrapy simple crawler record (for simple crawl)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.