Basic scrapy framework

Last Update:2018-10-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy getting started Tutorial: Explain /~ Gohlke/pythonlibs/scrapy framework depends on twistid and needs to be downloaded from the above website and placed under scrips ;?? PIP install c: \ Python \ anaconda3 \ Twisted-18.7.0-cp36-cp36m-win_amd64.whl ?? PIP install scrapy 2. Create scrapy project 1. because pychram does not have an integrated environment, you need to execute the command to create it. After execution, use pychram to select a new window to open ;?????????? Scrapy startproject projectname 2. Create a crawler file and execute the following command: command section ?????????????????? File Name crawling get the scrapy genspider Baidu. comscrapy genspider-T crawl Baidu Baidu.com 3 configuration file modification: settings. py file

USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.2.3.6000‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 3

ITEM_PIPELINES = {
   ‘xiaoshuo_pc.pipelines.XiaoshuoPcPipeline‘: 300,
}

4. Run the scrapy crawl name (variable value) scrapy crawl name-o book. JSON (output to file {JSON, XML, CSV}) scrapy crawl name-o book. JSON-t json (-T indicates the format output, which is generally ignored)

** During the first running, I encountered the no module named WIN32API error. This is because Python does not have a library to access the Windows system API and needs to download a third-party library. Library name is pywin32, can be directly downloaded from the Internet, download link: http://sourceforge.net/projects/pywin32/files%2Fpywin32/ (download for your Python version) download and placed to the scripts directory under the dual machine run, (or PIP install pypiwin32 ))

Iii. Sample Code for novel acquisition:Create the entry execution file main. py

From scrapy. cmdline import Execute
Execute ("scrapy crawl zol". Split () # zol is the name of the variable defined in the zol File

Class shiqikspider (scrapy. Spider ):
Name = 'shiqik'
Allowed_domains = ['17k. com']
Start_urls = ['https: // www.81zw. US/book/1379/6970209 .html ']

Def parse (self, response ):
Title = response. XPath ('// Div [@ class = "bookname"]/H1/text ()'). extract_first ()
Content = ''. join (response. XPath ('// Div [@ ID = "content"]/text ()'). extract ()). replace ('', '\ n ')
Yield {"title": title, "content": Content}
Next_page = response. XPath ('// Div [@ class = "bottem2"]/A [3]/@ href'). extract_first ()
If next_page.find (". html ")! =-1:
Print ("continue to next URL ")
New_url = response. urljoin (next_page)
Yield scrapy. Request (new_url, callback = self. parse, dont_filter = true)

4. Sample Code for novel acquisition:

class BayizhongwenSpider(CrawlSpider):
    name = ‘bayizhongwen‘
    allowed_domains = [‘81zw.us‘]
    # start_urls = [‘https://www.81zw.us/book/1215/863759.html‘]
    start_urls = [‘https://www.81zw.us/book/1215‘]

    rules = (
        Rule(LinkExtractor(restrict_xpaths=r‘//dl/dd[2]/a‘), callback=‘parse_item‘, follow=True),
        Rule(LinkExtractor(restrict_xpaths=r‘//div[@class="bottem1"]/a[3]‘), callback=‘parse_item‘, follow=True),
    )
    def parse_item(self, response):
        title=response.xpath(‘//div[@class="bookname"]/h1/text()‘).extract_first()
        content=‘‘.join(response.xpath(‘//div[@id="content"]/text()‘).extract()).replace(‘   ‘,‘\n‘)
        print({"title":title,"content":content})
        yield {"title":title,"content":content}

1. Create a project (Venv) c: \ Users \ NOC \ pycharmprojects> scrapy startproject tupian2. Create an app (Venv) c: \ Users \ NOC \ pycharmprojects \ tupian> scrapy genspider zol zol.com.cn3. modify the configuration information settings. py file:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 3

# Configure item Pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
Item_pipelines = {
# 'Tupian. Pipelines. tupianpipeline ': 300,
'Scrapy. contrib. pipeline. Images. imagespipeline ': 300,
}
# Add an image storage directory
Images_store = 'e:/IMG'

4. Create the entry execution file start. py

From scrapy. cmdline import Execute
Execute ("scrapy crawl zol". Split () # zol is the name of the variable defined in the zol File

V. Main file code:

Import scrapy


Class zolspider (scrapy. Spider ):
Name = 'zol'
Allowed_domains = ['zol .com.cn ']
Start_urls = ['HTTP: // developer.zol.com.cn/bizhi/7239_89590_2.html'] # crawl the address of the image page

Def parse (self, response ):
Image_url = response. XPath ('// IMG [@ ID = "bigimg"]/@ SRC'). Extract () # crawl the address of the first image
Image_name = response. XPath ('string (// H3) '). extract_first () # crawl the image name
Yield {"image_url": image_url, "image_name": image_name} # Push
Next_page = response. XPath ('// A [@ ID = "pagenext"]/@ href'). extract_first () # crawl the address of the next button in the image
If next_page.find('.html ')! =-1: # When the last image address is not included in. html
Yield scrapy. Request (response. urljoin (next_page), callback = self. PARSE)

Vi. middlewares File

From tupian. settings import user_agent
From Random import choice
From fake_useragent import useragent


# User-Agent settings
Class useragentdownloadermiddleware (object ):
Def process_request (self, request, spider ):
# If self. user_agent:
# Request. headers. setdefault (B 'user-agent', choice (user_agent ))
Request. headers. setdefault (B 'user-agent', useragent (). Random)

# Proxy settings
Class proxymiddleware (object ):
Def process_request (self, request, spider ):
# Request. Meta ['proxy'] = 'HTTP: // ip: port'
Request. Meta ['proxy'] = 'HTTP: // 124.235.145.79: 80'
# Request. Meta ['proxy'] = 'HTTP: // User: [email protected]: port'
# Request. Meta ['proxy'] = 'HTTP: // 398707160: [email protected]: 8080'

Basic scrapy framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Basic scrapy framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Basic scrapy framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support