Basic scrapy framework

Source: Internet
Author: User
Scrapy getting started Tutorial: Explain /~ Gohlke/pythonlibs/scrapy framework depends on twistid and needs to be downloaded from the above website and placed under scrips ;?? PIP install c: \ Python \ anaconda3 \ Twisted-18.7.0-cp36-cp36m-win_amd64.whl ?? PIP install scrapy 2. Create scrapy project 1. because pychram does not have an integrated environment, you need to execute the command to create it. After execution, use pychram to select a new window to open ;?????????? Scrapy startproject projectname 2. Create a crawler file and execute the following command: command section ?????????????????? File Name crawling get the scrapy genspider Baidu. comscrapy genspider-T crawl Baidu Baidu.com 3 configuration file modification: settings. py file
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36 Maxthon/5.2.3.6000‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
‘xiaoshuo_pc.pipelines.XiaoshuoPcPipeline‘: 300,
}
4. Run the scrapy crawl name (variable value) scrapy crawl name-o book. JSON (output to file {JSON, XML, CSV}) scrapy crawl name-o book. JSON-t json (-T indicates the format output, which is generally ignored)
** During the first running, I encountered the no module named WIN32API error. This is because Python does not have a library to access the Windows system API and needs to download a third-party library. Library name is pywin32, can be directly downloaded from the Internet, download link: http://sourceforge.net/projects/pywin32/files%2Fpywin32/ (download for your Python version) download and placed to the scripts directory under the dual machine run, (or PIP install pypiwin32 ))
Iii. Sample Code for novel acquisition:Create the entry execution file main. py
From scrapy. cmdline import Execute
Execute ("scrapy crawl zol". Split () # zol is the name of the variable defined in the zol File
Class shiqikspider (scrapy. Spider ):
Name = 'shiqik'
Allowed_domains = ['17k. com']
Start_urls = ['https: // www.81zw. US/book/1379/6970209 .html ']

Def parse (self, response ):
Title = response. XPath ('// Div [@ class = "bookname"]/H1/text ()'). extract_first ()
Content = ''. join (response. XPath ('// Div [@ ID = "content"]/text ()'). extract ()). replace ('', '\ n ')
Yield {"title": title, "content": Content}
Next_page = response. XPath ('// Div [@ class = "bottem2"]/A [3]/@ href'). extract_first ()
If next_page.find (". html ")! =-1:
Print ("continue to next URL ")
New_url = response. urljoin (next_page)
Yield scrapy. Request (new_url, callback = self. parse, dont_filter = true)
  4. Sample Code for novel acquisition:   
class BayizhongwenSpider(CrawlSpider):
name = ‘bayizhongwen‘
allowed_domains = [‘81zw.us‘]
# start_urls = [‘https://www.81zw.us/book/1215/863759.html‘]
start_urls = [‘https://www.81zw.us/book/1215‘]

rules = (
Rule(LinkExtractor(restrict_xpaths=r‘//dl/dd[2]/a‘), callback=‘parse_item‘, follow=True),
Rule(LinkExtractor(restrict_xpaths=r‘//div[@class="bottem1"]/a[3]‘), callback=‘parse_item‘, follow=True),
)
def parse_item(self, response):
title=response.xpath(‘//div[@class="bookname"]/h1/text()‘).extract_first()
content=‘‘.join(response.xpath(‘//div[@id="content"]/text()‘).extract()).replace(‘   ‘,‘\n‘)
print({"title":title,"content":content})
yield {"title":title,"content":content}
1. Create a project (Venv) c: \ Users \ NOC \ pycharmprojects> scrapy startproject tupian2. Create an app (Venv) c: \ Users \ NOC \ pycharmprojects \ tupian> scrapy genspider zol zol.com.cn3. modify the configuration information settings. py file:
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
 
# Configure item Pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
Item_pipelines = {
# 'Tupian. Pipelines. tupianpipeline ': 300,
'Scrapy. contrib. pipeline. Images. imagespipeline ': 300,
}
# Add an image storage directory
Images_store = 'e:/IMG'
  4. Create the entry execution file start. py
From scrapy. cmdline import Execute
Execute ("scrapy crawl zol". Split () # zol is the name of the variable defined in the zol File
 
V. Main file code:
 
Import scrapy


Class zolspider (scrapy. Spider ):
Name = 'zol'
Allowed_domains = ['zol .com.cn ']
Start_urls = ['HTTP: // developer.zol.com.cn/bizhi/7239_89590_2.html'] # crawl the address of the image page

Def parse (self, response ):
Image_url = response. XPath ('// IMG [@ ID = "bigimg"]/@ SRC'). Extract () # crawl the address of the first image
Image_name = response. XPath ('string (// H3) '). extract_first () # crawl the image name
Yield {"image_url": image_url, "image_name": image_name} # Push
Next_page = response. XPath ('// A [@ ID = "pagenext"]/@ href'). extract_first () # crawl the address of the next button in the image
If next_page.find('.html ')! =-1: # When the last image address is not included in. html
Yield scrapy. Request (response. urljoin (next_page), callback = self. PARSE)
 
 
Vi. middlewares File
 
From tupian. settings import user_agent
From Random import choice
From fake_useragent import useragent


# User-Agent settings
Class useragentdownloadermiddleware (object ):
Def process_request (self, request, spider ):
# If self. user_agent:
# Request. headers. setdefault (B 'user-agent', choice (user_agent ))
Request. headers. setdefault (B 'user-agent', useragent (). Random)

# Proxy settings
Class proxymiddleware (object ):
Def process_request (self, request, spider ):
# Request. Meta ['proxy'] = 'HTTP: // ip: port'
Request. Meta ['proxy'] = 'HTTP: // 124.235.145.79: 80'
# Request. Meta ['proxy'] = 'HTTP: // User: [email protected]: port'
# Request. Meta ['proxy'] = 'HTTP: // 398707160: [email protected]: 8080'


 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Basic scrapy framework

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.