Installation introduction of Scrapy
Scrapy Framework official Website: http://doc.scrapy.org/en/latest
Scrapy Chinese maintenance site: http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
How Windows is Installed
- Python 2/3
- To upgrade the PIP version:
pip install --upgrade pip
- Installing the Scrapy framework via PIP
pip install Scrapy
Specific Scrapy installation process reference: Http://doc.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes
Goal
- Create a Scrapy Project
- Defining extracted structured data (Item)
- Write the Spider of a crawl site and extract the Structured data (Item)
- Write item pipelines to store the extracted item (that is, structured data)
1. New Project (Scrapy Startproject)
2. Clear objectives (myspider/items.py)
Objective: To crawl the names, titles and personal information of all lecturers in the http://www.itcast.cn/channel/teacher.shtml website
Steps:
Open the items.py in the Myspider directory
Item defines a structured data field that is used to hold the crawled data, somewhat like the dict in Python, but provides some additional protection from errors.
You can create a scrapy by creating a. The Item class, and the definition type is scrapy. Field's class attribute to define an item (which can be understood as a mapping relationship similar to ORM).
Next, create a Itcastitem class, and build the item model.
Import Scrapyclass Itcastitem (scrapy. Item): name = Scrapy. Field () Level = Scrapy. Field () info = scrapy. Field ()
3. Making crawlers (spiders/itcastspider.py)
①. Crawling data
Enter the command under the current directory, mySpider/spider
create a crawler named under the directory itcast
, and specify the scope of the crawl domain:
Scrapy genspider itcast "itcast.cn"
Open the itcast.py in the Myspider/spider directory and add the following code by default:
Import Scrapyclass Itcastspider (scrapy. Spider): name = "Itcast" allowed_domains = ["itcast.cn"] start_urls = ( ' http://www.itcast.cn/', ) Def parse (self, Response): Pass
To build a spider, use scrapy. The spider class creates a subclass and determines three mandatory properties and a method.
A:name = ""
, the name of the crawler must be unique, and the different crawlers must define different names.
B:allow_domains = []
, the domain name range of the search, that is, the crawler's constrained area, which specifies that the crawler only does not exist URLs will be ignored.
C:start_urls = ()
, crawl the URL tuple/list.
Crawlers start crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.
D:parse(self, response),
Parsing method, each initial URL completes the download will be called, when the call passed from each URL returned to the response object as the only parameter, the main role is as follows:
-
- Responsible for parsing the returned web page data (response.body), extracting structured data (Generate item)
- Generate a URL request that requires the next page.
Modify the value of Start_urls to the first URL that needs to be crawled
Start_urls = ("http://www.itcast.cn/channel/teacher.shtml",)
Modify the Parse () method
Def parse (self, response): filename = "teacher.html" open (filename, ' W '). Write (Response.body)
Run the program to get crawled page information
②. Fetching data
Crawl the entire page, and the next step is to fetch the data
Observe the Web source style, choose the appropriate method to extract data
Introduce the Itcastitem class that was previously defined in myspider/items.py:
From Myspider.items import Itcastitem
Encapsulates the resulting data into an ItcastItem
object that can hold the attributes of each teacher:
From Myspider.items import Itcastitemdef parse (self, Response): #open ("teacher.html", "WB"). Write (Response.body) . Close () # stores the collection of teacher information items = [] for every in Response.xpath ("//div[@class = ' li_txt ']"): # Encapsulates the data we get into a The ' Itcastitem ' object item = Itcastitem () #extract () method returns the unicode string name = Each.xpath ("H3/text ()"). Extract ( title = Each.xpath ("H4/text ()"). Extract () info = Each.xpath ("P/text ()"). Extract () # XPath returns a list containing an element item[' name '] = name[0] item[' title '] = title[0] item[' info '] = info[0] Items.append (item) # Returns the last data directly return items
③. Saving data
Scrapy the simplest way to save information is mainly four,-o output the file in the specified format, the command is as follows:
# JSON format, default Unicode encoding scrapy crawl itcast-o teachers.json# JSON lines format, default to Unicode encoding scrapy crawl Itcast-o teachers.jsonl # CSV comma expression, available in Excel Open Scrapy crawl itcast-o teachers.csv# XML format scrapy crawl Itcast-o teachers.xml
Python crawler Development "1th", "Scrapy Primer"