Python crawler Development "1th", "Scrapy Primer"

Source: Internet
Author: User
Tags xpath

Installation introduction of Scrapy

Scrapy Framework official Website: http://doc.scrapy.org/en/latest

Scrapy Chinese maintenance site: http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

How Windows is Installed
    • Python 2/3
    • To upgrade the PIP version:pip install --upgrade pip
    • Installing the Scrapy framework via PIPpip install Scrapy

Specific Scrapy installation process reference: Http://doc.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes

Goal
    • Create a Scrapy Project
    • Defining extracted structured data (Item)
    • Write the Spider of a crawl site and extract the Structured data (Item)
    • Write item pipelines to store the extracted item (that is, structured data)
1. New Project (Scrapy Startproject)
    • Before you begin a crawl, you must create a new Scrapy project. Go to the custom project directory and run the following command:
    • Scrapy Startproject Myspider
    • Create Project Myspider, directories and main file roles:
      • SCRAPY.CFG: Configuration file for Project
      • myspider/: The project's Python module, which will reference the code from here

      • myspider/items.py: Target file for Project

      • myspider/pipelines.py: Pipeline file for Project

      • myspider/settings.py: Project Settings file myspider/spiders/: Store crawler code Directory

2. Clear objectives (myspider/items.py)

Objective: To crawl the names, titles and personal information of all lecturers in the http://www.itcast.cn/channel/teacher.shtml website

Steps:

    1. Open the items.py in the Myspider directory

    2. Item defines a structured data field that is used to hold the crawled data, somewhat like the dict in Python, but provides some additional protection from errors.

    3. You can create a scrapy by creating a. The Item class, and the definition type is scrapy. Field's class attribute to define an item (which can be understood as a mapping relationship similar to ORM).

    4. Next, create a Itcastitem class, and build the item model.

Import Scrapyclass Itcastitem (scrapy. Item):    name = Scrapy. Field () Level    = Scrapy. Field ()    info = scrapy. Field ()
3. Making crawlers (spiders/itcastspider.py)

①. Crawling data

Enter the command under the current directory, mySpider/spider create a crawler named under the directory itcast , and specify the scope of the crawl domain:

Scrapy genspider itcast "itcast.cn"

Open the itcast.py in the Myspider/spider directory and add the following code by default:

Import Scrapyclass Itcastspider (scrapy. Spider):    name = "Itcast"    allowed_domains = ["itcast.cn"]    start_urls = (        ' http://www.itcast.cn/',    )    Def parse (self, Response):        Pass

To build a spider, use scrapy. The spider class creates a subclass and determines three mandatory properties and a method.

  A:name = "", the name of the crawler must be unique, and the different crawlers must define different names.

  B:allow_domains = [], the domain name range of the search, that is, the crawler's constrained area, which specifies that the crawler only does not exist URLs will be ignored.

  C:start_urls = (), crawl the URL tuple/list.

Crawlers start crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.

  D:parse(self, response),Parsing method, each initial URL completes the download will be called, when the call passed from each URL returned to the response object as the only parameter, the main role is as follows:

      • Responsible for parsing the returned web page data (response.body), extracting structured data (Generate item)
      • Generate a URL request that requires the next page.
Modify the value of Start_urls to the first URL that needs to be crawled
Start_urls = ("http://www.itcast.cn/channel/teacher.shtml",)
Modify the Parse () method
Def parse (self, response):    filename = "teacher.html"    open (filename, ' W '). Write (Response.body)
Run the program to get crawled page information

②. Fetching data

 Crawl the entire page, and the next step is to fetch the data

Observe the Web source style, choose the appropriate method to extract data

Introduce the Itcastitem class that was previously defined in myspider/items.py:

From Myspider.items import Itcastitem

Encapsulates the resulting data into an ItcastItem object that can hold the attributes of each teacher:

From Myspider.items import Itcastitemdef parse (self, Response):    #open ("teacher.html", "WB"). Write (Response.body) . Close ()    # stores the collection of teacher information    items = [] for    every in Response.xpath ("//div[@class = ' li_txt ']"):        # Encapsulates the data we get into a The ' Itcastitem ' object        item = Itcastitem ()        #extract () method returns the unicode string        name = Each.xpath ("H3/text ()"). Extract (        title = Each.xpath ("H4/text ()"). Extract ()        info = Each.xpath ("P/text ()"). Extract ()        # XPath returns a list containing an element        item[' name '] = name[0]        item[' title '] = title[0]        item[' info '] = info[0]        Items.append (item)    # Returns the last data directly return    items

③. Saving data 

Scrapy the simplest way to save information is mainly four,-o output the file in the specified format, the command is as follows:
# JSON format, default Unicode encoding scrapy crawl itcast-o teachers.json# JSON lines format, default to Unicode encoding scrapy crawl Itcast-o teachers.jsonl # CSV comma expression, available in Excel Open Scrapy crawl itcast-o teachers.csv# XML format scrapy crawl Itcast-o teachers.xml

  

 

Python crawler Development "1th", "Scrapy Primer"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.