"Python" Scrapy Getting Started instance

Source: Internet
Author: User
Tags xpath

Scrapy

Scrapy is a lightweight web crawler written in Python that is very handy to use. Scrapy uses the Twisted asynchronous network library to handle network traffic. The overall structure is broadly as follows:

Create a Scrapy Project

The S-57 format is an electronic nautical chart standard promulgated by the International Maritime Organization (IMO) and is itself a vector chart. These standards are published on the http://www.s-57.com/. Take the page as a crawl object, enter the directory where the code is stored, and run the following command to create the project crawls57:

Scrapy Startproject crawls57

The command will create a crawls57 directory with the following content:

crawls57/    scrapy.cfg    crawls57/        __init__.py        items.py        pipelines.py        settings.py        Spiders/            __init__.py            ...

These files are:

    • scrapy.cfg: The configuration file for the project
    • crawls57/: The Python module for the project. You will then join the code here.
    • crawls57/items.py: Item file in the project.
    • crawls57/pipelines.py: The pipelines file in the project.
    • crawls57/settings.py: The setup file for the project.
    • crawls57/spiders/: The directory where the spider code is placed.

Define Item

Item is the container that holds the data that is crawled. On the Www.s-57.com Web page, the data is divided into two parts, the left is the information of the Beacon object, and the right is the information of the navigation Mark property. There is more data on the right, so we mainly crawl to the left, that is, some data of the Beacon object:

 import   scrapy  class   Crawls57item (scrapy. Item):  Object = Scrapy. Field () acronym  = Scrapy. Field () Code  = Scrapy. Field () Primitives  = Scrapy. Field () attribute_a  = Scrapy. Field () attribute_b  = Scrapy. Field () attribute_c  = Scrapy. Field () Definition  = Scrapy. Field () references_int  = Scrapy. Field () references_s4  = Scrapy. Field () Remarks  = Scrapy. Field () Distinction  = scrapy. Field () 

Writing crawlers

With the defined item, you can write a crawler spider to start crawling data.

You can create a new spider directly from the command line:

Scrapy Genspider s-57 s-57.com

At this point, a s-57.py file is generated in the Spiders folder. Below we need to edit the file.

Depending on the structure of the website s-57, there are two main steps in the crawl process:

    1. Fetch the name, the abbreviation, the ID and other information of the Beacon object;
    2. Extracts the web address of each Beacon object and fetches its details.

The parse () method is the default in s-57.py and can be returned to Request or items. Each time a crawler is run, the parse () method is called. The returned Request object accepts a parameter callback, specifies that callback is a new method, and can call this new method repeatedly. Therefore, recursive fetching of the Web site can be achieved by returning the Request. According to the above requirements of the crawl process, you can create a method parse_dir_contents (), the parse () call to grab the Navigation Beacon object details, edit the code as follows.

Importscrapy fromCrawls57.itemsImportCrawls57itemImportReclassS57spider (scrapy. Spider): Name="s-57"Allowed_domains= ["s-57.com"] Start_urls= [        "http://www.s-57.com/titleO.htm",] Data_code={} data_obj={}    defParse (self, response): forObjinchResponse.xpath ('//select[@name = "Title"]/option'): Obj_name= Obj.xpath ('text ()'). Extract () [0] Obj_value= Obj.xpath ('@value'). Extract () [0] self.data_obj[obj_value]=Obj_name forAcrinchResponse.xpath ('//select[@name = "acronym"]/option'): Acr_name= Acr.xpath ('text ()'). Extract () [0] Acr_value= Acr.xpath ('@value'). Extract () [0] self.data_code[acr_name]=acr_value URL= u'http://www.s-57.com/Object.asp?nameAcr='+Acr_nameyieldScrapy. Request (URL, callback=self.parse_dir_contents)defparse_dir_contents (Self, Response): Acr_name= Response.url.split ('=') [-1] Acr_value=str (self.data_code[acr_name]) Obj_name=Self.data_obj[acr_value] forSelinchResponse.xpath ('//html/body/dl'): Item=Crawls57item () item['Object'] =obj_name item['acronym'] =acr_name item['Code'] =Acr_value item['Primitives'] = Sel.xpath ('B/text ()'). Extract ()#Atrribute ABCAtainfo = u"'Apath= Sel.xpath ('.//tr[1]/td[2]/b')             forAtainchApath.xpath ('.//span'): Atainfo+ = Ata.re ('> (\w+) <') [0]+"; "item['attribute_a'] =Atainfo Atbinfo= u"'Bpath= Sel.xpath ('.//tr[2]/td[2]/b')             forAtbinchBpath.xpath ('.//span'): Atbinfo+ = Atb.re ('> (\w+) <') [0]+"; "item['Attribute_b'] =Atbinfo Atcinfo= u"'CPath= Sel.xpath ('.//tr[3]/td[2]/b')             forAtcinchCpath.xpath ('.//span'): Atcinfo+ = Atc.re ('> (\w+) <') [0]+"; "item['Attribute_c'] =Atcinfo#Descriptioni =0 forDecinchSel.xpath ('.//dl/dd'): I+ = 1DT='.//dl/dt['+ STR (i) +']/b/text ()'DD='.//dl/dd['+ STR (i) +']/font/text ()'                if(Sel.xpath (DT). Extract () [0] = ='References'): item['References_int'] = Sel.xpath ('.//tr[1]/td[2]/font/text ()'). Extract () [0] item['REFERENCES_S4'] = Sel.xpath ('.//tr[2]/td[2]/font/text ()'). Extract () [0]if(Len (Sel.xpath (DD). Extract ()) = =0):Continue                if(Sel.xpath (DT). Extract () [0] = ='Definition'): SS="'                     forDefiinchSel.xpath (DD). Extract (): SS+=Defi item['Definition'] =SSif(Sel.xpath (DT). Extract () [0] = ='Remarks:'): item['Remarks'] =Sel.xpath (DD). Extract () [0]if(Sel.xpath (DT). Extract () [0] = ='Distinction:'): item['Distinction'] =Sel.xpath (DD). Extract () [0]yieldItem

It is much easier to fetch data with XPath than regular expressions. However, the use of the time is inevitably repeated debugging. Scrapy provides a shell environment for debugging response commands, the basic syntax:

scrapy Shell [url]
After that, you can enter response directly.  XPath(' ... ') to debug the crawled data.
Crawl

Enter the project's root directory and execute the following command to start the spider:

Scrapy Crawl Crawls57-o Data.json

Finally, the captured data is stored in the Data.json.

"Python" Scrapy Getting Started instance

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.