Scrapy
Scrapy is a lightweight web crawler written in Python that is very handy to use. Scrapy uses the Twisted asynchronous network library to handle network traffic. The overall structure is broadly as follows:
Create a Scrapy Project
The S-57 format is an electronic nautical chart standard promulgated by the International Maritime Organization (IMO) and is itself a vector chart. These standards are published on the http://www.s-57.com/. Take the page as a crawl object, enter the directory where the code is stored, and run the following command to create the project crawls57:
Scrapy Startproject crawls57
The command will create a crawls57 directory with the following content:
crawls57/ scrapy.cfg crawls57/ __init__.py items.py pipelines.py settings.py Spiders/ __init__.py ...
These files are:
scrapy.cfg
: The configuration file for the project
crawls57/
: The Python module for the project. You will then join the code here.
crawls57/items.py
: Item file in the project.
crawls57/pipelines.py
: The pipelines file in the project.
crawls57/settings.py
: The setup file for the project.
crawls57/spiders/
: The directory where the spider code is placed.
Define Item
Item is the container that holds the data that is crawled. On the Www.s-57.com Web page, the data is divided into two parts, the left is the information of the Beacon object, and the right is the information of the navigation Mark property. There is more data on the right, so we mainly crawl to the left, that is, some data of the Beacon object:
import scrapy class Crawls57item (scrapy. Item): Object = Scrapy. Field () acronym = Scrapy. Field () Code = Scrapy. Field () Primitives = Scrapy. Field () attribute_a = Scrapy. Field () attribute_b = Scrapy. Field () attribute_c = Scrapy. Field () Definition = Scrapy. Field () references_int = Scrapy. Field () references_s4 = Scrapy. Field () Remarks = Scrapy. Field () Distinction = scrapy. Field ()
Writing crawlers
With the defined item, you can write a crawler spider to start crawling data.
You can create a new spider directly from the command line:
Scrapy Genspider s-57 s-57.com
At this point, a s-57.py file is generated in the Spiders folder. Below we need to edit the file.
Depending on the structure of the website s-57, there are two main steps in the crawl process:
- Fetch the name, the abbreviation, the ID and other information of the Beacon object;
- Extracts the web address of each Beacon object and fetches its details.
The parse () method is the default in s-57.py and can be returned to Request or items. Each time a crawler is run, the parse () method is called. The returned Request object accepts a parameter callback, specifies that callback is a new method, and can call this new method repeatedly. Therefore, recursive fetching of the Web site can be achieved by returning the Request. According to the above requirements of the crawl process, you can create a method parse_dir_contents (), the parse () call to grab the Navigation Beacon object details, edit the code as follows.
Importscrapy fromCrawls57.itemsImportCrawls57itemImportReclassS57spider (scrapy. Spider): Name="s-57"Allowed_domains= ["s-57.com"] Start_urls= [ "http://www.s-57.com/titleO.htm",] Data_code={} data_obj={} defParse (self, response): forObjinchResponse.xpath ('//select[@name = "Title"]/option'): Obj_name= Obj.xpath ('text ()'). Extract () [0] Obj_value= Obj.xpath ('@value'). Extract () [0] self.data_obj[obj_value]=Obj_name forAcrinchResponse.xpath ('//select[@name = "acronym"]/option'): Acr_name= Acr.xpath ('text ()'). Extract () [0] Acr_value= Acr.xpath ('@value'). Extract () [0] self.data_code[acr_name]=acr_value URL= u'http://www.s-57.com/Object.asp?nameAcr='+Acr_nameyieldScrapy. Request (URL, callback=self.parse_dir_contents)defparse_dir_contents (Self, Response): Acr_name= Response.url.split ('=') [-1] Acr_value=str (self.data_code[acr_name]) Obj_name=Self.data_obj[acr_value] forSelinchResponse.xpath ('//html/body/dl'): Item=Crawls57item () item['Object'] =obj_name item['acronym'] =acr_name item['Code'] =Acr_value item['Primitives'] = Sel.xpath ('B/text ()'). Extract ()#Atrribute ABCAtainfo = u"'Apath= Sel.xpath ('.//tr[1]/td[2]/b') forAtainchApath.xpath ('.//span'): Atainfo+ = Ata.re ('> (\w+) <') [0]+"; "item['attribute_a'] =Atainfo Atbinfo= u"'Bpath= Sel.xpath ('.//tr[2]/td[2]/b') forAtbinchBpath.xpath ('.//span'): Atbinfo+ = Atb.re ('> (\w+) <') [0]+"; "item['Attribute_b'] =Atbinfo Atcinfo= u"'CPath= Sel.xpath ('.//tr[3]/td[2]/b') forAtcinchCpath.xpath ('.//span'): Atcinfo+ = Atc.re ('> (\w+) <') [0]+"; "item['Attribute_c'] =Atcinfo#Descriptioni =0 forDecinchSel.xpath ('.//dl/dd'): I+ = 1DT='.//dl/dt['+ STR (i) +']/b/text ()'DD='.//dl/dd['+ STR (i) +']/font/text ()' if(Sel.xpath (DT). Extract () [0] = ='References'): item['References_int'] = Sel.xpath ('.//tr[1]/td[2]/font/text ()'). Extract () [0] item['REFERENCES_S4'] = Sel.xpath ('.//tr[2]/td[2]/font/text ()'). Extract () [0]if(Len (Sel.xpath (DD). Extract ()) = =0):Continue if(Sel.xpath (DT). Extract () [0] = ='Definition'): SS="' forDefiinchSel.xpath (DD). Extract (): SS+=Defi item['Definition'] =SSif(Sel.xpath (DT). Extract () [0] = ='Remarks:'): item['Remarks'] =Sel.xpath (DD). Extract () [0]if(Sel.xpath (DT). Extract () [0] = ='Distinction:'): item['Distinction'] =Sel.xpath (DD). Extract () [0]yieldItem
It is much easier to fetch data with XPath than regular expressions. However, the use of the time is inevitably repeated debugging. Scrapy provides a shell environment for debugging response commands, the basic syntax:
scrapy Shell [url]
After that, you can enter response directly. XPath(' ... ') to debug the crawled data.
Crawl
Enter the project's root directory and execute the following command to start the spider:
Scrapy Crawl Crawls57-o Data.json
Finally, the captured data is stored in the Data.json.
"Python" Scrapy Getting Started instance