Learning Scrapy Notes (vii)-scrapy run multiple crawlers based on Excel files

Last Update:2016-04-18 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary: Run multiple crawlers based on Excel file configuration

Most of the time, we need to write a crawler for each individual site, but there are some situations where you have to crawl a few sites the only difference is that the XPath expression is different, at this time to write a crawler for each site is futile, You can actually crawl these similar sites with just one spider.

First create a project named generic and a spider named Fromcsv:

Scrapy startproject GENERICCD genericscrapy genspider fromcsv example.com

Then create a CSV file that populates the file with the following information:

Use the Python CSV library to verify

$ python Import csv>>> with open ("todo.csv""rU") as F:         = csv. Dictreader (f)        for in  Reader:            Print Line

The output is as follows:

Note: The first row of the Todo.csv file is automatically used as the dictionary key

Now read the URL in the Todo.csv file and the XPath expression to run the spider, because we do not know the URL in advance, so to remove the start_urls and allowed_domains parts from the spider, using the Start_ The requests () method produces a request object for each row in the CSV file, The field names and XPath expressions are placed in the parameter request.mate, passed to the parse function, and then the permanent item and Itemloader to populate the item's fields

ImportCSVImportscrapy fromScrapy.httpImportRequest fromScrapy.loaderImportItemloader fromScrapy.itemImportItem, FieldclassFromcsvspider (scrapy. Spider): Name="Fromcsv"defstart_requests (self): with open ("Todo.csv","RU") as F:reader=CSV. Dictreader (f) forLineinchreader:request= Request (Line.pop ('URL'))#popped a key as a URL element from the dictionaryrequest.meta[' Fields'] = LineyieldRequestdefParse (self, Response): Item= Item ()#The items.py file is not defined in this projectL = Itemloader (Item=item, response=response) forName, XPathinchresponse.meta[' Fields'].iteritems ():ifXpath:item.fields[name]= Field ()#dynamically create an itemL.add_xpath (name, XPath)returnL.load_item ()

fromcsv.py source file code address:

https://github.com/Kylinlin/scrapybook/blob/master/ch05%2Fgeneric%2Fgeneric%2Fspiders%2Ffromcsv.py

Run Spider:scrapy Crawl Fromcsv

Because the above source code hard-coded todo.csv file name, once the file name has changed, it is not a good design, but in fact scrapy used a simple way (using-a) can be transferred from the command line to the spider parameters, such as:-A variable= Value, the spider can get the value in Self.variable in the source code. To check the variable name and provide the default value, use the Python method Getarrt (self, ' variable ', ' default '), so the WITH statement above can be modified to:

With open (GETARRT (self, "file", "Todo.csv"), "RU") as F:

Then specify the CSV file with the-a parameter when you run the spider (the Todo.csv file is used by default if you do not use the-a parameter):

Scrapy Crawl Fromcsv–a File=todo.csv

Learning Scrapy Notes (vii)-Scrapy to run multiple crawlers from an Excel file

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More