Learning Scrapy Notes (vii)-scrapy run multiple crawlers based on Excel files

Source: Internet
Author: User
Tags xpath

Summary: Run multiple crawlers based on Excel file configuration

Most of the time, we need to write a crawler for each individual site, but there are some situations where you have to crawl a few sites the only difference is that the XPath expression is different, at this time to write a crawler for each site is futile, You can actually crawl these similar sites with just one spider.

First create a project named generic and a spider named Fromcsv:

Scrapy startproject GENERICCD genericscrapy genspider fromcsv example.com

Then create a CSV file that populates the file with the following information:

Use the Python CSV library to verify

$ python Import csv>>> with open ("todo.csv""rU") as F:         = csv. Dictreader (f)        for in  Reader:            Print Line

The output is as follows:

Note: The first row of the Todo.csv file is automatically used as the dictionary key

Now read the URL in the Todo.csv file and the XPath expression to run the spider, because we do not know the URL in advance, so to remove the start_urls and allowed_domains parts from the spider, using the Start_ The requests () method produces a request object for each row in the CSV file, The field names and XPath expressions are placed in the parameter request.mate, passed to the parse function, and then the permanent item and Itemloader to populate the item's fields

ImportCSVImportscrapy fromScrapy.httpImportRequest fromScrapy.loaderImportItemloader fromScrapy.itemImportItem, FieldclassFromcsvspider (scrapy. Spider): Name="Fromcsv"defstart_requests (self): with open ("Todo.csv","RU") as F:reader=CSV. Dictreader (f) forLineinchreader:request= Request (Line.pop ('URL'))#popped a key as a URL element from the dictionaryrequest.meta[' Fields'] = LineyieldRequestdefParse (self, Response): Item= Item ()#The items.py file is not defined in this projectL = Itemloader (Item=item, response=response) forName, XPathinchresponse.meta[' Fields'].iteritems ():ifXpath:item.fields[name]= Field ()#dynamically create an itemL.add_xpath (name, XPath)returnL.load_item ()

fromcsv.py source file code address:

https://github.com/Kylinlin/scrapybook/blob/master/ch05%2Fgeneric%2Fgeneric%2Fspiders%2Ffromcsv.py

Run Spider:scrapy Crawl Fromcsv

Because the above source code hard-coded todo.csv file name, once the file name has changed, it is not a good design, but in fact scrapy used a simple way (using-a) can be transferred from the command line to the spider parameters, such as:-A variable= Value, the spider can get the value in Self.variable in the source code. To check the variable name and provide the default value, use the Python method Getarrt (self, ' variable ', ' default '), so the WITH statement above can be modified to:

With open (GETARRT (self, "file", "Todo.csv"), "RU") as F:

Then specify the CSV file with the-a parameter when you run the spider (the Todo.csv file is used by default if you do not use the-a parameter):

Scrapy Crawl Fromcsv–a File=todo.csv

Learning Scrapy Notes (vii)-Scrapy to run multiple crawlers from an Excel file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.