Summary: Run multiple crawlers based on Excel file configuration
Most of the time, we need to write a crawler for each individual site, but there are some situations where you have to crawl a few sites the only difference is that the XPath expression is different, at this time to write a crawler for each site is futile, You can actually crawl these similar sites with just one spider.
First create a project named generic and a spider named Fromcsv:
Scrapy startproject GENERICCD genericscrapy genspider fromcsv example.com
Then create a CSV file that populates the file with the following information:
Use the Python CSV library to verify
$ python Import csv>>> with open ("todo.csv""rU") as F: = csv. Dictreader (f) for in Reader: Print Line
The output is as follows:
Note: The first row of the Todo.csv file is automatically used as the dictionary key
Now read the URL in the Todo.csv file and the XPath expression to run the spider, because we do not know the URL in advance, so to remove the start_urls and allowed_domains parts from the spider, using the Start_ The requests () method produces a request object for each row in the CSV file, The field names and XPath expressions are placed in the parameter request.mate, passed to the parse function, and then the permanent item and Itemloader to populate the item's fields
ImportCSVImportscrapy fromScrapy.httpImportRequest fromScrapy.loaderImportItemloader fromScrapy.itemImportItem, FieldclassFromcsvspider (scrapy. Spider): Name="Fromcsv"defstart_requests (self): with open ("Todo.csv","RU") as F:reader=CSV. Dictreader (f) forLineinchreader:request= Request (Line.pop ('URL'))#popped a key as a URL element from the dictionaryrequest.meta[' Fields'] = LineyieldRequestdefParse (self, Response): Item= Item ()#The items.py file is not defined in this projectL = Itemloader (Item=item, response=response) forName, XPathinchresponse.meta[' Fields'].iteritems ():ifXpath:item.fields[name]= Field ()#dynamically create an itemL.add_xpath (name, XPath)returnL.load_item ()
fromcsv.py source file code address:
https://github.com/Kylinlin/scrapybook/blob/master/ch05%2Fgeneric%2Fgeneric%2Fspiders%2Ffromcsv.py
Run Spider:scrapy Crawl Fromcsv
Because the above source code hard-coded todo.csv file name, once the file name has changed, it is not a good design, but in fact scrapy used a simple way (using-a) can be transferred from the command line to the spider parameters, such as:-A variable= Value, the spider can get the value in Self.variable in the source code. To check the variable name and provide the default value, use the Python method Getarrt (self, ' variable ', ' default '), so the WITH statement above can be modified to:
With open (GETARRT (self, "file", "Todo.csv"), "RU") as F:
Then specify the CSV file with the-a parameter when you run the spider (the Todo.csv file is used by default if you do not use the-a parameter):
Scrapy Crawl Fromcsv–a File=todo.csv
Learning Scrapy Notes (vii)-Scrapy to run multiple crawlers from an Excel file