using Scrapy to crawl food information
This section will use Scrapy to crawl Taobao gourmet information, which involves the content: Multi-level Web crawl skills, data storage and picture download. The programming environment for this time is: pycharm+python3.4 (Windows) +scrapy1.4.0
1. Create a project: Open cmd, with cd command to enter the specified folder, enter: Scrapy startproject topgoods Carriage return, the following page appears:
2. Open with Pycharm, the engineering structure as shown in the picture:
3. Create a new Python file under the Spiders folder named taobao.py. The specific creation and implementation process See Bowen: http://www.aobosir.com/blog/2016/12/26/python3-large-web-crawler-taobao-com-import-to-MySQL-database/
#-*-Coding:utf-8-*-
class Taobaospider (scrapy. Spider):
name= ' taobao '
allowed_domains=[' taobao.com ']
start_urls=[' http://www.taobao.com ']
def Parse (self, Response):
key= ' snacks ' for
I in range (0,2): #只爬取前两页的信息
url= ' https://s.taobao.com/search?q = ' +str (key) + ' &s= ' +str (44*i)
#print ("url:", url)
yield Request (url=url,callback=self.get_page)
4. According to the above blog, can realize the crawling of Taobao information, but can not completely download two pages of 48*2=96 merchandise information. I set up a deferred download in the settings, with a latency of 3s, and the database and pictures only received 80 items of information. Follow-up to find the perfect approach, will be more this article. The information obtained in the database is as shown in the figure:
On the basis of the above reptile task, you can also crawl the corresponding picture. First add a new field in the items to store the URL for the picture download:
In addition, modify the settings file slightly:
The most important thing, of course, is to add the following information to the taobao.py (put it in the next () method):
File_url = Response.xpath ('//*[@id = ' j_imgbooth ']/@src '). Extract () [0]
file_url = ' http: ' + file_url
print ( File_url)
item[' file_urls ' = [' +file_url] # #特别注意, failure to do so will result in an error.
Finally, in the terminal of Pycharm, run the Topgoods:
Picture download results as shown in the figure:
-