The goal is to crawl all the product data information on the site http://www.muyingzhijia.com/, including the first class category of the product, class two, title, brand, price.
Search, Python's scrapy is a good reptile frame, so based on Scrapy wrote a simple crawler.
First analyze the product page, on the http://www.muyingzhijia.com/main page, there are links useful to the class link, namely:Http://www.muyingzhijia.com/Shopping/category.aspx?cateID=11 and http://www.muyingzhijia.com/Shopping/ subcategory.aspx?cateid=185&small=1, the former is a first class category, the latter is a class two category, the two-level category contains some commodity information, but does not include the article at the beginning of the five types of commodity information. In a similar link to HTTP://WWW.MUYINGZHIJIA.COM/SHOPPING/PRODUCTDETAIL.ASPX?PDTID=33158&FROMPROMTYPE=TTTJ, all five of the above information is included. So plan to Http://www.muyingzhijia.com/shopping/alllist.aspx,http:// Www.muyingzhijia.com/shopping/category.aspx?cateid,http://www.muyingzhijia.com/ Shopping/subcategory.aspx?cateid three types of links for the entrance, to http:// Www.muyingzhijia.com/Shopping/category.aspx?cateID and http://www.muyingzhijia.com/shopping/subcategory.aspx?cateid class link for automatic crawling, while encountering http://www.muyingzhijia.com/shopping/productdetail.aspx? Class links, page parsing, parse out the five types of information required.
The crawler realizes automatic crawling, item deduplication, link deduplication, and data stored in the database.
Code See: Https://github.com/darlwen/spider
Use Scrapy to crawl product data for a site