Preliminary Exploration of Scrapy framework crawler-Online mobile phone parameter data crawling in Zhongguancun and scrapy

Source: Internet
Author: User
Tags xpath functions

Preliminary Exploration of Scrapy framework crawler-Online mobile phone parameter data crawling in Zhongguancun and scrapy

There have been a lot of articles on how to install and deploy Scrapy, but there are not many practical examples on the Internet. Recently, I just learned about this crawler framework and wrote a Spider Demo to practice it.
As a hardware digital control, I chose the Zhongguancun Online mobile page that I visited frequently for crawling. The general idea is shown in.

1 # coding: UTF-8 2 import scrapy 3 import re 4 import OS 5 import sqlite3 6 from myspider. items import SpiderItem 7 8 9 class ZolSpider (scrapy. spider ): 10 name = "zol" 11 # allowed_domains = ["http://detail.zol.com.cn/"] # used to limit the crawling server Domain name 12 start_urls = [13 # The main crawler to Zhongguancun Online mobile phone information page, taking into account is the purpose of the demonstration only climbed the home page, in fact, the principle of crawling pages and secondary crawler is the same, out of time saving purposes here will not climb 14 # here can write multiple entries URL15 "http://detail.zol.com.cn/cell_phone_index/subcate57_list_1.html "16] 17 item = SpiderItem () # It cannot be dynamically created. It is useless to simply use it. Use meta to pass the value 18 between the spider functions. # Just use sqlite for test, lightweight 19 # database = sqlite3.connect (": memory:") 20 database_file = OS. path. dirname (OS. path. abspath (_ file _) + "\ phonedata. db "21 if OS. path. exists (database_file): 22 OS. remove (database_file) 23 database = sqlite3.connect (database_file) 24 # CREATE a field to help you understand the meaning of the field. 25 database.exe cute (26''' 27 create table CELL_PHONES28 (2 9 mobile phone model TEXT30); 31 ''' 32) 33 # used to check whether the data additions and modifications are comprehensive, comparison with total_changes 34 counter = 035 36 # mobile phone quote homepage crawling function 37 def parse (self, response ): 38 # obtain the link on the mobile phone details page and create a secondary crawler 39 hrefs = response. xpath ("// h3/a") 40 for href in hrefs: 41 url = response. urljoin (href. xpath ("@ href") [0]. extract () 42 yield scrapy. request (url, self. parse_detail_page) 43 44 # function for crawling the mobile phone details page 45 def parse_detail_page (self, response): 46 # obtain the mobile phone model 47 model = re through xpath Else se. xpath ("// h1 "). xpath ("text ()") [0]. extract () 48 # create a database record for this phone model 49 SQL = 'insert INTO CELL_PHONES (mobile phone model) VALUES ("'+ model +'") '50 self. counter + = 151 self.database.exe cute (SQL) 52 self. database. commit () 53 # obtain the link of the Parameter Details page 54 url = response. urljoin (response. xpath ("// div [@ id = 'tagnav'] // a [text () = 'parameter']"). xpath ("@ href") [0]. extract () 55 # Since Scrapy is an asynchronous drive (step-by-step crawler function startup), when you need to bind certain variables between parent-child crawler functions, you can use the meta dictionary for transmission, global ite M fields cannot be dynamically created. In flexible crawling scenarios, 56 yield scrapy is not very suitable. request (url, callback = self. parse_param_page, meta = {'model': model}) 57 58 # function 59 def parse_param_page (self, response) for crawling the phone Parameter Details page ): 60 # retrieve the phone parameter fields and traverse 61 params = response one by one. xpath ("// span [contains (@ class, 'Param-name')]") 62 for param in params: 63 legal_param_name_field = param_name = param. xpath ("text ()") [0]. extract () 64 # convert the mobile phone parameter field to a valid database field (not starting with a number, and prevent SQL logic contamination by removing the '/' symbol) 6 5 if re. match (r '^ \ d', param_name): 66 legal_param_name_field = re. sub (r '^ \ d', "f" + param_name [0], param_name) 67 if'/'in param_name: 68 legal_param_name_field = legal_param_name_field.replace ('/','') 69 # Check whether the dynamically added field already exists by querying the master table, if this field does not exist, add 70 SQL = "SELECT * FROM sqlite_master WHERE name = 'cell _ PHONES 'and SQL LIKE' %" + legal_param_name_field + "% '" 71 if self.database.exe cute (SQL ). fetchone () is None: 72 SQL = "alter table CELL_PHONES ADD" + legal_param_name_field + "TEXT" 73 self.database.exe cute (SQL) 74 self. database. commit () 75 # locate the parameter value element 76 xpath = "// span [contains (@ class, 'Param-name') and text () based on the xpath parameter field name () = '"+ param_name + \ 77"']/following-sibling: span [contains (@ id, 'newpmval ')] // text () "78 vals = response. xpath (xpath) 79 # because the parameter values of some fields are multiple values, you need to append them together and combine them into one field for easy storage. 80 # If you want to use a like clause or a database that supports full-text indexing for data subdivision, nosql is also good. 81 pm_val = "82 for val in vals: 83 pm_val + = val. extract () 84 re. sub (R' \ r | \ n', "", pm_val) 85 SQL = "UPDATE CELL_PHONES SET % s = '% s' WHERE mobile phone model =' % S'" \ 86% (legal_param_name_field, pm_val, response. meta ['model']) 87 self.database.exe cute (SQL) 88 self. counter + = 189 # Check whether the crawled data is correct. 90 results = self.database.exe cute ("SELECT * FROM CELL_PHONES "). fetchall () 91 #10 million Don't forget commit. Otherwise, the persistence database may have incomplete results 92 self. database. commit () 93 print (self. database. total_changes, self. counter) # compare whether the database's increase and change results are lost 94 for row in results: 95 print (row, end = '\ n ') # In fact, there is a small coding problem that needs to be solved 96 # Finally, use scrapy crawl zol to start crawlers happily!
Partial data crawling to the database


We recommend that you modify USER_AGENT in the settings script to simulate browser requests to avoid anti-crawling. For example:
USER_AGENT = 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/80'
Of course, there are other ways to cope with anti-crawling measures at the advanced level:
1. for user behavior-based crawling, you can add a logic route for crawling to obtain resources dynamically and switch IP addresses and UA frequently, session-based and cookie-based anti-crawling can also be based on this method;
2. AJAX and other asynchronous js interaction pages can customize js requests. If the request is encrypted, use selenium + webdriver to drive the browser and simulate user interaction;
3. The matching methods, regular expressions, XPath, CSS, and other selectors vary from person to person. CSS selector is not recommended for front-end frequent adjustments;
The execution efficiency of regular expressions is higher than that of XPath, but XPath can be used to locate multiple (Group) elements based on the logical hierarchy of elements and attribute values, and even the combination of XPath functions;
In general, regular expressions and XPath should be basic skills for crawlers, especially when crawling data in a targeted manner.
4. for routing and Task Scheduling Problems, Scrapy provides a simple asynchronous IO solution that allows you to easily crawl multi-level pages, and implement in-depth (selective) crawling based on the base URL and flexible custom callback functions,
However, for scenarios where massive data is crawled, the flexibility is poor. Therefore, queue management (deduplication, interruption prevention, and re-running prevention) and distributed crawler may be more suitable.
Of course, learning Python crawlers and mastering urllib (2, 3), requests, BeautifulSoup, lxml and other modules will also make you feel better.
P.s. There are more children's shoes that use Golang for crawling, and the performance is much better than that of Python. You can try it. JAVA kids shoes, you can also pay attention to the Nutch engine. (It's a long journey. Learn it together .)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.