Preliminary study on Scrapy frame crawler--zhongguancun online data crawling of mobile phone parameters

Source: Internet
Author: User
Tags xpath xpath functions

About Scrapy How to install the deployment of the article has been quite a lot, but the example of online combat is not many, recently just learning the bot framework, simply wrote a spider demo to practice.
As a hardware digital control, I chose the frequented Zhongguancun online mobile page to crawl, the general idea as shown.

1 #Coding:utf-82 Importscrapy3 ImportRe4 ImportOS5 ImportSqlite36  fromMyspider.itemsImportSpideritem7 8 9 classZolspider (scrapy. Spider):TenName ="Zol" One     #allowed_domains = ["http://detail.zol.com.cn/"] # The server domain name used to qualify the crawl AStart_urls = [ -         #the main crawl to Zhongguancun online mobile phone Information page, considering that is the purpose of the demonstration is only climbed the homepage, in fact, crawl page with the two-level crawler principle is the same, out of time to save the purpose of not climbing -         #you can write an entry URL here . the         "http://detail.zol.com.cn/cell_phone_index/subcate57_list_1.html" -     ] -item = Spideritem ()#cannot be created dynamically, simply useless meta in the spider function to pass the value between -     #just test it with sqlite, compare light weight +     #database = Sqlite3.connect (": Memory:") -Database_file = Os.path.dirname (Os.path.abspath (__file__)) +"\\phonedata.db" +     ifos.path.exists (database_file): A Os.remove (database_file) atDatabase =Sqlite3.connect (database_file) -     #first build a field, easy to understand the meaning of the field in Chinese - Database.execute ( -         " " - CREATE TABLE cell_phones -         ( in Phone model TEXT -         ); to         " " +     ) -     #used to check whether data modification is comprehensive, compared with total_changes theCounter =0 *  $     #Mobile Quote Homepage Crawl functionPanax Notoginseng     defParse (self, response): -         #get the Mobile Details page link and create a two-level crawler with it theHREFs = Response.xpath ("//h3/a") +          forHrefinchHREFs: Aurl = response.urljoin (Href.xpath ("@href") [0].extract ()) the             yieldScrapy. Request (URL, self.parse_detail_page) +  -     #Mobile Detail Page Crawl function $     defparse_detail_page (Self, Response): $         #get phone model via XPath -Model = Response.xpath ("//h1"). XPath ("text ()") [0].extract () -         #CREATE database records for this model phone thesql ='INSERT into Cell_phones (mobile model) VALUES ("'+ model +'")' -Self.counter + = 1Wuyi self.database.execute (SQL) the Self.database.commit () -         #get a link to the parameter Details page Wuurl = response.urljoin (Response.xpath ("//div[@id = ' Tagnav ']//a[text () = ' parameter ']"). XPath ("@href") [0].extract ()) -         #since Scrapy is asynchronous-driven (step-up crawler function), when it is necessary to bind certain variables between parent-child crawler functions, the meta-dictionary can be used to pass, the global item field cannot be moved, and it is not very suitable in a more flexible crawl scene . About         yieldScrapy. Request (URL, callback=self.parse_param_page, meta={'Model': Model}) $  -     #Phone Parameter Detail page crawl function -     defparse_param_page (Self, Response): -         #get the phone parameter field and one by one times calendar Aparams = Response.xpath ("//span[contains (@class, ' param-name ')]") +          forParaminchparams: theLegal_param_name_field = Param_name = Param.xpath ("text ()") [0].extract () -             #Converts a cell phone parameter field to a legitimate database (not the beginning of a number, and prevents the '/' symbol from being removed from the SQL logic pollution) $             ifRe.match (R'^\d', Param_name): theLegal_param_name_field = Re.sub (r'^\d',"F"+param_name[0], param_name) the             if '/' inchParam_name: theLegal_param_name_field = Legal_param_name_field.replace ('/',"') the             #Check the master table to see if the dynamically added field is already present and increase the field if it does not exist -sql ="SELECT * from Sqlite_master WHERE name= ' cell_phones ' and SQL like '%"+ Legal_param_name_field +"% '" in             ifSelf.database.execute (SQL). Fetchone () isNone: thesql ="ALTER TABLE cell_phones ADD"+ Legal_param_name_field +"TEXT" the self.database.execute (SQL) About Self.database.commit () the             #positional parameter value elements based on the XPath of the parameter field name theXPath ="//span[contains (@class, ' param-name ') and text () = '"+ Param_name + the                     "']/following-sibling::span[contains (@id, ' Newpmval ')]//text ()" +Vals =Response.xpath (XPath) -             #because some fields have parameter values that are multiple values, you need to attach them together to synthesize a field for easy storage.  the             #for data Segmentation Select a LIKE clause or a database that supports full-text indexing is good, of course, NoSQL is better .BayiPm_val ="" the              forValinchVals: thePm_val + =val.extract () -Re.sub (R'\r|\n',"", Pm_val) -sql ="UPDATE cell_phones SET%s = '%s ' WHERE phone model = '%s '"  the% (Legal_param_name_field, Pm_val, response.meta['Model']) the self.database.execute (SQL) theSelf.counter + = 1 the             #Check for crawled data right -Results = Self.database.execute ("SELECT * from Cell_phones"). Fetchall () the         #do not forget the commit or the persistent database may not be the result of a complete the Self.database.commit () the         Print(Self.database.total_changes, Self.counter)#comparison of the database of the increase in the situation is lost94          forRowinchResults: the             Print(Row, end='\ n')#Actually, there's a little coding problem that needs to be solved. the         #finally happy to use Scrapy crawl zol start crawler bar! 
data that is partially crawled into the database

Finally, we recommend that you modify the user_agent in the settings script to simulate browser requests to avoid crawling, for example:
user_agent = ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 '
of course, the anti-crawling of advanced points is also dealt with by other means:
1. Based on the user behavior of crawling, you can add crawl logical routes, dynamic crawling way to obtain resources, and frequently switch IP and UA, based on session and cookie anti-crawl can also be based on this means
2.AJAX and other asynchronous JS Interactive page, can customize the JS request, if the request is encrypted, combined with Selenium + Webdriver to drive the browser, analog user interaction is similar
3. About the matching method, regular, XPath, CSS and so selector varies from person to person, the front-end often adjust the words do not recommend using CSS selector;
regular expressions are much more efficient than XPath, but XPath can locate multiple (group) elements flexibly based on element logical hierarchy, attribute value conditions, and even XPath functions .
In General, the students who do reptiles, regular and XPath should be the basic skills, especially in the direction of crawling data is particularly important.
4. With regard to Routing and task scheduling problems, although Scrapy provides a very simple asynchronous IO scheme, it is easy to crawl multiple levels of pages and implement deep (selective) crawlers based on base URLs and flexible custom callback functions.
but for scenarios that crawl massive amounts of data, the flexibility is poor, so queue management (row weight, anti-stop, anti-restart) and distributed crawlers may be more trial-tested.
of course, Learning Python crawler, Master Urllib (2, 3), requests, BeautifulSoup, lxml and other modules will also make you more powerful, but also need to adapt to local conditions is.
P.S. The children's shoes, which come in with Golang to make reptiles, have a lot more performance than Python, so you can try them. Will be Java children's shoes, can also focus on the next Nutch engine. It's a long way to go and study together. )

Preliminary study on Scrapy frame crawler--zhongguancun online data crawling of mobile phone parameters

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.