Classification path:/datazen/datamining/crawler/ before, a friend asked me to make a small script, grab the sales/price data of the competitor on a one-time mall, so that he can adjust his marketing strategy in real times. I have had to write a crawler to capture a treasure data experience, the implementation of the problem is not big, so agreed. The initial idea is to use the Pyhton in Urllib.request and re two lib (this example is Pyhton 3.4, 2.x switch), plus other statistical analysis functions, up to two nights (working during the day) can be done. In fact, the process encountered two major difficulties: (1) e-commerce website for the protection of transaction data is very good. Small reptiles will be banned or used some other protective measures so that they can not properly collect the required data, the need to add additional code to deal with the various cases of abuse; (2) Regular expressions are difficult to write and complex and difficult to maintain. So I think there are other solutions-this is a preliminary introduction to one of these solutions. First thought of course is the famous third-party library BeautifulSoup (as a Cantonese man, I used to call it "soup"). This library is very powerful, but because it is strong, need a little learning time and I need to get started quickly, so I have to learn later (and then write an article BeautifulSoup study summary). After the tradeoff, the final gaze shifted to the html.parser in the Python standard library. html.parser is a very simple and useful library, and its core is the Htmlparser class. From the source view, it encapsulates a series of regular expression inside. The process of working is: when you feed it a similar HTML-formatted string, it calls the Goahead method to iterate forward the individual tags, and calls the corresponding Parse_xxxx method to extract Start_tag, tag, attrs data comment and End_ tag information and data, and then call the corresponding method to process the extracted content. The approximate structure of the entire htmlparser as shown in: can be found, processing start tag (handle_starttag), closing tag (Handle_endtag), and processing data (Handle_ Data) is not implemented in Htmlparser (pass), which requires us to inherit the Htmlparser class and override these methods. Refer to the Python documentation for details, here are a few common ways to focus:
- Feed (data): used primarily to accept str with HTML tags, when this method is called and the corresponding data is provided, the entire instance (instance) begins execution, closing execution close ().
- Handle_starttag (Tag, attrs): This method receives the tag and Attrs returned by the Parse_starttag, and handles it, usually by the consumer, which is itself empty. For example, the connected start tag is <a>, and then the corresponding parameter is tag= ' a ' (lowercase). Attrs is the property in start tag <>, returned in tuple form (name, value) (all of which are lowercase). For example, for <a href= "http://www.baidu.com", then the internal invocation form is: Handle_starttag (' A ', [(' HREF ', ' http://www.baidu.com ')]).
- Handle_endtag (TAG): As with the above, just deal with the end tag, that is, the label that starts with </.
- Handle_data (data): The content of the Web page, that is, the start and end tags. For example, the ellipsis content of:<script>...</script>
- Reset (): Resets the instance, including the data entered as a parameter, to empty.
Let me give you an example. For example, we have the following stacks of data with HTML tags,
"Golden Crown Spot/full color/Top with edition" xiaomi/Millet Millet Note mobile Unicom 4G mobile phone
<p class= "Tb-subtitle" >
"Buy machine is sent pudding cover + HD film + line control headphones + Clip Card + film bracket and so on, package more preferential" "Purchase machine that is sent pudding set + HD Film + line control headphones + Clip Card + film bracket, etc., package more preferential" "Golden Crown Prestige + Shun Fung Pack Mail + National Guarantee---multiple protection"
</p>
<div id= "J_tedititem" class= "Tb-editor-menu" ></div> </div>
"Spot enhancement/standard" miui/millet red rice mobile phone 2 red Rice 2 mobile Unicom Telecom 4G Dual SIM
<p class= "Tb-subtitle" >
[Red Rice mobile phone 2-generation color version more, please read the purchase instructions to buy---Thank you visit] "Golden Crown reputation Xiaomi mobile market sales First" "Purchase package send HD tempered film + line control call headset + cutter (including the restoration of the card) + anti-radiation stickers + Dedicated HD Film + wiping machine cloth + headphone winder + Mobile movie Bracket + one-year extended warranty service + default to enjoy the Shun Fung Pack mail!
</p>
<div id= "J_tedititem" class= "Tb-editor-menu" ></div> </div> Obviously, it contains two mobile phones, and our goal is to extract the names of two phones. Since when we feed this HTML into Htmlparser, all of their tags are iterated, and if it is necessary to extract only the data we need, we need to set the call Handle_ when Handle_starttag encounters that tag and attribute. Data and print out our results, this time we can use a FLG as the decision, the code is as follows:
#定义一个MyParser继承自HTMLParserclass Myparser (htmlparser): re=[] #放置结果 flg=0# flag to mark if we need to find the label def handle _starttag (self, Tag, attrs): if tag== ' h3 ': #目标标签 for attr in attrs: if attr[0]== ' class ' and attr[1]== ' Tb-main-title ': #目标标签具有的属性 self.flg=1# matches the flag set to 1 break else: pass def handle_data (self, Data): The if self.flg==1: self.re.append (Data.strip ()) #如果标志为我们需要的标志, the self.flg=0# reset flag is added to the list. Next iteration Else: passmy=myparser () my.feed (HTML)
The results of the operation are as follows: The above is just htmlparser a very simple application, but it can reflect some of the qualities of the Htmlparser class. With these basic understandings, we can extend the relevant functions to form a standard crawler. Next time, we will use the relevant knowledge to build a basic web crawler, please look forward to OH. --------------------------------------------------This article is the author of the original article, please indicate the source: @Datazen
Python Html.parser Library Learning Summary