Recently, to grab data from the Chinese weather web, the real-time weather on the Web pages is generated using JavaScript and cannot be resolved with simple tags. The reason is that the label is not on the page at all.
So, Google the next Python how to parse the Dynamic Web page, the following article is very helpful to me.
Reprint record: Python Introduction to Web page crawl, JS parsing
Because I only want to parse it under Mac, I'm not using a library with an extended platform. After using SpiderMonkey, it is still very comprehensive, such as document.write can not be performed (if I have a wrong understanding, please point out, thank you). I looked at the PYWEBKITGTK, but the installation did not succeed, forced me to give up (I have considered the use of PYV8, but still give up).
After the failure, I still found hope from the artifact of homebrew. It can help you install PYQT, probably know it is a Python interface library, but it also has a network module (WebKit), of course, you can use it to parse Web pages.
I will analyze the process of parsing dynamic Web pages, which is more than principle learning:
Step One: Resolve static page Tags
1 <meta http-equiv= "Content-type" content= "text/html"; Charset=utf-8 ">
2
The above is the test with the HTML code, I will parse its title tag, very simple, hehe ~
1 #!
/usr/bin/env Python 2 3 htmlentitydefs import entitydefs 4 from Htmlparser import htmlparser 5 import Sys,urllib2
6 7 class Dataparser (Htmlparser): 8 def __init__ (self): 9 Self.title = None Self.istag = 0 htmlparser.__init__ (self) 12 13 def handle_starttag (self,tag,attrs): if tag = ' title ': 15 Self.istag = 1 def handle_data (self,data): If Self.istag : self.title = Data def handle_endtag (Self,tag): 23 if tag = = ' title ': Self.istag = 0 def GetTitle (self): return self.title url = ' file:///Users/myName/Desktop/pyqt/2.html ' 2 9 # ' content in browser open, direct replyThe contents of the address bar can be req = urllib2. Request (URL): FD = Urllib2.urlopen (req) parser = Dataparser () parser.feed (Fd.read ()) print "Title is:", PARSER.G Ettitle ()
The result:
Step two installation Library
1. I assume you have Python installed.
2. Before beginning to parse the Dynamic Web page, the first to install PYQT, let brew to install for you, can help you save a lot of energy ...
To learn more homebrew, please visit the official website: Homebrew website
3. Description: Originally PYQT is a GUI library, but it contains the network module WebKit, this will be used to resolve dynamic Web pages.
Step three parsing JavaScript dynamic tags
1. There are many tags that are dynamically added to HTML pages, so sometimes using Python to execute JavaScript may not be able to meet the criteria, such as dynamically added tags, so it is a more common method to get the DOM tree executed. (may not understand correctly, if not, please correct me).
2. To write a JS file to call outside the HTML file above.
1 alert ("This is the called statement.") ")
2 var o = document.body;
3 function Creatediv (text)
4 {
5 var div = document.createelement ("div");
6 div.innerhtml = text;
7 O.appendchild (DIV);
8}
9 Creatediv ("15");
3. At this point, double-click 2.html, see the effect is:
Only a 15, this is what we want to parse the data, and now look at the source code:
is not no div tag, so now parse, it is impossible to obtain, should be 5757.js added to the div (JS name randomly taken) ~
The following is the beginning of parsing, my problem solving benefit from this article, I hope you can also see: scraping JavaScript webpages with WebKit
We're going to use WebKit to get the executed DOM tree:
1 #! /usr/bin/env python
2
3 import sys,urllib2
4 from htmlparser import Htmlparser
5 from Pyqt4.qtcore Import *
6 from Pyqt4.qtgui Import *
7 pyqt4.qtwebkit Import *
8
9 class Render (qwebpage):
10
def __init__ (self, url): Self.app = qapplication (sys.argv) qwebpage.__init__ (self) self.loadFinished.connect (self._loadfinished) self.mainframe (). Load (qurl (URL)) self.app.exec_ () def _loadfinished (self, result): self.frame = Self.mainframe () self.app.quit ()
url = './2.html '
r = Render (URL)%
html = r.frame.tohtml ()
print Html.toutf8 ()
# 26 writes the executed code to the file in
f = open ('./test.txt ', ' W ')
F.write (Html.toutf8 ())
F.close ()
I display the results of the print and then write the results to the Test.tex file. Now look at what's in the Test.tex (don't double-click, otherwise there's only a 15, use your text editor to view, for example: Sublime Text2):
1
Looks like an HTML code, but gets what I want, notice the eighth line, the DIV tag appears.
The last step, get that 15.
Stop and think about how we get to:
1 HTML = r.frame.tohtml ()
Gets a Qstring object that does not belong to the Python standard library. I think it makes me feel more comfortable to convert it into a Python object before I get to know pyqt. We can parse it like static page, the key is this sentence:
1 Parser.feed (Fd.read ())
Of course, since it can be written to the local file, open the file-> parse file-> to get the data, but I think no one wants to be so troublesome.
Check out the Python documentation:
1 htmlparser.feed (data)
2
3 feeds some text to the parser. It is processed insofar as it consists of complete elements; Incomplete the data is buffered until the more data are fed or close () are called.data can be either Unicode or STR, but passing uni The code is advised.
We found that we could parse the Unicode or STR as long as it was passed in, perhaps slightly altering the code:
1! /usr/bin/env Python 2 3 4 import sys,urllib2 5 from Htmlparser import Htmlparser 6 to pyqt4.qtcore import * 7 from Pyqt4.qtgui Import * 8 from Pyqt4.qtwebkit Import * 9 class Dataparser (Htmlparser): def __init__ (SE LF): Self.div = None Self.istag = 0 14 Htmlparser.__init__ (self) def handle_starttag (self,tag,attrs): I F tag = = ' div ': Self.istag = 1 def handle_data (Self,da TA): self.title If self.istag:23 = Data 24 25 def handle_endtag (Self,tag): if tag = ' div ': 27 Self.istag = 0 def getdiv (self): Self.title cl Ass Render (qwebpage): __init__ def (self, url): Self.app = Qapplication (SYS.ARGV) qwebpage.__init__ (self) SELF.LOADF Inished.connect (self._loadfinished) self.mainframe ()-Load (qurl (URL)) self.app.exec_ ()-Def _loadfinis Hed (self, result): Self.frame = Self.mainframe () self.app.quit () url = './2.html ' r = Render (URL) 4 6 html = r.frame.tohtml () #print Html.toutf8 () parser = Dataparser () parser.feed (str (HTML.TOUTF8 ())) Wuyi "JavaScript is", Parser.getdiv () #f = open ('./test.txt ', ' W ')-#f. Write (Html.toutf8 ()) #f. Close ()
Code to do a simple merge, the data is parsed out, the results of the operation are as follows:
Oh, although only 3 words, but did successfully parse the dynamic label, hehe ~
The fourth step to say.
The article realizes more than the principle, hoped that reads the article the person to provide certain help. Please correct me if there is any wrong place.
Of course, it is unrealistic to apply the articles directly to reality, but hopefully this is a good starting point.