I tried to use the htmlparser class to parse the webpage yesterday. It is found that the result is not satisfactory. In any case, write down the process first, and hope that later people can solve the problems I encountered on this basis.
I wrote two solutions. Of course, these two solutions are only valid for specific websites. Here I will mainly describe the BBC homepage www.BBC. Co. uk and www.163.com.
For bbc:
This setting is much simpler. It may be the encoding standard of the webpage.
Import html. parser
Import urllib. Request
Class parsehtml (html. parser. htmlparser ):
Def handle_starttag (self, Tag, attrs ):
Print ("encountered a {} start tag". Format (TAG ))
Def handle_endtag (self, tag ):
Print ("encountered a {} end tag". Format (TAG ))
Def handle_charref (self, name ):
Print ("charref ")
Def handle_entityref (self, name ):
Print ("endtiyref ")
Def handle_data (self, data ):
Print ("data ")
Def handle_comment (self, data ):
Print ("comment ")
Def handle_decl (self, Decl ):
Print ("Decl ")
Def handle_pi (self, Decl ):
Print ("Pi ")
# From here on, the inheritance above is very simple, and all the parent functions are reloaded.
# In binary write method to store the BBC web page, this is the content of the previous Article (http://blog.csdn.net/xiadasong007/archive/2009/09/03/4516683.aspx), not to repeat
File = open ("bbc.html", 'wb ') # It's 'wb', not 'W'
Url = urllib. Request. urlopen ("http://www.bbc.co.uk /")
While (1 ):
Line = URL. Readline ()
If Len (line) = 0:
Break
File. Write (line)
# Generate an object
PHT = parsehtml ()
# For this website, I use 'utf-8' to open it. Otherwise, an error will occur. Other websites may not need it. UTF-8 is unicode encoded.
File = open ("bbc.html", encoding = 'utf-8', mode = 'R ')
# Processing webpages and feeds
While (1 ):
Line = file. Readline ()
If Len (line) = 0:
Break
PHT. Feed (line)
File. Close ()
PHT. Close ()
For 163:
# If the above method is used for parsing the webpage, exceptions may occur when the CSS and JavaScript sections are encountered,
# So I removed the two parts here. Let's look at the Code:
Import html. parser
Import urllib. Request
Class parsehtml (html. parser. htmlparser ):
Def handle_starttag (self, Tag, attrs ):
Print ("encountered a {} start tag". Format (TAG ))
Def handle_endtag (self, tag ):
Print ("encountered a {} end tag". Format (TAG ))
Def handle_charref (self, name ):
Print ("charref ")
Def handle_entityref (self, name ):
Print ("endtiyref ")
Def handle_data (self, data ):
Print ("data ")
Def handle_comment (self, data ):
Print ("comment ")
Def handle_decl (self, Decl ):
Print ("Decl ")
Def handle_pi (self, Decl ):
Print ("Pi ")
# From here, I have defined four functions for processing the CSS and JavaScript sections.
Def encountercss (line ):
If line. Find ("<style type =" text/CSS ">") =-1:
Return 0
Return 1
Def passcss (file, line ):
# Print (line)
While (1 ):
If line. Find ("</style> ")! =-1:
Break
Line = file. Readline ()
Def encounterjavascript (line ):
If line. Find ("<SCRIPT type =" text/JavaScript ">") =-1:
Return 0
Return 1
Def passjavascript (file, line ):
Print (line)
While (1 ):
If line. Find ("</SCRIPT> ")! =-1:
Break
Line = file. Readline ()
Website = "http://www.163.com"
File = open ("163.html", mode = 'wb ') # It's 'wb', not 'W'
Url = urllib. Request. urlopen (website)
While (1 ):
Line = URL. Readline ()
If Len (line) = 0:
Break
File. Write (line)
PHT = parsehtml ()
File = open ("163.html", mode = 'R ')
While (1 ):
Line = file. Readline ()
If Len (line) = 0:
Break
# In this while loop, remove the CSS and JavaScript sections first.
If encountercss (line) = 1:
Passcss (file)
Elif encounterjavascript (line) = 1:
Passjavascript (file)
Else:
PHT. Feed (line)
File. Close ()
PHT. Close ()
Although they can all succeed, they are not what I want. I hope there are common methods to process webpages.
I wanted to use beautifulsoup. I hope this class can help solve this problem. Unfortunately, the Python version is too new and cannot be used. I will try again later.
Of course, you may not need the htmlparser class to process web pages. We can write the code we need. Only in that way can we have more initiative in web page parsing, in addition, it can improve its own capabilities. That is to say, we only need pyhon to help us download the webpage (webpage element) and parse the part. Let's do it by ourselves.