Explanation of Python Library Network (2)-parsing Web pages

Source: Internet
Author: User

I tried to use the htmlparser class to parse the webpage yesterday. It is found that the result is not satisfactory. In any case, write down the process first, and hope that later people can solve the problems I encountered on this basis.

 

I wrote two solutions. Of course, these two solutions are only valid for specific websites. Here I will mainly describe the BBC homepage www.BBC. Co. uk and www.163.com.

 

For bbc:

This setting is much simpler. It may be the encoding standard of the webpage.

Import html. parser
Import urllib. Request

Class parsehtml (html. parser. htmlparser ):
Def handle_starttag (self, Tag, attrs ):
Print ("encountered a {} start tag". Format (TAG ))
Def handle_endtag (self, tag ):
Print ("encountered a {} end tag". Format (TAG ))
Def handle_charref (self, name ):
Print ("charref ")
Def handle_entityref (self, name ):
Print ("endtiyref ")
Def handle_data (self, data ):
Print ("data ")
Def handle_comment (self, data ):
Print ("comment ")
Def handle_decl (self, Decl ):
Print ("Decl ")
Def handle_pi (self, Decl ):
Print ("Pi ")

# From here on, the inheritance above is very simple, and all the parent functions are reloaded.

# In binary write method to store the BBC web page, this is the content of the previous Article (http://blog.csdn.net/xiadasong007/archive/2009/09/03/4516683.aspx), not to repeat

File = open ("bbc.html", 'wb ') # It's 'wb', not 'W'
Url = urllib. Request. urlopen ("http://www.bbc.co.uk /")
While (1 ):
Line = URL. Readline ()
If Len (line) = 0:
Break
File. Write (line)

# Generate an object

PHT = parsehtml ()

# For this website, I use 'utf-8' to open it. Otherwise, an error will occur. Other websites may not need it. UTF-8 is unicode encoded.
File = open ("bbc.html", encoding = 'utf-8', mode = 'R ')

# Processing webpages and feeds
While (1 ):
Line = file. Readline ()
If Len (line) = 0:
Break
PHT. Feed (line)
File. Close ()
PHT. Close ()

 

For 163:

# If the above method is used for parsing the webpage, exceptions may occur when the CSS and JavaScript sections are encountered,

# So I removed the two parts here. Let's look at the Code:

Import html. parser
Import urllib. Request

Class parsehtml (html. parser. htmlparser ):
Def handle_starttag (self, Tag, attrs ):
Print ("encountered a {} start tag". Format (TAG ))
Def handle_endtag (self, tag ):
Print ("encountered a {} end tag". Format (TAG ))
Def handle_charref (self, name ):
Print ("charref ")
Def handle_entityref (self, name ):
Print ("endtiyref ")
Def handle_data (self, data ):
Print ("data ")
Def handle_comment (self, data ):
Print ("comment ")
Def handle_decl (self, Decl ):
Print ("Decl ")
Def handle_pi (self, Decl ):
Print ("Pi ")

# From here, I have defined four functions for processing the CSS and JavaScript sections.

Def encountercss (line ):
If line. Find ("<style type =" text/CSS ">") =-1:
Return 0
Return 1
Def passcss (file, line ):
# Print (line)
While (1 ):
If line. Find ("</style> ")! =-1:
Break
Line = file. Readline ()

Def encounterjavascript (line ):
If line. Find ("<SCRIPT type =" text/JavaScript ">") =-1:
Return 0
Return 1
Def passjavascript (file, line ):
Print (line)
While (1 ):
If line. Find ("</SCRIPT> ")! =-1:
Break
Line = file. Readline ()

 

Website = "http://www.163.com"
File = open ("163.html", mode = 'wb ') # It's 'wb', not 'W'
Url = urllib. Request. urlopen (website)
While (1 ):
Line = URL. Readline ()
If Len (line) = 0:
Break
File. Write (line)

 

PHT = parsehtml ()
File = open ("163.html", mode = 'R ')

While (1 ):
Line = file. Readline ()
If Len (line) = 0:
Break

# In this while loop, remove the CSS and JavaScript sections first.
If encountercss (line) = 1:
Passcss (file)
Elif encounterjavascript (line) = 1:
Passjavascript (file)
Else:
PHT. Feed (line)
File. Close ()
PHT. Close ()

Although they can all succeed, they are not what I want. I hope there are common methods to process webpages.

I wanted to use beautifulsoup. I hope this class can help solve this problem. Unfortunately, the Python version is too new and cannot be used. I will try again later.

Of course, you may not need the htmlparser class to process web pages. We can write the code we need. Only in that way can we have more initiative in web page parsing, in addition, it can improve its own capabilities. That is to say, we only need pyhon to help us download the webpage (webpage element) and parse the part. Let's do it by ourselves.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.