Explanation of Python Library Network (2)-parsing Web pages

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I tried to use the htmlparser class to parse the webpage yesterday. It is found that the result is not satisfactory. In any case, write down the process first, and hope that later people can solve the problems I encountered on this basis.

I wrote two solutions. Of course, these two solutions are only valid for specific websites. Here I will mainly describe the BBC homepage www.BBC. Co. uk and www.163.com.

For bbc:

This setting is much simpler. It may be the encoding standard of the webpage.

Import html. parser
Import urllib. Request

Class parsehtml (html. parser. htmlparser ):
Def handle_starttag (self, Tag, attrs ):
Print ("encountered a {} start tag". Format (TAG ))
Def handle_endtag (self, tag ):
Print ("encountered a {} end tag". Format (TAG ))
Def handle_charref (self, name ):
Print ("charref ")
Def handle_entityref (self, name ):
Print ("endtiyref ")
Def handle_data (self, data ):
Print ("data ")
Def handle_comment (self, data ):
Print ("comment ")
Def handle_decl (self, Decl ):
Print ("Decl ")
Def handle_pi (self, Decl ):
Print ("Pi ")

# From here on, the inheritance above is very simple, and all the parent functions are reloaded.

# In binary write method to store the BBC web page, this is the content of the previous Article (http://blog.csdn.net/xiadasong007/archive/2009/09/03/4516683.aspx), not to repeat

File = open ("bbc.html", 'wb ') # It's 'wb', not 'W'
Url = urllib. Request. urlopen ("http://www.bbc.co.uk /")
While (1 ):
Line = URL. Readline ()
If Len (line) = 0:
Break
File. Write (line)

# Generate an object

PHT = parsehtml ()

# For this website, I use 'utf-8' to open it. Otherwise, an error will occur. Other websites may not need it. UTF-8 is unicode encoded.
File = open ("bbc.html", encoding = 'utf-8', mode = 'R ')

# Processing webpages and feeds
While (1 ):
Line = file. Readline ()
If Len (line) = 0:
Break
PHT. Feed (line)
File. Close ()
PHT. Close ()

For 163:

# If the above method is used for parsing the webpage, exceptions may occur when the CSS and JavaScript sections are encountered,

# So I removed the two parts here. Let's look at the Code:

Import html. parser
Import urllib. Request

# From here, I have defined four functions for processing the CSS and JavaScript sections.

Def encountercss (line ):
If line. Find ("<style type =" text/CSS ">") =-1:
Return 0
Return 1
Def passcss (file, line ):
# Print (line)
While (1 ):
If line. Find ("</style> ")! =-1:
Break
Line = file. Readline ()

Def encounterjavascript (line ):
If line. Find ("<SCRIPT type =" text/JavaScript ">") =-1:
Return 0
Return 1
Def passjavascript (file, line ):
Print (line)
While (1 ):
If line. Find ("</SCRIPT> ")! =-1:
Break
Line = file. Readline ()

Website = "http://www.163.com"
File = open ("163.html", mode = 'wb ') # It's 'wb', not 'W'
Url = urllib. Request. urlopen (website)
While (1 ):
Line = URL. Readline ()
If Len (line) = 0:
Break
File. Write (line)

PHT = parsehtml ()
File = open ("163.html", mode = 'R ')

While (1 ):
Line = file. Readline ()
If Len (line) = 0:
Break

# In this while loop, remove the CSS and JavaScript sections first.
If encountercss (line) = 1:
Passcss (file)
Elif encounterjavascript (line) = 1:
Passjavascript (file)
Else:
PHT. Feed (line)
File. Close ()
PHT. Close ()

Although they can all succeed, they are not what I want. I hope there are common methods to process webpages.

I wanted to use beautifulsoup. I hope this class can help solve this problem. Unfortunately, the Python version is too new and cannot be used. I will try again later.

Of course, you may not need the htmlparser class to process web pages. We can write the code we need. Only in that way can we have more initiative in web page parsing, in addition, it can improve its own capabilities. That is to say, we only need pyhon to help us download the webpage (webpage element) and parse the part. Let's do it by ourselves.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Explanation of Python Library Network (2)-parsing Web pages

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Explanation of Python Library Network (2)-parsing Web pages

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support