Extract the song list from Baidu MP3 search results
Source: Internet
Author: User
1. Use htmlparser to extract the song list from Baidu MP3 search results
"""
Extract the song list from Baidu MP3 search results
The method is to use the data structure-stack and level for implementation.
1. Add all unused tags to the stack, and all unused tags to the stack.
2. HTML syntax error handling is implemented by the level. The tag level at the top of the stack is the lowest. If a high-level tag needs to go out of the stack, it first outputs the low-level tag first, so as to avoid missing errors.
3. For error handling, first check whether the error is in the stack. If not, discard it directly.
4. Each time a new tag is added to the stack, data is cleared.
5. analyze specific webpages. The Baidu MP3 list is relatively simple. Just use the ID of the song list.
6. Note: When Using ATTR = value, "" should be used for value; otherwise, Python reports the error "junk characters in start tag"
"""
From htmlparser import htmlparser
From htmlentitydefs import *
Import sys, re
Class songsparser (htmlparser ):
Def _ init _ (Self ):
# Record the tag that has not encountered/Tag>
Self. taglevels = []
# Tags to be processed
Self. handledtags = ["table", "th", "TD", "TR"]
# Tags currently being processed
Self. Processing = none
Self. Songs = 0
Htmlparser. _ init _ (Self)
Def handle_starttag (self, Tag, attrs ):
If Len (self. taglevels) and self. taglevels [-1] = tag:
"If the previous tag is the same as the current tag, it is considered as an unbalance tag """
Self. handle_endtag (TAG)
# Add the current tag to the stack (1)
Self. taglevels. append (TAG)
"Baidu search result song is displayed as a table. The table ID is" TBS ". Only the content of this table is processed during processing. Other table data is ignored """
If tag = "table ":
For item in attrs:
If item [0] = "ID" and item [1] = "TBS ":
Self. Songs = 1
# If the current tag needs to be processed, first clear the data and set the tag to the current processing (4)
If tag in self. handledtags:
Self. Data = ""
Self. Processing = tag
Def handle_data (self, data ):
"If it is a tag to be processed, append it directly """
If self. Processing:
Self. Data + = Data
Def handle_endtag (self, tag ):
If not tag in self. taglevels:
"If the tag is not in the stack, it is regarded as illegal and will not be processed (3 )"""
Return
While Len (self. taglevels ):
# Obtain the stack top tag
Starttag = self. taglevels. Pop ()
# If the tag is to be processed, call the processing function
If starttag in self. handledtags:
Self. finishprocessing (starttag)
# If this tag is the same as starttag, the level tag is processed and the loop is stopped. Otherwise,
# Tag processing at this level (2)
If starttag = tag:
Break
Def cleanse (Self ):
"Delete unnecessary spaces """
Self. Data = Re. sub ("/S +", "", self. Data)
Def finishprocessing (self, tag ):
Self. Cleanse ()
If tag = "title" and tag = self. Processing:
Print "Title =", self. Data
Elif tag = "TR" and self. Songs = 1:
Print ("")
Elif tag = "TD" and tag = self. Processing and self. Songs = 1:
SYS. stdout. Write ("/t" + self. Data)
Elif tag = "table ":
"There is no embedded table in the Song table, so we can see that the songs is set to 0, and the song list has been completed """
Self. Songs = 0
Self. Processing = none
FD = open ("jcparsehtmlsongs.htm ")
TP = songsparser ()
TP. Feed (FD. Read ())
FD. Close ()
2. output results
1 Rainbow second featured Jay Chou. I'm very busy trying to hear the lyrics 4.0 m WMA
2 rainbow OS version zoelee BBS jayfc com Jay Chou I am very busy to listen to the lyrics 6.1 m MP3
3 rainbow OS version zoelee BBS jayfc com Jay Chou I am very busy to listen to the lyrics 6.1 m MP3
4 rainbow Jay Chou I am very busy trying to listen to the lyrics 3.1 m MP3
5 Rainbow second featured Jay Chou. I'm very busy trying to hear the lyrics 3.5 m WMA
6 rainbow jaycn Wang Bei Jay Chou I am very busy trying to listen to the lyrics 6.1 m MP3
7 rainbow OS version zoelee BBS jayfc com Jay Chou I am very busy to listen to the lyrics 6.1 m MP3
8 rainbow Jay Chou I am very busy to listen to the lyrics 0.4 m MP3
9 rainbow QQ: 8058722 Jay Chou I am very busy trying to listen to the lyrics 1.3 m MP3
10 Rainbow second featured Jay Chou. I'm very busy trying to hear the lyrics 0.3 m WMA
11 love paintings Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
12 love paintings Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
13 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics unknown WMA
14 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
15 love pictures Jay Chou Liu Xiaohong rainbow heaven audition lyrics unknown WMA
16 love paintings Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
17 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
18 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics unknown WMA
19 love paintings Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
20 love paintings Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
21 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
22 love paintings Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
23 rainbow tribe jayhome CN Jay Chou I'm busy trying to listen to the lyrics 1.8 m MP3
24 love pictures Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
25 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics unknown WMA
26 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics unknown WMA
27 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
28 love pictures Jay Chou Liu Xiaohong rainbow heaven audition lyrics 0.2 m MP3
29 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics unknown WMA
30 love painting Jay Chou Liu Xiaohong rainbow heaven audition lyrics unknown WMA
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.