The principle of program implementation is very simple, that is, the web page is extracted first, then the tag is extracted, and href is filtered out.
1.
Html = urllib2.urlopen (url). read ()
# Html-unicode (html, 'gb2312', 'ignore'). encode ('utf-8', 'ignore ')
Content = BS (html). findAll ('A ')
Myfile = open (localfile, 'w ')
Pat = re. compile (r'href = "([^"] *) "')
Pat2 = re. compile (r 'http ')
For item in content:
H = pat. search (str (item ))
Href = h. group (1)
If pat2.search (href ):
Ans = href
Else:
Ans = url + href
2.
Def extractlinks (html ):
Soup = BS (html)
Anchors = soup. findAll ('A ')
Links = []
For a in anchors:
Links. append (a ['href '])
Return links
3.
Base_url = "http://www.hao123.com"
Html = urllib2.urlopen (base_url). read ()
Soup = BS (html)
Urls = soup. findAll ('A ')
Links = []
For url in urls:
Links. append (url ["href"])
For I in range (len (links )):
Print links [I]