Objective: To obtain the website link, can realize the data acquisition without human intervention.
1 Java-implemented Jsoup HTML parsing library
Download: http://jsoup.org/
2 Working Platform Ubuntu
3 extracting page join information using Jython invoke Jsoup implementation
Code:
#coding =utf-8#doc from Http://jsoup.org/apidocs/from org.python.core import codecscodecs.setdefaultencoding (' Utf-8 ' ) Import Sys#print (sys.defaultencoding) sys.path.append ("/home/xxx/software/htmlparse/jsoup-1.7.3.jar"); from Org.jsoup Import *doc = Jsoup.connect ("http://www.baidu.com"). get (); elms = Doc.getallelements (); head = Elms.select (" Head ") Page_title = Head.text () print (page_title) Hrfs = Elms.select (" [Href^=http] ") for h in hrfs:title = H.text () url = h.at TR (' href ') print title + "," + URL
The effect is as follows:
Baidu a bit, you will know
Experience the best Chinese input method on iphone! , http://srf.baidu.com/ios8/pc.html
Login, Https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
News, http://news.baidu.com
Hao123, http://www.hao123.com
Map, http://map.baidu.com
Video, http://v.baidu.com
Paste, http://tieba.baidu.com
Login, Https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
Settings, http://www.baidu.com/gaoji/preferences.html
More Products, http://www.baidu.com/more/
News, http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
Paste, HTTP://TIEBA.BAIDU.COM/F?KW=&FR=WWWT
Yes, HTTP://ZHIDAO.BAIDU.COM/Q?CT=17&PN=0&TN=IKASLIST&RN=10&WORD=&FR=WWWT.
Music, http://music.baidu.com/search?fr=ps&key=
Picture, http://image.baidu.com/i?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&word=
Video, http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=
Map, http://map.baidu.com/m?word=&fr=ps01000
Library, Http://wenku.baidu.com/search?word=&lm=0&od=0
Set Baidu as homepage, http://www.baidu.com/cache/sethelp/index.html
About Baidu, http://home.baidu.com
About Baidu, http://ir.baidu.com
Jython uses Jsoup to get page title and link information