Jython uses jsoup to crawl the webpage title and link information, jythonjsoup
Objective: To obtain website links without manual intervention.
1. java-implemented jsoup HTML Parsing Library
Download: http://jsoup.org/
2 working platform Ubuntu
3. Use Jython to call jsoup to extract webpage link information
Code:
#coding=utf-8#doc from http://jsoup.org/apidocs/from org.python.core import codecscodecs.setDefaultEncoding('utf-8')import sys#print(sys.defaultencoding)sys.path.append("/home/xxx/software/htmlparse/jsoup-1.7.3.jar");from org.jsoup import *doc = Jsoup.connect("http://www.baidu.com").get();elms = doc.getAllElements();head = elms.select("head")page_title = head.text()print(page_title)hrfs = elms.select("[href^=http]")for h in hrfs:title = h.text()url = h.attr('href')print title + ", " + url
The effect is as follows:
Baidu, you will know
Try the best Chinese Input Method on the iPhone !, Http://srf.baidu.com/ios8/pc.html
Login, https://passport.baidu.com/v2? Login & tpl = mn & u = http % 3A % 2F % 2Fwww.baidu.com % 2F
News, http://news.baidu.com
Hao123, http://www.hao123.com
Map, http://map.baidu.com
Http://v.baidu.com, video
Post it, http://tieba.baidu.com
Login, https://passport.baidu.com/v2? Login & tpl = mn & u = http % 3A % 2F % 2Fwww.baidu.com % 2F
Settings, http://www.baidu.com/gaoji/preferences.html
More Products, http://www.baidu.com/more/
News, http://news.baidu.com/ns? Cl = 2 & rn = 20 & tn = news & word =
Http://tieba.baidu.com/f? Kw = & fr = wwwt
Http://zhidao.baidu.com/q? Ct = 17 & pn = 0 & tn = ikaslist & rn = 10 & word = & fr = wwwt
Music, http://music.baidu.com/search? Fr = ps & key =
Http://image.baidu.com/I? Tn = baiduimage & ps = 1 & ct = 201326592 & lm =-1 & cl = 2 & nc = 1 & word =
Video, http://v.baidu.com/v? Ct = 301989888 & rn = 20 & pn = 0 & db = 0 & s = 25 & word =
Map, http://map.baidu.com/m? Word = & fr = ps01000
Library, http://wenku.baidu.com/search? Word = & lm = 0 & od = 0
Set Baidu as home page, http://www.baidu.com/cache/sethelp/index.html
About Baidu, http://home.baidu.com
About Baidu, http://ir.baidu.com
How does Jsoup extract all titles and links in the following document?
Import java. io. File;
Import java. io. IOException;
Import java. util. ArrayList;
Import java. util. List;
Import org. jsoup. Jsoup;
Import org. jsoup. nodes. Document;
Import org. jsoup. nodes. Element;
Import org. jsoup. select. Elements;
Public class Test {
Public static void main (String [] args) throws IOException {
Document doc = Jsoup. parse (new File ("E:/a.html"), "UTF-8 ");
Elements els = doc. select ("div. articleh span. l3 ");
List <Article> articles = new ArrayList <Article> ();
For (Element el: els ){
Articles. add (new Article (el. attr ("href"), el. attr ("title ")));
}
For (Article article: articles ){
System. out. println (article. getTitle ());
System. out. println (article. getHref ());
System. out. println ();
}
}
}
Class Article {
Private Stringhref;
Private Stringtitle;
Public Article (String href, String title ){
This. href = href;
This. title = title;
}
Public String getHref (){
Return href;
}
Public void setHref (String href ){
This. href = href;
}
Public String getTitle (){
Return title;
}
Public void setTitle (String title ){
This. title = title;
}
}