Introduction to Web Crawler framework jsoup and crawler framework jsoup
Preface: before knowing the jsoup framework, due to project requirements, you need to capture content from other websites on a regular basis and think of using HttpClient to obtain the content of a specified website. This method is stupid, a url request is used to specify a website, and text Parsing is returned Based on the specified website. To put it bluntly, HttpClient acts as the browser, and the returned text must be processed by itself. Generally, it is processed using the string. indexOf or string. subString method.
When I found the jsoup framework one day, I felt that the previous method was too stupid...
Jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and operations similar to jQuery.
Main functions of jsoup
1. parse HTML from a URL, file, or string;
2. Use the DOM or CSS selector to find and retrieve data;
3. HTML elements, attributes, and text can be operated;
Jsoup is released based on the MIT protocol and can be safely used in commercial projects.
Jsoup usage
File input = new File ("D: \ test.html"); Document doc = Jsoup. parse (input, "UTF-8", "url"); Elements links = doc. select ("a [href]"); // link Elements pngs = doc with the href attribute. select ("imgw.src==.png]"); // all elements that reference the png Image, Element masthead = doc. select ("div. masthead "). first ();
Have you ever felt familiar? Yes, the usage inside is similar to that of javascript and jquery, so you can directly use the jsoup API.
What can jsoup do?
1. The CMS system is often used to capture News (crawlers)
2. Preventing XSS attacks and Cross-Site Scripting (XSS) attacks, which are not abbreviated to Cascading Style Sheet (CSS). Therefore, Cross-Site Scripting (XSS) attacks are abbreviatedXSS
2. Website attacks and damages (you must be familiar with the HTTP protocol)
Recently, a java Web Crawler used jsoup to write a method. The main execution is normal and put in the action. Calling this method is not normal.
Import java. io. BufferedWriter;
Import java. io. FileOutputStream;
Import java. io. IOException;
Import java. io. OutputStreamWriter;
Import java.net. SocketTimeoutException;
Import java.net. UnknownHostException;
Import org. jsoup. Connection;
Import org. jsoup. Jsoup;
Import org. jsoup. nodes. Document;
Public class JsoupTest {
Static String url = "www.sogou.com/..y?java webpage crawler &page=1 ";
Public static void main (String [] args ){
Document doc = readUrlFist (url );
Write (doc );
}
Public static void write (Document doc ){
Try {
FileOutputStream fos = new FileOutputStream ("C: \ Documents and Settings \ Administrator \ Desktop \ a.html ");
OutputStreamWriter osw = new OutputStreamWriter (fos );
BufferedWriter bw = new BufferedWriter (osw );
Bw. write (doc. toString ());
Bw. flush ();
Fos. close ();
Osw. close ();
Bw. close ();
} Catch (Exception e ){
E. printStackTrace ();
}
}
Public static Document readUrlFist (String url ){
Document doc = null;
Connection conn = Jsoup. connect (url );
Conn. header (
"User-Agent ",
& Q ...... remaining full text>
How does a web crawler extract webpage information?
A regular expression or a third-party Toolkit can be used.
For example, html parser and jsoup.
Jsoup is recommended. Powerful functions. For more information, see
Zhidao.baidu.com/..273085
If you have any questions, please send me a private email.