Introduction to Web Crawler framework jsoup and crawler framework jsoup

Source: Internet
Author: User

Introduction to Web Crawler framework jsoup and crawler framework jsoup

Preface: before knowing the jsoup framework, due to project requirements, you need to capture content from other websites on a regular basis and think of using HttpClient to obtain the content of a specified website. This method is stupid, a url request is used to specify a website, and text Parsing is returned Based on the specified website. To put it bluntly, HttpClient acts as the browser, and the returned text must be processed by itself. Generally, it is processed using the string. indexOf or string. subString method.

When I found the jsoup framework one day, I felt that the previous method was too stupid...

Jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and operations similar to jQuery.

Main functions of jsoup

1. parse HTML from a URL, file, or string;
2. Use the DOM or CSS selector to find and retrieve data;
3. HTML elements, attributes, and text can be operated;
Jsoup is released based on the MIT protocol and can be safely used in commercial projects.

Jsoup usage

File input = new File ("D: \ test.html"); Document doc = Jsoup. parse (input, "UTF-8", "url"); Elements links = doc. select ("a [href]"); // link Elements pngs = doc with the href attribute. select ("imgw.src==.png]"); // all elements that reference the png Image, Element masthead = doc. select ("div. masthead "). first ();

Have you ever felt familiar? Yes, the usage inside is similar to that of javascript and jquery, so you can directly use the jsoup API.


What can jsoup do?

1. The CMS system is often used to capture News (crawlers)

2. Preventing XSS attacks and Cross-Site Scripting (XSS) attacks, which are not abbreviated to Cascading Style Sheet (CSS). Therefore, Cross-Site Scripting (XSS) attacks are abbreviatedXSS

2. Website attacks and damages (you must be familiar with the HTTP protocol)


Recently, a java Web Crawler used jsoup to write a method. The main execution is normal and put in the action. Calling this method is not normal.

Import java. io. BufferedWriter;
Import java. io. FileOutputStream;
Import java. io. IOException;
Import java. io. OutputStreamWriter;
Import java.net. SocketTimeoutException;
Import java.net. UnknownHostException;

Import org. jsoup. Connection;
Import org. jsoup. Jsoup;
Import org. jsoup. nodes. Document;

Public class JsoupTest {

Static String url = "www.sogou.com/..y?java webpage crawler &page=1 ";

Public static void main (String [] args ){
Document doc = readUrlFist (url );
Write (doc );
}
Public static void write (Document doc ){
Try {

FileOutputStream fos = new FileOutputStream ("C: \ Documents and Settings \ Administrator \ Desktop \ a.html ");
OutputStreamWriter osw = new OutputStreamWriter (fos );
BufferedWriter bw = new BufferedWriter (osw );
Bw. write (doc. toString ());
Bw. flush ();
Fos. close ();
Osw. close ();
Bw. close ();
} Catch (Exception e ){
E. printStackTrace ();
}

}
Public static Document readUrlFist (String url ){
Document doc = null;
Connection conn = Jsoup. connect (url );
Conn. header (
"User-Agent ",
& Q ...... remaining full text>

How does a web crawler extract webpage information?

A regular expression or a third-party Toolkit can be used.
For example, html parser and jsoup.
Jsoup is recommended. Powerful functions. For more information, see
Zhidao.baidu.com/..273085
If you have any questions, please send me a private email.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.