Web crawler Framework Jsoup Introduction

Source: Internet
Author: User

Preface: Before the Jsoup framework, because of the project needs, need to regularly crawl the content on other sites, then think of the httpclient way to get the content of the specified site, this method is more stupid, is to request a URL to specify the site, based on the specified site return text resolution. Plainly httpclient acts as a browser, the returned text needs to be handled by itself, and is generally handled using String.IndexOf or String.substring methods.

When one day found Jsoup this framework, a moment of emotion, the previous method is too stupid ...

Jsoup is a Java HTML parser that can parse a URL address and HTML text content directly. It provides a very labor-saving API that can be used to extract and manipulate data through dom,css and jquery-like operations.

Jsoup main functions

1. Parse the HTML from a URL, file, or string;
2. Use the DOM or CSS selector to find and remove data;
3. Can manipulate HTML elements, attributes, text;
Jsoup is based on the MIT protocol and can be used with confidence in commercial projects.

Jsoup usage

File input = new file ("D:\test.html");D ocument Doc =jsoup.parse (input, "UTF-8", "url"); Elements links = doc.select ("a[href]"); Links with href attributes elements PNGs = Doc.select ("img[src$=.png]");//all elements referencing PNG pictures element masthead =doc.select ("Div.masthead" ). First ();

There is no sense of déjà vu, yes, inside the usage is very similar to JavaScript and jquery, so simply look at the Jsoup API can be used directly.


What can jsoup do?

1, CMS system is often used to do news crawling (crawler)

2, prevent XSS attacks, cross-site scripting attacks (crosses site Scripting), for not and cascading style sheets (cascading style Sheets, CSS) abbreviations confused, so the cross-site scripting attacks abbreviated to XSS

2, website attack, destruction (need to be familiar with the HTTP protocol)

Web crawler Framework Jsoup Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.