Using HttpClient and Htmlparser to implement a simple crawler

Last Update:2017-02-27 Source: Internet

Author: User

Tags html page web services

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes the use of the Htmlparser Open source package and httpclient open source package, on the basis of which a simple web crawler (Crawler) is implemented to show how to use Htmlparser to handle Web pages on the Internet as needed, and how to use HttpClient to simplify get and Post request operations and build powerful network applications.

Introduction to HttpClient and Htmlparser

This summary briefly describes the two open source projects of Httpclinet and Htmlparser, as well as their websites and the addresses that provide downloads.

HttpClient Introduction

The HTTP protocol is one of the most important protocols on the Internet today. In addition to Web browsers, Web services, web-based applications, and growing network computing continue to expand the role of the HTTP protocol, making more and more applications require the support of the HTTP protocol. Although the JAVA Class library. NET package provides the basic functionality to access network resources using the HTTP protocol, its flexibility and functionality are far from satisfying the needs of many applications. The Jakarta Commons httpclient component seeks to provide more flexible, more efficient HTTP protocol support to simplify the creation of applications based on HTTP protocol. HttpClient provides a number of features that support the latest HTTP standards and can be accessed here for more information about httpclinet. There are a number of open source projects that use the HTTP features provided by HttpClient, which can be viewed at the landing site. This article uses the class library provided by httpclinet to access and download Web pages above the Internet, and in the following sections, details the two ways in which they can be used to request network resources: Get requests and Post requests. Apatche offers free Httpclien t source and JAR pack downloads, which can be landed here to download the latest HttpClient components. The author is using HttpClient3.1.

Htmlparser Introduction

Today's Internet has hundreds of millions of of pages, and more and more applications use these pages as data objects for analysis and processing. Most of these pages are semi-structured text with a large number of tags and nested structures. When we develop our own application to process Web pages, we think of developing a separate Web parser, and this part of the work will have to pay considerable effort and time. In fact, as a JAVA application developer, Htmlparser provides it with a powerful and flexible open source class library that greatly saves the cost of writing a Web parser. Htmlparser is an active open source project on Http://sourceforge.net, which provides linear and nested two ways to parse Web pages, mainly for HTML page Transformations (transformation) and the extraction of Web content ( Extraction). Htmlparser has the following features that are easy to use: Filters (Filters), visitor mode (visitors), processing custom tags, and easy to use JavaBeans. As Htmlparser's home page says: It is a fast, robust, and rigorously tested component; its simplicity, the speed with which it is run, and the ability to handle real Web pages on the Internet attract more and more developers. This article is to use Htmlparser to extract the links in the Web page, to achieve the key parts of the simple crawler. Htmlparser's latest version is HtmlParser1.6, where you can download its source code, API reference documentation, and JAR packs.

Development of the environment to build

The development environment I use is the Eclipse Europa, which can be downloaded free of charge in www.eclipse.org, the JDK is 1.6, you can download it at the www.java.sun.com site, and configure the environment variables in the operating system. Create a JAVA project in Eclipse and import downloaded Commons-httpclient3.1.jar,htmllexer.jar and Htmlparser.jar files in the project's build Path.

Figure 1. The development environment constructs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More