Use HttpClient and htmlparser for easy crawling

Source: Internet
Author: User

This summary briefly describes the Httpclinet and htmlparser two open source projects, as well as their website and provide the download address.

  HttpClient Introduction

The HTTP protocol is one of the most important protocols on the Internet today. In addition to Web browsers, Web services, network-based applications, and growing network computing continue to expand the role of the HTTP protocol, making more and more applications require the support of the HTTP protocol. Although the JAVA Class library. NET package provides the basic functionality to access network resources using the HTTP protocol, its flexibility and capabilities are far from satisfying the needs of many applications. While the Jakarta Commons HttpClient component seeks to provide more flexible, more efficient HTTP protocol support, simplifying the creation of HTTP protocol-based applications. HttpClient offers a number of features that support the latest HTTP standards and can be accessed here for more details on httpclinet. There are many open source projects that use the HTTP functionality provided by HttpClient, which can be viewed at the landing site. This article uses the class library provided by httpclinet to access and download the Web pages above the Internet, and in a later section details the two methods it provides for requesting network resources: Get requests and Post requests. Apatche provides free Httpclien t source code and JAR package download, you can log in here to download the latest HttpClient components. The author is using HttpClient3.1.

  Htmlparser Introduction

Today's Internet has hundreds of millions of of pages on it, and more and more applications use those pages as data objects for analysis and processing. These pages are semi-structured text, with a large number of tags and nested structures. When we develop some applications that deal with Web pages ourselves, we think of developing a separate Web parser, and this part of the work must be devoted to a considerable amount of effort and time. In fact, as a JAVA application developer, Htmlparser provides a powerful, flexible and easy-to-use open source class library, which saves the overhead of writing a Web parser. Htmlparser is an active open source project on Http://sourceforge.net, which provides linear and nested two ways to parse Web pages, primarily for the conversion of HTML pages (transformation) and the extraction of Web content ( Extraction). Htmlparser has the following easy-to-use features: Filters (Filters), visitor mode (Visitors), handling custom labels, and easy-to-use JavaBeans. As the Htmlparser home page says: It is a fast, robust, and rigorously tested component that attracts more and more developers with the simplicity of its design, the speed with which it runs, and the ability to handle real Web pages on the Internet. In this article, we use Htmlparser to extract the links in the Web pages to realize the key parts of the simple crawler. Htmlparser The latest version is HtmlParser1.6, you can log in here to download its source code, API reference documentation and JAR package.


 Simple and powerful Stringbean

If you want the text to be removed from the page after all the labels, use Stringbean. The following simple code can help you solve this problem:

Listing 5

STRINGBEANSB = new Stringbean ();

Sb.setlinks (false);//Set the link to the result.

Sb.seturl (URL);//Set the URL of the page you need to filter out the page labels

System.out.println (Sb.getstrings ());//Print results

Htmlparser provides a powerful class library to handle Web pages, as this article is intended to be a simple introduction, so it is only the key class libraries related to the author's subsequent crawlers are illustrated in the example. Interested readers can specialize in Htmlparser's more powerful class libraries.

  The realization of simple crawler

HttpClient provides a convenient HTTP protocol access, so that we can easily get the source of a Web page and save it locally, Htmlparser provides such a simple and smart class library, you can easily extract from the Web page hyperlinks to other Web pages. The author combines these two open-source packages and constructs a simple web crawler.

  Crawler (Crawler) principle

Readers who have studied data structures know the data structure of the graph. As shown, if you look at a Web page as one of the nodes in the diagram and the link to another page in the page as the edge of the node pointing to the other nodes, it is easy to model the Web page across the Internet as a directed graph. In theory, traversing the graph through the traversal algorithm provides access to almost all Web pages on the Internet. The simplest traversal is the width first and the depth first. The following authors implement the simple crawler is the use of a width-first crawl strategy.

Figure 2. A modeling diagram of Web page relationships

Simple Crawler Implementation Process

Before you look at the implementation code for a simple crawler, let's introduce the flow of crawling Web pages.

http://c.tieba.baidu.com/p/3408961718
http://c.tieba.baidu.com/p/3408971171
http://c.tieba.baidu.com/p/3408978819
http://c.tieba.baidu.com/p/3408983919
http://c.tieba.baidu.com/p/3408987713
http://c.tieba.baidu.com/p/3408991963
http://c.tieba.baidu.com/p/3408996472
http://c.tieba.baidu.com/p/3409001170
http://c.tieba.baidu.com/p/3409005781
http://c.tieba.baidu.com/p/3409010342
http://c.tieba.baidu.com/p/3409015144
http://c.tieba.baidu.com/p/3409019913
http://c.tieba.baidu.com/p/3409024815
http://c.tieba.baidu.com/p/3409029758
http://c.tieba.baidu.com/p/3409034566
http://c.tieba.baidu.com/p/3409039584
http://c.tieba.baidu.com/p/3409044532
http://c.tieba.baidu.com/p/3409049586
http://c.tieba.baidu.com/p/3409054519
http://c.tieba.baidu.com/p/3409092339
http://c.tieba.baidu.com/p/3409071327
http://c.tieba.baidu.com/p/3409103494
http://c.tieba.baidu.com/p/3409108450
http://c.tieba.baidu.com/p/3409113377
http://c.tieba.baidu.com/p/3409118414
http://c.tieba.baidu.com/p/3409123357
http://c.tieba.baidu.com/p/3409128339
http://c.tieba.baidu.com/p/3409133208
http://c.tieba.baidu.com/p/3409138003
http://c.tieba.baidu.com/p/3409169910
http://c.tieba.baidu.com/p/3401329030
http://c.tieba.baidu.com/p/3379202187
http://c.tieba.baidu.com/p/3404894476
http://c.tieba.baidu.com/p/3406909576
http://c.tieba.baidu.com/p/3406914372
http://c.tieba.baidu.com/p/3406914946
http://c.tieba.baidu.com/p/3406916795
http://c.tieba.baidu.com/p/3406921339
http://c.tieba.baidu.com/p/3379078552
http://c.tieba.baidu.com/p/3406920477
http://c.tieba.baidu.com/p/3408735878
http://c.tieba.baidu.com/p/3408743663
http://c.tieba.baidu.com/p/3408744284
http://c.tieba.baidu.com/p/3408744762
http://c.tieba.baidu.com/p/3408745167
http://c.tieba.baidu.com/p/3408746075
http://c.tieba.baidu.com/p/3408746973
http://c.tieba.baidu.com/p/3408747256
http://c.tieba.baidu.com/p/3408747316
http://c.tieba.baidu.com/p/3408747796
http://c.tieba.baidu.com/p/3408748333
http://c.tieba.baidu.com/p/3408748673
http://c.tieba.baidu.com/p/3408749015
http://c.tieba.baidu.com/p/3408750253
http://c.tieba.baidu.com/p/3408750415
http://c.tieba.baidu.com/p/3408750875
http://c.tieba.baidu.com/p/3408751623
http://c.tieba.baidu.com/p/3408751925
http://c.tieba.baidu.com/p/3408752216
http://c.tieba.baidu.com/p/3408752771
http://c.tieba.baidu.com/p/3408753118
http://c.tieba.baidu.com/p/3408754309
http://c.tieba.baidu.com/p/3408754743
http://c.tieba.baidu.com/p/3408755041
http://c.tieba.baidu.com/p/3408755375
http://c.tieba.baidu.com/p/3408755719
http://c.tieba.baidu.com/p/3408756064
http://c.tieba.baidu.com/p/3408756387
http://c.tieba.baidu.com/p/3408756724
http://c.tieba.baidu.com/p/3408757082
http://c.tieba.baidu.com/p/3408757424
http://c.tieba.baidu.com/p/3408757823
http://c.tieba.baidu.com/p/3408758170
http://c.tieba.baidu.com/p/3408758494
http://c.tieba.baidu.com/p/3408758839
http://c.tieba.baidu.com/p/3408759225
http://c.tieba.baidu.com/p/3408759621
http://c.tieba.baidu.com/p/3408760030
http://c.tieba.baidu.com/p/3408760425
http://c.tieba.baidu.com/p/3408760804
http://c.tieba.baidu.com/p/3408761254
http://c.tieba.baidu.com/p/3408761690
http://c.tieba.baidu.com/p/3408762110
http://c.tieba.baidu.com/p/3408762553
http://c.tieba.baidu.com/p/3408763017
http://c.tieba.baidu.com/p/3408763466
http://c.tieba.baidu.com/p/3408763928
http://c.tieba.baidu.com/p/3408764456
http://c.tieba.baidu.com/p/3408764468
http://c.tieba.baidu.com/p/3408765000
http://c.tieba.baidu.com/p/3408765490
http://c.tieba.baidu.com/p/3408766037
http://c.tieba.baidu.com/p/3408766614
http://c.tieba.baidu.com/p/3408767244
http://c.tieba.baidu.com/p/3408767944
http://c.tieba.baidu.com/p/3408768646
http://c.tieba.baidu.com/p/3408769358
http://c.tieba.baidu.com/p/3408770073
http://c.tieba.baidu.com/p/3408770836
http://c.tieba.baidu.com/p/3408771618
http://c.tieba.baidu.com/p/3408773263
http://c.tieba.baidu.com/p/3408772451
http://c.tieba.baidu.com/p/3408774141
http://c.tieba.baidu.com/p/3408775072
http://c.tieba.baidu.com/p/3408776075
http://c.tieba.baidu.com/p/3408777106
http://c.tieba.baidu.com/p/3408778147
http://c.tieba.baidu.com/p/3396342442
http://c.tieba.baidu.com/p/3381905005
http://c.tieba.baidu.com/p/3408779283
http://c.tieba.baidu.com/p/3379475874
http://c.tieba.baidu.com/p/3404102550
http://c.tieba.baidu.com/p/3404106494
http://c.tieba.baidu.com/p/3404114264
http://c.tieba.baidu.com/p/3404117528
http://c.tieba.baidu.com/p/3404121853
http://c.tieba.baidu.com/p/3404139787
http://c.tieba.baidu.com/p/3404162264
http://c.tieba.baidu.com/p/3404167365
http://c.tieba.baidu.com/p/3408780395
http://c.tieba.baidu.com/p/3404182093
http://c.tieba.baidu.com/p/3404186454
http://c.tieba.baidu.com/p/3404218080
http://c.tieba.baidu.com/p/3404234012
http://c.tieba.baidu.com/p/3404397089
http://c.tieba.baidu.com/p/3408841663
http://c.tieba.baidu.com/p/3408839072
http://c.tieba.baidu.com/p/3408836550
http://c.tieba.baidu.com/p/3408834039
http://c.tieba.baidu.com/p/3408831624
http://c.tieba.baidu.com/p/3408829263
http://c.tieba.baidu.com/p/3408827111
http://c.tieba.baidu.com/p/3408824893
http://c.tieba.baidu.com/p/3408822658
http://c.tieba.baidu.com/p/3408820525
http://c.tieba.baidu.com/p/3408818529
http://c.tieba.baidu.com/p/3408816505
http://c.tieba.baidu.com/p/3408814527
http://c.tieba.baidu.com/p/3408812584
http://c.tieba.baidu.com/p/3408810545
http://c.tieba.baidu.com/p/3408808575
http://c.tieba.baidu.com/p/3408806672
http://c.tieba.baidu.com/p/3408804746
http://c.tieba.baidu.com/p/3408802832
http://c.tieba.baidu.com/p/3408800993
http://c.tieba.baidu.com/p/3408799236
http://c.tieba.baidu.com/p/3408797524
http://c.tieba.baidu.com/p/3408795872
http://c.tieba.baidu.com/p/3408794250
http://c.tieba.baidu.com/p/3408792666
http://c.tieba.baidu.com/p/3408791203
http://c.tieba.baidu.com/p/3408789795
http://c.tieba.baidu.com/p/3408788282
http://c.tieba.baidu.com/p/3408786883
http://c.tieba.baidu.com/p/3408785527
http://c.tieba.baidu.com/p/3406911722
http://c.tieba.baidu.com/p/3406910802
http://c.tieba.baidu.com/p/3406909896
http://c.tieba.baidu.com/p/3405044440
http://c.tieba.baidu.com/p/3405042451
http://c.tieba.baidu.com/p/3408784167
http://c.tieba.baidu.com/p/3404891064
http://c.tieba.baidu.com/p/3404886459
http://c.tieba.baidu.com/p/3404570261
http://c.tieba.baidu.com/p/3404563126
http://c.tieba.baidu.com/p/3404550521
http://c.tieba.baidu.com/p/3382203031
http://c.tieba.baidu.com/p/3408782867
http://c.tieba.baidu.com/p/3404541723
http://c.tieba.baidu.com/p/3404521495
http://c.tieba.baidu.com/p/3404496080
http://c.tieba.baidu.com/p/3404465069
http://c.tieba.baidu.com/p/3404432477
http://c.tieba.baidu.com/p/3404427369
http://c.tieba.baidu.com/p/3408781615
http://c.tieba.baidu.com/p/3409040060
http://c.tieba.baidu.com/p/3409034973
http://c.tieba.baidu.com/p/3409030141

Use HttpClient and htmlparser for easy crawling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.