Introduction to Web crawler (i)

Source: Internet
Author: User

Winter vacation began to learn some of the simple crawlers and do some meaningful things.

First of all, Baidu a reptile means:

Web crawler: web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. Other infrequently used names are ants, auto-indexing, simulation programs, or worms.

The simplest reptile I understand means crawling the contents of a webpage and doing some other behavior. Then you need to get a Web page first.

So how to get a webpage, first of all to know the URL of this page, Baidu as an example, Baidu's URL is: http://www.baidu.com/. The URL class in Java can get a URL.

Then this URL gets to be treated as a file, which means that the URL is read as a file input stream.

After reading the resources for this URL, we can let him display it:

First look at the code:

1  Public classTESTSP1 {2      Public Static voidMain (string[] args) {3         Try {4URL pageurl=NewURL ("http://www.baidu.com");5InputStream input=Pageurl.openstream ();6Scanner Scanner =NewScanner (Input, "Utf-8");7String Text = Scanner.usedelimiter ("//a"). Next ();8 System.out.println (text);9}Catch(malformedurlexception e) {Ten e.printstacktrace (); One}Catch(IOException e) { A e.printstacktrace (); -         } -     } the}

First visit Baidu page, then read the content of the stream, then output, the results are as follows:

So what is this string of code? We can visit Baidu's page to see the source code of Baidu Web page. Here is my Google browser to visit Baidu after viewing the source code:

After comparison we can find the same, that is, the page we read is actually an HTML file, the HTML file on the server is read out, and then output to the console is good.

This is the first step of the crawler, but it is said that in fact, the network into the complex, so the use of Java.net API workload will be very large, so the actual development will have ready-made open source packages, Apache under the HTTP client jar package, called Httpclient.jar

So how to use httpclient this tool how to get Baidu page?

Two packages required first

The second logging this package does not work properly if no program is available, so the second must have one. After that, the code is changed to this:

1  Public classTESTSP2 {2      Public Static voidMain (string[] args) {3HttpClient httpclient=NewHttpClient ();4GetMethod get=NewGetMethod ("http://www.baidu.com/");5         Try {6             intnum=Httpclient.executemethod (get);7 System.out.println (num);8}Catch(IOException e) {9 e.printstacktrace ();Ten         } One  A         Try { - System.out.println (get.getresponsebodyasstring ()); -}Catch(IOException e) { the e.printstacktrace (); -         } -  -     } +}

Num is used to get the HTTP status code this is not too much to say, if the normal access will output 200. The next output is the contents of the page:

As expected, the status code output is 200, which means that the access is normal and then the contents of the HTML page.

=========================================

Introduction to Web crawler (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.