Java Simple Web Crawl

Source: Internet
Author: User

Background introduction

A brief introduction to TCP

1 TCP implements network midpoint-to-point transmission

2 transmissions are via ports and sockets

Ports provides different types of transports (for example, the port of HTTP is 80)

1) sockets can be tied to a specific port and provides transfer capability

2) A port can connect multiple sockets

Introduction to two URLs

A URL is a concise representation of where and how resources can be accessed from the Internet, and is the address of standard resources on the Internet.

Each file on the Internet has a unique URL that contains information that indicates the location of the file and how the browser should handle it.

In summary, we want to crawl the content of the Web page is essentially a URL to crawl the content of the page.

Java provides two methods:

One is to read the Web page directly from the URL

One is to read Web pages through URLConnection

The difference is that URLConnection is an HTTP-centric class that provides a lot of functions for connecting to HTTP

This article will give an example code based on URLConnection.

Let's take a look at the exception to the URL first. If you do not understand the Java exception mechanism, please see a blog post.

construct exception malformedurlexception for URL: the string for the URL is empty or an unrecognized protocol

Build URLConnection Exception ioexception:openconnection failed, note OpenConnection code is not connected remotely, just prepare for connection remote

The final code

ImportJava.io.BufferedReader;Importjava.io.IOException;ImportJava.io.InputStreamReader;Importjava.net.HttpURLConnection;Importjava.net.MalformedURLException;ImportJava.net.URL;Importjava.net.URLConnection; Public classSimplenetspider { Public Static voidMain (string[] args) {Try{URL u=NewURL ("http://docs.oracle.com/javase/tutorial/networking/urls/"); URLConnection Connection=u.openconnection (); HttpURLConnection Htcon=(httpurlconnection) connection; intCode =Htcon.getresponsecode (); if(Code = =HTTPURLCONNECTION.HTTP_OK) {System.out.println ("Find the website"); BufferedReader in=NewBufferedReader (NewInputStreamReader (Htcon.getinputstream ()));                String Inputline;  while((Inputline = In.readline ())! =NULL) System.out.println (inputline);            In.close (); }            Else{System.out.println ("Can not access the website"); }        }        Catch(malformedurlexception e) {System.out.println ("Wrong URL"); }        Catch(IOException e) {System.out.println ("Can Not Connect"); }    }}

Reference documents:

Http://docs.oracle.com/javase/tutorial/networking/urls/index.html

Java Simple Web Crawl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.