The realization method of Java simple web crawl _java

Source: Internet
Author: User

This article describes the Java Simple Web page crawl implementation method. Share to everyone for your reference. The specific analysis is as follows:

Background information

A brief introduction to TCP

1 TCP realizes point-to-point transmission in network

2 transmissions are via ports and sockets

Ports provides different types of transports (for example, HTTP port is 80)

1 sockets can be bound to a specific port and provide transport capabilities

2 A port can connect multiple sockets

Introduction to the two URLs

URLs are a concise representation of the location and access to resources available on the Internet and are the addresses of standard resources on the Internet.

Each file on the Internet has a unique URL that contains information about the location of the file and how the browser should handle it.

In summary, we want to crawl the content of the page is essentially through the URL to crawl the content of the page.

Java provides two methods:

One is to read the page directly from the URL

One is to read the Web page through URLConnection

The URLConnection is an HTTP-core class that provides a lot of functions for connecting HTTP

This article will give an example code based on URLConnection.

Let's take a look at the exception to the URL. If you don't understand the Java exception mechanism, see a blog post.

The exception malformedurlexception of constructing a URL creates a condition: the string of the URL is empty or is an unrecognized protocol

Create URLConnection exception IOException conditions: OpenConnection failure, note openconnection code is not connected to remote, just to prepare for connection remote

To sum up, the final code is as follows:

Copy Code code as follows:
Import Java.io.BufferedReader;
Import java.io.IOException;
Import Java.io.InputStreamReader;
Import java.net.HttpURLConnection;
Import java.net.MalformedURLException;
Import Java.net.URL;
Import java.net.URLConnection;

public class Simplenetspider {

public static void Main (string[] args) {

try{
URL u = new URL ("http://docs.oracle.com/javase/tutorial/networking/urls/");
URLConnection connection = U.openconnection ();
HttpURLConnection Htcon = (httpurlconnection) connection;
int code = Htcon.getresponsecode ();
if (code = = HTTPURLCONNECTION.HTTP_OK)
{
System.out.println ("Find the website");
BufferedReader in=new BufferedReader (New InputStreamReader (Htcon.getinputstream ()));
String Inputline;
while ((Inputline = In.readline ())!= null)
System.out.println (Inputline);
In.close ();
}
Else
{
System.out.println ("Can not access the website");
}
}
catch (Malformedurlexception e)
{
System.out.println ("wrong URL");
}
catch (IOException E)
{
System.out.println ("Can not Connect");
}
}
}

I hope this article will help you with your Java programming.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.