This article describes the Java Simple Web page crawl implementation method. Share to everyone for your reference. The specific analysis is as follows:
Background information
A brief introduction to TCP
1 TCP realizes point-to-point transmission in network
2 transmissions are via ports and sockets
Ports provides different types of transports (for example, HTTP port is 80)
1 sockets can be bound to a specific port and provide transport capabilities
2 A port can connect multiple sockets
Introduction to the two URLs
URLs are a concise representation of the location and access to resources available on the Internet and are the addresses of standard resources on the Internet.
Each file on the Internet has a unique URL that contains information about the location of the file and how the browser should handle it.
In summary, we want to crawl the content of the page is essentially through the URL to crawl the content of the page.
Java provides two methods:
One is to read the page directly from the URL
One is to read the Web page through URLConnection
The URLConnection is an HTTP-core class that provides a lot of functions for connecting HTTP
This article will give an example code based on URLConnection.
Let's take a look at the exception to the URL. If you don't understand the Java exception mechanism, see a blog post.
The exception malformedurlexception of constructing a URL creates a condition: the string of the URL is empty or is an unrecognized protocol
Create URLConnection exception IOException conditions: OpenConnection failure, note openconnection code is not connected to remote, just to prepare for connection remote
To sum up, the final code is as follows:
Copy Code code as follows:
Import Java.io.BufferedReader;
Import java.io.IOException;
Import Java.io.InputStreamReader;
Import java.net.HttpURLConnection;
Import java.net.MalformedURLException;
Import Java.net.URL;
Import java.net.URLConnection;
public class Simplenetspider {
public static void Main (string[] args) {
try{
URL u = new URL ("http://docs.oracle.com/javase/tutorial/networking/urls/");
URLConnection connection = U.openconnection ();
HttpURLConnection Htcon = (httpurlconnection) connection;
int code = Htcon.getresponsecode ();
if (code = = HTTPURLCONNECTION.HTTP_OK)
{
System.out.println ("Find the website");
BufferedReader in=new BufferedReader (New InputStreamReader (Htcon.getinputstream ()));
String Inputline;
while ((Inputline = In.readline ())!= null)
System.out.println (Inputline);
In.close ();
}
Else
{
System.out.println ("Can not access the website");
}
}
catch (Malformedurlexception e)
{
System.out.println ("wrong URL");
}
catch (IOException E)
{
System.out.println ("Can not Connect");
}
}
}
I hope this article will help you with your Java programming.