Background introduction
A brief introduction to TCP
1 TCP implements network midpoint-to-point transmission
2 transmissions are via ports and sockets
Ports provides different types of transports (for example, the port of HTTP is 80)
1) sockets can be tied to a specific port and provides transfer capability
2) A port can connect multiple sockets
Introduction to two URLs
A URL is a concise representation of where and how resources can be accessed from the Internet, and is the address of standard resources on the Internet.
Each file on the Internet has a unique URL that contains information that indicates the location of the file and how the browser should handle it.
In summary, we want to crawl the content of the Web page is essentially a URL to crawl the content of the page.
Java provides two methods:
One is to read the Web page directly from the URL
One is to read Web pages through URLConnection
The difference is that URLConnection is an HTTP-centric class that provides a lot of functions for connecting to HTTP
This article will give an example code based on URLConnection.
Let's take a look at the exception to the URL first. If you do not understand the Java exception mechanism, please see a blog post.
construct exception malformedurlexception for URL: the string for the URL is empty or an unrecognized protocol
Build URLConnection Exception ioexception:openconnection failed, note OpenConnection code is not connected remotely, just prepare for connection remote
The final code
ImportJava.io.BufferedReader;Importjava.io.IOException;ImportJava.io.InputStreamReader;Importjava.net.HttpURLConnection;Importjava.net.MalformedURLException;ImportJava.net.URL;Importjava.net.URLConnection; Public classSimplenetspider { Public Static voidMain (string[] args) {Try{URL u=NewURL ("http://docs.oracle.com/javase/tutorial/networking/urls/"); URLConnection Connection=u.openconnection (); HttpURLConnection Htcon=(httpurlconnection) connection; intCode =Htcon.getresponsecode (); if(Code = =HTTPURLCONNECTION.HTTP_OK) {System.out.println ("Find the website"); BufferedReader in=NewBufferedReader (NewInputStreamReader (Htcon.getinputstream ())); String Inputline; while((Inputline = In.readline ())! =NULL) System.out.println (inputline); In.close (); } Else{System.out.println ("Can not access the website"); } } Catch(malformedurlexception e) {System.out.println ("Wrong URL"); } Catch(IOException e) {System.out.println ("Can Not Connect"); } }}
Reference documents:
Http://docs.oracle.com/javase/tutorial/networking/urls/index.html
Java Simple Web Crawl