0 Basic writing Java know about the reptile first take Baidu home practice practicing _java

Source: Internet
Author: User
Tags readline in python

In the last set we talked about the need to use Java to make a known crawler, so this time we'll look at how to get to the content of a Web page using code.

First of all, there is no HTML and CSS and JS and Ajax experience suggestions to go to the consortium (dot me point me) a little understanding.

When it comes to HTML, there's a problem with getting access and post access.

If there is a lack of understanding of this aspect, you can read this article: "Get contrast post".

Ah ha, no longer repeat here.

Next, we need Java to crawl the content of a Web page.

At this time, our Baidu will come in handy.

Yes, he is no longer the unknown speed tester, he is about to become our reptile mouse! ~

Let's take a look at the homepage of Baidu First:

I believe we all know that now such a page is the result of working together HTML and CSS.

We right-click the page in the browser and select "View page source code":

Yes, that's the same thing. This is the source code of Baidu page.

The next task is to use our reptiles to get the same thing.

First look at a simple source code:

Import java.io.*;
Import java.net.*;
public class Main {
public static void Main (string[] args) {
Define the links that will be accessed
String url = "Http://www.baidu.com";
Defines a string for storing Web page content
String result = "";
Defines a buffer character input stream
BufferedReader in = null;
try {
To convert a string to a URL object
URL realurl = new URL (URL);
Initializes a connection that is linked to that URL
URLConnection connection = Realurl.openconnection ();
Start the actual connection
Connection.connect ();
Initializes the BufferedReader input stream to read the response of the URL
in = new BufferedReader (New InputStreamReader (
Connection.getinputstream ()));
Used to temporarily store data for each row crawled to
String Line;
while (line = In.readline ())!= null) {
Iterate over each row crawled to and store it in result
result + = line;
}
catch (Exception e) {
SYSTEM.OUT.PRINTLN ("Send GET request exception!") "+ e);
E.printstacktrace ();
}
Use finally to close the input stream
finally {
try {
if (in!= null) {
In.close ();
}
catch (Exception E2) {
E2.printstacktrace ();
}
}
SYSTEM.OUT.PRINTLN (result);
}
}

The above is Java simulation get access to Baidu's Main method,

Can run to see the results:


Aha, exactly the same as what we saw in front of the browser. At this point, one of the most simple reptiles even if it is done.

But such a big lump of things is not necessarily what I want ah, how to take out the things I want?

To Baidu's big paw logo as an example.

Temporary requirements:

Get Baidu logo of the big Paw Pictures link.

Let's talk about the browser's viewing method first.

Mouse right click on the picture, select the review element (Firefox, Google, IE11, all have this function, but the name is not the same):

Aha, you can see the poor img tag under the siege of a lot of div.

This src is the link of the image.

So what do we do in Java?

In advance, in order to facilitate the demo code, all the code is not a class encapsulation, but also please understand.

Let's first encapsulate the previous code into a sendget function:

Import java.io.*;
Import java.net.*;
public class Main {
static string Sendget (string url) {
Defines a string for storing Web page content
String result = "";
Defines a buffer character input stream
BufferedReader in = null;
try {
To convert a string to a URL object
URL realurl = new URL (URL);
Initializes a connection that is linked to that URL
URLConnection connection = Realurl.openconnection ();
Start the actual connection
Connection.connect ();
Initializes the BufferedReader input stream to read the response of the URL
in = new BufferedReader (New InputStreamReader (
Connection.getinputstream ()));
Used to temporarily store data for each row crawled to
String Line;
while (line = In.readline ())!= null) {
Iterate over each row crawled to and store it in result
result + = line;
}
catch (Exception e) {
SYSTEM.OUT.PRINTLN ("Send GET request exception!") "+ e);
E.printstacktrace ();
}
Use finally to close the input stream
finally {
try {
if (in!= null) {
In.close ();
}
catch (Exception E2) {
E2.printstacktrace ();
}
}
return result;
}
public static void Main (string[] args) {
Define the links that will be accessed
String url = "Http://www.baidu.com";
Access links and get page content
String result = sendget (URL);
SYSTEM.OUT.PRINTLN (result);
}
}

This looks a little neat, please forgive me for this obsessive-compulsive disorder.

The next task is to find the link to the picture from a large pile of things that you get.

The first thing we can think of is the string result of the page source code using the IndexOf function for string substring search.

Yes, this method can slowly solve this problem, such as direct indexof ("src") to find the starting sequence number, and then splinters to get to the end of the serial number.

But we can not always use this method, after all, sandals are only suitable for walking out, later still need to cut false legs to.

Please forgive my disorderly entry, continue.

So how do we find the src of this picture?

Yes, as the audience says, a regular match.

If a classmate is not very clear about the regular, you can refer to this article: [Python] web crawler (vii): a regular expression tutorial in Python.

In simple terms, it's just like a match.

For example, three fat men standing here, wearing red clothes, blue clothes, green clothes.

That's right: catch the one in the green dress!

Then he grabbed the fat man out of the green.

It's so simple.

But the regular grammar is still broad and profound, just contact the time will inevitably a little touch the mind,

Recommend a regular online test tool: Regular expression online testing.

With the regular this is a weapon, then how to use the Java inside the regular?

Let's take a look at a simple little plum.

Ah, wrong, little chestnut.

Defines a style template that uses regular expressions in parentheses to catch the content
It's the same place where the trap match is buried, and it goes down.
Pattern pattern = pattern.compile ("Href=\" (. +?) \"");
Define a matcher to make a match
Matcher Matcher = Pattern.matcher (" my homepage ");
If you find it,
if (Matcher.find ()) {
Print out the results
System.out.println (Matcher.group (1));
}

Run Result:

Index.html

Yes, that's our first regular code.

This application of the capture image link must also be at your fingertips.

We encapsulate the regular match into a function and then modify the code as follows:

Import java.io.*;
Import java.net.*;
Import java.util.regex.*;
public class Main {
static string Sendget (string url) {
Defines a string for storing Web page content
String result = "";
Defines a buffer character input stream
BufferedReader in = null;
try {
To convert a string to a URL object
URL realurl = new URL (URL);
Initializes a connection that is linked to that URL
URLConnection connection = Realurl.openconnection ();
Start the actual connection
Connection.connect ();
Initializes the BufferedReader input stream to read the response of the URL
in = new BufferedReader (New InputStreamReader (
Connection.getinputstream ()));
Used to temporarily store data for each row crawled to
String Line;
while (line = In.readline ())!= null) {
Iterate over each row crawled to and store it in result
result + = line;
}
catch (Exception e) {
SYSTEM.OUT.PRINTLN ("Send GET request exception!") "+ e);
E.printstacktrace ();
}
Use finally to close the input stream
finally {
try {
if (in!= null) {
In.close ();
}
catch (Exception E2) {
E2.printstacktrace ();
}
}
return result;
}
static string Regexstring (String targetstr, String patternstr) {
Defines a style template that uses regular expressions in parentheses to catch the content
It's the same place where the trap match is buried, and it goes down.
Pattern pattern = pattern.compile (PATTERNSTR);
Define a matcher to make a match
Matcher Matcher = Pattern.matcher (TARGETSTR);
If you find it,
if (Matcher.find ()) {
Print out the results
return Matcher.group (1);
}
Return "";
}
public static void Main (string[] args) {
Define the links that will be accessed
String url = "Http://www.baidu.com";
Access links and get page content
String result = sendget (URL);
Using the SRC content of a regular matching picture
String imgsrc = regexstring (result, "impending regular syntax");
Print results
System.out.println (IMGSRC);
}
}

OK, now everything is ready, only one regular grammar!

So what is the right statement to use?

We found that as long as we grabbed the src= "xxxxxx" string, we could grab the entire SRC link,

So the simple regular statement: Src=\ "(. +?) \"

The complete code is as follows:

Import java.io.*;
Import java.net.*;
Import java.util.regex.*;
public class Main {
static string Sendget (string url) {
Defines a string for storing Web page content
String result = "";
Defines a buffer character input stream
BufferedReader in = null;
try {
To convert a string to a URL object
URL realurl = new URL (URL);
Initializes a connection that is linked to that URL
URLConnection connection = Realurl.openconnection ();
Start the actual connection
Connection.connect ();
Initializes the BufferedReader input stream to read the response of the URL
in = new BufferedReader (New InputStreamReader (
Connection.getinputstream ()));
Used to temporarily store data for each row crawled to
String Line;
while (line = In.readline ())!= null) {
Iterate over each row crawled to and store it in result
result + = line;
}
catch (Exception e) {
SYSTEM.OUT.PRINTLN ("Send GET request exception!") "+ e);
E.printstacktrace ();
}
Use finally to close the input stream
finally {
try {
if (in!= null) {
In.close ();
}
catch (Exception E2) {
E2.printstacktrace ();
}
}
return result;
}
static string Regexstring (String targetstr, String patternstr) {
Defines a style template that uses regular expressions in parentheses to catch the content
It's the same place where the trap match is buried, and it goes down.
Pattern pattern = pattern.compile (PATTERNSTR);
Define a matcher to make a match
Matcher Matcher = Pattern.matcher (TARGETSTR);
If you find it,
if (Matcher.find ()) {
Print out the results
return Matcher.group (1);
}
Return to "nothing";
}
public static void Main (string[] args) {
Define the links that will be accessed
String url = "Http://www.baidu.com";
Access links and get page content
String result = sendget (URL);
Using the SRC content of a regular matching picture
String imgsrc = regexstring (result, "src=\" (. +?) \"");
Print results
System.out.println (IMGSRC);
}
}

So we can use Java to grasp the link of Baidu logo.

Well, although spent a lot of time talking about Baidu, but the foundation to play solid, the next time we officially start grasping know! ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.