It Ninja Turtle Java web crawler review

Source: Internet
Author: User

Java web crawler Technology, the discovery of web crawler technology first divided into the following steps:

1. Open Web Link

2, the page code with a BufferedReader storage

Here is a code example that I made:

In the process of learning web crawler first to import two packages: Htmllexer.jar,htmlparser.jar

public static void Main (string[] args) {
  try {
    URL url = new URL ("http://www.baidu.com");
   httpurlconnection Httpurl = (httpurlconnection) url.openconnection ();
   
   bufferedreader br = new BufferedReader (New InputStreamReader ( Httpurl.getinputstream (), "Utf-8"));

Use regular expressions to match Web content
Pattern p = pattern.compile ("(http://\\w+\\.baidu\\.com) | ( \\w://w+\\.baidu\\.com) ");
Matcher m;
String Line;
while (line = Br.readline ()) = null) {
m = P.matcher (line);

if (M.find ()) {

Print a page only if it matches
System.out.println (line);
}
}
} catch (IOException e) {
E.printstacktrace ();
}

In the course of learning, there is a very interesting question:

That's the difference between the find () and matches () methods in regular expressions ——— thefind () method is a partial match, which is to find the substring in the input string that matches the pattern, and the group () function can be used if the matched string has groups.

matches () is the full match, which is to match the entire input string to the pattern, and if you want to verify that one of the input data is a numeric type or other type, you typically use matches ().

Java Regular expressions, the difference between matcher.find () and matcher.matches ()


1. Thefind () method is a partial match, which is to find the substring in the input string that matches the pattern, and the group () function can also be used if the matched string has groups.

matches () is the full match, which is to match the entire input string to the pattern, and if you want to verify that one of the input data is a numeric type or other type, you typically use matches ().

2.Pattern pattern= Pattern.compile (". *?, (. *)");

Matcher Matcher = pattern. Matcher (result);

if (Matcher. Find()) {
Return Matcher. Group (1);
}

3. Detailed:

Matches
public static Boolean matches (String regex, charsequence input)

Compiles the given regular expression and attempts to match the given input with it.
Call this handy method in the form of
Pattern.matches (regex, input);
Pattern.compile (regex). Matcher (Input). matches ();
If you want to use a pattern more than once, it is more efficient to reuse this mode after compiling it once than to call this method every time.
Parameters:
Regex-An expression to compile
Input-sequence of characters to match
Thrown:
Patternsyntaxexception-If the syntax of an expression is invalid

Find
The public boolean find () tries to find the next subsequence of the input sequence that matches the pattern.
This method starts at the beginning of the match area, and if the previous invocation of the method succeeds and the match has not been reset since then, the first character that did not match the previous match operation begins.
If the match succeeds, you can get more information through the start, end, and group methods.

Matcher.start () returns the index position of the substring that matches to the string.
Matcher.end () returns the index position of the last character in the string that matches the substring.
Matcher.group () returns the substring matched to
Return:
Returns true if and only if the subsequence of the input sequence matches the pattern of this match.


4. Partial Java Regular expression instances

① character Matching
Pattern p = pattern.compile (expression); Regular expressions
Matcher m = p.matcher (str); string of Actions
Boolean B = m.matches (); Returns whether a matching result
System.out.println (b);

Pattern p = pattern.compile (expression); Regular expressions
Matcher m = p.matcher (str); string of Actions
Boolean B = M. Lookingat (); Returns whether a matching result
System.out.println (b);

Pattern p = pattern.compile (expression); Regular expressions
Matcher m = p.matcher (str); string of Actions
Boolean b = M.. Find (); Returns whether a matching result
System.out.println (b);


② Split string
Pattern pattern = pattern.compile (expression); Regular expressions
string[] STRs = Pattern.split (str); The action string gets the array of strings returned

③ Replacement string
Pattern p = pattern.compile (expression); Regular expressions
Matcher m = p.matcher (text); string of Actions
String s = m.replaceall (str); The replaced string

④ find replaces specified string
Pattern p = pattern.compile (expression); Regular expressions
Matcher m = p.matcher (text); string of Actions
StringBuffer sb = new StringBuffer ();
int i = 0;
while (M.find ()) {
M.appendreplacement (SB, str);
i++; Number of occurrences of string
}
M.appendtail (SB);//Connect from the string that follows the Intercept point
String s = sb.tostring ();
⑤ finding the output string
Pattern p = pattern.compile (expression); Regular expressions
Matcher m = p.matcher (text); string of Actions
while (M.find ()) {
Matcher.start ();
Matcher.end ();
Matcher.Group(1);
}


It Ninja Turtle Java web crawler review

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.