- Two methods
- A URL to get the page source code geturlcontentstring, the other one from the source code to get the desired address fragments, which need to use regular expressions to match
- the process of getting the Web source code:
- Address is a string that translates the address into a URL object in Java
- The OpenConnection method of the URL returns urlconnection
- URLConnection Connect method to establish a connection
- Creates a new InputStreamReader object, where InputStreamReader's build needs to inputstream the input stream object, and URLConnection's getInputStream method returns the input stream object. So you can connect.
- Then build the Bufferereader object using the established InputStreamReader object.
- Reads the source of the Web page from the BufferedReader object in rows, appends to the result string, and the result string is the Web page source code string
- logo Address matching
- Span style= "font-family: Microsoft Jacob Black; margin:0px; padding:0px; " >pattern Pattern = Pattern.compile (patternstring);
- < Span style= "font-family: Microsoft Jacob Black; margin:0px; padding:0px; " >java.util.regex: java class library package, Matches a string with a pattern defined by a regular expression
It consists of two classes: pattern and matcher.
pattern: Creates a matching pattern string.
matcher: Matches the pattern string with the input string.
- The Compile method of pattern : Compiles the specified characters into the pattern
- Matcher Matcher = Pattern.matcher (contentstring);
Java Crawl Baidu Homepage logo