Simple Java Capture Program

Source: Internet
Author: User

"Target task" through the site to collect the country's mobile phone number segment into the database table

"Complete Process"

1, the beginning of the regular expression, learn to write simple regular expression

2. Get a single page content, learn basic IO stream in Java

3. Insert the data into the MySQL database table and master the basic JDBC programming.

5. Get the full URL of each city via URL stitching

6, collect the entire site number segment, and use batch + precompiled batch Insert database table

7, using StringBuilder to optimize speed

database table Note Field names do not need to be quoted if you are building a table under the cmd command

CREATE TABLE number_segment (' id ' bigint NOT NULL auto_increment unique, ' segment ' char (7) NOT null primary key, ' province ' varchar (255) is not null and ' city ' varchar (255) is not NULL) default Charset=utf8;                                  

"First-in-the-regular expression"

1, Learning simple expression: Regular expression 30 points to get started.

2. Online test of the expression you wrote: Online test regular expression 1

3. Using the Java pattern class and the Matcher class

Import Java.util.regex.matcher;import Java.util.regex.pattern;public class Test_zhengze{public static void Main ( String[] (args) {Pattern p = pattern.compile ("(13\\d{5}[^<])"); String s = "/mobile/guangzhou_1300040.>1300040</a></li><li><a href=\": /.. /mobile/guangzhou_1300041.html\ ">1300041</a></li><li><a"; Matcher m = P.matcher (s), while (M.find ()) {System.out.println ("printed number paragraph:" +m.group (0));} System.out.print ("Captured Data:" +m.groupcount ());}}

"Get Web Content"

here the main use of InputStream bufferreader two IO Stream processing class. More ways to explain the "Java acquisition of Web content collection method Summary "

Import Java.io.bufferedreader;import java.io.ioexception;import java.io.inputstreamreader;import Java.net.URL; Import Java.util.regex.matcher;import Java.util.regex.pattern;public class gethtml {public static void main (string[] A        RGS) throws Exception {long start= system.currenttimemillis ();        String str_url= "http://www.hiphop8.com/city/guangdong/guangzhou.php";        Match number segment Pattern p = pattern.compile ("> (13\\d{5}|15\\d{5}|18\\d{5}|147\\d{4}) <");        String html = get_html (Str_url);             Matcher m = p.matcher (HTML);       int num = 0;       while (M.find ()) {System.out.println ("printed number paragraph:" +m.group (1) + "number" + (++num));}              SYSTEM.OUT.PRINTLN (num); Long end = System.currenttimemillis ();    System.out.println ("Time Spent" + (End-start) + "millisecond");    public static string get_html (String str_url) throws ioexception{url url = new URL (str_url); String content= ""; StringBuffer page = new StringBuffer (); try {bufferedreader in = new BufferedReader(New InputStreamReader (URL. OpenStream ())), while (content = In.readline ()) = null) {page.append (content );}}        catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}    return page.tostring (); }}
"Inserting collected content into a database"

The approximate operation of Java connection MySQL database is:

Load MySQL Drive---Creative a database connection---Create an SQL statement execution Object Statement---defines a string-type SQL statement, Statment calls the execution method of the SQL statement---Close the Statment object and the database.

Import Java.sql.drivermanager;import Java.sql.sqlexception;import Java.sql.statement;public class Database {public static string driver = "Com.mysql.jdbc.Driver";p ublic static string url = "Jdbc:mysql://127.0.0.1:3306/tele_dat?" Autoreconnect=true&characterencoding=utf-8 ";p ublic static string user =" root ";p ublic static string password =" 1234 ";p ublic static Statement Statement = null;public static Java.sql.Connection conn = null;public static int i=0;//Create an Insert number method public static void Datatomysql (String sql) throws SQLException {try {class.forname (driver);} catch ( ClassNotFoundException e) {System.out.println ("load driver failed"); E.printstacktrace ();}   conn = drivermanager.getconnection (URL, user, password);//Create a connection statement = Conn.createstatement (); Create a Statemnet object to transfer the SQL statement statement.executeupdate (SQL);}   public static void Close () throws Sqlexception{statement.close ();       Close Database Operations Object Conn.close (); Close database connection}//Test Connection database example public static void main (string args[]) {String sql = ' insert into Nu 'Mber_segment (segment,province,city) "+" VALUES (123458, ' Guangdong 1 ', ' Guangzhou '); try {datatomysql (SQL); System.out.println ("Insert succeeded");} catch (SQLException e) {System.out.println ("Insert Failed"); E.printstacktrace ();} try {close (); System.out.print ("Close Database");} catch (SQLException e) {e.printstacktrace ();}}}

I use the integrated MySQL database in Wampsever, and operate under CMD, common commands see: MySQL common commands See, if the JDBC programming is not familiar, you can refer to this blog post.

"Get the URL of the entire site so the city"

By looking at the source code of the homepage of the website, we find that we can get the URL of each province from here, then look at the page of a province to get a partial suffix of the provincial city URL, which can be spliced to get the URL of a complete city.

Mport Java.io.bufferedreader;import java.io.ioexception;import Java.io.inputstreamreader;import Java.net.URL; Import Java.util.arraylist;import Java.util.regex.matcher;import Java.util.regex.pattern;public class Get_all_city_ URL {public static void main (string[] args) throws Exception {String Home_url = "http://www.hiphop8.com";  String Pattern_pro = "\\w{3}\\.\\w{7}\\.\\w{3}\\/\\w{4}\\/\\w+"; Match the province of URLString pattern_city_hz= "<li><a href=\" (. *?)   \ "Target=_blank>"; City suffix Matcher mat_home = Get (home_url,pattern_pro), int i = 0;//You can save all URLs with ArrayList, and you can add the strings in StringBuilder. But the test takes almost a long start = System.currenttimemillis (); while (Mat_home.find ()) {String CITY_URL_QZ = "http:/" +mat_ Home.group () + "/"; Matcher mat_city_hz = Get (City_url_qz,pattern_city_hz), while (Mat_city_hz.find ()) {i++; String City_url = City_url_qz + mat_city_hz.group (1); System.out.println (i+ "" +city_url);} Long end = System.currenttimemillis (); long time =end-start; SYSTEM.OUT.PRINTLN ("Total Time" +time);} Public StatiC Matcher Get (String str, String pa) throws Exception {string Urlsource =get_html (str); Pattern p = pattern.compile (PA); Matcher m = P.matcher (Urlsource); return m;}    public static string get_html (String str_url) throws ioexception{url url = new URL (str_url); String content= "";                    StringBuffer page = new StringBuffer (); try {bufferedreader in = new BufferedReader (new InputStreamReader (URL . OpenStream ()); while (content = In.readline ()) = null) {page.append (content);}}        catch (IOException e) {e.printstacktrace ();}    return page.tostring ();  }}

With the above basis, you can do the whole site of the number of the collection, but because the data inserted in the data table is more than 200,000, so more is to consider the efficiency of the problem.

In addition, there are many Java collection online tutorials, some write is very good, I know that Java is too few, write this blog one is to summarize, to leave a memorial, the second is hoping to give me a beginner some help, and everyone to exchange discussions, of course, if the article has written the wrong place, but also hope that the great God pointed out.

Simple Java Capture Program

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.