Java programs that use search engines to collect URLs

Source: Internet
Author: User
What I am talking about here is not how to use the search engine, but how to let programs use the search engine to collect URLs. What is the purpose of this? Very useful! On the Internet, users often sell web site databases, such as publishing software websites, email addresses, Forum websites, and Industry websites. How did these websites come from? It is impossible for a program to obtain information from a search engine by hand. If you need some website information data, it is very easy to study it with me.

This document is written in Java and is intended for google and Baidu search engines.

We need to use two of the Search rules of google and Baidu search engines, keyword search and inurl search. What is inurl search, is the URL you want to search with its own keywords, such as http://www.xxx.com/post.asp, this URL contains post. for a keyword such as asp, enter the rule inurl: post in the search engine. asp, which is the key to collecting websites, because many websites contain specific information, such as publish, submit, and tuijian.

The first step is to use a program to retrieve the search results. Baidu is used as an example. For example, if we want to search for a software release webpage, the keyword "software release version inurl: publish. asp ", first log on to Baidu to see, write the keyword, and then submit, in the address bar will see the http://www.baidu.com/s? Ie = gb2312 & bs = % C8 % ED % BC % FE % B7 % A2 % B2 % BC + % C8 % ED % BC % FE % B0 % E6 % B1 % BE + inurl % 3Apublish. asp & sr = & z = & cl = 3 & f = 8 & wd = % C8 % ED % BC % FE % B7 % A2 % B2 % BC + % B0 % E6 % b1 % BE + inurl % 3Apublish. asp & ct = 0. All Chinese keywords are encoded. It doesn't matter. We can also use Chinese keywords directly in the program. Multiple keywords are connected with the plus sign, remove some useless information, we can optimize the address to http://www.baidu.com/s? Lm = 0 & si = & rn = 20 & ie = gb2312 & ct = 0 & wd = software release + version + inurl % 3 Apublish % 2 Easp & pn = 0 & cl = 0, rn indicates the number of results displayed on a page, wd = indicates the keyword you want to search, pn indicates the number of results displayed from the beginning, this pn will be the variable for our program to take the results cyclically, every 20 cycles. We use a program written in Java to simulate this search process. The key classes used are java.net. HttpURLConnection and java.net. URL. First, write a class to submit the search. The key code is as follows:

Class Search
{
Public URL;
Public HttpURLConnection http;
Public java. io. InputStream urlstream;
......
For (int I = 0; I ++; I <100)
{
......
Try {
Url = new URL ("http://www.baidu.com/s? Lm = 0 & si = & rn = 20 & ie = gb2312 & ct = 0 & wd = software release + version + inurl % 3 Apublish % 2 Easp & pn = "+ beginrecord + "& cl = 0 ");
} Catch (Exception ef ){};
Try {
Http = (HttpURLConnection) url. openConnection ();
Http. connect ();
Urlstream = http. getInputStream ();
} Catch (Exception ef ){};
Java. io. BufferedReader l_reader = new java. io.
BufferedReader (new java. io. InputStreamReader (urlstream ));
Try {
While (currentLine = l_reader.readLine ())! = Null ){
Totalstring + = currentLine;
}
} Catch (IOException ex3 ){}
....
// The search result has been placed in totalstring. It is some HTML code and needs to be analyzed in the next step.
}


Take google as an example, a little different, google some browser detection, encoding is also different, URL for http://www.google.com/search? Q = software release + version + inurl: publish. asp & hl = zh-CN & lr = & newwindow = 1 & start = 0 & sa = N & ie = UTF-8, where the Code must be ie = UTF-8, start indicates the number of records to be displayed. Note that google must check the browser. If the browser does not meet its requirements, the error code will be returned. Therefore, when submitting a simulated browser, we need to add a line of code. to modify the key part, set the User-Agent in the http attribute to a common browser, such as Mozilla/4.0. The Code is as follows:

Try {
Http = (HttpURLConnection) url. openConnection ();
Http. setRequestProperty ("User-Agent", "Mozilla/4.0 ");
Http. connect ();
Urlstream = http. getInputStream ();
} Catch (Exception ef ){};


Step 2: analyze the retrieved HTML code, retrieve the useful URL Information, and write it into a file or database, because these search engines contain webpage snapshots and URL information such as similar webpages mixed in HTML, we need to remove the URL Information. The key to removing the URL Information is to find out the rule, the Web snapshots and other useless addresses in baidu search engine contain the baidu keyword, while the useless website information contained in google contains the keywords google and cache, we will remove useless URL Information Based on these keywords. To analyze strings in Java, java is required. util. the StringTokenize class is used to separate strings with specific delimiters. util. regex. pattern and java. util. regex. matcher is used to match strings. The key code is as follows:

Class CompareStr
{
Public boolean comparestring (String oristring, String tostring)
{
Pattern p = null; // Regular Expression
Matcher m = null; // The operator string
Boolean B;
P = Pattern. compile (oristring, Pattern. CASE_INSENSITIVE );
M = p. matcher (tostring );
B = m. find ();
Return B;
}
}

Class AnalyUrl
{
......
StringTokenizer token = new StringTokenizer (totalstring, "<> "");
String firstword;
CompareStrcompstr = new CompareStr ();
String dsturl = null;
While (token. hasMoreTokens ())
{
Firstword = token. nextToken ();
If (! Compstr. comparestring ("google.com", firstword )&&! Compstr. comparestring ("cache", firstword ))
{
If (firstword. length ()> 7)
{
Dsturl = firstword. substring (6, firstword. length ()-1 );
WriteUrl (dsturl); // The URL is obtained successfully and recorded in the file.
}
}
}
}


Through the above procedures, we can collect the URL information we need, and you can write another application to further analyze the collected URL Information and obtain the information we need, this is no longer cumbersome, and the truth is the same. Finally, you need to note that the number of results returned by google search engine cannot exceed 1000. After 1000 results are returned, the system directly prompts "sorry, google has no more than 1000 results for all queries. ", Baidu search engine cannot return more than 700 results. Therefore, we should add as many keywords as possible to narrow down the search results.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.