Java implementation of the use of search engines to collect Web site programs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Program | Search engine I am not talking about how to use the search engine, but how to let the program use search engines to collect URLs, what is the use? Very useful! On the internet, some people sell Web site database, such as the release of software Web site, e-mail address, forum Web site, industry Web site, these sites are how to come? It is not possible to collect by hand, are the program to use the search engine to get, if you need some kind of web site information data, with me to study, very simple.

This article is written in the Java language, to Google and Baidu search engine for the object.

We want to use Google, Baidu search engine search rules in the two, keyword search and inurl search. What is Inurl search, is that you want to search the URL itself with the keyword, such as http://www.xxx.com/post.asp, this site contains post.asp such keywords, in search engines to fill in the rules is inurl:post.asp, This is the key to the collection of Web sites, because many of the Web site itself will have specific information, such as the software published in the Web site information contains publish, submit, Tuijian such information, such as http://www.xxx.com/publish.asp, Such a Web site is more information published pages, in conjunction with the page itself may contain keywords, you can use search engine search results, and then we use the program to retrieve the results, the HTML page analysis, remove the useless information, the useful Web site information written to the file or database, Can be used for other applications or people.

The first step, using the program to retrieve the results of the search, first, for example, Baidu, for example, we want to search the software release of the Web page, the keyword using the "Software release version inurl:publish.asp", first login Baidu to see, the keyword written, and then submitted, in the address bar will see http:// Www.baidu.com/s?ie=gb2312&bs=%C8%ED%BC%FE%B7%A2%B2%BC+%C8%ED%BC%FE%B0%E6%B1%BE+inurl%3Apublish.asp&sr= &z=&cl=3&f=8&wd=%c8%ed%bc%fe%b7%a2%b2%bc+%b0%e6%b1%be+inurl%3apublish.asp&ct=0, Chinese keywords all become encoded, there is no relationship, we are directly in the program in Chinese is also possible, including a number of key words with +, to remove some of the information is not used, we can optimize the address into http://www.baidu.com/s?lm=0&si=& rn=20&ie=gb2312&ct=0&wd= Software release + version +inurl%3apublish%2easp&pn=0&cl=0, where RN represents how many results a page shows, wd= Indicates the keyword you are searching for, and the PN indicates that, from the beginning of the first few, this PN will be the variable for our program to loop through the results, every 20 loops at a time. We use Java program to simulate the process of this search, use the key class for Java.net.httpurlconnection,java.net.url, first write a Submit search class, the key code is as follows:

class Search 
{ 
　public URL url; 
　public HttpURLConnection http; 
　public java.io.InputStream urlstream; 
　...... 
　for(int i=0;i++;i <100) 
　{ 
　　...... 
　　try { 
　　　url = new URL("http://www.baidu.com/s?lm=0&si=&rn=20&ie=gb2312&ct=0&wd=软件发布+版本+inurl%3Apublish%2Easp&pn="+beginrecord+"&cl=0"); 
　　}catch(Exception ef){}; 
　　try { 
　　　http = (HttpURLConnection) url.openConnection(); 
　　　http.connect(); 
　　　urlstream = http.getInputStream(); 
　　}catch(Exception ef){}; 
　　java.io.BufferedReader l_reader = new java.io. 
　　BufferedReader(new java.io.InputStreamReader(urlstream)); 
　　try { 
　　　while ((currentLine = l_reader.readLine()) != null) { 
　　　　totalstring += currentLine; 
　　　} 
　　} catch (IOException ex3) {} 
　　.... 
　　//本次搜索的结果已经放到totalstring中了，是一些HTML代码，需要下一步进行分析了。 
}

Google, for example, slightly different, Google has a number of browser detection, coding is also different, URL for http://www.google.com/search?q= software release + version +inurl:publish.asp&hl= Zh-cn&lr=&newwindow=1&start=0&sa=n&ie=utf-8, where the encoding is to be shown from the first few records with a ie=utf-8,start representation, Note that Google has to check the browser, if the browser does not meet its requirements, will return the error code, so in the mock browser submission, we need to add a line of code, modify the key part of the HTTP properties of the User-agent set to a common browser, such as mozilla/ 4.0, the code is as follows:

try {
http = (httpurlconnection) url.openconnection ();
Http.setrequestproperty ("User-agent", "mozilla/4.0");
Http.connect ();
Urlstream = Http.getinputstream ();
}catch (Exception ef) {};
The second step, to retrieve the HTML code for analysis, take out the useful web site information, and write files or databases, because these search engines have a snapshot of the Web sites and similar web site information mixed in HTML, we want to remove these web site information, remove the key is to find out the law, Baidu search engine of the page snapshots and other unused addresses contain the keyword Baidu, and Google contains the useless web site information contains keywords Google and cache, we will be based on these keywords to eliminate the useless web site information. Parsing a string in Java is bound to use the Java.util.StringTokenize class, which separates strings into specific delimiters, Java.util.regex.Pattern and java.util.regex.Matcher to match strings, key The code is as follows:

class CompareStr 
{ 
　public boolean comparestring(String oristring,String tostring) 
　{ 
　　Pattern p=null; //正则表达式 
　　Matcher m=null; //操作的字符串 
　　boolean b; 
　　p = Pattern.compile(oristring,Pattern.CASE_INSENSITIVE); 
　　m = p.matcher(tostring); 
　　b = m.find(); 
　　return b; 
　} 
} 

class AnalyUrl 
{ 
　...... 
　StringTokenizer token = new StringTokenizer(totalstring," <> \""); 
　String firstword; 
　CompareStrcompstr = new CompareStr(); 
　String dsturl = null; 
　while (token.hasMoreTokens()) 
　{ 
　　firstword = token.nextToken(); 
　　if (!compstr.comparestring("google.com", firstword) && !compstr.comparestring("cache",firstword)) 
　　{ 
　　　if (firstword.length() > 7) 
　　　{ 
　　　　dsturl = firstword.substring(6,firstword.length() - 1); 
　　　　WriteUrl(dsturl); //成功取到URL，记录到文件中 
　　　} 
　　} 
　} 
}

Through the above procedures, we can collect their own URL information, but also can write another application, the collection of information on the Web site further analysis, take out the information they need, here is no longer cumbersome, the truth is the same. Finally, it should be stated that Google search engine can return the results of more than 1000, after 1000, directly prompted "Sorry, Google for all the results of the query will not exceed 1000." ", Baidu search engine returned results can not more than 700, so we want to search as much as possible with more keywords, the result range is narrowed."

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java implementation of the use of search engines to collect Web site programs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java implementation of the use of search engines to collect Web site programs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support