Program | Search engine I am not talking about how to use the search engine, but how to let the program use search engines to collect URLs, what is the use? Very useful! On the internet, some people sell Web site database, such as the release of software Web site, e-mail address, forum Web site, industry Web site, these sites are how to come? It is not possible to collect by hand, are the program to use the search engine to get, if you need some kind of web site information data, with me to study, very simple.
This article is written in the Java language, to Google and Baidu search engine for the object.
We want to use Google, Baidu search engine search rules in the two, keyword search and inurl search. What is Inurl search, is that you want to search the URL itself with the keyword, such as http://www.xxx.com/post.asp, this site contains post.asp such keywords, in search engines to fill in the rules is inurl:post.asp, This is the key to the collection of Web sites, because many of the Web site itself will have specific information, such as the software published in the Web site information contains publish, submit, Tuijian such information, such as http://www.xxx.com/publish.asp, Such a Web site is more information published pages, in conjunction with the page itself may contain keywords, you can use search engine search results, and then we use the program to retrieve the results, the HTML page analysis, remove the useless information, the useful Web site information written to the file or database, Can be used for other applications or people.
The first step, using the program to retrieve the results of the search, first, for example, Baidu, for example, we want to search the software release of the Web page, the keyword using the "Software release version inurl:publish.asp", first login Baidu to see, the keyword written, and then submitted, in the address bar will see http:// Www.baidu.com/s?ie=gb2312&bs=%C8%ED%BC%FE%B7%A2%B2%BC+%C8%ED%BC%FE%B0%E6%B1%BE+inurl%3Apublish.asp&sr= &z=&cl=3&f=8&wd=%c8%ed%bc%fe%b7%a2%b2%bc+%b0%e6%b1%be+inurl%3apublish.asp&ct=0, Chinese keywords all become encoded, there is no relationship, we are directly in the program in Chinese is also possible, including a number of key words with +, to remove some of the information is not used, we can optimize the address into http://www.baidu.com/s?lm=0&si=& rn=20&ie=gb2312&ct=0&wd= Software release + version +inurl%3apublish%2easp&pn=0&cl=0, where RN represents how many results a page shows, wd= Indicates the keyword you are searching for, and the PN indicates that, from the beginning of the first few, this PN will be the variable for our program to loop through the results, every 20 loops at a time. We use Java program to simulate the process of this search, use the key class for Java.net.httpurlconnection,java.net.url, first write a Submit search class, the key code is as follows:
class Search { public URL url; public HttpURLConnection http; public java.io.InputStream urlstream; ...... for(int i=0;i++;i <100) { ...... try { url = new URL("http://www.baidu.com/s?lm=0&si=&rn=20&ie=gb2312&ct=0&wd=软件发布+版本+inurl%3Apublish%2Easp&pn="+beginrecord+"&cl=0"); }catch(Exception ef){}; try { http = (HttpURLConnection) url.openConnection(); http.connect(); urlstream = http.getInputStream(); }catch(Exception ef){}; java.io.BufferedReader l_reader = new java.io. BufferedReader(new java.io.InputStreamReader(urlstream)); try { while ((currentLine = l_reader.readLine()) != null) { totalstring += currentLine; } } catch (IOException ex3) {} .... //本次搜索的结果已经放到totalstring中了,是一些HTML代码,需要下一步进行分析了。 } Google, for example, slightly different, Google has a number of browser detection, coding is also different, URL for http://www.google.com/search?q= software release + version +inurl:publish.asp&hl= Zh-cn&lr=&newwindow=1&start=0&sa=n&ie=utf-8, where the encoding is to be shown from the first few records with a ie=utf-8,start representation, Note that Google has to check the browser, if the browser does not meet its requirements, will return the error code, so in the mock browser submission, we need to add a line of code, modify the key part of the HTTP properties of the User-agent set to a common browser, such as mozilla/ 4.0, the code is as follows:
try {
http = (httpurlconnection) url.openconnection ();
Http.setrequestproperty ("User-agent", "mozilla/4.0");
Http.connect ();
Urlstream = Http.getinputstream ();
}catch (Exception ef) {};
The second step, to retrieve the HTML code for analysis, take out the useful web site information, and write files or databases, because these search engines have a snapshot of the Web sites and similar web site information mixed in HTML, we want to remove these web site information, remove the key is to find out the law, Baidu search engine of the page snapshots and other unused addresses contain the keyword Baidu, and Google contains the useless web site information contains keywords Google and cache, we will be based on these keywords to eliminate the useless web site information. Parsing a string in Java is bound to use the Java.util.StringTokenize class, which separates strings into specific delimiters, Java.util.regex.Pattern and java.util.regex.Matcher to match strings, key The code is as follows:
class CompareStr { public boolean comparestring(String oristring,String tostring) { Pattern p=null; //正则表达式 Matcher m=null; //操作的字符串 boolean b; p = Pattern.compile(oristring,Pattern.CASE_INSENSITIVE); m = p.matcher(tostring); b = m.find(); return b; } }
class AnalyUrl { ...... StringTokenizer token = new StringTokenizer(totalstring," <> \""); String firstword; CompareStrcompstr = new CompareStr(); String dsturl = null; while (token.hasMoreTokens()) { firstword = token.nextToken(); if (!compstr.comparestring("google.com", firstword) && !compstr.comparestring("cache",firstword)) { if (firstword.length() > 7) { dsturl = firstword.substring(6,firstword.length() - 1); WriteUrl(dsturl); //成功取到URL,记录到文件中 } } } } Through the above procedures, we can collect their own URL information, but also can write another application, the collection of information on the Web site further analysis, take out the information they need, here is no longer cumbersome, the truth is the same. Finally, it should be stated that Google search engine can return the results of more than 1000, after 1000, directly prompted "Sorry, Google for all the results of the query will not exceed 1000." ", Baidu search engine returned results can not more than 700, so we want to search as much as possible with more keywords, the result range is narrowed."
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.