What I'm talking about here is not how to use search engines, but how to get programs to use search engines to collect URLs. Very useful! On the internet, some people sell Web site database, such as the release of software Web site, e-mail address, forum Web site, industry Web site, these sites are how to come? It is not possible to collect by hand, are the program to use the search engine to get, if you need some kind of web site information data, with me to study, very simple.
This article is written in the Java language, to Google and Baidu search engine for the object.
We want to use Google, Baidu search engine search rules in the two, keyword search and inurl search. What is Inurl search, is that you want to search the URL itself with the keyword, such as http://www.xxx.com/post.asp, this site contains post.asp such keywords, in search engines to fill in the rules is inurl:post.asp, This is the key to the collection of Web sites, because many of the Web site itself will have specific information, such as the software published in the Web site information contains publish, submit, Tuijian such information, such as http://www.xxx.com/publish.asp, Such a Web site is more information published pages, in conjunction with the page itself may contain keywords, you can use search engine search results, and then we use the program to retrieve the results, the HTML page analysis, remove the useless information, the useful Web site information written to the file or database, Can be used for other applications or people.
The first step, using the program to retrieve the results of the search, first, for example, Baidu, for example, we want to search the software release of the Web page, the keyword using the "Software release version inurl:publish.asp", first login Baidu look, the keyword written, and then submitted, in the address bar will see http:// Www.baidu.com/s?ie=gb2312&bs=%C8%ED%BC%FE%B7%A2%B2%BC+%C8%ED%BC%FE%B0%E6%B1%BE+inurl%3Apublish.asp&sr= &z=&cl=3&f=8&wd=%c8%ed%bc%fe%b7%a2%b2%bc+%b0%e6%b1%be+inurl%3apublish.asp&ct=0, Chinese keywords all become encoded, there is no relationship, we are directly in the program in Chinese is also possible, including a number of key words with +, to remove some of the information is not used, we can optimize the address into http://www.baidu.com/s?lm=0&si=& rn=20&ie=gb2312&ct=0& wd= Software release + version +inurl%3apublish%2easp&pn=0&cl=0, where RN represents how many results a page shows, wd= Indicates the keyword you are searching for, and the PN indicates that, from the beginning of the first few, this PN will be the variable for our program to loop through the results, every 20 loops at a time. We use Java program to simulate the process of this search, use the key class for Java.net.httpurlconnection,java.net.url, first write a Submit search class, the key code is as follows:
class Search
{
public URL url;
Public HttpURLConnection http;
Public Java.io.InputStream Urlstream;
.....
for (int i=0;i++;i<100)
{
.....
try {
url = new URL ("www.baidu.com/s?lm=0&si=&rn=20&ie=gb2312&ct=0&wd= software release + version +inurl% 3apublish%2easp&pn= "+beginrecord+" &cl=0 ");
}catch (Exception ef) {};
try {
http = (httpurlconnection) url.openconnection ();
Http.connect ();
Urlstream = Http.getinputstream ();
}catch (Exception ef) {};
Java.io.BufferedReader l_reader = new java.io.
BufferedReader (New Java.io.InputStreamReader (Urlstream));
try {
while (CurrentLine = L_reader.readline ())!= null) {
Totalstring = = CurrentLine;
}
} catch (IOException ex3) {}
....
//The result of this search has been put into the totalstring, is some HTML code, need to analyze the next step.
}
Google, for example, slightly different, Google has a number of browser detection, coding is also different, URL for http://www.google.com/search?q= software release + version +inurl:publish.asp&hl= Zh-cn&lr= &newwindow=1&start=0&sa=n&ie=utf-8, where the encoding is to be shown from the first few records with a ie=utf-8,start representation, Note that Google has to check the browser, if the browser does not meet its requirements, will return the error code, so in the mock browser submission, we need to add a line of code, modify the key part of the HTTP properties of the User-agent set to a common browser, such as mozilla/ 4.0, the code is as follows:
try {
http = (httpurlconnection) url.openconnection ();
Http.setrequestproperty ("User-agent", "mozilla/4.0");
Http.connect ();
Urlstream = Http.getinputstream ();
}catch (Exception ef) {};