Java implementation of the use of search engines to collect Web site programs

Source: Internet
Author: User
Tags error code html page

What I'm talking about here is not how to use search engines, but how to get programs to use search engines to collect URLs. Very useful! On the internet, some people sell Web site database, such as the release of software Web site, e-mail address, forum Web site, industry Web site, these sites are how to come? It is not possible to collect by hand, are the program to use the search engine to get, if you need some kind of web site information data, with me to study, very simple.

This article is written in the Java language, to Google and Baidu search engine for the object.

We want to use Google, Baidu search engine search rules in the two, keyword search and inurl search. What is Inurl search, is that you want to search the URL itself with the keyword, such as http://www.xxx.com/post.asp, this site contains post.asp such keywords, in search engines to fill in the rules is inurl:post.asp, This is the key to the collection of Web sites, because many of the Web site itself will have specific information, such as the software published in the Web site information contains publish, submit, Tuijian such information, such as http://www.xxx.com/publish.asp, Such a Web site is more information published pages, in conjunction with the page itself may contain keywords, you can use search engine search results, and then we use the program to retrieve the results, the HTML page analysis, remove the useless information, the useful Web site information written to the file or database, Can be used for other applications or people.

The first step, using the program to retrieve the results of the search, first, for example, Baidu, for example, we want to search the software release of the Web page, the keyword using the "Software release version inurl:publish.asp", first login Baidu look, the keyword written, and then submitted, in the address bar will see http:// Www.baidu.com/s?ie=gb2312&bs=%C8%ED%BC%FE%B7%A2%B2%BC+%C8%ED%BC%FE%B0%E6%B1%BE+inurl%3Apublish.asp&sr= &z=&cl=3&f=8&wd=%c8%ed%bc%fe%b7%a2%b2%bc+%b0%e6%b1%be+inurl%3apublish.asp&ct=0, Chinese keywords all become encoded, there is no relationship, we are directly in the program in Chinese is also possible, including a number of key words with +, to remove some of the information is not used, we can optimize the address into http://www.baidu.com/s?lm=0&si=& rn=20&ie=gb2312&ct=0& wd= Software release + version +inurl%3apublish%2easp&pn=0&cl=0, where RN represents how many results a page shows, wd= Indicates the keyword you are searching for, and the PN indicates that, from the beginning of the first few, this PN will be the variable for our program to loop through the results, every 20 loops at a time. We use Java program to simulate the process of this search, use the key class for Java.net.httpurlconnection,java.net.url, first write a Submit search class, the key code is as follows:

 class Search 
{
public URL url;
Public HttpURLConnection http;
Public Java.io.InputStream Urlstream;
.....
for (int i=0;i++;i<100)
{
.....
try {
url = new URL ("www.baidu.com/s?lm=0&si=&rn=20&ie=gb2312&ct=0&wd= software release + version +inurl% 3apublish%2easp&pn= "+beginrecord+" &cl=0 ");
}catch (Exception ef) {};
try {
http = (httpurlconnection) url.openconnection ();
Http.connect ();
Urlstream = Http.getinputstream ();
}catch (Exception ef) {};
Java.io.BufferedReader l_reader = new java.io.
BufferedReader (New Java.io.InputStreamReader (Urlstream));
try {
while (CurrentLine = L_reader.readline ())!= null) {
Totalstring = = CurrentLine;
}
} catch (IOException ex3) {}
....
//The result of this search has been put into the totalstring, is some HTML code, need to analyze the next step.
}

Google, for example, slightly different, Google has a number of browser detection, coding is also different, URL for http://www.google.com/search?q= software release + version +inurl:publish.asp&hl= Zh-cn&lr= &newwindow=1&start=0&sa=n&ie=utf-8, where the encoding is to be shown from the first few records with a ie=utf-8,start representation, Note that Google has to check the browser, if the browser does not meet its requirements, will return the error code, so in the mock browser submission, we need to add a line of code, modify the key part of the HTTP properties of the User-agent set to a common browser, such as mozilla/ 4.0, the code is as follows:

try {
http = (httpurlconnection) url.openconnection ();
Http.setrequestproperty ("User-agent", "mozilla/4.0");
Http.connect ();
Urlstream = Http.getinputstream ();
}catch (Exception ef) {};

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.