First, preface
I am doing a Baidu included in the situation query software, is through the software to bulk query article links whether Baidu is included, is mainly used to query a URL of the number of entries and the number of sites, the idea is to learn from others.
Ii. Description of the problem
The first thing to consider is to support which search engine query, first of all Baidu, then Bing, Sogou, search, 360. Originally wanted to support Google but a think wrong, it is not good to visit, so for the time being not counted. And we actually have to do is based on a URL can be retrieved from the site in each search engine included in the number of different keywords in the URL ranking, here and there are only URLs there are a number of keywords, and the output is the URL in different search engines and the number of entries under each keyword.
But here is a question, is the number of lines, if the search URL in the first 100 is OK, if the ranking is very late, then the problem comes, that will allow users to wait a long time to see the results, but the user may only want to know the top 100 of the specific rankings, and those who exceed the only show 100 after the Can, And these need to be considered well in advance, so that the latter procedure is good to do.
Third, the solution of ideas
Believe that many people can think of, is to use WebClient to download the page, and then use the regular to get the parts we are interested in, and then use the program to deal with. And the key difficulty is in this regular writing.
Iv. number of included
First of all, the number of sites included, we can enter site:www.cnblogs.com/in Baidu and then we can see the following page:
And the number we need to ingest is 5,280,000, and we'll look at the page elements:
Then we observe the other search engines can be found to be similar, so our thinking this time should be drawn, the last is how to organize the Web site, this part of our view of the address bar? WD=SITE%3AWWW.CNBLOGS.COM%2F This paragraph will know how to write.
Wait a moment, we may be impatient for one implementation, so that we can not focus on the call, but also affect the future of the new, so we have to specify a number of functions to achieve the abstract class, so that the implementation can be realized in a unified use, and also in the future easy to add new search engine, This method belongs to the Strategy mode (stategry), and we will analyze the concrete content of this abstract class slowly.
First of all, each implementation of the abstract class should be corresponding to a specific search engine, then you need to have a basic URL, but also to leave a placeholder, such as according to the above Baidu, we draw such a string
HTTP://WWW.BAIDU.COM/S?WD=SITE%3A{0}
problem: I am doing a Baidu included in the case of the software, is through the software to bulk query the article link is Baidu included, the problem title, direct: This is to crawl the address of the Web page, directly in the browser input results are: But HttpWebRequest acquired is: completely do not know what situation? Solving.
HttpWebRequest Crawl Web content and direct input URL to get the content inconsistent! Ball Big God help!!