搜尋引擎的那些事（web遍曆）

最後更新：2018-12-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

【聲明：著作權，歡迎轉載，請勿用於商業用途。聯絡信箱：feixiaoxing @163.com】

寫搜尋引擎對我來說是一件有趣的事情，做的多好談不上，但是至少可以一步一步做出來。當然做的怎麼樣，還得大家來判斷了。在開始今天的話題之前，我們可以扯一些別的東西。什麼東西呢，其實就是搜尋引擎的一些衍生產品，很有意思。

現在的網頁中有很多的資訊，這裡包括的內容很多。除了一般我們大家知道的字元資料之外，還有很多的其他檔案內容，比如說pdf檔案、視頻檔案、圖形檔案、音頻檔案等等，所以我們可以根據這些內容開發出對應的搜尋產品。可以是圖片搜尋、視頻搜尋、部落格搜尋、mp3搜尋等等。曾經在我上大學的時候，那個階段mp3開始流行，大家基本上都習慣於用百度來搜尋mp3，使用起來非常簡潔方便。當然後來出了迅雷之後，大家又開始習慣使用狗狗來搜尋一些視頻檔案，那個時候資源真的豐富得不行，基本上說只有你不想要的，沒有你找不到的。當然後來由於關鍵字的引入、IP的保護和一些其他的原因，狗狗的資源越來越少，基本上處於棄用的邊緣了，這中間已經不是技術的問題了。

習慣上網的朋友都知道，在網頁上如果某一段文字有詳細的說明，那麼滑鼠滑到該文字上的時候就會變成手的形狀，提示我們此處可以單擊。其實，這就是我們常說的超連結。有了超連結，我們就可以進一步訪問更多的網頁，熟悉更多的內容。網頁與網頁之間有了超連結這個利器之後，我們就可以順著這條線慢慢爬行，遍曆所有的網頁，這在理論上完全是可以行得通的。但是在現實中是否必要就是另外一回事了。

在網頁內容中尋找超連結的方法很多，大家完全可以採用Regex的方法把這些地址全部找到。但是我在這方面不是很瞭解，採用的方法就比較笨，就是根據http://的方法，逐步尋找到對應url地址的。當然如果長度超過一定的範圍，我就丟棄了。當然，現在很多網頁中的url地址使用指令碼產生的，這給我們的工作造成了一定的困難。不過沒有關係，後續我們可以慢慢改進、逐步進行解決。

#include <stdio.h>#include <windows.h>#include <wininet.h>#ifdef ERROR#undef ERROR#endif#define U8 unsigned char#define U32 unsigned int#define STATUS unsigned int#define OK 0#define ERROR (~0L)#define MAX_BLOCK_SIZE 1024#define MAX_DOMAIN_NAME_LENGTH 64#define SAVE_DIR  "E:/download/"#pragma comment(lib, "wininet.lib")static STATUS download_web_page(const char* url, const char* path);static int total_number = 0;static char* domain_name[] = {"http://www.baidu.com","http://www.sogou.com","http://www.163.com","http://www.sina.com","http://www.sohu.com","http://www.qq.com","http://www.ifeng.com","http://www.z.cn","http://www.360buy.com","http://www.dangdang.com","http://www.zaobao.com",};/* get length of html file */static int get_file_size(const char* path){HANDLE hFile;int size = 0;hFile = CreateFile(path, FILE_READ_EA, FILE_SHARE_READ, 0, OPEN_EXISTING, 0, 0);if (hFile != INVALID_HANDLE_VALUE)    {size = GetFileSize(hFile, NULL);        CloseHandle(hFile);    }return size;}/* get all data from html file */static STATUS get_file_content(const char* path, void** pp_buffer, int* size){int length;char* buffer;HANDLE hFile;if(NULL == path){return ERROR;}if(NULL == pp_buffer){return ERROR;}if(NULL == size){return ERROR;}length = get_file_size(path);if(0 == length){return ERROR;}buffer = (char*) malloc(length +1);if(NULL == buffer){return ERROR;}buffer[length] = '\0';hFile = fopen(path, "r+b");if(NULL == hFile){free(buffer);return ERROR;}fread(buffer, 1, length, hFile);fclose(hFile);*pp_buffer = buffer;*size = length;return OK;}/* show all http name, sometimes just for debug use */static void print_http_name(const char* buffer, int size){while(size --){printf("%c", *buffer ++);}printf("\n");}static void download_linked_page(const char* buffer, int size){char* data;char name[64];print_http_name(buffer, size);data = (char*)malloc(size + 1);if(NULL == data){return;}data[size] = '\0';memmove(data, buffer, size);memset(name, 0, 64);sprintf(name, SAVE_DIR"%d.html", total_number);if(OK == download_web_page(data, name)){total_number ++;}/*  free data memroy, which contained http domain name */free(data);}/* get http form html file, then download it by its name*/static void get_http_and_download(const char* buffer){const char* prev;const char* next;char letter;int count;if(NULL == buffer){return;}next = buffer;while(1){next = strstr(next, "http://");if(NULL == next){break;}count = MAX_DOMAIN_NAME_LENGTH;prev = next;next += strlen("http://");while(1){if(!count){break;}count --;letter = *next;if('"' == letter || '\'' == letter || ')' ==  letter || '>' == letter){break;}next ++;}if(count){download_linked_page(prev, next - prev);}}}/* implement page download */static STATUS download_web_page(const char* url, const char* path){U8 buffer[MAX_BLOCK_SIZE];U32 iNumber;FILE* hFile;HINTERNET hSession;HINTERNET hUrl;STATUS result;hSession = InternetOpen("RookIE/1.0", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);if(NULL == hSession){return ERROR;}hUrl = InternetOpenUrl(hSession, url, NULL, 0, INTERNET_FLAG_DONT_CACHE, 0);if(NULL == hUrl){result = ERROR;goto error1;}hFile = fopen(path, "wb");if(NULL == hFile){result = ERROR;goto error2;}iNumber = 1;while(iNumber > 0){InternetReadFile(hUrl, buffer, MAX_BLOCK_SIZE -1, &iNumber);fwrite(buffer, sizeof(char), iNumber, hFile);}fclose(hFile);result = OK;error2:InternetCloseHandle(hUrl);error1:InternetCloseHandle(hSession);return result;}/* download page and its linked pages */void download_page_entry(const char* url){char* buffer;int size;char name[64];memset(name, 0, 64);sprintf(name, SAVE_DIR"/%d.html", total_number ++);download_web_page(url, name);if(OK == get_file_content(name, &buffer, &size)){get_http_and_download(buffer);free(buffer);}}int main(int argc, char* argv[]){int index;for(index = 0; index < (sizeof(domain_name) / sizeof(char*)); index ++){download_page_entry(domain_name[index]);}return 1;}

上面的代碼稍顯複雜，關鍵內容在於判斷get_http_and_download函數中是如何擷取url地址的。為了驗證效果，我們下載了11個網站的網頁和其連結網頁，大約有1萬多個，花費了有兩個多小時。當然因為中間沒有做重複性url地址判斷，所以很多網頁極有可能重複下載了。不過沒關係，朋友們可以看看代碼的基本邏輯結構就行了，重點掌握網頁是怎麼遍曆和尋找的就可以了。最後，如果大家要運行這段代碼，首先需要在E盤建立一個download的目錄，也就是我們儲存網頁的地方，再就是保證E盤剩餘空間有1G以上即可。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

搜尋引擎的那些事（web遍曆）

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support