Wrote a web crawler, you can crawl the image online.
A given initial web site is required.
Compile through in vs2010.
Needs to be compiled using multibyte character sets,
The vs2010 default is the Unicode character set.
After compiling, run can, have pleasantly surprised oh ...
The principle of reptiles
Start at the beginning of the URL and find a hyperlink to another page
Put it in a page queue to save it, find all the pictures of the page, download it down.
To see if the page queue is empty or empty, remove the next page.
The hyperlink that extracts the page is placed behind the queue and downloads all the pictures on the page.
This cycle.
Main frame: void Main () {//Initialize socket for TCP network connection Wsadata Wsadata; if (WSAStartup (Makeword (2,2), &wsadata)!= 0) {return; }
Create folders, save pictures and Web page text files createdirectory ("./img", 0); CreateDirectory ("./html", 0); String Urlstart = "http://hao.360.cn/meinvdaohang.html";
The starting address of the traversal string urlstart = "Http://www.wmpic.me/tupian";
Use the breadth traversal//Extract hyperlink in the page to put in the Hrefurl, extract the picture link, download the picture. BFS (Urlstart);
Visited URLs saved up Visitedurl.insert (Urlstart);
while (Hrefurl.size ()!=0) {String url = Hrefurl.front (); Remove a URL from the beginning of the queue cout << URL << endl; BFS (URL); Traversing the extracted page, find it inside the hyperlink page into the Hrefurl, download it inside the text, Picture Hrefurl.pop (); After the traversal, delete this URL} wsacleanup (); Return }
BFS is the most important treatment:
First get the page response, save into the text, and then find the picture link htmlparse, download all the pictures downloadimg.
Breadth traversal void BFS (const string & URL) {char * response; int bytes;//Get the response of the Web page, put it in response. if (! Gethttpresponse (url, response, bytes)) {cout << "the URL is wrong! Ignore. "<< Endl; Return } string Httpresponse=response; Free (response); string filename = tofilename (URL); Ofstream ofile ("./html/" +filename); if (Ofile.is_open ()) {//Save the text content of the Web page ofile << httpresponse << Endl; Ofile.close (); Vector<string> Imgu Rls Parse all the pictures of the page link, put Imgurls inside Htmlparse (HttpResponse, imgurls, URL); Download all the picture resources downloadimg (imgurls, URL); }
Attach All code:
Code updated in 1014-10-15.
#include <Windows.h> #include <string> #include <iostream> #include <fstream> #include < vector> #include "winsock2.h" #include <time.h> #include <queue> #include