Realization of C + + web crawler--winsock programming __linux

Source: Internet
Author: User


Wrote a web crawler, you can crawl the image online.

A given initial web site is required.

Compile through in vs2010.

Needs to be compiled using multibyte character sets,

The vs2010 default is the Unicode character set.

After compiling, run can, have pleasantly surprised oh ...


The principle of reptiles

Start at the beginning of the URL and find a hyperlink to another page

Put it in a page queue to save it, find all the pictures of the page, download it down.


To see if the page queue is empty or empty, remove the next page.

The hyperlink that extracts the page is placed behind the queue and downloads all the pictures on the page.


This cycle.
Main frame: void Main () {//Initialize socket for TCP network connection Wsadata Wsadata;     if (WSAStartup (Makeword (2,2), &wsadata)!= 0) {return; }
Create folders, save pictures and Web page text files createdirectory ("./img", 0); CreateDirectory ("./html", 0); String Urlstart = "http://hao.360.cn/meinvdaohang.html";
The starting address of the traversal string urlstart = "Http://www.wmpic.me/tupian";
Use the breadth traversal//Extract hyperlink in the page to put in the Hrefurl, extract the picture link, download the picture. BFS (Urlstart);
Visited URLs saved up Visitedurl.insert (Urlstart);
while (Hrefurl.size ()!=0) {String url = Hrefurl.front (); Remove a URL from the beginning of the queue cout << URL << endl;  BFS (URL);   Traversing the extracted page, find it inside the hyperlink page into the Hrefurl, download it inside the text, Picture Hrefurl.pop ();     After the traversal, delete this URL} wsacleanup (); Return }
BFS is the most important treatment:
First get the page response, save into the text, and then find the picture link htmlparse, download all the pictures downloadimg.
Breadth traversal void BFS (const string & URL) {char * response; int bytes;//Get the response of the Web page, put it in response. if (! Gethttpresponse (url, response, bytes)) {cout << "the URL is wrong! Ignore. "<< Endl; Return } string Httpresponse=response; Free (response); string filename = tofilename (URL); Ofstream ofile ("./html/" +filename); if (Ofile.is_open ()) {//Save the text content of the Web page ofile << httpresponse << Endl; Ofile.close (); Vector<string> Imgu Rls Parse all the pictures of the page link, put Imgurls inside Htmlparse (HttpResponse, imgurls, URL); Download all the picture resources downloadimg (imgurls, URL); }



Attach All code:

Code updated in 1014-10-15.


#include <Windows.h> #include <string> #include <iostream> #include <fstream> #include < vector> #include "winsock2.h" #include <time.h> #include <queue> #include 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.