C + + implements web crawler

Source: Internet
Author: User
Tags socket connect

Spit Groove

The day before yesterday, my interview experience down.

I feel that I am looking for a way to be more sad, but also pay more than the average person sweat.

I thought it would be an inspirational story to write.

The comments received are,

Language is just a tool. ||| I have been listening to this sentence for 4 years!

I used to be xx, now XXX. ||| Until you have an arrow in your knee?

I also use C's ... Can only say, the important is a kind of learning ability, can use. ||| Oh.

May be good for your classmates, you are excellent, but probably for you want to work in the direction of the company, you will be too little, not worthy of some companies to cultivate, so it is recommended that you put the foundation to be more practical and more favorable.

The last word is the one that bothers me the most.

Now people, look what is what, generalize,

The university is not good, from the university to the university graduation, this after how many years, from this many years ago is doomed to determine the level of this person now?

NET invested hundreds of resumes, did not receive an interview notice,

Go straight to the company, face 2, over 2.

Isn't that a question on a resume?

Suddenly think of looking for a job that period of time, I in a group of a hanging ads.

Immediately someone came out to play a lot of people who read.

Frankly speaking, if you are very good people have been robbed, or a training organization.

C + + Programmers understand that C + + molding is slow, the general company will not use the new, let alone specialist graduation.

Those who are accustomed to the crash will not understand.

All right, spit out the groove. Toggle mode.

C + + implements web crawler
#include <iostream>#include<vector>#include<list>#include<map>#include<queue>#include<string>#include<utility>#include<regex>#include<fstream>#include<WinSock2.h>#include<Windows.h>#pragmaComment (lib, "Ws2_32.lib")using namespacestd;voidStartupwsa () {wsadata wsadata; WSAStartup (Makeword (2,0), &wsadata);} InlinevoidCleanupwsa () {WSACleanup ();} Inline pair<string,string> binarystring (Const string&AMP;STR,Const string&Dilme) {Pair<string,string> Result (str,""); Auto POS=Str.find (Dilme); if(POS! =string:: NPOs) {Result.first= Str.substr (0, POS); Result.second= Str.substr (pos +dilme.size ()); }    returnresult;} InlinestringGetipbyhostname (Const string&hostName) {hostent* Phost =gethostbyname (Hostname.c_str ()); returnPhost? Inet_ntoa (* (IN_ADDR *) phost->h_addr_list[0]):"";} Inline SOCKET Connect (Const string&hostName) {Auto IP=Getipbyhostname (hostName); if(Ip.empty ())return 0; Auto Sock= Socket (Af_inet, Sock_stream,0); if(Sock = =invalid_socket)return 0;    Sockaddr_in addr; Addr.sin_family=af_inet; Addr.sin_port= Htons ( the); Addr.sin_addr.s_addr=inet_addr (Ip.c_str ()); if(Connect (sock,ConstSOCKADDR *) &addr,sizeof(sockaddr_in)) ==socket_error)return 0; returnSock;} InlineBOOLSendRequest (SOCKET sock,Const string&host,Const string&Get){    stringhttp="GET"+Get+"http/1.1\r\n"+"HOST:"+ Host +"\ r \ n"+"connection:close\r\n\r\n"; returnHttp.size () = = Send (sock, &http[0], http.size (),0);} InlinestringRecvrequest (SOCKET sock) {StaticTimeval wait = {2,0}; StaticAuto buffer =string(2048* -,' /'); Auto Len=0, Reclen =0;  Do{fd_set FD= {0}; Fd_set (sock,&FD); Reclen=0; if(Select(0, &AMP;FD, nullptr, nullptr, &wait) >0) {Reclen= recv (sock, &buffer[0] + len,2048* --Len,0); if(Reclen >0) Len+=Reclen; } Fd_zero (&FD); }  while(Reclen >0); returnLen > One? buffer[9] =='2'&& buffer[Ten] =='0'&& buffer[ One] =='0'? BUFFER.SUBSTR (0, Len):""        : "";} InlinevoidExturl (Const string&buffer, queue<string> &urlqueue) {    if(Buffer.empty ()) {return ;    } smatch result; Auto Curiter=Buffer.begin (); Auto Enditer=Buffer.end ();  while(Regex_search (Curiter, Enditer, result, regex ("href=\ "(https?:)? \\s+\ "")) {Urlqueue.push (regex_replace (result[0].str (), Regex ("href=\ "(https?:)? (\\s+) \ ""),            " $") ); Curiter= result[0].second; }}voidGo (Const string&url,intcount) {Queue<string>URLs;    Urls.push (URL);  for(Auto i =0; I! = count; ++i) {if( !Urls.empty ()) {Auto&url =Urls.front (); Auto Pair= binarystring (URL,"/" ); Auto Sock=Connect (Pair.first); if(Sock && SendRequest (sock, Pair.first,"/"+pair.second)) {Auto buffer=Move (recvrequest (sock));            Exturl (buffer, URLs); }
Closesocket (sock); cout<< URL <<": count=>"<< urls.size () <<Endl; Urls.pop (); } }}intMain () {STARTUPWSA (); Go ("www.hao123.com", $); CLEANUPWSA (); return 0;}

The crawler took only about 1 hours.

Actually I want to say, write sucks, everybody don't spray.

HTTP protocol, socket, regular expression Let's not talk about it.

To talk about this principle,

All URLs are placed in this queue of URLs.

The first one is to push a root URL.

Then the crawler moves on.

The process is probably like this:

Take a URL out of the URL and read out all the URLs on the Web page----Analyze all URLs--and put the URLs in URLs and pop up a URL from the URLs.

The URL is host + get.

So a binarystring is needed to cut it.

Efficiency is not very fast, 1 minutes about 4W url, remove the repetition at least there is a good thousands of.

There is a point to note.

C++11 Regular expression is really a bit hard to use ~ ~ ~

I don't know how many times to match.

Had to use a loop.

Online search out an answer, writing a little puzzled.

Execution results

C + + implements web crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.