First, late for the next period of the notice
Since the last article published to now, about 3 months of appearance, in fact, have been trying to get this introductory series of tutorials to finish, one is to support Dotnetspider, two is for the. Net Community Development to give a humble effort, this open source project author has been updated, Relatively speaking is very good, the last version of the tutorial is 2.4.4, today browsed this project, the most recent update was 3 days ago, has been updated to 2.5.0, and the project star has been more than 1000, or very popular, but also in this thanks to the author's continuous efforts.
So long period of time did not write a good article, is because the author for the March PMP exam preparation, and then finished almost to April, to April, the urgency of the project, so has been dragged to the present only a little free time, I hope that after the completion of the article, you will have a certain help.
Second, update Dotnetspider library
Just mentioned that the Dotnetspider upgrade to 2.5.0, so we update the library, using the latest version, technology to keep up with the times, the two class library added to the line.
Iii. Analysis of Autohome Product Details page 3.1 Analysis of Product Details page data
① The last time we found that when clicking on a parameter configuration, the page will send two AJAX requests, respectively, to obtain the basic parameters of the vehicle, and configuration parameters, the returned data is in JSON format.
② through Chrome's network capture can be found that the two requests have one thing in common, the submitted parameters are data[specid]:28762, I guess this should be skuid, you can try to open these two addresses directly in the browser, you can directly return the relevant information of the vehicle , so the crux of the problem is to solve the problem of how to get skuid, to get the data of these models that is not extremely easy.
3.2 How to obtain a SKU for a product
Actually this is really a problem that bothers me, because when we open the page the link does not contain skuid,[https://mall.autohome.com.cn/detail/284641-0-0.html], So it is not very realistic to get from the URL, so I use element to go to the page to search for the value of this skuid, the results found that two places have this value, one exists in the HTML element, one is the existence of JS global variables, compared to, I think the relative aspect of the HTML element would be better to handle, just get the attributes of the element.
Iv. Development 4.1 Preparation processor
Why this time to write a separate preparation processor, because the preparation of three processor, respectively, to deal with 3 data, the first for the acquisition of Skuid, the second to obtain the basic parameters of the vehicle model, the third to obtain the model configuration, attentive little friends will certainly find that This time we each processor use the constructor, the inside can clearly see that we are familiar with the regular expression (PS regular expression is very bad, I have a better way of writing can reply in the comments inside, let me worship a bit, hehe), it will certainly be asked why so write it?
Compared to the last tutorial, this time the crawl process is more complex, the last time we just grabbed a list page, an interface can be completely done, our process has become, the first step, we need to obtain the Model Details page page, from the page to find Skuid, the second part, Will get the Skuid stitching good request into the crawler collection, through the new construction request to get the data, then how we know which request to use which processor to handle it, Dotnetspider provides the URL to verify the judgment, Which processor is used to process the data.
Private classGetskuprocessor:basepageprocessor//Get Skuid { Publicgetskuprocessor () {Targeturlsextractor=NewRegionandpatterntargeturlsextractor (".",@"^https://mall\.autohome\.com\.cn/detail/*"); } protected Override voidHandle (Page page) {stringSkuid =string. Empty; Skuid= page. Selectable.xpath (".//a[@class = ' carbox-compare_detail ']/@link"). GetValue (); Page. Addresultitem ("Skuid", Skuid); Page. Addtargetrequest (@"https://mall.autohome.com.cn/http/data.html?data[_host]=//car.api.autohome.com.cn/v1/carprice/spec_ paramsinglebyspecid.ashx&data[_appid]=mall&data[specid]="+skuid); Page. Addtargetrequest (@"Https://mall.autohome.com.cn/http/data.html?data[_host]=//car.api.autohome.com.cn/v2/carprice/config_ getlistbyspecid.ashx&data[_appid]=mall&data[specid]="+skuid); } } Private classGetbasicinfoprocessor:basepageprocessor//get basic vehicle Parameters { Publicgetbasicinfoprocessor () {Targeturlsextractor=NewRegionandpatterntargeturlsextractor (".",@"^https://mall\.autohome\.com\.cn/http/data\.html\?data\[_host\]=//car\.api\.autohome\.com\.cn/v1/carprice/ spec_paramsinglebyspecid\.ashx*"); } protected Override voidHandle (Page page) {page. Addresultitem ("Baseinfo", page. Content); } } Private classGetextinfoprocessor:basepageprocessor//Get Car configuration { Publicgetextinfoprocessor () {Targeturlsextractor=NewRegionandpatterntargeturlsextractor (".",@"^https://mall\.autohome\.com\.cn\/http\/data\.html\?data\[_host\]=//car\.api\.autohome\.com\.cn/v2/ carprice/config_getlistbyspecid\.ashx*"); } protected Override voidHandle (Page page) {page. Addresultitem ("Extinfo", page. Content); } }
4.2. Create Pipeline
Pipeline basically changed a little, slightly modified a bit, so easy.
Private classPrintskupipe:basepipeline { Public Override voidProcess (ienumerable<resultitems>resultitems, Ispider spider) { foreach(varResultiteminchResultitems) { if(Resultitem.getresultitem ("Skuid") !=NULL) {Console.WriteLine (resultitem.results["Skuid"] as string); } if(Resultitem.getresultitem ("Baseinfo") !=NULL) { vart = jsonconvert.deserializeobject<autocarparam> (resultitem.results["Baseinfo"]); //Console.WriteLine (resultitem.results["Baseinfo"]); } if(Resultitem.getresultitem ("Extinfo") !=NULL) { vart = jsonconvert.deserializeobject<autocarconfig> (resultitem.results["Extinfo"]); //Console.WriteLine (resultitem.results["Extinfo"]); } } } }
Two new entity class autocarparam,autocarconfig, in fact, there are duplicates, children can be abstracted again, the code can be reduced, you can save a little hard disk space
Public classAutocarconfig { Public stringMessage {Get;Set; } PublicConfigresult Result {Get;Set; } Public stringReturnCode {Get;Set; } } Public classConfigresult { Public stringSpecid {Get;Set; } PublicList<configtypeitem> Configtypeitems {Get;Set; } } Public classConfigtypeitem { Public stringName {Get;Set; } PublicList<configitem> Configitems {Get;Set; } } Public classConfigitem { Public stringName {Get;Set; } Public stringValue {Get;Set; } } Public classAutocarparam { Public stringMessage {Get;Set; } PublicParamresult Result {Get;Set; } Public stringReturnCode {Get;Set; } } Public classParamresult { Public stringSpecid {Get;Set; } PublicList<paramtypeitem> Paramtypeitems {Get;Set; } } Public classParamtypeitem { Public stringName {Get;Set; } PublicList<paramitem> Paramitems {Get;Set;} } Public classParamitem { Public stringName {Get;Set; } Public stringValue {Get;Set; } }
4.3. Construction crawler
This piece of change is not a big, changing place to look at my notes, because we need to have multiple processor and add these to the line.
varsite =NewSite {cycleretrytimes=1, Sleeptime= $, Headers=Newdictionary<string,string>() { { "Accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" }, { "Cache-control","No-cache" }, { "Connection","keep-alive" }, { "Content-type","application/x-www-form-urlencoded; Charset=utf-8" }, { "user-agent","mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.139 safari/537.36"} } }; List<Request> reslist =NewList<request>(); Request Res=NewRequest (); Res. URL="https://mall.autohome.com.cn/detail/284641-0-0.html"; Res. Method=System.Net.Http.HttpMethod.Get; Reslist.add (RES); varSpider = Spider.create (site,NewQueueduplicateremovedscheduler (),NewGetskuprocessor (),NewGetbasicinfoprocessor (),NewGetextinfoprocessor ())//because we have multiple processor, we have to add them. . Addstartrequests (Reslist.toarray ()). Addpipeline (Newprintskupipe ()); Spider. Threadnum=1; Spider. Run (); Console.read ();
V. Results of implementation
Vi. Summary
This time suppressed so long to write the second article, it is ashamed that the end of February this article will come out, has been dragged to the present, the reading volume of the previous article is still good beyond my expectations, the following comments also have small partners hope I quickly point out this article, the overall is good. This article I hope that is not only people have learned how to use Dotnetspider, and can let everyone know how to crawl data, give you a little ideas, so I will be the actual scene to write this article, otherwise, it will be too boring.
I hope you'll make lots of bricks.
Seven, the next period of notice
The next article will involve file fetching
By the way, there is a need to take a PMP or a high test can contact me, I have some information to provide
May 13, 2018
Car House Store Merchandise details data capture Dotnetspider Combat [II]