DotnetSpider practices in car house stores [1], dotnetspider practices

Source: Internet
Author: User

DotnetSpider practices in car house stores [1], dotnetspider practices
I. background

The Spring Festival cannot be idle. I always wanted to learn how to play crawlers. I searched a lot on the Internet. Most of them are Python. Everyone is also very active, and I found many articles, I found that there was an open-source DotNetSpider library developed by a great god in the garden. Fortunately, this library also supports it. net Core, so I took a gap in the Spring Festival to study the entire open-source project. At present, the Internet automobile industry is very hot. I chose the car house and mango automobile store to capture data.

2. Development Environment

VS2017 +. Net Core2.x + DotNetSpider + Win10

3. Develop 3.1 new. Net Core project

Create a. Net Core console application

3.2 Add the DotNetSpider class library through Nuget

Search DotnetSpider and add the two libraries.

 

3.3 analyze the web address to be crawled

Open this page at https://store.mall.autohome.com.cn/83425681.html.

We use the Network interface of Chrome's development tool to capture this information. In this interface, we can clearly know all the data in the HTTP request, including the Header and Post parameters, in fact, we can simulate an HTTP request and parse the data by adding an HTML parsing.

The parameter page is the page number. You only need to modify the page value to obtain the data of the specified page number.

The returned result is the HTML of the List page.

3.4 create a storage entity class AutoHomeShopListEntity
        class AutoHomeShopListEntity : SpiderEntity         {            public string DetailUrl { get; set; }            public string CarImg { get; set; }            public string Price { get; set; }            public string DelPrice { get; set; }            public string Title { get; set; }            public string Tip { get; set; }            public string BuyNum { get; set; }            public override string ToString()            {                return $"{Title}|{Price}|{DelPrice}|{BuyNum}";            }        }
3.5 create AutoHomeProcessor

Used to parse and save the obtained HTML.

        private class AutoHomeProcessor : BasePageProcessor        {            protected override void Handle(Page page)            {                List<AutoHomeShopListEntity> list = new List<AutoHomeShopListEntity>();                var modelHtmlList = page.Selectable.XPath(".//div[@class='list']/ul[@class='fn-clear']/li[@class='carbox']").Nodes();                foreach (var modelHtml in modelHtmlList)                {                    AutoHomeShopListEntity entity = new AutoHomeShopListEntity();                    entity.DetailUrl = modelHtml.XPath(".//a/@href").GetValue();                    entity.CarImg = modelHtml.XPath(".//a/div[@class='carbox-carimg']/img/@src").GetValue();                    var price = modelHtml.XPath(".//a/div[@class='carbox-info']").GetValue(DotnetSpider.Core.Selector.ValueOption.InnerText).Trim().Replace(" ", string.Empty).Replace("\n", string.Empty).Replace("\t", string.Empty).TrimStart('¥').Split("¥");                    if (price.Length > 1)                    {                        entity.Price = price[0];                        entity.DelPrice = price[1];                    }                    else                    {                        entity.Price = price[0];                        entity.DelPrice = price[0];                    }                    entity.Title = modelHtml.XPath(".//a/div[@class='carbox-title']").GetValue();                    entity.Tip = modelHtml.XPath(".//a/div[@class='carbox-tip']").GetValue();                    entity.BuyNum = modelHtml.XPath(".//a/div[@class='carbox-number']/span").GetValue();                    list.Add(entity);                }                page.AddResultItem("CarList", list);            }        }
3.6 create AutoHomePipe

Output the captured results.

        private class AutoHomePipe : BasePipeline        {            public override void Process(IEnumerable<ResultItems> resultItems, ISpider spider)            {                foreach (var resultItem in resultItems)                {                    Console.WriteLine((resultItem.Results["CarList"] as List<AutoHomeShopListEntity>).Count);                    foreach (var item in (resultItem.Results["CarList"] as List<AutoHomeShopListEntity>))                    {                        Console.WriteLine(item);                    }                }            }        }

 

3.7 create a Site

It is mainly to put the HTTP Header information in

            var site = new Site            {                CycleRetryTimes = 1,                SleepTime = 200,                Headers = new Dictionary<string, string>()                {                    { "Accept","text/html, */*; q=0.01" },                    { "Referer", "https://store.mall.autohome.com.cn/83106681.html"},                    { "Cache-Control","no-cache" },                    { "Connection","keep-alive" },                    { "Content-Type","application/x-www-form-urlencoded; charset=UTF-8" },                    { "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36"}                                    }            };
3.8 construct a Request

Because the captured interface must use POST. If it is a GET request, this part can be omitted, and the parameter can be placed in PostBody.

            List<Request> resList = new List<Request>();            for (int i = 1; i <= 33; i++)            {                Request res = new Request();                res.PostBody = $"id=7&j=%7B%22createMan%22%3A%2218273159100%22%2C%22createTime%22%3A1518433690000%2C%22row%22%3A5%2C%22siteUserActivityListId%22%3A8553%2C%22siteUserPageRowModuleId%22%3A84959%2C%22topids%22%3A%22%22%2C%22wherePhase%22%3A%221%22%2C%22wherePreferential%22%3A%220%22%2C%22whereUsertype%22%3A%220%22%7D&page={i}&shopid=83106681";                res.Url = "https://store.mall.autohome.com.cn/shop/ajaxsitemodlecontext.jtml";                res.Method = System.Net.Http.HttpMethod.Post;                resList.Add(res);            }
3.9 construct and execute a crawler
            var spider = Spider.Create(site, new QueueDuplicateRemovedScheduler(), new AutoHomeProcessor())                .AddStartRequests(resList.ToArray())                .AddPipeline(new AutoHomePipe());            spider.ThreadNum = 1;            spider.Run();
3.10 execution results

4. coming soon

Next, I will capture the details page data of the product (including the vehicle model parameter configuration and so on). The interface has been captured and I am still thinking about how to obtain the product id more conveniently, because the product id is currently stored in the js global variables on the page, it is difficult to capture.

 

V. Summary

. Compared with other languages, Net is not so active. Although DotnetSpider does not take a long time, it is hoped that everyone will use it in the garden, so that he will continue to develop and let us. net can be better developed.

The first time I wrote a blog, I was in a hurry. There are still many shortcomings. You are welcome to shoot bricks.

Happy New Year

February 19, 2018

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.