Use. NET Core to write crawlers to crawl movie heaven and core Crawlers

Source: Internet
Author: User

Use. NET Core to write crawlers to crawl movie heaven and core Crawlers

Since the previous project was migrated from. NET to. NET core, it was a month before it was officially launched to the new version.

Then, a new trap was launched recently, and a crawler was used to crawl movie resources on dy2018 movie heaven. Here we also take the opportunity to briefly introduce how to write a crawler Based on. NET Core.

PS: if any error occurs, please specify...

PPS: whether to go to a cinema or go to a cinema more often. After all, the beauty is priceless.

Preparation (. NET Core preparation)

First, install. NET Core first. Download and install the tutorial here: http://www.bkjia.com/article/87907.htm http://www.bkjia.com/article/88735.htm. You can play it on either Windows, linux, or mac.

My environment here is Windows 10 + VS2015 community updata3 +. NET Core 1.1.0 SDK +. NET Core 1.0.1 tools Preview 2.

Theoretically, you only need to install the. NET Core 1.1.0 SDK to develop the. NET Core program. It does not matter what tool is used to write code.

After the above tools are installed, you can see the. NET Core template in the new project in VS2015. For example:

For the sake of simplicity, we directly select the template that comes with VS. NET Core tools when creating it.

A crawler self-cultivation analysis webpage

Before writing a crawler, we must first understand the composition of the webpage data to be crawled.

Specifically, the webpage analyzes the tags or tags used in HTML for the data to be captured, and then extracts the data from HTML using the tags. Here, I use the ID and CSS attributes of HTML tags.

Take dy2018.com as an example to describe this process. Dy2018.com homepage, for example:

In chrome, press F12 to enter the developer mode. Then, use the mouse to select the corresponding page data and analyze the HTML composition of the page.

Then we start to analyze the page data:

After a simple analysis of HTML, we get the following conclusions:

The movie data on the www.dy2018.com homepage is stored in a div tag of class co_content222.

The movie details link is a tag. The text displayed in the tag is the movie name, And the URL is the details URL.

In conclusion, our job is to find the div tag of class = 'co _ content222 'and extract all the tag data from it.

Start writing code...

Previously, I used the AngleSharp library when writing a project, A. NET (C #)-Based DLL component specially designed to parse the xHTML source code.

AngleSharp home here: https://anglesharp.github.io /,

Introduction: http://www.bkjia.com/article/99082.htm

Nuget address: Nuget AngleSharp installation command:Install-Package AngleSharp

Get movie list data

Private static HtmlParser htmlParser = new HtmlParser (); private ConcurrentDictionary <string, MovieInfo> _ cdMovieInfo = new ConcurrentDictionary <string, MovieInfo> (); privatevoidAddToHotMovieList () {// This operation does not block other current operations. Therefore, the Task // _ cdMovieInfo is used as the thread-safe dictionary to store all current movie data tasks. factory. startNew () => {try {// obtain HTML var htmlDoc = HTTPHelper through URL. getHTMLByURL (" http://www.dy2018.com /"); // Parse HTML into IDocument var dom = htmlParser. parse (htmlDoc); // extract all div labels of class = 'co _ content222 'from dom // The QuerySelectorAll method accepts selector syntax var lstDivInfo = dom. querySelectorAll ("div. co_content222 "); if (lstDivInfo! = Null) {// The first three divs are the new movie foreach (var divInfo in lstDivInfo. take (3) {// obtain all the tags in the div and the tag Contains "/I/" // Contains ("/I /") the filter condition is found in the test that the tag in this div may be the ad link divInfo. querySelectorAll (""). where (a =>. getAttribute ("href "). contains ("/I /")). toList (). forEach (a ==>{// concatenate the complete link var onlineURL =" http://www.dy2018.com "+ A. GetAttribute (" href "); // check whether it already exists in the existing data if (! _ CdMovieInfo. containsKey (onlineURL) {// obtain the details of a movie. MovieInfo movieInfo = FillMovieInfoFormWeb (a, onlineURL); // Add the downloaded link to the existing data if (movieInfo. xunLeiDownLoadURLList! = Null & movieInfo. XunLeiDownLoadURLList. Count! = 0) {_ cdMovieInfo. TryAdd (movieInfo. Dy2018OnlineUrl, movieInfo) ;}}}}} catch (Exception ex ){}});}

Get movie details

PrivateMovieInfoFillMovieInfoFormWeb (AngleSharp. Dom. IElement a, string onlineURL) {var movieHTML = HTTPHelper. GetHTMLByURL (onlineURL); var movieDoc = htmlParser. Parse (movieHTML );// http://www.dy2018.com/i/97462.html For more information about the analysis process, see. // For more information about movies, see var Zoom = movieDoc in the label with id: zoom. getElementById ("Zoom"); // The download link is in the td of bgcolor = '# fdfddf'. There may be multiple links var lstDownLoadURL = movieDoc. querySelectorAll ("[bgcolor = '# fdfddf']"); // The release time is var updatetime = movieDoc In the span tag of class = 'updatetime. querySelector ("span. updatetime "); var pubDate = DateTime. now; if (updatetime! = Null &&! String. isNullOrEmpty (updatetime. innerHtml) {// The content contains the "release time:", // replace into "" before conversion. Conversion failure does not affect the process DateTime. tryParse (updatetime. innerHtml. replace ("Release Date:", ""), out pubDate);} var movieInfo = new MovieInfo () {// InnerHtml may also contain font tags, make multiple Replace MovieName =. innerHtml. replace ("<font color = \" #0c9000 \ "> ",""). replace ("<font color = \" #0c9000 \ "> ",""). replace ("</font>", ""), Dy2018OnlineUrl = onlineURL, Movi EIntro = zoom! = Null? WebUtility. HtmlEncode (zoom. InnerHtml): "no introduction...", // there may be no introduction, although it does not seem very likely that XunLeiDownLoadURLList = lstDownLoadURL! = Null? LstDownLoadURL. Select (d => d. FirstElementChild. InnerHtml). ToList (): null, // The download link PubDate = pubDate,}; return movieInfo;} may not exist ;}

HTTPHelper

There is a pitfall here. The Encoding format of dy2018 webpage is GB2312, And. NET Core does not support GB2312 by default. An exception is thrown when Encoding. GetEncoding ("GB2312") is used.

The solution is to manually Install the System. Text. Encoding. CodePages Package (Install-Package System. Text. Encoding. CodePages ),

Then add Encoding. RegisterProvider (CodePagesEncodingProvider. Instance) to the Configure method of Starup. cs, and Encoding. GetEncoding ("GB2312") can be used normally.

Using System; using System. net. http; using System. net. http. headers; using System. text; namespace Dy2018Crawler {public class HTTPHelper {public static HttpClient Client {get;} = new HttpClient (); publicstaticstringGetHTMLByURL (stringurl) {try {System. net. webRequest wRequest = System. net. webRequest. create (url); wRequest. contentType = "text/html; charset = gb2312"; wRequest. method = "get"; wRequest. usedefacrecredentials = true; // Get the response instance. var task = wRequest. getResponseAsync (); System. net. webResponse wResp = task. result; System. IO. stream respStream = wResp. getResponseStream (); // The website encoding method of dy2018 is GB2312, using (System. IO. streamReader reader = new System. IO. streamReader (respStream, Encoding. getEncoding ("GB2312") {return reader. readToEnd () ;}} catch (Exception ex) {Console. writeLine (ex. toString (); return string. empty ;}}}}

Implementation of scheduled tasks

The scheduled task here uses Pomelo. AspNetCore. TimedJob.

Pomelo. AspNetCore. TimedJob is A. NET Core scheduled task job library that supports millisecond-level scheduled tasks, read timing configurations from databases, and synchronize asynchronous scheduled tasks.

By the. NET Core community and former Microsoft MVP AmamiyaYuuko (after joining Microsoft, I will step down as an MVP ...) Development and maintenance, but it seems that there is no open source. Let's look back and see if it can be opened.

There are various nuget versions available on demand. Address: https://www.nuget.org/packages/Pomelo.AspNetCore.TimedJob/1.1.0-rtm-10026

Author's own article: Timed Job-Pomelo extension package Series

Startup. cs code

First, Install the corresponding Package: Install-Package Pomelo. AspNetCore. TimedJob-Pre.

Add the Service in the ConfigureServices function of Startup. cs and Use it in the Configure function.

// This method gets called by the runtime. use this method to add services to the container. publicvoidConfigureServices (IServiceCollection services) {// Add framework services. services. addMvc (); // Add TimedJob services. addTimedJob ();} publicvoidConfigure (IApplicationBuilder app, IHostingEnvironment env, ILoggerFactory loggerFactory) {// use TimedJob app. useTimedJob (); if (env. isDevelopment () ) {App. usemediaexceptionpage (); app. useBrowserLink ();} else {app. useExceptionHandler ("/Home/Error");} app. useStaticFiles (); app. useMvc (routes => {routes. mapRoute (name: "default", template: "{controller = Home}/{action = Index}/{id ?} ") ;}); Encoding. RegisterProvider (CodePagesEncodingProvider. Instance );}

Job-related code

Create a new class named XXXJob. cs. Reference The namespace using Pomelo. AspNetCore. TimedJob. XXXJob inherits from the Job and add the following code.

Public class AutoGetMovieListJob: Job {// Begin start time; Interval execution Interval, in milliseconds. the following format is recommended, which is 3 hours; // whether SkipWhileExecuting waits for the previous execution to complete. true indicates waiting. [Invoke (Begin = "", Interval = 1000*3600*3, SkipWhileExecuting = true)] publicvoidRun () {// logic code to be executed by the Job // LogHelper. info ("Start crawler"); // AddToLatestMovieList (100); // AddToHotMovieList (); // LogHelper. info ("Finish crawling ");}}

Add a runtimes node for project Publishing

If you use the template project created in VS2015, project. json configuration does not have a runtimes node by default.

When we want to publish to a non-Windows platform, we need to manually configure this node for generation.

 "runtimes": { "win7-x64": {}, "win7-x86": {}, "osx.10.10-x64": {}, "osx.10.11-x64": {}, "ubuntu.14.04-x64": {}}

Delete/comment a scripts Node

When the node. js script is called to build the front-end code, it cannot ensure that every environment has a bower... Complete the remarks.

//"scripts": {// "prepublish": [ "bower install", "dotnet bundle" ],// "postpublish": [ "dotnet publish-iis --publish-folder %publish:OutputPath% --framework %publish:FullTargetFramework%" ]//},

Delete/comment the type in the dependencies Node

"dependencies": { "Microsoft.NETCore.App": {  "version": "1.1.0"  //"type": "platform" },

For configuration instructions on project. json, see this official document: Project. json-file,

Or instructor Zhang Shanyou's article. NET Core series: 2. What medicines are sold in Huludao in project. json?

Development, compilation, and publishing

// Restore various package files dotnet restore; // Release to C: \ code \ website \ Dy2018Crawler folder dotnet publish-r ubuntu.14.04-x64-c Release-o "C: \ code \ website \ Dy2018Crawler ";

Finally, we are still open-source ...... The above code is found below:

Gayhub address: https://github.com/liguobao/Dy2018Crawler

PS: I'll write a crawler. You can't help it...

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.