1. Introduction
Recently contacted Abot crawler also has a few days, idle to have no matter to crawl from the IMDB website some movie data to play. Just the captain of the United States 3 is aggressively, intends to climb the marvel in recent years, and vis this JS library to present the Marvel Universe of related films.
Abot is an open-source C # Crawler, and the code is very lightweight. See this article (using Abot to crawl Blog Park News data) to get started abot.
Vis is a JS visual library similar to D3. VIS provides visualizations like network diagrams, TimeLine visualizations, and so on. This uses the network, only need to give the VIS incoming Simple node information, the edge of the information can be automatically built a map.
2. Implement
Starting with the data, we get all the related movie names in the Marvel Universe, which is too much on the Web:
From movie name to IMDB movie page actually have a search process, fortunately the number of films, here Steal a lazy direct use of IMDB movie link as a seed URL
public static list<string> imdbfeedmovies = new list<string> () {//iron mans 2008 "http://www.imdb.com/title/tt1233205/",//hunk "http://www.imdb.com/title/tt0800080/", Iron Man 2 "http://www.imdb.com/title/tt1228705/",//thor "http// www.imdb.com/title/tt0800369/",//captain America" http://www.imdb.com/title/tt0458339/", Averages "http://www.imdb.com/title/tt0848228/",//iron man 3 "Http://www.imdb.com/ti tle/tt1300854/",//thor 2" http://www.imdb.com/title/tt1981115/",//captain America 2 "http://www.imdb.com/title/tt1843866/",//guardians of the Galaxy; "http://www.imdb.com/title/tt2015381/",//ultron "http://www.imdb.com/title/tt2395427/", Ant-man "http://www.imdb.com/title/tT0478970/",//civil War" http://www.imdb.com/title/tt3498820/",//doctor Strange "http://www.imdb.com/title/tt1211837/",//guardians of the Galaxy 2; "http://www.imdb.com/title/tt3896198/",//thor 3 "http://www.imdb.com/title/tt3501632/", Black Panther "http://www.imdb.com/title/tt1825683/",//avengers:infinity war-part I "Http://www.imdb.com/title/tt4154756/"};
With the seed URL, you can use Abot to crawl the movie's data, crawling only for movie names, movie pictures, and actors.
Here you define some of the data structures that need to be used:
public class Marvellitem {//<summary>//http://www.imdb.com/title/tt0800369/// </ Summary> public string Imdburl {get; set;} public string Name {get; set;} public string Image {get; set;} } public class Imdbmovie {public string Imdburl {get; set;} public string Name {get; set;} public string Image {get; set;} Public DateTime Date {get; set;} Public list<marvellitem> Actors {get; set;} } public static readonly Regex Movieregex = new Regex ("http://www.imdb.com/title/tt\\d+", regexoptions.compiled);
The main processing function after crawling pages in Abot is Pagecrawlcompletedasync, which gives the complete callback function after crawling each movie page.
Private concurrentdictionary<string, imdbmovie> Movieresult; Crawl to movie data public void Moviecrawler_processpagecrawlcompletedasync (object sender, Pagecrawlcompletedargs e) {if (Movieregex.ismatch (E.crawledpage.uri.absoluteuri)) {var cstitle = E.crawledpag E.csquerydocument.select (". Title_block >. title_bar_wrapper >. titlebar >. title_wrapper > H1"); string title = Htmldata.htmldecode (Cstitle.text (). Trim ()); var datetime = E.crawledpage.csquerydocument.select (". Title_block >. title_b Ar_wrapper > TitleBar >. title_wrapper >. subtext > A:last > Meta "); var year = DateTime. Attr ("Content"). Trim (); var csimg = E.crawledpage.csquerydocument.select (". Poster > A > img"); String image = Csimg.attr ("src"). Trim (); if (!string. IsNullOrEmpty (image)) { HttpWebRequest webRequest = (HttpWebRequest) webrequest.create (image); Webrequest.credentials = CredentialCache.DefaultCredentials; var stream = Webrequest.getresponse (). GetResponseStream (); if (stream! = null) {Image bitmap = new Bitmap (stream); Image = E.crawledpage.uri.absoluteuri.gethashcode () + ". jpg"; Bitmap. Save (image); }} var cstable = E.crawledpage.csquerydocument.select ("#titleCast > table"); var cstrs = Cstable.select ("tr", cstable); list<marvellitem> actors = new list<marvellitem> (); foreach (Var tr in cstrs) {var csTr = new Csquery.cq (TR); var cslink = cstr.select ("td > A", csTr); if (Cslink. Any ()) { String url = Normurl (cslink. Attr ("href"). Trim ()); String actortitle = Cslink. Select ("img", Cslink). Attr ("title"). Trim (); String actorimage = Cslink. Select ("img", Cslink). Attr ("src"). Trim (); Actors. ADD (New Marvellitem () {Name = Actortitle, Im Dburl = URL, Image = actorimage}); }} this.movieResult.TryAdd (E.crawledpage.uri.absoluteuri, New Imdbmovie () { Name = title, image = image, Date = DateTime.Parse (year), Imdburl = E.crawledpage.uri.absoluteuri, Actors = Actors}); } }
The main function of this function is to parse the movie page, get movie name movie picture and actor information. There is also a small trick, due to the limitations of IMDB, you need to download the images crawled, otherwise in the production environment picture is not displayed.
More details of this trick can be found in some thoughts about IMG 403 Forbidden
For all movie links, you can execute in parallel with the task:
task[] Movietasks = new Task[imdbfeedmovies.count]; System.Console.WriteLine ("Start crawl Movies"); for (var i = 0; i < Imdbfeedmovies.count; i++) { var url = imdbfeedmovies[i]; Movietasks[i] = new Task (() = { System.Console.WriteLine ("Start crawl:" + URL); var crawler = Getmanuallyconfiguredwebcrawler (); Configmoviecrawl (crawler); Crawler. Crawl (new Uri (URL)); System.Console.WriteLine ("End crawl:" + URL); }); Movietasks[i]. Start (); } Task.waitall (movietasks); System.Console.WriteLine ("End crawl Movies");
At the end we get a bunch of JSON data
Spread it to the front:
@model list<imdbmovie><div class= "Clearfix" style= "position:relative" > <div id= "marvel-graph" > & lt;/div></div> @section postscripts{<script type= "Text/javascript" > $ (function () {VA R nodes = []; var edges = []; @for (int i = 0; i < Model.count; i++) {var film = Model[i]; <text> Nodes.push ({id: ' @film. Imdburl ', title: ' @film. Name ', Borderwidth:4, shapeproperties: {useborderwithimage:true}, Shape: "Image", Image: ' @ (string. IsNullOrEmpty (film. Image)? "": (film. Image.startswith ("http")? Film. Image:href (".. /.. /images/marvel/"+film. Image)) ', color: {border: ' #4db6ac ', background: ' #009688 '}}); @if (i! = model.count-1) {<text> Edges.push ({from: ' @film. Imdburl ', to: ' @Model [i+1]. Imdburl ', arrows: {to:true}, Width:4, length:360, Color: "Red"}); </text>} @foreach (Var actor in film. Actors) {<text> Nodes.push ({id: ' @film . Imdburl ' + ' @actor. Imdburl ', title: ' @actor. Name ', Borderwidth:4, shapeproperties: {useborderwithimage:true}, Shape: "Circularimage", Image: ' @ (string. IsNullOrEmpty (actor. Image)? "": (actor. Image.startswith ("http")? Actor. Image:href (".. /.. /images/marvel/"+actor. Image)), ",}); Edges.push ({from: ' @film. Imdburl ', to: ' @film. Imdburl ' + ' @actor. Imdburl ', arrows: {to:true}}); </text>} </text>} var container = Do Cument.getelementbyid ("Marvel-graph"); var visnodes = new Vis. DataSet (nodes); var data = {nodes:visnodes, edges:edges}; var options = {layout: {Improvedlayout:false}, nodes: {borderWidth: 3, font: {color: ' #000000 ', Size:12, Face: ' Segoe UI '}, color: {background: ' #4db6ac ', border: ' #009688 '} }, Edges: {color: ' #c1c1c1 ', Width:2, fo NT: { Color: ' #2d2d2d ', size:12}, smooth: { Enabled:false, type: ' Continuous '}}; var network = new Vis.network (container, data, options); }); </script>}
The VIS network is primarily the new network (container, data, options); Incoming nodes and edges.
The final effect
Abot Reptiles and Visjs