Abot Reptiles and Visjs

Source: Internet
Author: User

1. Introduction

Recently contacted Abot crawler also has a few days, idle to have no matter to crawl from the IMDB website some movie data to play. Just the captain of the United States 3 is aggressively, intends to climb the marvel in recent years, and vis this JS library to present the Marvel Universe of related films.

Abot is an open-source C # Crawler, and the code is very lightweight. See this article (using Abot to crawl Blog Park News data) to get started abot.

Vis is a JS visual library similar to D3. VIS provides visualizations like network diagrams, TimeLine visualizations, and so on. This uses the network, only need to give the VIS incoming Simple node information, the edge of the information can be automatically built a map.

2. Implement

Starting with the data, we get all the related movie names in the Marvel Universe, which is too much on the Web:

From movie name to IMDB movie page actually have a search process, fortunately the number of films, here Steal a lazy direct use of IMDB movie link as a seed URL

        public static list<string> imdbfeedmovies = new list<string> () {//iron mans 2008            "http://www.imdb.com/title/tt1233205/",//hunk "http://www.imdb.com/title/tt0800080/", Iron Man 2 "http://www.imdb.com/title/tt1228705/",//thor "http//            www.imdb.com/title/tt0800369/",//captain America" http://www.imdb.com/title/tt0458339/", Averages "http://www.imdb.com/title/tt0848228/",//iron man 3 "Http://www.imdb.com/ti            tle/tt1300854/",//thor 2" http://www.imdb.com/title/tt1981115/",//captain America 2            "http://www.imdb.com/title/tt1843866/",//guardians of the Galaxy;            "http://www.imdb.com/title/tt2015381/",//ultron "http://www.imdb.com/title/tt2395427/", Ant-man "http://www.imdb.com/title/tT0478970/",//civil War" http://www.imdb.com/title/tt3498820/",//doctor Strange            "http://www.imdb.com/title/tt1211837/",//guardians of the Galaxy 2;            "http://www.imdb.com/title/tt3896198/",//thor 3 "http://www.imdb.com/title/tt3501632/",            Black Panther "http://www.imdb.com/title/tt1825683/",//avengers:infinity war-part I "Http://www.imdb.com/title/tt4154756/"};

With the seed URL, you can use Abot to crawl the movie's data, crawling only for movie names, movie pictures, and actors.

Here you define some of the data structures that need to be used:

    public class Marvellitem {//<summary>//http://www.imdb.com/title/tt0800369///        </ Summary> public        string Imdburl {get; set;}        public string Name {get; set;}        public string Image {get; set;}    }    public class Imdbmovie    {public        string Imdburl {get; set;}        public string Name {get; set;}        public string Image {get; set;}        Public DateTime Date {get; set;}         Public list<marvellitem> Actors {get; set;}     }    public static readonly Regex Movieregex = new Regex ("http://www.imdb.com/title/tt\\d+", regexoptions.compiled);

The main processing function after crawling pages in Abot is Pagecrawlcompletedasync, which gives the complete callback function after crawling each movie page.

        Private concurrentdictionary<string, imdbmovie> Movieresult;        Crawl to movie data public void Moviecrawler_processpagecrawlcompletedasync (object sender, Pagecrawlcompletedargs e) {if (Movieregex.ismatch (E.crawledpage.uri.absoluteuri)) {var cstitle = E.crawledpag                E.csquerydocument.select (". Title_block >. title_bar_wrapper >. titlebar >. title_wrapper > H1"); string title = Htmldata.htmldecode (Cstitle.text ().                Trim ()); var datetime = E.crawledpage.csquerydocument.select (". Title_block >. title_b                Ar_wrapper > TitleBar >. title_wrapper >. subtext > A:last > Meta "); var year = DateTime. Attr ("Content").                Trim ();                var csimg = E.crawledpage.csquerydocument.select (". Poster > A > img"); String image = Csimg.attr ("src").                Trim (); if (!string.    IsNullOrEmpty (image)) {                HttpWebRequest webRequest = (HttpWebRequest) webrequest.create (image);                    Webrequest.credentials = CredentialCache.DefaultCredentials; var stream = Webrequest.getresponse ().                    GetResponseStream ();                        if (stream! = null) {Image bitmap = new Bitmap (stream);                        Image = E.crawledpage.uri.absoluteuri.gethashcode () + ". jpg"; Bitmap.                    Save (image);                }} var cstable = E.crawledpage.csquerydocument.select ("#titleCast > table");                var cstrs = Cstable.select ("tr", cstable);                list<marvellitem> actors = new list<marvellitem> ();                    foreach (Var tr in cstrs) {var csTr = new Csquery.cq (TR);                    var cslink = cstr.select ("td > A", csTr); if (Cslink.             Any ()) {           String url = Normurl (cslink. Attr ("href").                        Trim ()); String actortitle = Cslink. Select ("img", Cslink). Attr ("title").                        Trim (); String actorimage = Cslink. Select ("img", Cslink). Attr ("src").                        Trim (); Actors. ADD (New Marvellitem () {Name = Actortitle, Im                    Dburl = URL, Image = actorimage});                    }} this.movieResult.TryAdd (E.crawledpage.uri.absoluteuri, New Imdbmovie () {                    Name = title, image = image, Date = DateTime.Parse (year),            Imdburl = E.crawledpage.uri.absoluteuri, Actors = Actors}); }        }

The main function of this function is to parse the movie page, get movie name movie picture and actor information. There is also a small trick, due to the limitations of IMDB, you need to download the images crawled, otherwise in the production environment picture is not displayed.

More details of this trick can be found in some thoughts about IMG 403 Forbidden

For all movie links, you can execute in parallel with the task:

           task[] Movietasks = new Task[imdbfeedmovies.count];            System.Console.WriteLine ("Start crawl Movies");            for (var i = 0; i < Imdbfeedmovies.count; i++)            {                var url = imdbfeedmovies[i];                Movietasks[i] = new Task (() =                {                    System.Console.WriteLine ("Start crawl:" + URL);                    var crawler = Getmanuallyconfiguredwebcrawler ();                    Configmoviecrawl (crawler);                    Crawler. Crawl (new Uri (URL));                    System.Console.WriteLine ("End crawl:" + URL);                });                Movietasks[i]. Start ();            }            Task.waitall (movietasks);            System.Console.WriteLine ("End crawl Movies");

At the end we get a bunch of JSON data

Spread it to the front:

@model list<imdbmovie><div class= "Clearfix" style= "position:relative" > <div id= "marvel-graph" > & lt;/div></div> @section postscripts{<script type= "Text/javascript" > $ (function () {VA            R nodes = [];            var edges = [];                @for (int i = 0; i < Model.count; i++) {var film = Model[i]; <text> Nodes.push ({id: ' @film. Imdburl ', title: ' @film.                    Name ', Borderwidth:4, shapeproperties: {useborderwithimage:true}, Shape: "Image", Image: ' @ (string. IsNullOrEmpty (film. Image)? "": (film. Image.startswith ("http")? Film. Image:href (".. /.. /images/marvel/"+film.                Image)) ', color: {border: ' #4db6ac ', background: ' #009688 '}});    @if (i! = model.count-1) {<text>                Edges.push ({from: ' @film. Imdburl ', to: ' @Model [i+1].                         Imdburl ', arrows: {to:true}, Width:4, length:360,                    Color: "Red"}); </text>} @foreach (Var actor in film. Actors) {<text> Nodes.push ({id: ' @film . Imdburl ' + ' @actor. Imdburl ', title: ' @actor.                        Name ', Borderwidth:4, shapeproperties: {useborderwithimage:true}, Shape: "Circularimage", Image: ' @ (string. IsNullOrEmpty (actor. Image)? "": (actor. Image.startswith ("http")? Actor. Image:href (".. /.. /images/marvel/"+actor.                    Image)), ",}); Edges.push ({from: ' @film. Imdburl ', to: ' @film. Imdburl ' + ' @actor.                    Imdburl ', arrows: {to:true}}); </text>} </text>} var container = Do                 Cument.getelementbyid ("Marvel-graph"); var visnodes = new Vis.            DataSet (nodes);            var data = {nodes:visnodes, edges:edges}; var options = {layout: {Improvedlayout:false}, nodes: {borderWidth:                        3, font: {color: ' #000000 ', Size:12,                Face: ' Segoe UI '}, color: {background: ' #4db6ac ', border: ' #009688 '} }, Edges: {color: ' #c1c1c1 ', Width:2, fo                      NT: {  Color: ' #2d2d2d ', size:12}, smooth: {            Enabled:false, type: ' Continuous '}};        var network = new Vis.network (container, data, options);    }); </script>}

The VIS network is primarily the new network (container, data, options); Incoming nodes and edges.

The final effect

Abot Reptiles and Visjs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.