Abot Reptiles and Visjs

Last Update:2016-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Introduction

Recently contacted Abot crawler also has a few days, idle to have no matter to crawl from the IMDB website some movie data to play. Just the captain of the United States 3 is aggressively, intends to climb the marvel in recent years, and vis this JS library to present the Marvel Universe of related films.

Abot is an open-source C # Crawler, and the code is very lightweight. See this article (using Abot to crawl Blog Park News data) to get started abot.

Vis is a JS visual library similar to D3. VIS provides visualizations like network diagrams, TimeLine visualizations, and so on. This uses the network, only need to give the VIS incoming Simple node information, the edge of the information can be automatically built a map.

2. Implement

Starting with the data, we get all the related movie names in the Marvel Universe, which is too much on the Web:

From movie name to IMDB movie page actually have a search process, fortunately the number of films, here Steal a lazy direct use of IMDB movie link as a seed URL

        public static list<string> imdbfeedmovies = new list<string> () {//iron mans 2008            "http://www.imdb.com/title/tt1233205/",//hunk "http://www.imdb.com/title/tt0800080/", Iron Man 2 "http://www.imdb.com/title/tt1228705/",//thor "http//            www.imdb.com/title/tt0800369/",//captain America" http://www.imdb.com/title/tt0458339/", Averages "http://www.imdb.com/title/tt0848228/",//iron man 3 "Http://www.imdb.com/ti            tle/tt1300854/",//thor 2" http://www.imdb.com/title/tt1981115/",//captain America 2            "http://www.imdb.com/title/tt1843866/",//guardians of the Galaxy;            "http://www.imdb.com/title/tt2015381/",//ultron "http://www.imdb.com/title/tt2395427/", Ant-man "http://www.imdb.com/title/tT0478970/",//civil War" http://www.imdb.com/title/tt3498820/",//doctor Strange            "http://www.imdb.com/title/tt1211837/",//guardians of the Galaxy 2;            "http://www.imdb.com/title/tt3896198/",//thor 3 "http://www.imdb.com/title/tt3501632/",            Black Panther "http://www.imdb.com/title/tt1825683/",//avengers:infinity war-part I "Http://www.imdb.com/title/tt4154756/"};

With the seed URL, you can use Abot to crawl the movie's data, crawling only for movie names, movie pictures, and actors.

Here you define some of the data structures that need to be used:

    public class Marvellitem {//<summary>//http://www.imdb.com/title/tt0800369///        </ Summary> public        string Imdburl {get; set;}        public string Name {get; set;}        public string Image {get; set;}    }    public class Imdbmovie    {public        string Imdburl {get; set;}        public string Name {get; set;}        public string Image {get; set;}        Public DateTime Date {get; set;}         Public list<marvellitem> Actors {get; set;}     }    public static readonly Regex Movieregex = new Regex ("http://www.imdb.com/title/tt\\d+", regexoptions.compiled);

The main processing function after crawling pages in Abot is Pagecrawlcompletedasync, which gives the complete callback function after crawling each movie page.

        Private concurrentdictionary<string, imdbmovie> Movieresult;        Crawl to movie data public void Moviecrawler_processpagecrawlcompletedasync (object sender, Pagecrawlcompletedargs e) {if (Movieregex.ismatch (E.crawledpage.uri.absoluteuri)) {var cstitle = E.crawledpag                E.csquerydocument.select (". Title_block >. title_bar_wrapper >. titlebar >. title_wrapper > H1"); string title = Htmldata.htmldecode (Cstitle.text ().                Trim ()); var datetime = E.crawledpage.csquerydocument.select (". Title_block >. title_b                Ar_wrapper > TitleBar >. title_wrapper >. subtext > A:last > Meta "); var year = DateTime. Attr ("Content").                Trim ();                var csimg = E.crawledpage.csquerydocument.select (". Poster > A > img"); String image = Csimg.attr ("src").                Trim (); if (!string.    IsNullOrEmpty (image)) {                HttpWebRequest webRequest = (HttpWebRequest) webrequest.create (image);                    Webrequest.credentials = CredentialCache.DefaultCredentials; var stream = Webrequest.getresponse ().                    GetResponseStream ();                        if (stream! = null) {Image bitmap = new Bitmap (stream);                        Image = E.crawledpage.uri.absoluteuri.gethashcode () + ". jpg"; Bitmap.                    Save (image);                }} var cstable = E.crawledpage.csquerydocument.select ("#titleCast > table");                var cstrs = Cstable.select ("tr", cstable);                list<marvellitem> actors = new list<marvellitem> ();                    foreach (Var tr in cstrs) {var csTr = new Csquery.cq (TR);                    var cslink = cstr.select ("td > A", csTr); if (Cslink.             Any ()) {           String url = Normurl (cslink. Attr ("href").                        Trim ()); String actortitle = Cslink. Select ("img", Cslink). Attr ("title").                        Trim (); String actorimage = Cslink. Select ("img", Cslink). Attr ("src").                        Trim (); Actors. ADD (New Marvellitem () {Name = Actortitle, Im                    Dburl = URL, Image = actorimage});                    }} this.movieResult.TryAdd (E.crawledpage.uri.absoluteuri, New Imdbmovie () {                    Name = title, image = image, Date = DateTime.Parse (year),            Imdburl = E.crawledpage.uri.absoluteuri, Actors = Actors}); }        }

The main function of this function is to parse the movie page, get movie name movie picture and actor information. There is also a small trick, due to the limitations of IMDB, you need to download the images crawled, otherwise in the production environment picture is not displayed.

More details of this trick can be found in some thoughts about IMG 403 Forbidden

For all movie links, you can execute in parallel with the task:

           task[] Movietasks = new Task[imdbfeedmovies.count];            System.Console.WriteLine ("Start crawl Movies");            for (var i = 0; i < Imdbfeedmovies.count; i++)            {                var url = imdbfeedmovies[i];                Movietasks[i] = new Task (() =                {                    System.Console.WriteLine ("Start crawl:" + URL);                    var crawler = Getmanuallyconfiguredwebcrawler ();                    Configmoviecrawl (crawler);                    Crawler. Crawl (new Uri (URL));                    System.Console.WriteLine ("End crawl:" + URL);                });                Movietasks[i]. Start ();            }            Task.waitall (movietasks);            System.Console.WriteLine ("End crawl Movies");

At the end we get a bunch of JSON data

Spread it to the front:

@model list<imdbmovie><div class= "Clearfix" style= "position:relative" > <div id= "marvel-graph" > & lt;/div></div> @section postscripts{<script type= "Text/javascript" > $ (function () {VA            R nodes = [];            var edges = [];                @for (int i = 0; i < Model.count; i++) {var film = Model[i]; <text> Nodes.push ({id: ' @film. Imdburl ', title: ' @film.                    Name ', Borderwidth:4, shapeproperties: {useborderwithimage:true}, Shape: "Image", Image: ' @ (string. IsNullOrEmpty (film. Image)? "": (film. Image.startswith ("http")? Film. Image:href (".. /.. /images/marvel/"+film.                Image)) ', color: {border: ' #4db6ac ', background: ' #009688 '}});    @if (i! = model.count-1) {<text>                Edges.push ({from: ' @film. Imdburl ', to: ' @Model [i+1].                         Imdburl ', arrows: {to:true}, Width:4, length:360,                    Color: "Red"}); </text>} @foreach (Var actor in film. Actors) {<text> Nodes.push ({id: ' @film . Imdburl ' + ' @actor. Imdburl ', title: ' @actor.                        Name ', Borderwidth:4, shapeproperties: {useborderwithimage:true}, Shape: "Circularimage", Image: ' @ (string. IsNullOrEmpty (actor. Image)? "": (actor. Image.startswith ("http")? Actor. Image:href (".. /.. /images/marvel/"+actor.                    Image)), ",}); Edges.push ({from: ' @film. Imdburl ', to: ' @film. Imdburl ' + ' @actor.                    Imdburl ', arrows: {to:true}}); </text>} </text>} var container = Do                 Cument.getelementbyid ("Marvel-graph"); var visnodes = new Vis.            DataSet (nodes);            var data = {nodes:visnodes, edges:edges}; var options = {layout: {Improvedlayout:false}, nodes: {borderWidth:                        3, font: {color: ' #000000 ', Size:12,                Face: ' Segoe UI '}, color: {background: ' #4db6ac ', border: ' #009688 '} }, Edges: {color: ' #c1c1c1 ', Width:2, fo                      NT: {  Color: ' #2d2d2d ', size:12}, smooth: {            Enabled:false, type: ' Continuous '}};        var network = new Vis.network (container, data, options);    }); </script>}

The VIS network is primarily the new network (container, data, options); Incoming nodes and edges.

The final effect

Abot Reptiles and Visjs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Abot Reptiles and Visjs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Abot Reptiles and Visjs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support