Abot爬蟲和visjs

來源:互聯網
上載者:User

標籤:img   ext   public   images   tool   功能   let   manually   graph   

1. 引言

最近接觸Abot爬蟲也有幾天時間了,閑來無事打算從IMDB網站上爬取一些電影資料玩玩。正好美國隊長3正在熱映,打算爬取漫威近幾年的電影並用vis這個JS庫呈現下漫威宇宙的相關電影。

Abot是一個開源的C#爬蟲,代碼非常輕巧。可以參看這篇文章(利用Abot 抓取部落格園新聞資料)入門Abot。

Vis 是一個JS的可視化庫類似於D3。vis 提供了像Network 網狀圖的可視化,TimeLine 可視化等等。這裡用到了network,只需要給vis傳入簡單的節點資訊,邊的資訊就可以自動構建一個網狀圖。

 

2. 實現

首先從資料開始,得到漫威宇宙所有相關的電影名稱,這個資料網上太多了:

從電影名稱到IMDB的電影頁面其實有個搜尋過程,還好電影數目不多,這裡偷個懶直接採用IMDB的電影連結作為種子Url

        public static List<string> ImdbFeedMovies = new List<string>()        {            //Iron man 2008            "http://www.imdb.com/title/tt1233205/",            //hunk 2008            "http://www.imdb.com/title/tt0800080/",            //Iron man 2 2010            "http://www.imdb.com/title/tt1228705/",            //Thor 2011            "http://www.imdb.com/title/tt0800369/",            //Captain America            "http://www.imdb.com/title/tt0458339/",            //Averages            "http://www.imdb.com/title/tt0848228/",            //Iron man 3             "http://www.imdb.com/title/tt1300854/",            //thor 2            "http://www.imdb.com/title/tt1981115/",            //Captain America 2            "http://www.imdb.com/title/tt1843866/",            //Guardians of the Galaxy;            "http://www.imdb.com/title/tt2015381/",            //Ultron            "http://www.imdb.com/title/tt2395427/",            //ant-man            "http://www.imdb.com/title/tt0478970/",            //Civil war            "http://www.imdb.com/title/tt3498820/",            //Doctor Strange            "http://www.imdb.com/title/tt1211837/",            //Guardians of the Galaxy 2;            "http://www.imdb.com/title/tt3896198/",            //Thor 3            "http://www.imdb.com/title/tt3501632/",            // Black Panther            "http://www.imdb.com/title/tt1825683/",            //Avengers: Infinity War - Part I            "http://www.imdb.com/title/tt4154756/"        };

有了種子Url 就可以利用Abot 爬取電影的資料,這裡只爬取電影名稱,電影圖片以及演員。

這裡定義一些需要用到的資料結構:

    public class MarvellItem    {        /// <summary>        /// http://www.imdb.com/title/tt0800369/        /// </summary>        public string ImdbUrl { get; set; }        public string Name { get; set; }        public string Image { get; set; }    }    public class ImdbMovie    {        public string ImdbUrl { get; set; }        public string Name { get; set; }        public string Image { get; set; }        public DateTime Date { get; set; }         public List<MarvellItem> Actors { get; set; }     }    public static readonly Regex MovieRegex = new Regex("http://www.imdb.com/title/tt\\d+", RegexOptions.Compiled);

Abot中爬取頁面後最主要的處理函數就是PageCrawlCompletedAsync ,這裡給出爬取每個電影頁面後的complete Callback函數

        private ConcurrentDictionary<string, ImdbMovie> movieResult; //爬取到的電影資料        public void Moviecrawler_ProcessPageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)        {            if (MovieRegex.IsMatch(e.CrawledPage.Uri.AbsoluteUri))            {                var csTitle = e.CrawledPage.CsQueryDocument.Select(".title_block > .title_bar_wrapper > .titleBar > .title_wrapper > h1");                string title = HtmlData.HtmlDecode(csTitle.Text().Trim());                var datetime =                    e.CrawledPage.CsQueryDocument.Select(                        ".title_block > .title_bar_wrapper > .titleBar > .title_wrapper > .subtext > a:last > meta");                var year = datetime.Attr("content").Trim();                var csImg = e.CrawledPage.CsQueryDocument.Select(".poster > a > img");                string image = csImg.Attr("src").Trim();                if (!string.IsNullOrEmpty(image))                {                    HttpWebRequest webRequest = (HttpWebRequest) WebRequest.Create(image);                    webRequest.Credentials = CredentialCache.DefaultCredentials;                    var stream = webRequest.GetResponse().GetResponseStream();                    if (stream != null)                    {                        Image bitmap = new Bitmap(stream);                        image = e.CrawledPage.Uri.AbsoluteUri.GetHashCode() + ".jpg";                        bitmap.Save(image);                    }                }                var csTable = e.CrawledPage.CsQueryDocument.Select("#titleCast > table");                var csTrs = csTable.Select("tr", csTable);                List<MarvellItem> actors = new List<MarvellItem>();                foreach (var tr in csTrs)                {                    var csTr = new CsQuery.CQ(tr);                    var cslink = csTr.Select("td > a", csTr);                    if (cslink.Any())                    {                        string url = NormUrl(cslink.Attr("href").Trim());                        string actorTitle = cslink.Select("img", cslink).Attr("title").Trim();                        string actorImage = cslink.Select("img", cslink).Attr("src").Trim();                        actors.Add(new MarvellItem()                        {                            Name = actorTitle,                            ImdbUrl = url,                            Image = actorImage                        });                    }                }                this.movieResult.TryAdd(e.CrawledPage.Uri.AbsoluteUri, new ImdbMovie()                {                    Name = title,                    Image = image,                    Date = DateTime.Parse(year),                    ImdbUrl = e.CrawledPage.Uri.AbsoluteUri,                    Actors = actors                });            }        }

該函數的主要功能就是解析電影頁面,得到電影名字 電影圖片 和 演員資訊。這裡面還有一個小trick ,由於IMDB的限制,需要把爬到的圖片下載下來,否則在生產環境下<img src=””/>  圖片是無法顯示的.

更多這個trick的細節可以參看 關於img 403 forbidden的一些思考

對於所有的電影連結,可以採用Task 並存執行:

           Task[] movieTasks = new Task[ImdbFeedMovies.Count];            System.Console.WriteLine("Start crawl Movies");            for (var i = 0; i < ImdbFeedMovies.Count; i++)            {                var url = ImdbFeedMovies[i];                movieTasks[i] = new Task(() =>                {                    System.Console.WriteLine("Start crawl:" + url);                    var crawler = GetManuallyConfiguredWebCrawler();                    ConfigMovieCrawl(crawler);                    crawler.Crawl(new Uri(url));                    System.Console.WriteLine("End crawl:" + url);                });                movieTasks[i].Start();            }            Task.WaitAll(movieTasks);            System.Console.WriteLine("End crawl Movies");

結束後我們得到一堆JSON 資料

把它傳到前端:

@model List<ImdbMovie><div class="clearfix" style=" position: relative">    <div id="marvel-graph">    </div></div>@section PostScripts{    <script type="text/javascript">        $(function () {            var nodes = [];            var edges = [];            @for (int i = 0; i < Model.Count; i++)            {                var film = Model[i];                <text>                nodes.push({                    id: ‘@film.ImdbUrl‘,                    title: ‘@film.Name‘,                    borderWidth: 4,                    shapeProperties: {useBorderWithImage: true},                    shape: "image",                    image: ‘@(string.IsNullOrEmpty(film.Image) ? "" : (film.Image.StartsWith("http") ? film.Image : Href("../../Images/marvel/"+film.Image)))‘,                    color: { border: ‘#4db6ac‘, background: ‘#009688‘ }                });                @if (i != Model.Count - 1)                {                    <text>                    edges.push({                        from: ‘@film.ImdbUrl‘,                        to: ‘@Model[i+1].ImdbUrl‘,                        arrows: { to: true },                        width: 4,                        length:360,                        color: "red"                    });                    </text>                }                @foreach (var actor in film.Actors)                {                    <text>                    nodes.push({                        id: ‘@film.ImdbUrl‘ + ‘@actor.ImdbUrl‘,                        title: ‘@actor.Name‘,                        borderWidth: 4,                        shapeProperties: { useBorderWithImage: true },                        shape: "circularImage",                        image: ‘@(string.IsNullOrEmpty(actor.Image) ? "" : (actor.Image.StartsWith("http") ? actor.Image : Href("../../Images/marvel/"+actor.Image)))‘,                    });                    edges.push({                        from: ‘@film.ImdbUrl‘,                        to: ‘@film.ImdbUrl‘ + ‘@actor.ImdbUrl‘,                        arrows: { to: true }                    });                    </text>                }                                    </text>            }            var container = document.getElementById("marvel-graph");                 var visNodes = new vis.DataSet(nodes);            var data = {                nodes: visNodes,                edges: edges            };            var options = {                layout: { improvedLayout: false },                nodes: {                    borderWidth: 3,                    font: {                        color: ‘#000000‘,                        size: 12,                        face: ‘Segoe UI‘                    },                    color: { background: ‘#4db6ac‘, border: ‘#009688‘ }                },                edges: {                    color: ‘#c1c1c1‘,                    width: 2,                    font: {                        color: ‘#2d2d2d‘,                        size: 12                    },                    smooth: {                        enabled: false,                        type: ‘continuous‘                    }                }            };            var network = new vis.Network(container, data, options);        });    </script>}

vis network 主要就是 new Network(container, data, options); 傳入節點 和 邊即可。

最終的效果

Abot爬蟲和visjs

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.