wIndows phone 7 解析Html資料

來源:互聯網
上載者:User

在我的上一篇文章中我介紹了windows phone 7的gb2312解碼,

http://www.cnblogs.com/qingci/archive/2011/11/25/2263124.html

解決了下載的Html亂碼問題,這一篇,我將介紹關於windows phone 7解析html資料,以便我們獲得想要的資料.

這裡,我先介紹一個類庫HtmlAgilityPack,(上一篇文章也是通過這個工具來解碼的). 類庫的dll檔案我會隨demo一起提供

這裡,我以新浪新聞為例來解析資料

 

先看看網頁版的新浪新聞

http://news.sina.com.cn/w/sd/2011-11-27/070023531646.shtml

然後我們看一下他的源檔案,

發現新聞內容的結構是

<div class="blkContainerSblk"><h1 id="artibodyTitle" pid="1" tid="1" did="23531646" fid="1666">title</h1><div class="artInfo"><span id="art_source"><a href="http://www.sina.com.cn">http://www.sina.com.cn</a></span>  <span id="pub_date">pub_date</span>  <span id="media_name"><a href="">media_name</a> <a href=""></a> </span></div><!-- 本文內容 begin --><!-- google_ad_section_start --><div class="blkContainerSblkCon" id="artibody"></div></div>

大部分還有ID屬性,這更適合我們去解析了。

接下來我們開始去解析

第一: 引用HtmlAgilityPack.dll檔案

第二:用WebClient或者WebRequest類來下載HTML頁面然後處理成字串。

 public  delegate void CallbackEvent(object sender, DownloadEventArgs e);        public  event CallbackEvent DownloadCallbackEvent;        public void HttpWebRequestDownloadGet(string url)        {                        Thread _thread = new Thread(delegate()            {                Uri _uri = new Uri(url, UriKind.RelativeOrAbsolute);                HttpWebRequest _httpWebRequest = (HttpWebRequest)WebRequest.Create(_uri);                 _httpWebRequest.Method="Get";                              _httpWebRequest.BeginGetResponse(new AsyncCallback(delegate(IAsyncResult result)                {                    HttpWebRequest _httpWebRequestCallback = (HttpWebRequest)result.AsyncState;                    HttpWebResponse _httpWebResponseCallback = (HttpWebResponse)_httpWebRequestCallback.EndGetResponse(result);                    Stream _streamCallback = _httpWebResponseCallback.GetResponseStream();                    StreamReader _streamReader = new StreamReader(_streamCallback,new HtmlAgilityPack.Gb2312Encoding());                    string _stringCallback = _streamReader.ReadToEnd();                                     Deployment.Current.Dispatcher.BeginInvoke(new Action(() =>                    {                        if (DownloadCallbackEvent != null)                        {                            DownloadEventArgs _downloadEventArgs = new DownloadEventArgs();                            _downloadEventArgs._DownloadStream = _streamCallback;                            _downloadEventArgs._DownloadString = _stringCallback;                            DownloadCallbackEvent(this, _downloadEventArgs);                        }                    }));                }), _httpWebRequest);            }) ;            _thread.Start();        }       // }

O(∩_∩)O! 我這個比較複雜, 總之我們下載了html的資料就行了。  

貼一個簡單的下載方式吧

WebClient webClenet=new WebClient();           webClenet.Encoding = new HtmlAgilityPack.Gb2312Encoding(); //加入這句設定編碼           webClenet.DownloadStringAsync(new Uri("http://news.sina.com.cn/s/2011-11-25/120923524756.shtml", UriKind.RelativeOrAbsolute));                webClenet.DownloadStringCompleted += new DownloadStringCompletedEventHandler(webClenet_DownloadStringCompleted); 

 現在處理回呼函數的 e.Result

 string _result = e._DownloadString;            HtmlDocument _doc = new HtmlDocument(); //執行個體化HtmlAgilityPack.HtmlDocument對象            _doc.LoadHtml(_result);         //載入HTML            HtmlNode _htmlNode01 = _doc.GetElementbyId("artibodyTitle");  //新聞標題的Div            string _title = _htmlNode01.InnerText;            HtmlNode _htmlNode02 = _doc.GetElementbyId("artibody");     //擷取內容的div              string _content = _htmlNode02.InnerText;           // int _count= _htmlNode02.ChildNodes.Where(new Func<HtmlNode,bool>("div"));            int _divIndex = _content.IndexOf(" .blkComment");            _content= _content.Substring(0,_divIndex);            #region 新浪標籤            HtmlNode _htmlNodo03 = _doc.GetElementbyId("art_source");            string _www = _htmlNodo03.FirstChild.InnerText;            string _wwwInt = _htmlNodo03.FirstChild.Attributes[0].Value;            #endregion            // string _source = _htmlNodo03;            //_htmlNodo03.ChildNodes            #region 發布時間            HtmlNode _htmlNodo04 = _doc.GetElementbyId("pub_date");            string _pub_date = _htmlNodo04.InnerText;            #endregion            #region 來源網站資訊            HtmlNode _htmlNodo05 = _doc.GetElementbyId("media_name");            string _media_name = _htmlNodo05.FirstChild.InnerText;            string _modia_source = _htmlNodo05.FirstChild.Attributes[0].Value;            #endregion            Media_nameHyperlinkButton.Content = _pub_date + " " + _media_name;            Media_nameHyperlinkButton.NavigateUri = new Uri(_modia_source, UriKind.RelativeOrAbsolute);            TitleTextBlock.Text = _title;            ContentTextBlock.Text = _content;

 

結果如所示:

網頁的大部分標籤是沒有ID屬性的,不過幸運的是HtmlAgilityPack支援XPath

那就需要通過XPATH語言來尋找匹配所需節點

XPath教程:http://www.w3school.com.cn/xpath/index.asp

 

案例下載:

http://115.com/file/dn87dl2d#
MyFramework_Test.zip

 

 

 

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.