在我的上一篇文章中我介紹了windows phone 7的gb2312解碼,
http://www.cnblogs.com/qingci/archive/2011/11/25/2263124.html
解決了下載的Html亂碼問題,這一篇,我將介紹關於windows phone 7解析html資料,以便我們獲得想要的資料.
這裡,我先介紹一個類庫HtmlAgilityPack,(上一篇文章也是通過這個工具來解碼的). 類庫的dll檔案我會隨demo一起提供
這裡,我以新浪新聞為例來解析資料
先看看網頁版的新浪新聞
http://news.sina.com.cn/w/sd/2011-11-27/070023531646.shtml
然後我們看一下他的源檔案,
發現新聞內容的結構是
<div class="blkContainerSblk"><h1 id="artibodyTitle" pid="1" tid="1" did="23531646" fid="1666">title</h1><div class="artInfo"><span id="art_source"><a href="http://www.sina.com.cn">http://www.sina.com.cn</a></span> <span id="pub_date">pub_date</span> <span id="media_name"><a href="">media_name</a> <a href=""></a> </span></div><!-- 本文內容 begin --><!-- google_ad_section_start --><div class="blkContainerSblkCon" id="artibody"></div></div>
大部分還有ID屬性,這更適合我們去解析了。
接下來我們開始去解析
第一: 引用HtmlAgilityPack.dll檔案
第二:用WebClient或者WebRequest類來下載HTML頁面然後處理成字串。
public delegate void CallbackEvent(object sender, DownloadEventArgs e); public event CallbackEvent DownloadCallbackEvent; public void HttpWebRequestDownloadGet(string url) { Thread _thread = new Thread(delegate() { Uri _uri = new Uri(url, UriKind.RelativeOrAbsolute); HttpWebRequest _httpWebRequest = (HttpWebRequest)WebRequest.Create(_uri); _httpWebRequest.Method="Get"; _httpWebRequest.BeginGetResponse(new AsyncCallback(delegate(IAsyncResult result) { HttpWebRequest _httpWebRequestCallback = (HttpWebRequest)result.AsyncState; HttpWebResponse _httpWebResponseCallback = (HttpWebResponse)_httpWebRequestCallback.EndGetResponse(result); Stream _streamCallback = _httpWebResponseCallback.GetResponseStream(); StreamReader _streamReader = new StreamReader(_streamCallback,new HtmlAgilityPack.Gb2312Encoding()); string _stringCallback = _streamReader.ReadToEnd(); Deployment.Current.Dispatcher.BeginInvoke(new Action(() => { if (DownloadCallbackEvent != null) { DownloadEventArgs _downloadEventArgs = new DownloadEventArgs(); _downloadEventArgs._DownloadStream = _streamCallback; _downloadEventArgs._DownloadString = _stringCallback; DownloadCallbackEvent(this, _downloadEventArgs); } })); }), _httpWebRequest); }) ; _thread.Start(); } // }
O(∩_∩)O! 我這個比較複雜, 總之我們下載了html的資料就行了。
貼一個簡單的下載方式吧
WebClient webClenet=new WebClient(); webClenet.Encoding = new HtmlAgilityPack.Gb2312Encoding(); //加入這句設定編碼 webClenet.DownloadStringAsync(new Uri("http://news.sina.com.cn/s/2011-11-25/120923524756.shtml", UriKind.RelativeOrAbsolute)); webClenet.DownloadStringCompleted += new DownloadStringCompletedEventHandler(webClenet_DownloadStringCompleted);
現在處理回呼函數的 e.Result
string _result = e._DownloadString; HtmlDocument _doc = new HtmlDocument(); //執行個體化HtmlAgilityPack.HtmlDocument對象 _doc.LoadHtml(_result); //載入HTML HtmlNode _htmlNode01 = _doc.GetElementbyId("artibodyTitle"); //新聞標題的Div string _title = _htmlNode01.InnerText; HtmlNode _htmlNode02 = _doc.GetElementbyId("artibody"); //擷取內容的div string _content = _htmlNode02.InnerText; // int _count= _htmlNode02.ChildNodes.Where(new Func<HtmlNode,bool>("div")); int _divIndex = _content.IndexOf(" .blkComment"); _content= _content.Substring(0,_divIndex); #region 新浪標籤 HtmlNode _htmlNodo03 = _doc.GetElementbyId("art_source"); string _www = _htmlNodo03.FirstChild.InnerText; string _wwwInt = _htmlNodo03.FirstChild.Attributes[0].Value; #endregion // string _source = _htmlNodo03; //_htmlNodo03.ChildNodes #region 發布時間 HtmlNode _htmlNodo04 = _doc.GetElementbyId("pub_date"); string _pub_date = _htmlNodo04.InnerText; #endregion #region 來源網站資訊 HtmlNode _htmlNodo05 = _doc.GetElementbyId("media_name"); string _media_name = _htmlNodo05.FirstChild.InnerText; string _modia_source = _htmlNodo05.FirstChild.Attributes[0].Value; #endregion Media_nameHyperlinkButton.Content = _pub_date + " " + _media_name; Media_nameHyperlinkButton.NavigateUri = new Uri(_modia_source, UriKind.RelativeOrAbsolute); TitleTextBlock.Text = _title; ContentTextBlock.Text = _content;
結果如所示:
網頁的大部分標籤是沒有ID屬性的,不過幸運的是HtmlAgilityPack支援XPath
那就需要通過XPATH語言來尋找匹配所需節點
XPath教程:http://www.w3school.com.cn/xpath/index.asp
案例下載:
http://115.com/file/dn87dl2d#
MyFramework_Test.zip