In my previous article, I introduced gb2312 Decoding for windows phone 7,
Http://www.cnblogs.com/qingci/archive/2011/11/25/2263124.html
This article describes how to parse Html data in windows phone 7 to obtain the desired data.
Here, I will first introduce a class library HtmlAgilityPack (this tool was used to decode in the previous article). The dll file of the class library will be provided along with the demo
Here, I use Sina news as an example to parse data
Let's take a look at Sina news on the webpage
Http://news.sina.com.cn/w/sd/2011-11-27/070023531646.shtml
Then let's take a look at his source file,
The structure of news content is found to be
The result looks like this:
Most tags on web pages have no ID attribute, but fortunately HtmlAgilityPack supports XPath
Then you need to find the matching node through XPATH language
XPath tutorial: http://www.w3school.com.cn/xpath/index.asp
Zh
Case download:
http://115.com/file/dn87dl2d#
MyFramework_Test.zip
Most also have the ID attribute, which is more suitable for us to parse.
Next we start to parse
First: Reference the HtmlAgilityPack.dll file
Second: Use the WebClient or WebRequest class to download the HTML page and process it into a string.
public delegate void CallbackEvent(object sender, DownloadEventArgs e);
public event CallbackEvent DownloadCallbackEvent;
public void HttpWebRequestDownloadGet(string url)
{
Thread _thread = new Thread(delegate()
{
Uri _uri = new Uri(url, UriKind.RelativeOrAbsolute);
HttpWebRequest _httpWebRequest = (HttpWebRequest)WebRequest.Create(_uri);
_httpWebRequest.Method="Get";
_httpWebRequest.BeginGetResponse(new AsyncCallback(delegate(IAsyncResult result)
{
HttpWebRequest _httpWebRequestCallback = (HttpWebRequest)result.AsyncState;
HttpWebResponse _httpWebResponseCallback = (HttpWebResponse)_httpWebRequestCallback.EndGetResponse(result);
Stream _streamCallback = _httpWebResponseCallback.GetResponseStream();
StreamReader _streamReader = new StreamReader(_streamCallback,new HtmlAgilityPack.Gb2312Encoding());
string _stringCallback = _streamReader.ReadToEnd();
Deployment.Current.Dispatcher.BeginInvoke(new Action(() =>
{
if (DownloadCallbackEvent != null)
{
DownloadEventArgs _downloadEventArgs = new DownloadEventArgs();
_downloadEventArgs._DownloadStream = _streamCallback;
_downloadEventArgs._DownloadString = _stringCallback;
DownloadCallbackEvent(this, _downloadEventArgs);
}
}));
}), _httpWebRequest);
}) ;
_thread.Start();
}
// }
O (∩_∩) O! I am more complicated. In short, we just download the html data.
Post a simple download method
WebClient webClenet = new WebClient ();
webClenet.Encoding = new HtmlAgilityPack.Gb2312Encoding (); // Add this sentence to set the encoding
webClenet.DownloadStringAsync (new Uri ("http://news.sina.com.cn/s/2011-11-25/120923524756.shtml", UriKind.RelativeOrAbsolute));
webClenet.DownloadStringCompleted + = new DownloadStringCompletedEventHandler (webClenet_DownloadStringCompleted);
Now handle e.Result of callback function
string _result = e._DownloadString;
HtmlDocument _doc = new HtmlDocument (); // Instantiate HtmlAgilityPack.HtmlDocument object
_doc.LoadHtml (_result); // Load HTML
HtmlNode _htmlNode01 = _doc.GetElementbyId ("artibodyTitle"); // Div for news title
string _title = _htmlNode01.InnerText;
HtmlNode _htmlNode02 = _doc.GetElementbyId ("artibody"); // Get content div
string _content = _htmlNode02.InnerText;
// int _count = _htmlNode02.ChildNodes.Where (new Func <HtmlNode, bool> ("div"));
int _divIndex = _content.IndexOf (".blkComment");
_content = _content.Substring (0, _divIndex);
#region Sina tags
HtmlNode _htmlNodo03 = _doc.GetElementbyId ("art_source");
string _www = _htmlNodo03.FirstChild.InnerText;
string _wwwInt = _htmlNodo03.FirstChild.Attributes [0] .Value;
#endregion
// string _source = _htmlNodo03;
//_htmlNodo03.ChildNodes
#region release time
HtmlNode _htmlNodo04 = _doc.GetElementbyId ("pub_date");
string _pub_date = _htmlNodo04.InnerText;
#endregion
#region Source site information
HtmlNode _htmlNodo05 = _doc.GetElementbyId ("media_name");
string _media_name = _htmlNodo05.FirstChild.InnerText;
string _modia_source = _htmlNodo05.FirstChild.Attributes [0] .Value;
#endregion
Media_nameHyperlinkButton.Content = _pub_date + "" + _media_name;
Media_nameHyperlinkButton.NavigateUri = new Uri (_modia_source, UriKind.RelativeOrAbsolute);
TitleTextBlock.Text = _title;
ContentTextBlock.Text = _content;
The result looks like this:
Most tags on web pages have no ID attribute, but fortunately HtmlAgilityPack supports XPath
Then you need to find the matching node through XPATH language
XPath tutorial: http://www.w3school.com.cn/xpath/index.asp
Case download:
http://115.com/file/dn87dl2d#
MyFramework_Test.zip