標籤:
剛做了一個小任務,需要抓取其他網站的部分資料,這裡就順便介紹使用Winista.Text.HtmlParser這個類庫如何解析HTML並抓取部分資料
1、擷取指定網站的頁面源碼
string url = "http://www.100njz.com/price/list/p--------1.html";System.Net.WebClient aWebClient = new System.Net.WebClient();aWebClient.Encoding = System.Text.Encoding.Default;string html = aWebClient.DownloadString(url);
2、擷取到源碼後,解析並擷取指定的節點資料,這裡示範的是擷取id="articleList"的div
Lexer lexer = new Lexer(html); Parser parser = new Parser(lexer); NodeFilter filter = new NodeClassFilter(typeof(Winista.Text.HtmlParser.Tags.Div)); NodeList nodeList = parser.Parse(filter); ITag t; if (nodeList.Count == 0) Response.Write("沒有符合要求的節點"); else { for (int i = 0; i < nodeList.Count; i++) { t = getTag(nodeList[i]); if (t != null && t.GetAttribute("id") == "articleList") { NodeFilter filter2 = new NodeClassFilter(typeof(Winista.Text.HtmlParser.Tags.LinkTag)); Response.Write(nodeList[i].ToHtml()); } } }
private ITag getTag(INode node) { if (node == null) return null; return node is Div ? node as Div : null; }
C#解析HTML源碼