網頁資訊抓取,網頁抓取

來源:互聯網
上載者:User

網頁資訊抓取,網頁抓取

寫了一個從網頁中抓取資訊(如最新的頭條新聞,新聞的來源,標題,內容等)的類,本文將介紹如何使用這個類來抓取網頁中需要的資訊。本文將以抓取部落格園首頁的部落格標題和連結為例:

顯示的是部落格園首頁的DOM樹,顯然只需提取出class為post_item的div,再重中提取出class為titlelnk的a標誌即可。這樣的功能可以通過以下函數來實現:

/// <summary>/// 在文本html的文本尋找標誌名為tagName,並且屬性attrName的值為attrValue的所有標誌/// 例如:FindTagByAttr(html, "div", "class", "demo")/// 返回所有class為demo的div標誌/// 前端學習交流QQ群:461593224
/// </summary>public static List<HtmlTag> FindTagByAttr(String html, String tagName, String attrName, String attrValue){ String format = String.Format(@"<{0}\s[^<>]*{1}\s*=\s*(\x27|\x22){2}(\x27|\x22)[^<>]*>", tagName, attrName, attrValue); return FindTag(html, tagName, format);}public static List<HtmlTag> FindTag(String html, String name, String format){ Regex reg = new Regex(format, RegexOptions.IgnoreCase); Regex tagReg = new Regex(String.Format(@"<(\/|)({0})(\s[^<>]*|)>", name), RegexOptions.IgnoreCase); List<HtmlTag> tags = new List<HtmlTag>(); int start = 0; while (true) { Match match = reg.Match(html, start); if (match.Success) { start = match.Index + match.Length; Match tagMatch = null; int beginTagCount = 1; while (true) { tagMatch = tagReg.Match(html, start); if (!tagMatch.Success) { tagMatch = null; break; } start = tagMatch.Index + tagMatch.Length; if (tagMatch.Groups[1].Value == "/") beginTagCount--; else beginTagCount++; if (beginTagCount == 0) break; } if (tagMatch != null) { HtmlTag tag = new HtmlTag(name, match.Value, html.Substring(match.Index + match.Length, tagMatch.Index - match.Index - match.Length)); tags.Add(tag); } else { break; } } else { break; } } return tags;}

  有了以上函數,就可以提取需要的HTML標誌了,要實現抓取,還需要一個下載網頁的函數:

public static String GetHtml(string url){    try    {        HttpWebRequest req = HttpWebRequest.Create(url) as HttpWebRequest;        req.Timeout = 30 * 1000;        HttpWebResponse response = req.GetResponse() as HttpWebResponse;        Stream stream = response.GetResponseStream();        MemoryStream buffer = new MemoryStream();        Byte[] temp = new Byte[4096];        int count = 0;        while ((count = stream.Read(temp, 0, 4096)) > 0)        {            buffer.Write(temp, 0, count);        }        return Encoding.GetEncoding(response.CharacterSet).GetString(buffer.GetBuffer());    }    catch    {        return String.Empty;    }}
/// 前端學習交流QQ群:461593224

  以下以抓取部落格園首頁的文章標題和連結為例,介紹如何使用HtmlTag類來抓取網頁資訊:

class Program{    static void Main(string[] args)    {        String html = HtmlTag.GetHtml("http://www.cnblogs.com");        List<HtmlTag> tags = HtmlTag.FindTagByAttr(html, "div", "id", "post_list");        if (tags.Count > 0)        {            List<HtmlTag> item_tags = tags[0].FindTagByAttr("div", "class", "post_item");            foreach (HtmlTag item_tag in item_tags)            {                List<HtmlTag> a_tags = item_tag.FindTagByAttr("a", "class", "titlelnk");                if (a_tags.Count > 0)                {                    Console.WriteLine("標題:{0}", a_tags[0].InnerHTML);                    Console.WriteLine("連結:{0}", a_tags[0].GetAttribute("href"));                    Console.WriteLine("");                }            }        }    }}

  

運行結果如下:

 

歡迎學習前端的同學一起學習

前端學習交流QQ群:461593224

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.