Using readability to solve the problem of extracting text from Web pages

Source: Internet
Author: User
Tags json static class

Do data capture and analysis of the parents, have you encountered the following problems?

-How to extract the text from a wide variety of Web pages!?

Although you can use the SS for a variety of web sites to write a script to do, but the Internet more than tens of thousands of different kinds of sites, even if we are exhausted we can not finish. Here I give you a warm recommendation to use readability to solve this problem thoroughly (hehe, not advertising, really love this good dongdong)

Raedability website (www.readability.com) is most proud of its powerful analytical engine, known as the world's most powerful text parsing artifact. Safari's "reader" function is to use it to achieve! They also provided the functionality that the API could invoke the parser, and I made a C # proxy class to make it easier for everyone to use.

Please register readability before you start and apply for Appkey, free of charge.

Proxy class Code:

public static class Readabilityproxy {public static Article Parse (string URL, string token)//token is everybody's appkey.
        {WebClient WC = new WebClient (); Wc.
        Encoding = Encoding.UTF8;
        var encurl = httputility.urlencode (URL); Uri u = new uri (string.
        Format ("Https://readability.com/api/content/v1/parser?url={0}&token={1}", Encurl, token)); var json = WC.
        downloadstring (U);
        JavaScriptSerializer se = new JavaScriptSerializer (); Return SE.
    Deserialize (JSON, typeof (Article)) as Article;
    } public class Article {public string Domain;
    public string next_page_id;
    public string Url;
    public string Content;
    public string Short_url;
    public string excerpt;
    public string Direction;
    public int word_count;
    public int total_pages;
    public string date_published;
    public string Dek;
    public string Lead_image_url;
    public string Title;
   
    public int rendered_pages; Public virtual VOID Decode () {this. Excerpt = Httputility.htmldecode (this.
        Excerpt); This. Content = Httputility.htmldecode (this.
    Content); }
}

Because readability returns the content, excerpt is encoded, so I provide a Article.decode method to decode.

Test effects in ConsoleApp:

Class program
{
    static void Main (string[] args)
    {
        var article = Readabilityproxy.parse

("http:// www.mot.gov.cn/st2010/shanghai/sh_zhaobiaoxx/201203/t2012

0330_1219097.html "," * * * here omitted n words * * * *); 
        Article. Decode ();
        Console.WriteLine (article. Title);
        Console.WriteLine (article. excerpt);
        Console.WriteLine (article. Content);
        Console.ReadLine ();
    }

What do you think? Good results, try it quickly!

See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/webkf/script/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.