Learn python by example: capture the webpage body using python,

Source: Internet
Author: User

Learn python by example: capture the webpage body using python,

This method is based on the text density. The original idea was derived from Harbin Institute of Technology's general webpage Text Extraction Algorithm Based on the row block distribution function. This article makes some minor modifications based on this.

 

Conventions:

This article makes statistics based on different lines of the Web page. Therefore, if the content of the Web page is not compressed, the web page has a normal line feed.

Some news pages may have relatively short text content, but they embed a video file, so I will give a higher weight to the video; this is also suitable for pictures, there is a deficiency here, the weight should be determined based on the image size, but the method in this article fails to achieve this.

Because of advertisements, the non-body content is usually displayed as a hyperlink, so the text will give the hyperlink a text weight of zero.

It is assumed that the body content is continuous and does not contain non-body content. Therefore, extracting the body content is to find the beginning and end position of the body content.

Steps:

First, clear the content in the CSS, Javascript, comments, Meta, and Ins labels on the webpage and clear blank lines.

Calculate the processed value of each row (1)

Calculates the start and end position of the maximum number of texts in each row.

The second step should be described as follows:

For each row, we need to calculate a value, which is calculated as follows:

An Image Tag img is equivalent to a text with a length of 50 characters (given weight), x1,

A video tag, embed, is equivalent to text with a length of 1000 characters, x2

The text length of tag a of all links in a line is x3,

Text length of other labels x4

The number of occurrences of each row = 50 * x1 + 1000 * x2 the number of occurrences + x4-8

// Note,-8 because we want to calculate a maximum positive substring, We need to subtract a positive number. As for how big the number should be, I 'd like to proceed with experience.

Complete code

#coding:utf-8import redef remove_js_css (content):    """ remove the the javascript and the stylesheet and the comment content (<script>....</script> and <style>....</style> <!-- xxx -->) """    r = re.compile(r'''<script.*?</script>''',re.I|re.M|re.S)    s = r.sub ('',content)    r = re.compile(r'''<style.*?</style>''',re.I|re.M|re.S)    s = r.sub ('', s)    r = re.compile(r'''<!--.*?-->''', re.I|re.M|re.S)    s = r.sub('',s)    r = re.compile(r'''<meta.*?>''', re.I|re.M|re.S)    s = r.sub('',s)    r = re.compile(r'''<ins.*?</ins>''', re.I|re.M|re.S)    s = r.sub('',s)    return sdef remove_empty_line (content):    """remove multi space """    r = re.compile(r'''^\s+$''', re.M|re.S)    s = r.sub ('', content)    r = re.compile(r'''\n+''',re.M|re.S)    s = r.sub('\n',s)    return sdef remove_any_tag (s):    s = re.sub(r'''<[^>]+>''','',s)    return s.strip()def remove_any_tag_but_a (s):    text = re.findall (r'''<a[^r][^>]*>(.*?)</a>''',s,re.I|re.S|re.S)    text_b = remove_any_tag (s)    return len(''.join(text)),len(text_b)def remove_image (s,n=50):    image = 'a' * n    r = re.compile (r'''''',re.I|re.M|re.S)    s = r.sub(image,s)    return sdef remove_video (s,n=1000):    video = 'a' * n    r = re.compile (r'''<embed.*?>''',re.I|re.M|re.S)    s = r.sub(video,s)    return sdef sum_max (values):    cur_max = values[0]    glo_max = -999999    left,right = 0,0    for index,value in enumerate (values):        cur_max += value        if (cur_max > glo_max) :            glo_max = cur_max            right = index        elif (cur_max < 0):            cur_max = 0    for i in range(right, -1, -1):        glo_max -= values[i]        if abs(glo_max < 0.00001):            left = i            break    return left,right+1def method_1 (content, k=1):    if not content:        return None,None,None,None    tmp = content.split('\n')    group_value = []    for i in range(0,len(tmp),k):        group = '\n'.join(tmp[i:i+k])        group = remove_image (group)        group = remove_video (group)        text_a,text_b= remove_any_tag_but_a (group)        temp = (text_b - text_a) - 8         group_value.append (temp)    left,right = sum_max (group_value)    return left,right, len('\n'.join(tmp[:left])), len ('\n'.join(tmp[:right]))def extract (content):    content = remove_empty_line(remove_js_css(content))    left,right,x,y = method_1 (content)    return '\n'.join(content.split('\n')[left:right])

The Code starts to call the last function.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.