Learn python by example: capture the webpage body using python,
This method is based on the text density. The original idea was derived from Harbin Institute of Technology's general webpage Text Extraction Algorithm Based on the row block distribution function. This article makes some minor modifications based on this.
Conventions:
This article makes statistics based on different lines of the Web page. Therefore, if the content of the Web page is not compressed, the web page has a normal line feed.
Some news pages may have relatively short text content, but they embed a video file, so I will give a higher weight to the video; this is also suitable for pictures, there is a deficiency here, the weight should be determined based on the image size, but the method in this article fails to achieve this.
Because of advertisements, the non-body content is usually displayed as a hyperlink, so the text will give the hyperlink a text weight of zero.
It is assumed that the body content is continuous and does not contain non-body content. Therefore, extracting the body content is to find the beginning and end position of the body content.
Steps:
First, clear the content in the CSS, Javascript, comments, Meta, and Ins labels on the webpage and clear blank lines.
Calculate the processed value of each row (1)
Calculates the start and end position of the maximum number of texts in each row.
The second step should be described as follows:
For each row, we need to calculate a value, which is calculated as follows:
An Image Tag img is equivalent to a text with a length of 50 characters (given weight), x1,
A video tag, embed, is equivalent to text with a length of 1000 characters, x2
The text length of tag a of all links in a line is x3,
Text length of other labels x4
The number of occurrences of each row = 50 * x1 + 1000 * x2 the number of occurrences + x4-8
// Note,-8 because we want to calculate a maximum positive substring, We need to subtract a positive number. As for how big the number should be, I 'd like to proceed with experience.
Complete code
#coding:utf-8import redef remove_js_css (content): """ remove the the javascript and the stylesheet and the comment content (<script>....</script> and <style>....</style> <!-- xxx -->) """ r = re.compile(r'''<script.*?</script>''',re.I|re.M|re.S) s = r.sub ('',content) r = re.compile(r'''<style.*?</style>''',re.I|re.M|re.S) s = r.sub ('', s) r = re.compile(r'''<!--.*?-->''', re.I|re.M|re.S) s = r.sub('',s) r = re.compile(r'''<meta.*?>''', re.I|re.M|re.S) s = r.sub('',s) r = re.compile(r'''<ins.*?</ins>''', re.I|re.M|re.S) s = r.sub('',s) return sdef remove_empty_line (content): """remove multi space """ r = re.compile(r'''^\s+$''', re.M|re.S) s = r.sub ('', content) r = re.compile(r'''\n+''',re.M|re.S) s = r.sub('\n',s) return sdef remove_any_tag (s): s = re.sub(r'''<[^>]+>''','',s) return s.strip()def remove_any_tag_but_a (s): text = re.findall (r'''<a[^r][^>]*>(.*?)</a>''',s,re.I|re.S|re.S) text_b = remove_any_tag (s) return len(''.join(text)),len(text_b)def remove_image (s,n=50): image = 'a' * n r = re.compile (r'''''',re.I|re.M|re.S) s = r.sub(image,s) return sdef remove_video (s,n=1000): video = 'a' * n r = re.compile (r'''<embed.*?>''',re.I|re.M|re.S) s = r.sub(video,s) return sdef sum_max (values): cur_max = values[0] glo_max = -999999 left,right = 0,0 for index,value in enumerate (values): cur_max += value if (cur_max > glo_max) : glo_max = cur_max right = index elif (cur_max < 0): cur_max = 0 for i in range(right, -1, -1): glo_max -= values[i] if abs(glo_max < 0.00001): left = i break return left,right+1def method_1 (content, k=1): if not content: return None,None,None,None tmp = content.split('\n') group_value = [] for i in range(0,len(tmp),k): group = '\n'.join(tmp[i:i+k]) group = remove_image (group) group = remove_video (group) text_a,text_b= remove_any_tag_but_a (group) temp = (text_b - text_a) - 8 group_value.append (temp) left,right = sum_max (group_value) return left,right, len('\n'.join(tmp[:left])), len ('\n'.join(tmp[:right]))def extract (content): content = remove_empty_line(remove_js_css(content)) left,right,x,y = method_1 (content) return '\n'.join(content.split('\n')[left:right])
The Code starts to call the last function.