Use of Python readability:
From readability.readability import Document
Import Urllib
html = urllib.urlopen (URL). Read ()
Readable_article = Document (HTML). Summary ()
Readable_title = Document (HTML). Short_title ()
The last extracted readable_article is text with HTML tags. However, in many cases after the readability filtered text with HTML tags is what we do not want, that is, readability the wrong content, in the face of this situation we can first the HTML operation before the incoming.
For example, the body needs to be extracted in <div class= "Arti-con rel" > <div class= "Arti-con rel" > with <div class= "Clearfix Page-n-p-con" Between >, we can take the following actions.
From readability.readability import Document
From Scrapy. Selector Import Htmlxpathselector
From scrapy.http import Htmlresponse
Import Urllib
html = urllib.urlopen (URL). Read ()
content_t = Html.split (' <div class= ' arti-con rel > ') [ -1].strip (). Split (' <div class= ' clearfix page-n-p-con "' ) [0].strip ()
content_t = ' <div class= ' arti-con rel ' > ' + content_t
Readable_article = Document (content_t). Summary ()
Response = Htmlresponse (url= ", body=readable_article, encoding= ' UTF8 ')
HXS = Htmlxpathselector (response)
Html_content = ". Join (Hxs.select ('//text () '). Extract ()). Strip ()
The body obtained by such processing is relatively clean, and reduces the phenomenon of not getting, the disadvantage is not suitable for multiple pages of the site.
Python readability Extract page body optimization