This example describes a simple spider acquisition program based on Scrapy implementation. Share to everyone for your reference. as follows:
# Standard Python Library Imports # 3rd party imports from scrapy.contrib.spiders import crawlspider, rule from scrapy.co NTRIB.LINKEXTRACTORS.SGML Import sgmllinkextractor from scrapy.selector import Htmlxpathselector # I imports from poetry _analysis.items Import Poetryanalysisitem html_file_name = R '. +\.html ' class Poetryparser (object): "" "provides common
Parsing to poems formatted this one specific way.
"" "Date_pattern = R ' (\d{2} \w{3,9} \d{4}) ' Def parse_poem (self, response): HxS = htmlxpathselector (response) item = Poetryanalysisitem () # All poetry the text is in pre tags-text = hxs.select ('//pre/text () '). Extract () ite m[' text '] = '. Join (text) item[' url ' = response.url # Head/title contains title-a poem by author Title_text = Hxs.select ('//head/title/text () '). Extract () [0] item[' title '], item[' author '] = title_text.split ('-') item[' au
Thor '] = item[' author '].replace (' A poem by ', ' ") for key in [' title ', ' Author ']: Item[key] = Item[key].strip () item[' date '] = Hxs.select ("//p[@class = ' small ']/text ()"). Re (Date_pattern) return
Item Class Poetryspider (Crawlspider, poetryparser): name = ' example.com_poetry ' allowed_domains = [' www.example.com '] Root_path = ' someuser/poetry/' start_urls = [' http://www.example.com/someuser/poetry/recent/', ' http://www.e
xample.com/someuser/poetry/less_recent/'] rules = [Rule (sgmllinkextractor (allow=[start_urls[0) + HTML_FILE_NAME]),
callback= ' parse_poem '), Rule (Sgmllinkextractor (allow=[start_urls[1] + html_file_name)), callback= ' Parse_poem ')]
I hope this article will help you with your Python programming.