Python news crawler based on Scrapy framework

Last Update:2018-09-04 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview The project is based on the scrapy framework of the Python News crawler, able to crawl NetEase, Sohu, Phoenix and surging website News, will title, content, comments, time and other content to organize and save to local detailed code download: http://www.demodashi.com/demo/ 13933.html. Development background

Python, as a hooping in data processing, has been growing in recent years. Web crawler can be said to be one of the most representative of Python applications, then through the web crawler to learn python and network and data processing related content can be said to be more appropriate.

Scrapy is a fast, high-level screen capture and web crawling framework developed by the Python language for crawling Web sites and extracting structured data from pages. Crawlers based on the scrapy framework are more structured and more efficient than traditional crawlers, and can perform more complex crawl tasks.

Second, the crawler effect

1. Title

2. Content

3. Comments

4. Date, heat and ID

5. Program Operation diagram

Third, the specific development

1. Mission Requirements

1. Crawl the articles and reviews of NetEase, Sohu, Phoenix and surging news websites

2. The number of news pages is not less than 100,000 pages

3. Each news page and its comments can be updated within 1 days

2. Function design

1. Design a web crawler that can crawl all pages of a specified website and extract articles and comments from them

2. Run the web crawler regularly to update data daily

3. System Architecture

Let's start with a brief introduction to the Scrapy framework, which is a reptile framework

The Green Line is the data flow,

(1) First from the initial URL, Scheduler will give it to Downloader to download,

(2) After the download will be given to the spider for analysis, the spider here is the core function of the crawler code

(3) Spider analysis has two results: one is the need to further crawl links, they will be sent back to Scheduler through middleware, the other is to save the data, into the item Pipeline, processing and storage

(4) Finally, all data is output and saved as a file

4. Actual Project

(1) Project structure

As you can see, Newsspider-master is the full project folder with crawler startup scripts for each website debug_ The Xx.py,scrapyspider folder holds the relevant files required by the Scrapy framework, Spiders folder holds the actual crawler code

(2) Crawler engine

Take the example of the crawler news_163.py of NetEase News, briefly explain some core code:

① defines a reptile:

Class News163_spider (Crawlspider):    # NetEase News crawler name Name    = "163news"    # disguised as browser    headers = {        ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 ',    }    #网易全网    allowed_domains = [        "163.com"    ]    #新闻版    start_urls = [        ' http://news.163.com/'    ]    # A regular expression represents a URL rule that can continue to be accessed, http://news.163.com/\d\d\d\d\d ([[\w\._+-]) *) *$    rules = [        rule (Linkextractor (        allow= ('            http://news\.163\.com/.*$ ')        ,        deny = (' http://.*.163.com/photo.*$ ')        ),        callback= "Parse_item",        follow=true)    ]

② Web content Analysis Module

According to different content of the XPath path from the page to extract content, because the site at different times of the page structure, so according to different page layout divided into several if sentence block;

def parse_item (self, Response): # response is the response of the current URL Article = Selector (response) Article_url = Response.url GL Obal Count # Analysis page Type # compare new NetEase News http://news.163.com/05-17/if get_category (article) = = 1:articlexpath = '/ /*[@id = "Epcontentleft"] ' if Article.xpath (articlexpath): Titlexpath = '//*[@id = ' Epcontentleft ']/h1/text (            ) ' Datexpath = '//*[@id = ' epcontentleft ']/div[1]/text () ' Contentxpath = '//*[@id = "Endtext"] ' News_infoxpath = '//*[@id = ' Post_comment_area ']/script[3]/text () ' # title if Article.xpath (Titlexpath) : News_item = NewsItem () news_item[' url '] = Article_url get_title (article, TI Tlexpath, News_item) # date if Article.xpath (Datexpath): Get_date (article, Datexpath, News_item) # Content if Article.xpath (contentxpath): Get_content (AR Ticle, Contentxpath, newS_item) Count = count + 1 news_item[' id '] = count # comment                    Try:comment_url = Get_comment_url (article, News_infoxpath) # comment Processing Comments = Get_comment (Comment_url, News_item) [1] news_item[' comments '] = comments EX cept:news_item[' comments ' = ' news_item[' heat '] = 0 yield news_it Em

Match the date information in the page content according to the regular expression:

' Universal Date handler ' Def get_date (article, Datexpath, News_item):    # time    try:        article_date = Article.xpath ( Datexpath). Extract () [0]        pattern = Re.compile ("(\d.*\d)")  # Regular Match news time        article_datetime = Pattern.findall ( Article_date) [0]        #article_datetime = Datetime.datetime.strptime (Article_datetime, "%y-%m-%d%h:%m:%s")        news_item[' Date ' = Article_datetime    except:        news_item[' date '] = ' 2010-10-01 17:00:00 '

Other functions:

"' Site classification function ' ' Def get_category (article): ' Character filter function ' Def str_replace (content): ' General body handler function ' Def get_content (article, Contentxpath, News_item): ' Comment information extraction function ' Def get_comment_url (article, News_infoxpath): ' Comment handler function ' Def get_comment ( Comment_url, News_item):

(3) Run the crawler and format the storage

① Configuration in settings.py

Import sys# here to the absolute path of the crawler project, to prevent the path search bugsys.path.append (' E:\Python\ previous Project \\NewsSpider-master\scrapyspider ') # crawler name bot _name = ' Scrapyspider ' # sets whether to obey the site's crawler rules Robotstxt_obey = true# Concurrent requests, the larger the crawl the faster the load is also large concurrent_requests = 32# prohibit cookies, Prevents the encoding format from being bancookies_enabled = false# output, because Excel is ANSI-encoded by default, so here is consistent # if there are other coding requirements such as UTF-8, etc. can be changed by itself feed_export_encoding = ' ANSI ' # Increase the crawl latency, reduce the crawl site server pressure download_delay = 0.01# Crawl The maximum number of news bars closespider_itemcount = 500# Download Timeout setting, more than 10 seconds not responding then discard the current Urldownload_ TIMEOUT = 100item_pipelines = {    ' scrapyspider.pipelines.ScrapyspiderPipeline ': Class name in 300,# pipeline}

② run crawlers and save news content

Crawled news content and comments need to format the store, if you run the debug script in the IDE, the effect is as follows:

When crawled, it is saved as a. csv file, which opens in Excel to view:

③ If you need to extract the comments separately, you can use csv_process.py, with the following effects:

Iv. Other Additions

Not at the moment

Code Download: http://www.demodashi.com/demo/13933.html Note: This copyright belongs to the author, published by the demo master, refused to reprint, reprint needs the author authorization

Python news crawler based on Scrapy framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More