Long time no crawler, write a scrapy crawl crawl to crawl NetEase news, code prototype is a crawler on GitHub, recently also saw a bit of mongodb. By the way, use it for a little bit. Experience what it feels like to be a nosql. Well, come on. Scrapy crawler mainly have several files need to be changed. This crawler needs you to install the MongoDB database and Pymongo, after entering the database. The Find statement enables you to view the contents of the database, such as what you see below:
{
"_id" : ObjectId("5577ae44745d785e65fa8686"),
"from_url" : "http://tech.163.com/",
"news_body" : [
"Technology News June 9th Morning News 2015",
"The Global Developers Conference (WWDC 2015) is in the old",
"Hang, Netease Technology carried out a full live video broadcast. The latest",
"9 operating system is in",
"The performance has been greatly improved, enabling split-screen display. It can also support the picture-in-picture function.",
"The new iOS 9 adds a QuickType keyboard to make typing and editing easier and faster.
When using the iPad with an external keyboard. Users can use shortcut keys to operate, such as switching between different apps. ",
"And. iOS 9 once again designed a switch between apps. The iPad's split-screen feature allows users to open a second app at the same time they don't leave the current app. This means that the two apps are on the same screen. Time to open, parallel operation.
The ratio of the two screens can be 5:5 or 7:3. ",
"In addition, the iPad also supports the "Picture in Picture" function, which can zoom the video being played to a corner and then use the other space on the screen to handle other work.
",
"It is revealed that the split screen function only supports iPad Air2. The picture-in-picture function will only support iPad Air, iPad Air2, iPad mini2, iPad mini3.",
"\r\n"
],
"news_from" : "Netease Technology Report",
"news_thread" : "ARKR2G22000915BD",
"news_time" : "2015-06-09 02:24:55",
"news_title" : "iOS 9 can implement split screen function on iPad",
"news_url" : "http://tech.163.com/15/0609/02/ARKR2G22000915BD.html"
}
Here are the files that need to be changed:
1.spider crawler files, making crawl rules is mainly the use of XPath
2.items.py primarily specifies what to crawl
3.pipeline.py has a feature that points to and stores data. Here we will also add a store.py file. Inside the file is the creation of a MongoDB database.
4.setting.py configuration file, mainly configuration agent, user_agent, crawl time interval, delay, etc.
Mainly these files, this scrapy the reptile I have added a few new features, and a database link to implement the storage function. is not stored as JSON or TXT file. The second is to set the follow = True property in the spider, meaning to continue crawling down the crawled results, which is equivalent to a deep search process. Below we look at the source code.
Generally, first of all, we're writing items.py files.
# -*- coding: utf-8 -*-
import scrapy
class Tech163Item(scrapy.Item):
news_thread = scrapy.Field()
news_title = scrapy.Field()
news_url = scrapy.Field()
news_time = scrapy.Field()
news_from = scrapy.Field()
from_url = scrapy.Field()
news_body = scrapy.Field()
Then we wrote the Spider file. We can simply name a file, as we call the crawler only need to know its internal crawler name can be, that is, name = "News" This property. Our reptile names are called news. If you need to use this crawler you may need to change the Allow attribute in rule below to change the time. Because NetEase news will not store more than a year of news. You can change the time to the recent assumption that you can change to/15/08 now for the August 15.
#encoding:utf-8
Import scrapy
Import re
From scrapy.selector import Selector
From tech163.items import Tech163Item
From scrapy.contrib.linkextractors import LinkExtractor
From scrapy.contrib.spiders import CrawlSpider,Rule
Class Spider(CrawlSpider):
Name = "news"
Allowed_domains = ["tech.163.com"]
Start_urls = [‘http://tech.163.com/‘]
Rules = (
Rule(
LinkExtractor(allow = r"/15/06\d+/\d+/*"),
The meaning of the regular /15/06\d+/\d+/* in the #code is probably to climb the beginning of /15/06 and follow the numbers/numbers/regardless of the format/news
Callback = "parse_news",
Follow = True
#follow=ture defines whether to climb again and continue to climb backwards
),
)
Def parse_news(self,response):
Item = Tech163Item()
Item[‘news_thread‘] = response.url.strip().split(‘/‘)[-1][:-5]
Self.get_title(response, item)
Self.get_source(response, item)
Self.get_url(response, item)
Self.get_news_from(response, item)
Self.get_from_url(response, item)
Self.get_text(response, item)
Return item
Def get_title(self,response,item):
Title = response.xpath("/html/head/title/text()").extract()
If title:
Item[‘news_title‘] = title[0][:-5]
Def get_source(self,response,item):
Source = response.xpath("//div[@class='ep-time-soure cDGray']/text()").extract()
If source:
Item[‘news_time‘] = source[0][9:-5]
Def get_news_from(self,response,item):
News_from = response.xpath("//div[@class='ep-time-soure cDGray‘]/a/text()").extract()
If news_from:
Item[‘news_from‘] = news_from[0]
Def get_from_url(self,response,item):
From_url = response.xpath("//div[@class='ep-time-soure cDGray‘]/a/@href").extract()
If from_url:
Item[‘from_url‘] = from_url[0]
Def get_text(self,response,item):
News_body = response.xpath("//div[@id=‘endText‘]/p/text()").extract()
If news_body:
Item[‘news_body‘] = news_body
Def get_url(self,response,item):
News_url = response.url
If news_url:
Item[‘news_url‘] = news_url
After that we create a store.py file. In this file we create a database that will then reference the database in the pipeline file and store the data in the database. Below we look at the source code.
import pymongo
import random
HOST = "127.0.0.1"
PORT = 27017
client = pymongo.MongoClient(HOST,PORT)
NewsDB = client.NewsDB
In the pipeline.py file. We will import newsdb this database, using the UPDATE statement to insert each piece of news into the database, there are two inference one is to infer whether the name of the crawler is news and one is to infer whether the number of threads is empty, The most important sentence is NewsDB.new.update (spec,{"$set":d ICT (item)},upsert = True). Inserts data from the dictionary into the database.
from store import NewsDB
class Tech163Pipeline(object):
def process_item(self, item, spider):
if spider.name != "news":
return item
if item.get("news_thread",None) is None:
return item
spec = {"news_thread":item["news_thread"]}
NewsDB.new.update(spec,{"$set":dict(item)},upsert = True)
return None
Finally, we will change the configuration file settings user_agent, we want to maximize the crawler to mimic the behavior of the browser. This ability to crawl the content you want smoothly.
BOT_NAME = ‘tech163‘
SPIDER_MODULES = [‘tech163.spiders‘]
NEWSPIDER_MODULE = ‘tech163.spiders‘
ITEM_PIPELINES = [‘tech163.pipelines.Tech163Pipeline‘,]
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = ‘tech163 (+http://www.yourdomain.com)‘
USER_AGENT = ‘Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7‘
DOWNLOAD_TIMEOUT = 15
Use Scrapy to crawl NetEase news and store it in MongoDB