"Crawler" saves the captured data--crawlers with MongoDB.
Video Address
The method of fetching data, the previous lesson should have been told, crawl to take down the data is only the first step, the second step is to save it first. The easiest thing to think about is to save the file, and the course before Python writes the file has already been told. It is possible to save to a file, but do you ever think that it is a little geek to open the whole file and read it every time you use it?
So we usually choose to deposit into the database, easy to write and read the data, and for the most part, the Python data structure of the dict enough for us to go to the structured crawl of the data, then the two can play to the ultimate artifact is--mongodb!
Mongodb
- Distributed
- Loose data Structures (JSON)
- Query Language is powerful
Document
You can be seen as a dict,dict inside you can also nest dict, for example:
{"name": "alan", score_list: {"chinese": 90, "english": 80}}
Collection
A set of documents is a bunch of dict.
Database
Multiple collections make up a database
So understand: You can think of MongoDB as a library, every book in the library is a document, a book on a bookshelf is a collection, and each library's bookshelf adds up to a database.
Installation
Official installation method
Students who learn my tutorial should know that I will not give specific steps to encourage people to follow the official documents to explore, shielding hand party.
How to store the captured data in MongoDB
- Write the captured data into the Dict form you want.
- Insert to the specified bookshelf
- It's gone.
Additions and Deletions Example Python2 version
Need to install Pymongo
pip install pymongo
mongo_api.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
import sys
import unittest
reload(sys)
sys.setdefaultencoding(‘utf-8‘)
class MongoAPI(object):
def __init__(self, db_ip, db_port, db_name, table_name):
self.db_ip = db_ip
self.db_port = db_port
self.db_name = db_name
self.table_name = table_name
-
self.= Pymongo. Mongoclient (host=self db_ip, Port=selfdb_port)
self.db = self.conn[self.db_name]
self.table = self.db[self.table_name]
def get_one(self, query):
return self.table.find_one(query, projection={"_id": False})
def get_all(self, query):
return self.table.find(query)
def add(self, kv_dict):
return self.table.insert(kv_dict)
def delete(self, query):
return self.table.delete_many(query)
def check_exist(self, query):
ret = self.get(query)
return len(ret) > 0
# 如果没有 会新建
def update(self, query, kv_dict):
ret = self.table.update_many(
query,
{
"$set": kv_dict,
}
)
if not ret.matched_count or ret.matched_count == 0:
self.add(kv_dict)
elif ret.matched_count and ret.matched_count > 1:
self.delete(query)
self.add(kv_dict)
class DBAPITest(unittest.TestCase):
def setUp(self):
self.db_api = MongoAPI("127.0.0.1", # 图书馆大楼地址
27017, # 图书馆门牌号
"test", # 一号图书室
"test_table") # 第一排书架
def test(self):
db_api = self.db_api
db_api.add({"url": "test_url", "k": "v"})
self.assertEqual(db_api.get_one({"url": "test_url"})["k"], "v")
db_api.update({"url": "test_url"}, {"url_update": "url_update"})
ob = db_api.get_one({"url": "test_url"})
self.assertEqual(ob["url_update"], "url_update")
db_api.delete({"url": "test_url"})
self.assertEqual(db_api.get_one({"url": "test_url"}), None)
if __name__ == ‘__main__‘:
unittest.main()
Previous article
"Crawler" saves the captured data--crawlers with MongoDB.