"Crawler" saves the captured data--crawlers with MongoDB.

Source: Internet
Author: User

"Crawler" saves the captured data--crawlers with MongoDB.

Video Address

The method of fetching data, the previous lesson should have been told, crawl to take down the data is only the first step, the second step is to save it first. The easiest thing to think about is to save the file, and the course before Python writes the file has already been told. It is possible to save to a file, but do you ever think that it is a little geek to open the whole file and read it every time you use it?
So we usually choose to deposit into the database, easy to write and read the data, and for the most part, the Python data structure of the dict enough for us to go to the structured crawl of the data, then the two can play to the ultimate artifact is--mongodb!

Mongodb
    1. Distributed
    2. Loose data Structures (JSON)
    3. Query Language is powerful
Document

You can be seen as a dict,dict inside you can also nest dict, for example:

    1. {"name": "alan", score_list: {"chinese": 90, "english": 80}}
Collection

A set of documents is a bunch of dict.

Database

Multiple collections make up a database

So understand: You can think of MongoDB as a library, every book in the library is a document, a book on a bookshelf is a collection, and each library's bookshelf adds up to a database.

Installation

Official installation method
Students who learn my tutorial should know that I will not give specific steps to encourage people to follow the official documents to explore, shielding hand party.

How to store the captured data in MongoDB
    1. Write the captured data into the Dict form you want.
    2. Insert to the specified bookshelf
    3. It's gone.
Additions and Deletions Example Python2 version

Need to install Pymongo

    1. pip install pymongo

mongo_api.py

  1. # -*- coding: utf-8 -*-
  2. # Define your item pipelines here
  3. #
  4. # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
  6. import pymongo
  7. import sys
  8. import unittest
  9. reload(sys)
  10. sys.setdefaultencoding(‘utf-8‘)
  11. class MongoAPI(object):
  12. def __init__(self, db_ip, db_port, db_name, table_name):
  13. self.db_ip = db_ip
  14. self.db_port = db_port
  15. self.db_name = db_name
  16. self.table_name = table_name
  17. self.= Pymongo. Mongoclient (host=self db_ip, Port=selfdb_port)
  18. self.db = self.conn[self.db_name]
  19. self.table = self.db[self.table_name]
  20. def get_one(self, query):
  21. return self.table.find_one(query, projection={"_id": False})
  22. def get_all(self, query):
  23. return self.table.find(query)
  24. def add(self, kv_dict):
  25. return self.table.insert(kv_dict)
  26. def delete(self, query):
  27. return self.table.delete_many(query)
  28. def check_exist(self, query):
  29. ret = self.get(query)
  30. return len(ret) > 0
  31. # 如果没有 会新建
  32. def update(self, query, kv_dict):
  33. ret = self.table.update_many(
  34. query,
  35. {
  36. "$set": kv_dict,
  37. }
  38. )
  39. if not ret.matched_count or ret.matched_count == 0:
  40. self.add(kv_dict)
  41. elif ret.matched_count and ret.matched_count > 1:
  42. self.delete(query)
  43. self.add(kv_dict)
  44. class DBAPITest(unittest.TestCase):
  45. def setUp(self):
  46. self.db_api = MongoAPI("127.0.0.1", # 图书馆大楼地址
  47. 27017, # 图书馆门牌号
  48. "test", # 一号图书室
  49. "test_table") # 第一排书架
  50. def test(self):
  51. db_api = self.db_api
  52. db_api.add({"url": "test_url", "k": "v"})
  53. self.assertEqual(db_api.get_one({"url": "test_url"})["k"], "v")
  54. db_api.update({"url": "test_url"}, {"url_update": "url_update"})
  55. ob = db_api.get_one({"url": "test_url"})
  56. self.assertEqual(ob["url_update"], "url_update")
  57. db_api.delete({"url": "test_url"})
  58. self.assertEqual(db_api.get_one({"url": "test_url"}), None)
  59. if __name__ == ‘__main__‘:
  60. unittest.main()

Previous article

"Crawler" saves the captured data--crawlers with MongoDB.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.