Scrapy pipeline is a very important module, the main function is to write the return items to the database, files and other persistent modules, below we will briefly understand the use of pipelines.
Case one:
Items Pool
classZhihuuseritem (scrapy. Item):#Define the fields for your item here is like: #name = Scrapy. Field ()ID =Scrapy. Field () name=Scrapy. Field () Avatar_url=Scrapy. Field () Headline=Scrapy. Field () Description=Scrapy. Field () URL=Scrapy. Field () Url_token=Scrapy. Field () Gender=Scrapy. Field () Cover_url=Scrapy. Field () Type=Scrapy. Field () Badge=Scrapy. Field () Answer_count=Scrapy. Field () Articles_count=Scrapy. Field () commercial_question=Scrapy. Field () Favorite_count=Scrapy. Field () Favorited_count=Scrapy. Field () Follower_count=Scrapy. Field () Following_columns_count=Scrapy. Field () Following_count=Scrapy. Field () Pins_count=Scrapy. Field () Question_count=Scrapy. Field () Thank_from_count=Scrapy. Field () Thank_to_count=Scrapy. Field () Thanked_count=Scrapy. Field () Vote_from_count=Scrapy. Field () Vote_to_count=Scrapy. Field () Voteup_count=Scrapy. Field () Following_favlists_count=Scrapy. Field () Following_question_count=Scrapy. Field () Following_topic_count=Scrapy. Field () Marked_answers_count=Scrapy. Field () Mutual_followees_count=Scrapy. Field () Participated_live_count=Scrapy. Field () Locations=Scrapy. Field () Educations=Scrapy. Field () employments= Scrapy. Field ()
Items
Basic configuration for writing to the MongoDB database
# Configure connection information for a MONGODB database ' 172.16.5.239 ' = 27017'zhihuuser'# parameter equals false, which is equal to telling you what you want to take on this site, Does not read the Prohibit crawl list under the root directory of each Web site (for example: www.baidu.com/robots.txt)Robotstxt_obey == { ' zhihuuser.pipelines.MongoDBPipeline': +,}
settings.py
pipelines.py:
1, first we want to read the data from the settings file address, port, database name (not automatically created).
2, get the basic information of the database to connect.
3. Write data to the database
4. Close the database
Note: Only open and close are performed once, and the write operation depends on the number of writes.
Import Pymongoclass Mongodbpipeline (object): "" "1, Connect database Operation" "" Def __init__ (SELF,MONGOURL,MONGOPORT,MONGODB): "' Initialize MongoDB data URL, port number, database name:p Aram Mongourl::p Aram Mongoport::p Aram MongoDB: "' Self.mongourl = mongourl Self.mongoport = Mongoport Self.mongodb = MongoDB @classmethod D EF From_crawler (Cls,crawler): "" "1, read the URL of the MongoDB data inside the settings, port, DB. :p Aram Crawler:: Return: "" "return cls (Mongourl = Crawler.settings.get (" Mongo_url "), Mongoport = Crawler.settings.get ("Mongo_port"), MongoDB = Crawler.settings.get ("mongo_db")) def open_spider (Self,spider): ' 1, connect MongoDB data:p Aram Spider:: return: ' Self . Client = Pymongo. Mongoclient (self.mongourl,self.mongoport) self.db = Self.client[self.mongodb] def process_item (Self,item,spider) : "1. Write data to Database :p Aram Item::p Aram Spider:: return: ' name = item.__class__.__name__ # Self.db[na Me].insert (Dict (item)) self.db[' user '].update ({' Url_token ': item[' url_token ']},{' $set ': item},true) return ite M def Close_spider (self,spider): ' 1, close database connection:p Aram Spider:: return: ' SE Lf.client.close ()
The use of pipeline in the scrapy of Python crawlers