This project is also a first glimpse into the Python crawler project, is also my graduation design, at that time, found that most people choose is the site class, it is common but, are some simple additions and deletions, business class to feel a very common system design, at that time also just in the know to see an answer , how do you use computer technology to solve the practical problems of life, links are not put, interested can search, and then use this topic.
Abstract: Based on the Python distributed data fetching system for the further application of the data is the availability of the recommendation system to do data support. This project is devoted to solving the bottleneck of single process stand-alone crawler and creating a topic crawler based on Redis distributed multi-crawler Shared Queue. The system is developed using the Scrapy framework developed by Python, using Xpath technology to extract and parse the downloaded Web pages, using the Redis database to do the distribution, using the MongoDB database to do the data storage, using the Django web framework and the Semantic UI open source box Frame to the data for friendly visualization, and finally use the Docker to deploy the crawler. The Distributed crawler system is designed and implemented for the rental platform of 58 city cities. I. System function Architecture
Field selection is mainly based on the application of the system research to carry out, because the system development stand-alone configuration is relatively low, did not download picture files to this machine. Reduce the pressure on a single machine.
(f) Data processing
1) Object Definition Program
Item is the container that defines the crawl data. Declared by creating a Scrapy.item.Item class. The definition property is a Scrapy.item.Field object that controls the obtained site data by instantiating the desired item. This system defines nine fetching objects, namely: Post title, rent, leasing method, area, location, city, post details page link, release time. The definition of a field here is defined based on the needs of the data processing side. The key code is as follows:
Class Tczufangitem (Item):
#帖子名称
Title=field ()
#租金
Money=field ()
#租赁方式
Method=field ()
#所在区域
Area=field ()
#所在小区
Community=field ()
#帖子详情url
Targeturl=field ()
#帖子发布时间
Pub_time=field ()
#所在城市
City=field ()
2) Data Processing program
The pipeline class defines the method of saving and outputting data, the item returned from the spider parse method, and the data corresponds to the pipeline class in the Item_pipelines list and outputs in the top format. The data returned by the system to the pipeline is stored using MongoDB. The key code is as follows:
def process_item (self, item, spider):
If item[' pub_time '] = = 0:
Raise Dropitem ("Duplicate Item found:%s"% item)
If item[' method '] = = 0:
Raise Dropitem ("Duplicate Item found:%s"% item)
If item[' Community ']==0:
Raise Dropitem ("Duplicate Item found:%s"% item)
If item[' Money ']==0:
Raise Dropitem ("Duplicate Item found:%s"% item)
If item[' area '] = = 0:
Raise Dropitem ("Duplicate Item found:%s"% item)
If item[' city ' = = 0:
Raise Dropitem ("Duplicate Item found:%s"% item)
Zufang_detail = {
' title ': Item.get (' title '),
' Money ': item.get (' money '),
' Method ': Item.get (' method '),
' Area ': item.get (' area ', '),
' Community ': item.get (' Community ', '),
' TargetUrl ': Item.get (' TargetUrl '),
' Pub_time ': Item.get (' pub_time ', '),
' City ': Item.get (' City ', ')
}
result = self.db[' Zufang_detail '].insert (zufang_detail)
print ' [success] the ' +item[' targeturl ']+ ' wrote to MongoDB database '
Return item
(g) Visualization of data design
The visualization of the data is actually the conversion of the database data into a form that our users can easily observe, the system uses MONGODB to store the data. The visualization of the data is based on DJANGO+SEMANTIUI, and the effect is shown in the following illustration: