There is such a collection system needs, to achieve the target: the need to collect 300,000 keywords data, micro-blog must be collected in one hours, covering four micro-blog (Sina Weibo, Tencent micro-blog, NetEase micro-blog, Sohu Weibo). In order to save the customer cost, the hardware is a common server: E5200 Dual-core 2.5G CPU, 4 G DDR3 1333 RAM, hard disk 500G SATA 7200 rpm. Database is MySQL. Can we achieve this system goal under such conditions? Of course, if there's better hardware, it's not what this article describes. Now through the collection, storage to explain how to achieve:
First, the goal is in one hours to 300,000 keywords corresponding to the data from the four micro-gambling set down, can use the machine configuration is configured above the ordinary server. The acquisition server does not have much requirements for the hard disk, which is CPU intensive and consumes some memory. Evaluation of hardware resources is not a bottleneck, look at the interface to obtain data what is the problem?
1, through the major micro-blog search API. For example, Sina Weibo API for a server IP request, the general permission limit is one hour 1w, the highest authority cooperation authorization one hours 4w times. When using the application, you need to have enough users, one user per application per hour access 1000 times, the highest authority 4w times requires 40 users to use your application. Up to 30w keywords, requires at least 8 applications, if each keyword needs to visit 3 pages, a total of 24 cooperation permissions to apply. The actual operation we are not possible for this project to develop 24 cooperative permission applications, so this approach is not very appropriate. Sina Weibo API limits reference links.
2, through the latest micro-BO collection data, micro-blog just launched, the major micro Bo have Weibo square, can be the latest micro-Bo collected, and then through participle, if there are 300,000 keywords in one of the left, the others will be discarded. But now, in addition to Tencent Weibo and Sohu Weibo has a similar function of microblogging plaza, Sina Weibo and NetEase Weibo has no such function. According to the data released before Sina Weibo, registered users have more than 500 million, per hour more than 100 million micro-blog, if the full collection of data storage is a big test, also requires a lot of system resources, the actual collection of 100 million, perhaps 1000w useful, a waste of 9000w data resources.
3, through the major micro-blog Web search, visible can grasp the way, combined with the anti-monitoring system module to simulate the normal behavior of human operations, search 300,000 keyword data, so that the maximum utilization of resources. In order to ensure the acquisition in one hours, it is necessary to use distributed multithread mode crawl, concurrent acquisition. Concurrency can not be from the same IP or the same IP network segment, to ensure that the other side will not monitor our crawler.
We finally adopted a third way, the current state of operation for the 30w keyword search to get all the micro-Boga together a total of more than 1000 W per day, Sina and Tencent, Sina Weibo slightly. The use of 6 ordinary PC server, even if a machine 7000 yuan, a total of 40,000 hardware equipment to solve the problem of acquisition hardware. The overall deployment diagram is:
Second, storage, how to deal with the data collected? First of all, storage acquisition data is an intensive write operation, the ordinary hard drive can support, MySQL database software can support, the future volume of sudden increase how to deal with? Then there is the assessment of storage space, the daily increment of so much need to consume a lot of storage resources, how to store and easy to expand.
1, how to store. Normally we configure the server above, MySQL uses MyISAM engine a table up to 20w, use InnoDB engine up to 400w, if more than this number, query update speed is very slow. Here we take a more tricky approach, use MySQL's InnoDB storage engine to do a layer of cache, the cache has two cache tables, each table only stores less than 300w of data, one table more than 300w of data to switch to another table insert until more than 300w and then switch back. After the switch succeeds, the table that has more than 300w of data truncate off, remember to be sure to have no data to insert when truncate again, prevent data loss. You must use truncate here, you cannot use delete, because delete requires a query, index read and write, and delete writes the database log consumes disk IO and storage space is not released. Truncate and drop are good practices for manipulating database deletion data. Since there are two tables as data insert tables, it is not appropriate to use the ID of the database table, and a high-speed unique self-ID server is required to provide the build distributed IDs. The other database can shut down the write transaction log and improve performance because the crawled data was lost and then started crawling, so the database can remain in a relatively high-performance situation to complete the insert operation. Crawl cache table results as shown:
2, storage space. The inserted data needs to be preserved and cannot be truncate after 300w. We need a program that synchronizes data to another library (we call it the result library, and the result library uses the InnoDB engine) before it is truncate at 3 million. But we every day more than 10 million, by day increment, MySQL a table a day to burst, we this table is not write operation-intensive, so the result library can store more data, set the upper limit of 500w, but 5 million or save 10 million data. We need a database of MySQL final results to be divided into tables. The data is divided into machines by time, and then according to the data source, for example, 201301 data is stored in a machine by hash, and 201302 by hash on another machine. After the machine is followed by days or half-day, such as table name weibo_2013020101, weibo_2013020112. Weibo_2013020101 said February 1 morning a table, Weibo_2013020112 said February 1 afternoon a table. The light is still not enough, 1000w/2=500w, can't stand the pressure to expand. We also need to split the table, such as weibo_2013020101 into Weibo_2013020101_1 (Sina Weibo), Weibo_2013020101_2 (Tencent Weibo), Weibo_2013020101_3 (NetEase Weibo), Weibo_2013020101_4 (Sohu Weibo). Such a table on average storage 500W/4 = 125w data, far less than the 500w limit, but also to deal with the future burst of growth. Again from the storage space to calculate, even if a micro-blog data for 1k, one day 1000w*1k=10g, hard disk 500G storage of up to 50 days of data, so we plan the machine can hang more than a little hard disk, or increase the machine. The result library is divided into tables as shown:
According to this framework, we use open source free software, Low-cost server built Tens data acquisition system in the production of a good operation.