A project in the lab collects Amazon's product data, including monomer and variant acquisitions. The beginning of a demo is a stand-alone version of the Java collection software, the following according to the requirements of the instructor, the realization of the cloud distributed collection-the laboratory to provide collection equipment resources, multi-computer distributed multi-threaded acquisition, the user only need to configure the front desk to collect the URL, do not need to hang the machine to collect, In order to provide users with cloud capture services.
Project team encountered a lot of technical difficulties in the implementation, including the construction of distributed architecture, the design of the acquisition logic process and the identification of the Amazon verification Code, code optimization, after repeated research, after a few months of testing, has basically realized the main function, collection efficiency, the customer is also very satisfied. Once in a day, the largest number of acquisitions reached nearly 20 million. And that's what we didn't expect.
In front of the work is mainly to provide users with cloud collection services. Now there is a demand, can write a single version of the Amazon collection, directly to the user to collect and use, and need to be restricted by our server permissions. The initial solution is to build a Java desktop application using JavaFX, which captures the core process exactly as it did in the previous distributed collection.
There is another way of thinking is in accordance with the Crawl Alliance crowdsourcing collection Sina Weibo form, we give users to assign acquisition tasks, bare metal to accept acquisition tasks, the form of crowdsourcing to collect data. But this plan does not accord with our present needs, can not consider.
Amazon Cloud Platform acquisition and single-machine acquisition implementation