Amazon e-commerce data analysis-Data Acquisition

Source: Internet
Author: User

Recently, the main focus has been on Amazon e-commerce data analysis. This is a data analysis and visualization project. Specifically, we need to first obtain Amazon product data, clean and store data persistently, and then use it as our own data source. The analysis module and visualization module perform a series of operations based on data.

Obviously, the most basic and important part of the entire project is the acquisition of the preliminary data. This article briefly introduces and summarizes the data acquisition and cleaning processes.

Throughout the project, we used python as the development language, and the visualization module was built based on Django. Of course, we also used python as our development language in data acquisition, that is, the crawler module.

For the crawler module, the requirement is fixed, and the crawling site is also fixed: --amazon.com. Therefore, the crawler module mainly needs to consider the scheduling, page parsing, and process automation issues.

First, describe the overall architecture of crawlers. In the beginning, we used a single machine for crawling and the startup method was very simple-the command line was started, but the problem was significant and unstable, you need to manually start the task. After that, we deploy the crawler as a service. For the incoming task, it may be submitted by the user or submitted internally. we add the crawler to the Service Queue through a submission system, the crawler service then detects and starts the crawling task. This solves the problem of instability to a certain extent. However, as data increases and bandwidth becomes more prominent, we add some machines, build a small crawler cluster for Distributed crawling. But this is not really distributed, and there are no nodes.

 

In specific implementation, we adopt scrapy, an open-source crawler framework of Python, which provides basic scheduling and page crawling functions. What we need to do is to set up the DOM resolution scheme and the subsequent data processing and storage Scheme Based on this framework, and build a small pseudo distributed crawling system based on this framework.

 

The following describes several issues that need to be considered in the crawler design process.

 

First, the page parsing problem. Due to JS asynchronous loading, the DOM elements of pages actually obtained by crawlers are a little different from the DOM elements obtained by opening pages in browsers, therefore, you cannot rely solely on the browser to locate specific DOM elements. Second, when the number of accesses reaches a certain level, especially when the number of concurrent access requests reaches a certain level, Amazon blocks the requests and returns the robot check page or even 500 server error. In this case, one solution is to reduce the number of concurrent requests. According to our actual test, if more than 50 requests are sent per second, Amazon will return 500 server error (Amazon may update the policy continuously). Therefore, we set the number of concurrent requests to 32, that is, one machine sends 32 requests in one second. This is still in the exploration phase when there may be a robot check, because such pages appear less frequently and are concentrated on the product information page, this page does not need to be updated for a long time because of its fixed information. Currently, the proxy is added as the downloadmiddleware ). In addition, considering that frequent access in China may cause such problems, we are currently migrating crawlers to Amazon EC2, which is relatively stable, in addition, access will be faster than on domestic machines.

 

The second is scheduling. This problem does not exist in single-host, single-task crawlers. However, this is a very important issue in multi-machine and multi-task scheduling. How can I schedule multiple tasks after they are submitted? If there is a priority, it is based on the priority, otherwise, it is placed in the task queue by default. In our crawler system, a crawler system composed of multiple machines is controlled by one scheduling. When a crawler task is submitted, the scheduling splits the task and distributes it to different crawler machines, on a single crawler machine, there will be a crawling queue where all the crawling tasks distributed to the machine are obtained by default from the queue, one machine can start six crawling tasks at the same time for parallel crawling.

 

Finally, the automation of the process. In our system, after the task is submitted, the processed data is directly displayed on the website in a visual chart. This requires automation of the entire process. After submitting a task, you can start to crawl data. After the task is crawled, the data is processed and merged to generate some statistical information, the data is normalized and displayed visually at the front end. This series of processes are divided into two phases: crawling and processing. After a task in the crawling phase is submitted and distributed, the crawler starts the crawling task. After the crawling is completed, the scrapy interface is used to modify the crawling task status. For example, if a task T is added to the database at startup, {'name': T, 'jobid ': 'Task _ id', 'status': 'running'}. After the crawling is completed, the status is changed to finished. At the same time, a scheduled script round-robin is provided to check whether the tasks in the database are completed. If the process is completed, start the data processing process. After the data processing is complete, merge the data into the formal project database to complete the front-end visualization of the data. Because this project is deployed on a Linux server, the cronjob in Linux is directly used to implement script polling and execution. Simply put, after writing several crontabs, start cronjob. These scripts concatenate each of the above processes to make them a complete set of processes.

 

The above is about several major and important issues in the Data crawling process. To be honest, even the current system still has not solved these problems perfectly, resolution still encounters Amazon blocking, and the automation robustness is too weak. This may be a problem to be considered in the next stage. At the same time, there is still a lot of dirty data (dirty data) after data crawling and storage, and further cleaning is required.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.