Raymie Stata, co-founder and CEO of Altiscale, a Hadoop as-a-service company, and former CTO of Yahoo, assisted Yahoo in completing the open source strategy and was involved in the launch of the Apache Hadoop project. Hadoop's expansion and operation are complex processes that hide potential crises in their implementation. Raymie has listed seven crisis signals and corresponding solutions based on experience to help users avoid disasters in advance.
The following is the translation:
Hadoop extension is a very complex process, here are seven common problems and solutions listed.
All Hadoop implementations have potential crises, including some very tricky Hadoop run issues. This type of problem can cause Hadoop to be deprecated before it goes into production, but a "successful disaster" (in fact, more likely to be a pure catastrophe) when it comes to production.
Hadoop expansion and implementation is very complex. However, if you can know exactly where the root cause of the problem is, you can avoid the occurrence of a "disaster." Here are some crisis signals summarized from experience.
Crisis signal 1: can not be put into production
From proof of concept to production environment usage is an important step in Big Data workflows. Hadoop expansion work full of challenges, the larger workload often can not be completed in time, the test environment can not completely cover the real operating environment, such as data testing is a common problem: proof of concept often use unrealistic small or single data set.
Before going into production, scale and stress tests are required. Applications that pass such tests are scalable and fault-tolerant and can assist in developing their own capacity planning models.
Crisis signal 2: start delay
The first application put into production marks the ease with which you can implement SLAs, but as the number of Hadoop clusters increases, its runtimes become unpredictable and the first delay problem can easily be overlooked, and over time Getting worse and worse, eventually leading to a crisis.
Do not wait for the crisis to take action. Before capacity can be challenged, capacity or optimizer may be appropriately expanded. Adjust the expected capacity model, with particular attention to capacity testing in the worst performance environment, giving it a more realistic performance.
Crisis signal 3: Start telling customers that it is impossible to save all the data
Another symptom of a crisis is the reduction of data retention requirements. At first you want to keep 13 months of data for annual data analysis, but because of space constraints you are starting to reduce the time it takes to keep data, which is somewhat equivalent to losing the benefits of Hadoop big data analytics.
Reducing data retention does not solve the problem. To avoid this problem, you must act early to re-examine the capacity model to find the cause of the failure and then adjust the model to better track the root cause of the problem.
Crisis 4: Data scientists lose status
Overusing Hadoop clusters can stifle innovation, leaving data scientists without the resources to run large jobs and without the space to store large numbers of computational results for scientists.
Capacity planning is often easily overlooked and the role of data scientists is often overlooked. Neglecting the lack of planning for the production environment load means that data scientists are often marginalized. Make sure your needs include the needs of data scientists and be effective early in the capacity issue.
Crisis Signals 5: Data scientists solve problems with Stack Overflow
In the early days of Hadoop implementation, operations teams and data scientists worked together. With the success of Hadoop implementation, O & M team maintenance pressure increases, scientists must solve Hadoop problems, often through Stock Overflow to find treatment.
As Hadoop expands and mission-critical additions begin, workloads for maintenance begin to increase, and if you want to keep data experts focused on data research, you need to resize your operations teams.
Crisis signal 6: server temperature increases
When allocating server power supplies, we often assume that they will not run at full capacity, but large Hadoop jobs are likely to let the server run for hours, posing a serious threat to your grid (there are similar issues with cooling). So make sure your Hadoop cluster can operate for long periods of time in full power.
Crisis signal 7: Expenses out of control
In a Hadoop environment based on IaaS deployments, the # 1 "successful disaster" is out of control. You suddenly find that the bill is three times the cost of last month, seriously exceeding the budget.
Capacity planning is a fairly significant step in the implementation of IaaS-based Hadoop, not only for managing capacity but also for managing costs. But good capacity planning is just the beginning, and if you want to scale with Iaas-based implementations of Hadoop, it's best to invest heavily in systems like Netflix to track and optimize costs.
Smooth Hadoop extension
Hadoop plans often underestimate the amount of work required to keep a Hadoop cluster running smoothly, a miscalculation that is understandable. The cost of initial implementation of traditional enterprise applications is many orders of magnitude higher than subsequent maintenance and support. People often mistakenly believe that Hadoop follows the same pattern. In fact, Hadoop is very difficult to maintain and requires a lot of work and maintenance.
Good capacity planning is essential; with good capacity models, it needs to be updated in time to avoid deviating from the real world scenarios; do not let innovation be a late issue and give data scientists enough support; expansion is not the only solution to the problem The same is true for managing usage, and allowing users (and business owners) to do enough job optimization with a bit of optimization to reduce existing costs.