KeywordsDanger expansion data scientists encounters bottlenecks
Most http://www.aliyun.com/zixun/aggregation/13861.html "> Enterprise Large Data Application cases are still in the experimental and pilot phase, for the few users who first deployed Hadoop systems in the production environment, Most often encountered is the expansion of the problem, such problems often lead to enterprises unworthy, the termination of large data application projects.
Deploying and expanding the Hadoop system is a highly complex matter, and if users can get an early understanding of the various problems and dangerous signals that the Hadoop extension may encounter, they can avoid many "firefighting" scenes.
The following are the seven major risk signs for the expansion of the Hadoop large data system that Altiscale Raymie Stata for us:
Danger signal One: Never enter the production stage
Large data applications from concept to production environment is a huge leap, the scalability of Hadoop system will face great challenges. Some problems in the production environment data scale the experimental environment is difficult to meet. There are also differences in the data itself, and the test datasets used in the proof-of-concept phase are often untrue or of a single type.
Before entering the production environment, large data teams need to simulate the actual data scale of the Hadoop system, which can test the scalability and fault-tolerant performance of large data applications, and also help you to make a more accurate performance (resource requirements) planning model.
Hazard signal Two: Analysis calculation task constantly timeout
When the large data application running in the Hadoop cluster is small or just one, everything is flowing and step-by-step, but as the Hadoop cluster grows, the running time of the data Analysis task becomes difficult to predict. At first, there were sporadic timeouts, and the problem was easily overlooked, but over time, the timeout problem became more serious and eventually led to a crisis.
Before the crisis, you must take action in advance to adjust the performance planning model according to the peak task.
Danger signal Three: You start telling people not to keep all the data.
Another symptom of the crisis is the shrinking of the Data Retention window. At first you want to keep 13 months of data for annual analysis. But because of the space constraints, you start to reduce the number of months that keep the data. In the end, your Hadoop system is no longer a "big data" system because it doesn't have enough data.
The shrinkage of the Data Retention window is due to problems with storage extensibility, similar to previous computational performance issues. When there is a problem with your capacity prediction model, you need to adjust it as quickly as possible.
Danger signal Four: Data scientists are "starving"
Overloaded Hadoop clusters stifle innovation because data scientists will not have enough computing resources to carry out large tasks or have enough space to store intermediate results.
Performance and capacity planning often ignores or underestimates the needs of data scientists, and, in addition to the aforementioned estimates of the production environment task, severely limits the pioneering and innovative work of data scientists.
Hazard signal Five: Data scientists are starting to look at stack Overflow
In the early days of the Hadoop system deployment, your operations team worked closely with scientists. The operations team provides support for data scientists at any time. (Editor's note: Similar in-line collaboration mode) but when the Hadoop system is successful on the line, the system's operation and expansion tasks will keep the operations team on the run, and then the data scientists encounter the problem of Hadoop have to solve their own, such as often go to the technical questions and answers website stack overflow to see
Danger Signal Six: data center getting hotter
The power of the data Center server is not configured by the peak power of the server, but when a Hadoop cluster runs the task, it often "pagers" for hours, burning power-mismatched power lines, and the same problem exists in the refrigeration system. When you deploy a Hadoop system, make sure that the data center supports its long, full running time.
Hazard Signal VII: Cost overruns
Hadoop deployments based on IaaS, such as AWS, are out of control in spending. The one-month fee is likely to be three times times that of last month, well above your budget.
Performance planning is also important for Hadoop deployments based on IaaS, but good performance planning is just the beginning, and if you need to extend the Hadoop system on IaaS, then you need to learn about Netflix spending a lot of money on cost monitoring and optimization systems.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.