After repeated bombing by countless authoritative media, we have generally believed that data scientists are the most mysterious and sexiest of the 21st century careers, they are the bomb-breaking experts in the big Data age, digital business engines, they are worth as much as the NFL four, and they are less than the number of snow leopards on the Kunlun mountains.
It is clear that the data scientists are all 18 of the most proficient in the data analysis martial arts master, but they have also been a worry recently. Not long ago, a survey of 111 North American data scientists conducted by SCIDB developer PARADIGM4, an Open-source database, found that 71% of data scientists believed that data sources were diverse (IT Manager network reporter had previously worked with Baidu to initiate one of the Seven Musketeers, Cool Music CEO Thunder discusses the biggest challenge of machine learning and big data analysis, which he also considers a data dimension, rather than the amount of data that poses the greatest threat and challenge to his career.
Notably, only 48% of the respondents said they had used Hadoop or spark at work, up to 76% of data scientists complained about the slow pace of Hadoop, slow programming and other limitations.
Although Hadoop has a bad reputation, nearly half of the data scientists say it is hard to store data in traditional relational database tables. Nexedi's chief executive, Akayesu Smets, also said in an interview that the real problem with big data is not the so-called "big", but that the industry lacks the software to process data using efficient distributed algorithms, and Hadoop relies too much on Java, And Java has been tightly controlled by Oracle. The rise of the IoE movement in China has in fact provided a great opportunity for large data software solutions outside of Hadoop.
Enterprise data entering complex analysis stage
According to the report, 59% of data scientists say that their businesses have begun to use more sophisticated analytical techniques, such as clustering, machine learning, Seed analysis (Principal RS analyses), graph analysis, and other advanced analysis techniques, Instead of being limited to traditional bi reporting.
and 15% of data scientists say they plan to enable complex analysis technology next year, while another 16% of data scientists say they will use complex analytical techniques over the next two years.
Hadoop is overblown
PARADIGM4 's report points out that Hadoop is overly touted as an omnipotent, revolutionary large data solution, and that Hadoop does not really apply to large data scenarios that require complex analysis.
The core technical approach of Hadoop is data parallelism (parallel), which is known by PARADIGM4 as "annoying parallelism". Complex analysts often need to access, process, and share data, and cross-talk the intermediate results in data processing, which is precisely the Achilles heel of Hadoop MapReduce, the report says.
22% of surveyed data scientists say Hadoop and spark are not at all suited to their analytical tasks, and 35% of data scientists stop using both technologies after they try Hadoop or spark.
Summary:
In the fast data and machine learning as the main trend of the large data stream, Hadoop as an open source system, enterprise users installed independently, its performance optimization has a considerable technical threshold. In fact, the Hadoop system is not as bad as the data scientists have reflected, the Hadoop system can also be quickly up, such as the old Cray of the Hadoop solution of the software and hardware tuning, and provide subsequent technical support, test performance is many times higher, The problem of poor performance of Hadoop is well solved.
Some of the highlights of the PARADIGM4 data scientists ' report are condensed into the following information map for interested readers to delve into:
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.