4 Pitfalls of Data Science for Big Data Practitioners

Source: Internet
Author: User
Keywords the 9 pitfalls of data science data quality big data science
This sharing mainly summarizes the pitfalls and defects that data practitioners may encounter in practice. Like other new industries, data science practitioners need to constantly consider the present and the future; they need to constantly consider the rationality and correctness of their work methods. Only thinking about it can move forward.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.

I recently read the news and found that the major of data science is already the highest entry barrier for Peking University's college entrance examination. In fact, the word "Data Science" has been sexy for almost ten years, and for the Internet industry, it has been sexy for a century.

From "data speaking", "DT era", to "data center", "data drive (Data Drive/Data Driven)", the continuous evolution of the data system is continuously changing everyone's work and decision-making methods; continuous innovation Everyone’s way of thinking; at the same time, new business logic and new development opportunities have been created.

In 1976, Pascal author Nikalus Wirth said: Algorithms + Data Structures = Programs.

Just like the previous concepts of "SOA" and "cloud computing", the current concept of data science itself is still undergoing constant changes. Practitioners from various companies are exploring and profiting at the same time; they are summarizing and preaching at the same time. There are also many comrades who join in the fun to make the concept more vague. Therefore, the competence boundaries, methodology system, best practices, etc. of data science itself have not been fully established, and there are many questions that cannot be answered well. As a result, some superstitions and misunderstandings will arise. "Forced data", "arbitrary data", "political correct data" and so on are more common. Both the actual operation level and the method level, there are some big misunderstandings. . This is why I intend to summarize the pitfalls and shortcomings in the practice of data science.

This sharing is based on my own work experience and interviews with relevant senior colleagues. Its correctness is limited by my personal level of knowledge and the current level of development of the industry. It sorts out some of the current problems that may exist, but it may not be a long-term truth. I hope you will treat it critically when you read it. To start, if you have different ideas, please feel free to communicate and verify with me. The conclusion itself can be updated at any time.

Pitfalls and defects 1: Data quality kills automatic/intelligent decision-making

For many businesses that NetEase carefully selects, such as risk control business, the core driving force is data and algorithms. When we started our risk control business, we established a data algorithm-driven risk control method system, so we can ensure that a small team (3 people) supports the strict selection of dozens of internal and external risk scenarios and executes millions of risk decisions every day . Of course, this is the power brought by data-driven automatic/intelligent decision-making. The beauty of success may make you unbearably want to transform many business operations, but unfortunately, the lack of data quality assurance will make all this a castle in the sky that will collapse at any time! In fact, most organizations' understanding of data quality cannot support more automatic and intelligent decision-making scenarios. Forced transformation and downsizing to increase efficiency will bring their originally stable business close to collapse.

Strictly selected risk control has had several major failures that are closely related to data quality. In August of this year, when the risk control performed weekly misjudgment inspections, it was found that the overall suspected misjudgment rate increased by 4 times. The reason for the final location is that the log content related to the device number is somewhat abnormal. As a result, a considerable part of the user's behavior (sign-in operation) was wrongly executed and intercepted.

This is a very interesting case. Some key decisions: For example, is the user a bad person? How much should I purchase for a certain product? May rely on a small part of an online log that is not taken seriously. It is difficult for our entire quality assurance system to put a perspective on a certain log field of a specific application. Will it make mistakes under high pressure? In the traditional application service quality assurance concept, an occasional small error in the log field will not be regarded as a bug, and developers will not pay attention to it. But if you use data as a means of production, if we do not innovate the concepts and tools of application quality assurance, your large amounts of data analysis reports, trained algorithm models, and decisions made may be very unreliable, because you The production material itself is garbage, and the old saying goes: Garbage in, garbage out.

Another surprising situation is that a large number of complex SQL used in production data has not been tested, and even a so-called test environment does not exist for a large number of data systems. It is difficult for us to test the correctness of the data production process like online services (such as order systems). So can the data produced through tens of thousands of rows or even hundreds of thousands of rows (strictly selected) SQL be used? This question is actually difficult to answer.

The reliability of data is a very big trap for organizations in the process of data-driven transformation.

Everyone is discussing the importance of data quality, but inwardly they feel that this matter is relatively low-level. Therefore, we rarely see a team that puts a lot of smart brains into data quality assurance.

In addition to the lack of resource input, many data teams have different perceptions of data quality. I once had an in-depth communication with a senior who has been in the data industry for 15 years and has made great contributions to the data system of a well-known company. He talked about data quality. "What do you think is data quality?" His answer was: "Data quality, what really needs to be considered is index consistency.". Look, even for a very senior colleague, his knowledge is still not complete. According to his understanding of data quality, the data can be supported by the report to show people. This level is perfect. It needs to be implemented at the tactical level. It is basically not feasible to implement automatic decision-making online (because data quality failures are difficult to repair as quickly as online program failures, and it is a process of continuous pollution).

As the input of intelligent decision-making, data changes dynamically. It cannot do static analysis like code dependencies, and its dependency level is dynamic and unstable.

Pitfalls and flaws 2: Where is the "science" of data science?
Data science is a word we often talk about, and it is also a word to describe our daily work, but when we talk about it, we feel a little guilty in our hearts, and we just see the data. Where is "science"? If there is no "scientific" part, will our conclusions be problematic?

This is one of the most common questions. Practitioners of data science do not know what "science" is. Therefore, there will be SQL Boy, SQL Girl in the arena.

A common question is whether the correlation between data indicators is really relevant? We can often see a lot of interesting correlations when we do data analysis. For example, users who bought slippers in the past few months seem to be more likely to repurchase another product in the last month. But does this correlation really exist, or is it just a False Postive? Our analysis report can easily turn a blind eye to this issue. But if this correlation itself cannot withstand scrutiny, how can it guide our work? Do data analysis reports rely on luck to drive business development?

Even if we have a good statistical foundation and add P Value to each hypothesis, it is often easy to confuse correlation with causality. Two things are related, and it cannot be concluded that they are mutually cause and effect. We need to use the method of causal analysis to propose explanations that conform to business logic and business logic for the correlation between data.

If the process of causal analysis is omitted from the data analysis, some strange conclusions will be drawn. For example, we found that larger users usually buy large shoes. If there is no causal analysis based on business logic, we might guide the operation work like this: In order to make users' feet bigger, we should sell them more large shoes.

But sometimes, it is difficult for us to directly analyze the causal relationship between the data, and it is difficult to draw conclusions intuitively. At this time, we need to use scientific experiments to help us understand our business more deeply.

How to do scientific experiments, combined with Didi Xie Liang’s point of view (Xie Liang, the "science" in data science), summarized as follows:

Discover and define problems through data acumen and business familiarity;

Propose structured and quantifiable assumptions;

Design verification experiments. Science and experiment are closely related. In Yanxuan and many companies, we often use experiments to judge the quality of the plan. However, in fact, experiments are more used to help us verify hypotheses and help us understand our users more deeply (the CEO of the famous experiment company Toutiao today said: More often, AB tests help us understand users, rather than help us make decisions. ). It is not easy to design a good experiment. It is necessary to sort out the indicators to be verified, sample sets, and controllable factors (usually flow) based on assumptions. Designing experiments requires great professionalism.
Collect and analyze data. Analyzing data is not just an intuitive view of the trend. Analyzing data first requires a clear concept of the main indicators of the business and their correlation, and the correlation factors between the indicators need to be quantified and even calculated. I think there is a structured, systematic, and quantitative system first, and then data analysis. Fortunately, a structured system can be supported by systems and services. This year, our team is mainly designing and developing the DIS system (strictly selected data intelligence platform). One of the main goals is to solve this problem.

Analysts need professional quantitative analysis ability and statistical ability.


Pitfalls and flaws 3: manipulation, misleading, insufficient democratization of data
Data democratization is discussed a lot in foreign data communities, but less domestically. Data scientists use black magic to create some models, and then tell business students how to make decisions, and tell high-level business indicators whether they are well accomplished. The ability of data is limited to a certain professional team, but its output is closely related to the business. These unknowns will bring fear and anxiety to business personnel and management. Will the conclusions given by the data team be Manipulated? Will it be misleading intentionally or unintentionally? These problems can easily breed mistrust between teams.

Therefore, an important issue brought about by insufficient data democratization is the issue of trust. How to solve it?

Yanxuan had a colleague at an industry-technology co-creation meeting that he wanted to "fall in love" with the business. For the current reality, this is indeed a good way to solve the trust problem. Alibaba’s former data leader, Pinjue teacher, also said something similar: Data students must be able to "mix, communicate, and share", follow the business and build trust in order to succeed each other.

But this is not a scalable and standardized solution after all. Last year, when we considered the development of the strictly selected data platform from 2019 to 2020, we thought about this for a long time. How to lower the threshold of data usage, make everything more intuitive and easier to explain? Some of our projects, SQL on AI, Data Intelligence System (DIS), algorithm platform, etc., have a common goal to lower the threshold of data usage, and to solidify or even visualize the data analysis process through products.

 Pitfalls and defects 4: Data predicting the future is not a matter of course, the success of prediction is not only an algorithm model
Bosses often simplify the algorithmic capabilities: inaccurate predictions? Find two NB algorithm experts to make a model to get it done! Unfortunately, the reality is not so simple, you may find 100 NB algorithm experts are useless.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.