Today, 90% of data analysts are talking about big data, and in the context of big data, where are the boundaries where data analysts collect data? How can we use the data? The author in the United States to participate in the Strata 2012 conference, with a lot of data in the exchange of people, including former linkedinhttp://www.aliyun.com/zixun/aggregation/5913.html "> Chief Scientist DJ Patil gave him the deepest impression.
Dialogue: First demand, then data
Cheping: I have a problem has been confused, now the enterprise access to data is very easy, and the data growth is very fast, so for the company, what exactly to collect data? How much data are collected? Where are the boundaries for collecting data?
Patil: It's hard to collect data in the past, but now it's easier to get data. If the starting point for collecting data is not to solve the problem, the amount of data collected is too large.
Cheping: But many companies believe that it is not difficult to collect data, the cost is not high, why not collect data first? When you need data to solve the problem and then take it out.
Patil: Do not think so, with such a concept to design data products will certainly fail. The data is borderless, and I've been miserable for a few days. such as collecting a person's birthday, can be accurate to a few seconds, but how to use but do not know, then this data is useless.
Cheping: Actually, the data is also life-cycle, such as the Chinese ID number can be inferred from the gender, but in a few years if this rule changes, then the basis of the data has changed, resulting in our data based on the assumptions and decision based on the loss of meaning (data broken). And it's not easy to save the data and the context it collects. So, at the same time we collect the data, we have to know what the data is for, and we can't think of it today.
Today, for example, many electric business owners will ask what the repeat purchase rate is, so we collect data to calculate the repeat purchase rate, but seldom think of the need to repeat the purchase rate to decide what to do. The story of "Kezhouqiujian" tells us that things are changing, and we can't just mechanically apply the methods or indicators. Just as there is a different definition of repeat purchase rate, different definition of duplicate purchase rate is required to make various decisions. If a company wants to buy B company, then the attention of the repeat purchase rate may be meticulous to 3 months, the purchase of a user ratio is how much, the proportion of users to buy 2~3 times, 3~4 times the proportion of the user is how much. If a company is only measuring its own operations, it may be more concerned about the trend of repeat purchase rates at the day and week levels, or how many new customers are repeatedly buying this month, so that they can measure the ultimate loyalty and quality of new customers each month.
Data applications are small and beautiful
During this time of data application, I was particularly bothered by what data was collected. At the time I wanted to do a very large data application, suitable for most people to use, but later found that this is almost impossible in the initial stage. One is to solve most of the needs of the data application does not exist, the second is to pay the treasure of the data is very rich, there are many factors to consider, the relationship between the factors is very complex.
So I conclude, as data application, the data is equal to the raw material, when the raw material has been changing, the application has a problem. After I realized the relationship between data and application, I decided to make a small application.
"Small" here refers to the application of the goal is very specific. For example, for a data application, if my goal is to distinguish between two decisions who are better, where the difference is, is a very specific problem. But if my goal is to know how to make a company profitable, it's a vague goal.
Also note that "small" does not mean the amount of data. Many people are enjoying their ignorance when they do not get enough data and have no understanding of the data.
After some difficulties, I chose the idea of a small angle into the design of data applications, small angle into the design application can be specific and rapid, but also to avoid the changes in raw materials caused by the problem.
This trip to the United States also has some feelings, now many U.S. data analysts are talking about the Air Force in the analysis model used in the Ooda (observation-regulation-decision-action), because the Air Force war to emphasize rapid decision, so this model is also particularly suitable for the needs of today's Internet. The core idea of this model is rapid attack, and for today's internet companies, the pace of development is too fast, and data analysts must be in a rapid development environment, quickly find solutions.
This model fully embodies the Internet's rapid error-seeking, rapid adjustment needs. Fast prototyping is more practical for internet companies that have never used data to solve problems. In the background of large data, not only the amount of data, but also many kinds. The initial stage, if not from a small point of view, it is difficult to make practical products and visible results.
Put the data in the frame
This points to another topic, in the context of large data, must consider the relationship between the data. A single data is meaningless, in practice, the choice of data in two extremes is often easier to find the link between them, put them in a framework to find the problem.
For example, I have studied what websites in the United States are worth learning. Relying on data to find American Internet applications Dark Horse, is from the point of view of the problem. In the various data, I have chosen the "traffic" and "stay Time" two often this data as a way to help me make the framework of decision-making. Using this framework, I discovered Pinterest in 2010, far earlier than the domestic application of imitation.
So, on how to make decisions using data in the context of large data, I've summed up the four-step approach:
First, data collection from the perspective of problem-solving;
Second, organize the collected data into a framework and use the framework to help decision makers to make decisions;
Third, evaluate the effectiveness of decisions and actions, which will tell us whether the framework is reasonable;
Four, if there are new data, we will examine whether we can use it to improve on the previous three steps and whether we need to collect more kinds of data today.
The author of the car tasting, Alipay chief business intelligence officer. Hong Kong people in China have received a master's degree in Western education, Tsinghua University and INSEAD in Britain, Australia and other places, and joined Alipay as chief product officer in the Dunhuang net.
(Responsible editor: Lu Guang)