Data analysis Methodology of "reprint" Avenue to Jane

Source: Internet
Author: User

http://www.36dsj.com/archives/40569

Yang Yonghong Technology Vice President

Introduction: Do you find it painful to study data analysis methods? In this paper, the author summed up a set of easy-to-understand and easy-to-use data analysis methodology, so that beginners quickly grasp the data analysis method of the most core, the most commonly used points, at least to meet the daily needs of 90%.

Learning is a painful thing for most people, especially if you look at thick professional books, a variety of difficult-to-understand, and poorly explained definitions of terms that can exacerbate this pain. But some books or articles can be used to describe complex theories in a very popular and colloquial way, so that readers can understand them without difficulty. These contents are really a kind of gospel of the scholar. In the final analysis, the user's thinking in the internet has been talking about for so long, the creators of education and training content should also make a good change, and stand in the reader's angle to speak.

This article is about the data analysis method. According to the author's contact and understanding of many enterprises, although most enterprises are now more and more attention to data, but there are still a considerable number of enterprises and practitioners have not found the doorway of data analysis, do not know how to analyze their data, I hope that professionals to help.

The data analysis method is not a mystery .

I used to study data analysis method is also very painful, read a lot of books, content, but it is difficult to remember the whole, more difficult to use, and then joined the Yong Hong technology to many enterprises to do data analysis system, through a large number of project practice, only slowly can talk about getting started.

A good methodology should be easy to learn and easy to use. Now, this article attempts to use the most simple and easy to understand the writing, let the novice data analysis of the people to understand and grasp the data analysis method of the most core, the most common points, at least to meet the daily needs of 90%. To do this, the profound data analysis method must be refined into people can remember the 3 points, rather than 30 points, and then condensed to an article of space, rather than a book thickness.

1, the data are divided into two kinds, dimension and measure, analysis is the combination of dimension and measure

Here is an example of the simplest consumer shopping data.

Regardless of whether the data table exists in Excel or in a database, focus only on the data itself. The Data items (or fields) involved in the table are order ID, user ID, region, age, order Amount, order item, order time.

What is the difference between these data items? In general, there are two kinds of data, one called dimension, and one is called measure (or indicator). In the above example, the "Order Amount" is a measure and the rest of the data items are dimensions.

As can be seen, the measurement is the quantitative value of the specific calculation, and the dimension is the various attribute information describing things. When we do data analysis, we are constantly doing a combination of dimensions and measures, such as the sum of orders in Beijing, the average number of orders for 21-30-year-olds, or a mathematical formula for dimensions and measures, such as the sum of all order amounts, the user count (the non-repetition of the user ID) Wait a minute.

From the data type, the measure is numeric, but the value is not necessarily a measure, such as an order ID, although it is a value, but is a dimension, and the time, text class data are dimensions.

It is important to note that dimensions and measures can be converted. For example, to see the "age" of the average, where the "age" is a measure, to see the 19-year-old user's order situation, where the "age" is the dimension. For a data item, whether it is a dimension or a measure, depends on the user's needs, much like the quantum effect, the state is determined only after the need is determined.

In addition, dimensions can derive new dimensions and measures, such as the "region" dimension derived from a large area dimension, "Beijing", "Tianjin" are corresponding to "North China", or the "age" dimension derived an age range dimension, 20-29 years old = "young people", 30-39 years old = "middle-aged", 40-49-year-old = "Senior middle-aged". Another example of the above average age is the use of the "age" dimension to derive a measure.

Measures can also be derived from the new dimensions and measures, such as the "Order Amount" measure derived from an amount range of dimensions, below 100 yuan corresponding to "small orders", more than 500 yuan corresponding to "large orders" and so on. For example, by subtracting the "revenue" metric from the "cost" metric, you can get a "profit" metric.

2. Comparison of Judgments

Here is a question: Enterprise A 80 million revenue this year, is high or low? If you look at this problem, you should feel that there is no judgment, because there is no reference, that is, no comparison. Therefore, to get a data, to judge whether it is good or bad is high or low, it must be compared.

First, enterprise a can be compared with itself. If the previous annual income of 20 million, revenue 40 million last year, then this year 80 million is very good. Last year's income was 100 million, and 80 million this year was bad. This is called portrait contrast.

Second, enterprise A can also be compared with other people. A few rival companies are earning hundreds of millions of this year, and the 80 million of business A is not ideal. This is called transverse contrast.

Thirdly, enterprise A can also compare different dimensions and metrics. For example, competitors do the national market, enterprise a only do Shandong market. Enterprise A in Shandong market income than competitors in the Shandong market income is high, then in the region, enterprise A to do better, and look at the country, enterprise a does have limitations. For example, if the competitors have been doing more than 10 years, and enterprise a just do four or five years, the enterprise a even if it does a good job, but if the establishment of a similar competitor has been over billion, the enterprise a even if not good enough. This is called synthetic contrast.

The child exam 95 points, parents are very happy, because know full score is 100 points, there is reference. In the last exam 80 points, parents will be angry, because the past 95 has been divided into new references. Later a question, found that the paper is difficult, the child is already the first class, and then turn to anger for joy, where the other children became ginseng (xi) (Sheng) (PIN).

Compared with different reference, the conclusions obtained are different. In order to avoid the conclusion one-sided, not objective, should try to use comprehensive comparison.

3, to find the reason with the subdivision

Profits fell this year, the boss was angry, ordered to find the reason for the "suspect". What's the reason for this? Attention is to find the reason, not to find a reason. Many people often do not know how to find the reason, and finally give the reason.

Let's look at an example for a reason--"because sales of washing machines in the South China region declined in the four quarter, leading to a decline in profits this year." Let's look at the characteristics of this cause.

We will find that this reason is composed of three dimensions and sales of time, region, product, and so we can know that the locating of the cause of the problem is essentially the answer to which metrics are falling or rising in the dimensions, which leads to the problem.

This is what is done in the subdivision.

We can subdivide by dimension, how many dimensions there are, and how many segments we can have in the direction. See, for example, that all the months of the year were down, or only a few months down. If it is the latter, then you can narrow the data range of the lookup. After focusing on these months, you can see which areas are down and further subdivided.

The order in which the dimensions are started has little impact and the dimensions involved in the problem are unpredictable, so you can start with any one dimension as a portal.

If the indicators of the problem are related to the pilot, you want to further explore the cause of the problems, after the breakdown also depends on different measures, such as the above reasons for the conclusion example is "because the four-quarter South China region washing machine sales declined, resulting in a decline in profits this year," The problem is "profit" because of "sales", Because profits are derived from other measures.

The breakdown is endless, to what extent is enough? The answer is that there is enough to be done.

For example, the breakdown to "four quarter profit decline, the other quarter did not fall", or there is no solution to the problem, it must be fine to which area of the period of which product line, until the end of a certain person to be responsible for the operability. It is important to note that in the real situation, the problem is often not necessarily only one cause, but a combination of multiple reasons to form.

I Si Yonghong Technology main push of one-stop big data analysis platform software, why provide "scaling" and "brush" two kinds of interaction, is to meet the "contrast" and "subdivision" two scenarios.

For example, the left figure is the income margin of each product, the right is the profit trend of each category, now users want to focus on the "tea" category of three products, to see how their profits.

Some people may ask, this effect is very similar to the filter, why not put some filters next to the implementation? Filters can have, but in reality, when we find problems on a chart, it's not always easy to find the corresponding filter, especially the scatter plot. Therefore, it is very convenient and efficient to select directly on the chart.

Another example is the product profit trend analysis, the user found that starting from July 2009, the profit has a continuous decline of 4 months (as shown in the Red box), users want to know why.

Unlike zoom, the brush makes it easier for users to compare local data with overall data. Because in the above example, simply see which products this 4-month sales income of the absolute low, and can not explain what, some products would have sold less, must see which products in this 4 months relative performance is not good.

First judge the data is good, then analyze the reason is what, the link of data analysis chain basically even complete.

how to think about machine learning, data mining and other such tall stuff

When to touch machine learning, data mining such a tall stuff. In a word, the above-mentioned data analysis method to be able to do their best, and then engage those tall on. Do not superstition complex algorithm, many enterprise internal data analysis Danale, often is the deep understanding business, uses is the common computation method, can complete the very wonderful practical analysis process.

Machine learning, data mining, etc. when will it be used? In short, data items are often used when people don't see them. If a total of 10 or so a few data items, each take out a picture alone to see the clues, in fact, it is not necessary to use mining algorithms. If a total of hundreds of data items, want to see a certain data item is affected by which data item is the biggest, the person sees not to come over, uses the mining algorithm to be more suitable.

Data analysis Methodology of "reprint" Avenue to Jane

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.