Three kinds of roles in large data mining

Source: Internet
Author: User
Keywords Well this that is what data mining

I'm a novice in data mining and machine learning, starting with Amazon last July, and because of the need for passive contact with work that has never been contacted before and what is needed to predict machine learning. Later, to Taobao, their own initiative to do a few months and user address related data mining work, there are some superficial experience. Anyway, welcome advice and discussion.

In addition, note that the title of this article imitates an American drama "Game of Thrones: The Song of Ice and Fire". In the world of data, we see a lot of great, powerful and interesting cases. But the data, like a throne, is like a power and a conquest, but the journey to the road is as tremulous as a board.

Three kinds of roles in large data mining

When I was working on machine learning in Amazon, I noticed three roles that Amazon played with the data.

Data Analyzer: Database analyst. This type of person is primarily to analyze the data, find some rules from the data, and find the training data for the different scenarios for the model. In addition, these people are also the people who wash some dirty data clean.

Research scientist: Study scientists. This role is mainly based on different requirements to build the data model. They dubbed themselves a singular species that did not close to the human fireworks, just like the Sheldon in the Big Bang. These people basically play with the science of data

Software Developer: Software Development engineer. The main is to scientist established data model for implementation, to Data Analyzer to play. These people usually know a variety of machine learning algorithms.

I believe that other companies doing data mining or machine learning are also the three kinds of jobs, or three kinds of people, for me,

The most technical content is scientist, because data modeling and extracting the most meaningful vectors, and choosing different methods are all determined by these people. I don't think I can find such people in China.

The hardest and most tiring, but also the most important is data Analyzer, and their lives are the most important of the three characters (note: I used the three most). Because, regardless of your model your algorithm no matter how the cow, in a heap of rotten data can only dry a pile of rubbish to live. The so-called: garbage in, Garbage out! But this work is the most dirty and tired of living, but also the most easy to shrink the life.

The least technical content is software Developer. Now many of the domestic playing data are considered the most important algorithm, and many technicians are studying the algorithm of machine learning. Wrong, the most important is the top two people, one is hard to wash data Analyzer, the other is really understand the data modeling scientist! And like what k-means,k nearest neighbor, or some other Bayesian, regression, decision tree, random forest and other such play, are very mature, and not artificial intelligence, plainly, these algorithms in machine learning and data mining, it seems like quick Algorithms like sort have little technical content in software design. Of course, I'm not saying that algorithms don't matter, I just want to say that these algorithms are the least important in the whole process.

The quality of the data

The current popular Buzz word--Big data is quite misleading. In my eyes, the data is not divided into size, only good or bad.

In the process of processing data, the first thing I feel the most is the quality of data. Here are a few examples to illustrate:

Case one: Criteria for data

In Amazon, all goods have a unique ID, called Asin--amazon single Identify number, which is used to identify the uniqueness of the product (from the barcode). That is, no matter what you describe the product, as long as the ASIN, this is exactly the same goods.

In this way, unlike Taobao, when you search for an iphone, you will have a variety of iphone, some called the "value of the iphone", some called "Apple iphone", some called "smartphone iphone", some called "iphone white/Black" ... These different descriptions of the same commodity are merchants in order to attract users. But the problem is two points:

1 user experience is not good. Commodity-centric business model, for consumers, experience significantly better than the business-centric model.

2 If you can't read (identify) the data correctly, what algorithm is behind you and what model is useless.

So, as long as you play with the data, you will find that if the data standards are not established, what is useless. Data standard is the first hurdle of data quality, without this thing, you don't play anything. The so-called data standards, the unique identification of data is only one of the most basic step, the standard of data is only this, more importantly, the standard of data abstraction into the mathematical vector, no mathematical vector, the back can not be excavated.

So, as you can see, a lot of the work of washing data is merging the messy data into aggregation, which is to establish data standards. This is absolutely no human flesh work. Nothing more than:

Smart people define standards before data is produced and work on data cleansing when data is generated.

The average person does this after the data is generated and piled up.

Also, speaking of Amazon's Asin, which began more than 10 years ago, the information I see in Amazon's intranet does not say why I got an ID, I think it's not because Amazon has to recommend a product ID for playing data discovery Perhaps because Amazon's business model is designed to be "commodity-centric." Today, this ASIN still have a lot of problems, ASIN as not completely guarantee that the goods are the same, ASIN is not the same as the goods are not the same, but more than 90% of the merchandise is guaranteed. Amazon has a dedicated team of category teams with many business people desperately trying to correct ASIN data every day.

Case two: Accurate data

The user address is another thing I've done in data analysis. I remember the thrill of seeing the data on the hundreds of millions of user addresses. But then I couldn't get excited. Because the address is the user fills in, this has many pits, is not very easy to do.

The first is a false/wrong address, because some businesses cheat or users to do the test. So the address is wrong,

For example, just enter "this address does not exist", "13243234ASDFASDI" and so on. This kind of address can be identified by my program.

And it's hard to be identified by my program. For example: "Cosmic Road Earth Community" and so on. But this kind of address can be recognized by people.

There are even people can not recognize, such as: "Beijing East Four Ring Road, No. 23rd, Southern Mansion, 5 floor, Room 540, this address does not exist."

The second is the real address, but because the user writes non-standard, so it is difficult to deal with, such as:

Abbreviation: "Jian Guo men wai da Jie" and "Jian Wai Street", "Industrial and Commercial Bank of China" and "ICBC" ...

Typos: "Chaoyang door", "Tong Hui River" ...

Upside down: "East four Ring road Chaoyang Park" and "Chaoyang Park (East four ring)" ...

Alias: Some people write the developer's community name "East Heng International", and some is the name of the administrative "eight Li Zhuang East" ...

There are so many more examples than that. If the visible data is inaccurate, it will increase the difficulty of your handling. There is a very good analogy, playing data is like digging a gold mine, if the gold content is high, then, the difficulty of excavation is small, it is easy to effect, if the gold is low, then the difficulty of mining, the effect is poor.

Above, I gave two cases to illustrate

1 The data is not the size, only the value of large data and garbage large data points.

2 data cleaning is a very important work, this is a human flesh work of a great workload.

So, this job is best done in a bit of time when the data is generated.

There is a point of view: If the accuracy of data in 60%, you do things, will be users scold! If the data accuracy is about 80%, then the user will say, not bad! Only when the data accuracy is 90% can users feel real cow B. But the cost of data from 80% to 90% is much greater than the 60% to 80% pay. Most data mining teams will stop at 70%. Because, in the future, this is a rather tiring job.

Business Scenarios for data

I don't know how many data mining teams are really aware of the important relationship between business scenarios and data mining? We need to know that there is no way to make a data mining and analysis model that meets all the business.

Recommended music videos, and E-commerce in the recommended items of the scene is completely different. Electric dealers, as long as you buy a thing without returning, then, there's a lot of probability that I can trust you to like this, and then, for music and video, you simply can't think of the user listening to this song or watching this video and arbitrarily feel that users are like this song and this video, so we can see that The recommended algorithm is not as difficult to implement in different business scenarios.

When it comes to recommendation algorithms, you're just like me, and sometimes there's a feeling of recommendation--the recommendation is a sort of algorithm based on different dimensions. Personally, I think it would be tricky to recommend this thing in some business scenarios, for example, two kinds of recommendations (not by user relationship and by item),

One is the common recommendation, the result is to recommend the popular things, this may be good, but this may be the user known things, for example, to Beijing, I want to find a restaurant, you always recommend roast duck, I want to go to a place, you always recommend to me Tiananmen Palace Temple of Heaven (because most people come to Beijing is to eat roast duck, is to the Tiananmen Square, which I do not know, but also you to recommend? In addition, the commonality of things usually can be brush by the Navy.

Another is a personalized recommendation, this need to analyze the user's individual preferences, the good is always give me my favorite, the bad is that maybe my taste will change with my age and environment, and always recommended to meet the user's taste, can not help users discover fresh points. For example, I like spicy food, you always recommend Sichuan cuisine and Hunan, I will be tired of long time.

The recommendation is sometimes not a democratic vote, but rather a recommendation from a professional user or veteran player; recommendations are sometimes not recommended for the popular, but are recommended for fresh and I don't know. You can see that different business scenarios, different product patterns under the play may not be the same,

In addition, even for the same e-commerce, books, mobile phones and clothing business form is completely different. I used to do demand forecasting (user demand forecast) in Amazon--predicting the future needs of users through historical data.

For books, mobile phones, home appliances These things, in the Amazon called hard line products, you can think is "standard" (but not necessarily), forecasting is more accurate, even can predict the relevant product attributes of the demand.

But the products called soft line, such as clothing, Amazon did not have the means to predict very well for more than 10 years, because such things are subject to too many interference factors, such as: the user's preference for color styles, wear up to the body, love friends do not like ... This kind of thing is too easy to change, buy more people instead will sell bad, so simply can not predict good, not Stock/vender manager put forward to "predict a certain color of a brand of clothing or shoes."

For the demand forecast, I found that the long-term in this industry to fight people's prediction is the most accurate, what machine learning is a cloud. Machine learning makes sense only when you are faced with thousands of different products and categories.

Data mining is not artificial intelligence, and it's too far off. Do not feel that data mining is capable of anything, find a suitable business scene and product form, more important than anything else.

Data analysis Results

I see a lot of playing big data, basically doing is data statistics, from a number of different dimensions to the performance of statistical data. The simplest and most common statistic is something like a website statistic. For example: PV is how much, UV is how much, the antecedents is where, browser, operating system, geography, search engine distribution, etc., etc.

Nagging a sentence, do not think that you have more than 10 a day of the log is the data, do not think you will use Hadoop/mapreduce analysis of the log, this is the data mining, said that you are doing is just a statistical work. That a few t raw data, basically have no meaning, can only call log, even the data is not, only you count out of this data is a little meaning, can call data.

When a user in the face of their own shop data, such as: Each thousand people have 5 orders, 65% of the visitors are men, 18-24 years old people have 30%, and so on. Even you give out the data that you beat 40% of the same type of merchant. As a merchant, in the face of these data, most people do not know what they can do at all. Is it better to change the website to a more masculine one or to make the young people prefer it? Completely unaware of the measures.

As long as you take a look, you will find that some of the data analyzed by some of the results appear to be good, but in fact, do not know what to do next?

So, I think, the results of data analysis is not only to show the data, but more should be concerned about what can be done after this data? If you look at the results of data analysis and do not know what to do, then this data analysis is a failure.

Summary

To sum up, here are some of the most important things I think about data mining or machine learning:

1 The quality of the data. The data are divided into standard and accurate data. The noise in the data should be ruled out as much as possible. For the quality of the data, a lot of human flesh work is indispensable.

2 Data business scenario. We can't do all the scenes, so the business scene and product form are important, and I personally feel that the narrower the business scene is the better.

3 Data analysis results, so that people can understand, know what to do next, rather than for data and data.

There are many people in the data mining, but not many successful cases (compared to a lot of attempts), for now, I seem to think that the current data mining technology is a transitional technology, is still in the groping stage. In addition, some of the data mining team to do business, technical not technical, for the technical staff feel sorry ...

Sorry, I only gave a question, no suggestion, it also shows that there are many opportunities in data analysis ...

Finally, a "personal privacy problem in the data", which seems to be like the unethical black magic, you have to be successful to make yourself dark. Yes, the data is like a throne, like a power and conquest, but the journey to the road is as quivering as a board.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.