Talking about Alibaba Big Data: Source of data

Source: Internet
Author: User
Tags big data data collection data warehouse data trading data transactions

II) Source of data


The first priority of big data is the need for data. Otherwise, "data is the first production factor in the DT era."

Where does the data come from and where does it generate data?

Data is everywhere. Since the beginning of the invention of the text, humans have begun to record various data, but the medium of preservation is generally a book, and it is difficult to analyze and process. With the rapid development of computer and storage technology, and the digitalization process of all things (audio digitization, graphic digitization, etc.), there has been an explosion of data, and the trend of data explosion, with the development of Internet of Things Internet of Things, will come The faster. At the same time, the requirements for data storage technology and processing technology will be higher and higher.

According to the Digital World Research Report published by IDC, the amount of data generated, copied and consumed by humans reached 4.4ZB in 2013. By 2020, the amount of data will increase 10 times to 44ZB. Big data has become the most valuable asset of today's human beings. How to use these data reasonably and effectively, and play the role of these data, this is what big data will do.

Early enterprises are also relatively simple. The data stored in relational databases is often the source of all their data. At this time, their corresponding big data technology is the traditional OLAP data warehouse solution. Because relational databases are basically all of their data, often big data technology is relatively simple, get statistical data directly from relational databases, or at most build a unified OLAP data warehouse center.

According to the history of Taobao, the early data of the warehouse is basically derived from the OLTP database of the main business. The data is nothing more than user information (acquired by registration and certification), commodity information (obtained through seller upload), transaction data (through sales and purchase behavior). Get), collect data (obtained through the user's collection behavior). From the company's business level, the focus is on the statistics of these data, such as the total number of users, the number of active users, the number of transactions, the amount (can be drilled to categories, provinces, etc.), the number of payments, the amount, etc. . Because there is no marketing system and no advertising system at this time, the company only pays attention to the relevant data of users, commodities and transactions. The statistical processing of these data is all that Taobao big data at that time.

However, with the development of business, such as personalized recommendation, the emergence of advertising system, will need more data to support, and the user data of the database, in addition to the collection, the shopping cart is the embodiment of user behavior, but the user's Other behaviors, such as browsing data, search behavior, etc., are completely unknown at this time.

Here you need to introduce another data source, log data, record the user's behavior data, you can use the cookie technology, as long as the user logs in once, you can get in touch with the real user. For example, by obtaining the user's browsing behavior and purchasing behavior, the user can be recommended to the product that he may be interested in. After reading and watching, buying and buying is a recommendation algorithm based on these basic user behavior data. These behavioral data can also be used to analyze the user's browsing path and browsing time. These data are important basis for improving related Taobao products.

In 2009, the rapid development of wireless Internet, with the large-scale emergence of App based on native technology, it is no longer possible to obtain wireless user behavior data by traditional log method. At this time, a number of new wireless data collection and analysis tools have emerged, such as Friendship, Talkingdata, Taobao's internal wireless number reading, etc., through the built-in SDK, they can count the user behavior data on the native.

The data is statistical, but new problems are also born. For example, the user behavior on my PC, how to correspond to the user behavior on the wireless, this is out of line, because the PC is the standard on the PC, wireless uses wireless The standard, if there is a unified user library, such as login name, email address, ID number, mobile phone number, imei address, mac address, etc., to uniquely identify a user, no matter where the data is generated, as long as it is Once connected, you can respond accordingly.

This involves an important topic -- data standards, data standards are not only to solve the problem of internal data association, such as a good user library, can solve many problems in the future big data association, assuming that the public security data wants to follow The hospital's data is linked to open up and play a greater value. However, the public security identity user is the identity card, and the data of the hospital identification user is the mobile phone number. With a unified user library, the data of both parties can be easily correlated by the idmapping technology.

The standard of data is not only important for data association within the enterprise, but also for data association across organizations and enterprises. There are not many companies in the industry that have the ability to establish data standards such as user libraries. Alibaba is one of them.

Big data is developing to the later stage. Of course, the more data, the better. The internal data of the enterprise can no longer meet the needs of the company. For example, Taobao wants to perform a complete image analysis on the user, such as wanting to obtain the user's real-time status, hobbies. , constellation, consumption level, what kind of car to drive, etc., for precision marketing. Taobao's own data is not enough. At this time, many companies will buy some data (some companies will also climb some information on their own, this is relatively simple), such as Alibaba to buy Gaode, Youmeng, and other purchase micro Bo's relevant data is used for user's label processing to obtain more accurate user images.


However, data trading is not that simple. Because data transactions involve several very big problems:

1) How to protect user privacy information

The EU has introduced stringent data protection regulations. The United States also imposes heavy penalties on operators who sell customer data. It is still in the burgeoning Chinese big data industry. How to ensure that user privacy information is not leaked? For some non-private information For example, geographic data, meteorological data, and map data are very valuable for opening, trading, and analyzing, but when it comes to the user's private data, especially the individual's private data, it involves ethical and legal risks.

Desensitization before data trading may be a solution, but it does not completely solve the problem. Therefore, Alibaba also proposed another solution, based on the platform-guaranteed "invisible invisible" technology. For example, Alibaba Cloud, as a trading platform, is an intermediate guarantee institution like Alipay. The data of both parties is uploaded to Alibaba Cloud Data Trading Platform. Both parties can use each other's data to obtain specific results, such as by uploading some algorithms and models. As a result, neither party can see any detailed data of the other party.


2) Owner of the data

As a means of production, the data is different from the land in the agricultural period and the capital in the industrial period. It will not disappear after use. If it is the purchaser of the data, who is the owner of the data? How to ensure that the purchaser of the data does not Will the data be sold again? Or after the purchaser has processed the data, who is the owner of the data after processing?


3) Legality of the use of data

In big data marketing, the most used one is precision marketing. In data transactions, the most valuable is personal data. The customer portraits we make in our daily analysis are aimed at grouping and tagging a large number of customers, and then targeted marketing and services. However, if you use the user's personal information (such as age, gender, occupation, etc.) for marketing, you must obtain the user's consent before you can send advertising information to the user, or can you use it directly?

Therefore, the use of data transactions and associated use, it must address data standards, legislation and regulatory issues, in the future, does not rule out the existence of special laws, and even professional regulatory agencies, such as the establishment of the number of the Supervisory Board to regulate data transactions. Problems with use. If it is really the day, it is also a good thing. If the data is to be circulated, it will play a greater value. If each company has its own data, it will eliminate the information islands inside the enterprise and the information islands outside the enterprise.

If it is reasonable and appropriate to use multi-party data, the so-called "wool out of the pig" will happen, such as Alibaba Loan Service, using B2B and Taobao data. In this case, the pig (B2B, Taobaolai) said that this is a spillover effect of massive data in a commercial scenario, and for sheep (ant small loan), at a lower cost, different dimensions After the data collection, the process of the value of the chemical reaction jumps, which is a typical feature of intelligent business in the era of big data.

This is the value of big data, and it is the reason why we are welcoming this new era with the name of "data".


Alibaba Big Data Development History

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.