Big Data and our starting points
August 17, 2014 ~
It has been several months since I was waiting for a child to borrow books outside the library. I have been thinking about it and have made some attempts, but I have been "in the process ". During this period, I once again read the big data era and related online resources. In addition to my work every day, I am always thinking about the summary of this line of thinking.
Early accurate statistics, or random sampling, and now the big data idea of "sample = all" emerged, I feel we can do something. Although I have been busy with "enterprise applications", there is always a "Internet gene. From the first collection of more than 50 poi, to the recent collection of more than 300 mobile Internet access process data, we made a data analysis and made a try in the big data era.
The core driving force behind big data development comes from the desire to measure, record, and analyze the world. Information technology changes can be seen everywhere, but today's information technology changes focus on "T" (technology), rather than "I" (information. Now, it's time to think about "I" in the spotlight and focus on the information itself.
Whether it is a mature Internet or a popular mobile Internet, a large amount of data is generated every day. If not processed and mined, the data will always sleep on hard disks everywhere. Now the processing capability and storage space are getting cheaper and cheaper. It is time to turn "data" into "information" and "information" into "knowledge ", knowledge is used to guide the Organization's operations and the daily life of individuals.
"Data" may be hard disk loads, and "Knowledge" is the source to improve people's happiness.
This article is a summary of my thoughts on moving forward from "LOAD" to "Source" for reference and hope to communicate with colleagues.
I. Big Data
It is said that Microsoft's spelling check basic data volume is tens of millions, and Google's intelligent translation basic data volume is hundreds of billions. I once heard from professors at the University of Liverpool, the so-called "Big Data" must be at least 10 to the power of 10. By this order of magnitude, our common data sources are rare.
However, the sample of a wrestling competition analyzed by the big data age uses "Records of about 64000 wrestling competitions in 11 years ", we need to use this big data to find its abnormalities. We also analyzed the human and psychological factors that contributed to the competition's success or failure.
It seems that the so-called "big" also depends on the Data Sample object. Therefore, it is more appropriate to call it "Big Data" here. The big data around me includes: I collected and recorded my location behavior, and some of the online process records of the participants.
If you have your own big data resources, we can analyze them together.
2. Expectations for Big Data
In the face of a data set, people have certain expectations, at least "preliminary understanding and expectations ". If you "find the rule from the inside", you do not know what the rule is.
"A preliminary understanding and understanding of data features" means that no algorithm or model can be created out of thin air. Just like the theory of [[], there are some ways to solve the problem of invention.
I have heard of a dinosaur-level company analyzing bank data. The basic expectations include:
L financial changes of bank users are one by one due to the sunset customers, and those who are on the rise are the sunrise customers.
L distribution of customers in Chaoyang region, adding corresponding store and customer service managers to improve customer service satisfaction.
L The behavioral characteristics of Chaoyang customers provide more considerate services and explore potential customer values.
L The increase of store depositors and savings. Monthly and annual fluctuations to avoid fraudulent savings and performance gains.
L in addition to the "Expected items" in the initial stage, some intermediate results and statistical analysis can produce other data results, such:
N customer age group
N time period characteristics of different customers visiting stores
N queues of different stores in different time periods
N parking space occupation and reservation plan for important customers
N correlation between passenger flow and weather
N changes in the relationship between savings, investment and national economic growth
N bank's key business indicator forecast and subsequent investment volume
N relationship between investment and Risk Return
Iii. Data processing process and results
First, hold a large data set (such as a bank) or obtain a large data source (such as an Internet application) that continues to grow. Then, perform statistical analysis based on the preliminary understanding and understanding of the data features, A preliminary data pattern can be generated.
To establish and improve a model, you can understand and understand the basic features of the data from the following dimensions and create a multi-dimensional model:
L expected result Model
L intermediate result model during analysis
L actual result Model
L manually recognized data feature model
L Multidimensional Data Statistics Model
In September 3, 2014, Baidu world released its big data application product "city prediction", "scenic spot prediction", "College Entrance Examination prediction", "World Cup prediction", and "European League prediction ", "economic index prediction ". Although it is mysterious, I still admire the ability to build a prediction model for such resources.
4. My Data Processing Model
Modeling (outline ):
L time features
L location and location features
L daily growth characteristics
L human movement characteristics (trajectory)
L exception sorting statistics
L correlation between the Internet access process and weather changes
L comprehensive analysis of the characteristics of people
Up to now, it takes more than 45 minutes to process the existing "Big dataset" with a single machine "non-traversal" based on the above model. Next, I plan to build and try some professional processing environments.
Follow up. Thank you!
V. Appendix hadoop and storm
Storm is a free, open-source, distributed, and highly Fault-Tolerant Real-time computing system. It differs from other big data solutions in its processing method.
Hadoop is essentially a batch processing system. Data is imported into the hadoop File System (HDFS) and distributed to each node for processing. When the processing is complete, the result data is returned to HDFS for use by the initiator. Hadoop's high throughput and massive data processing capabilities allow people to easily process massive data. However, the disadvantage of hadoop is also as obvious as its advantages-high latency, slow response, and complicated O & M. Storm is created to make up for the real-time performance of hadoop.
S language and R Language
The s language is an interpreted language developed by at&t Bell Laboratory for data exploration, statistical analysis, and plotting. Its rich data types (vectors, arrays, lists, objects, and so on) are particularly conducive to the implementation of new statistical algorithms, its Interactive running mode and powerful graphics and interactive graphics Functions allow us to easily explore data.
Currently, the main version of s language is s-plus. It is based on the s language and further improved by the statistics Science Department of mathsoft. As a general methodology toolbox for statisticians and general researchers, the S-PLUS emphasizes graphical demonstration, exploratory data analysis, statistical methods, the development of new statistical tool calculation methods, and scalability.
S-PLUS can be directly used for standard statistical analysis to obtain the desired results, but its main feature is that it can interactively discover information in data from various aspects, it is easy to implement a new statistical method.
The r language is widely used in the field of statistics. It was born around 1980 as a branch of the s language. The r language is an implementation of the s language. The s language is an interpreted language developed by at&t Bell Laboratory for data exploration, statistical analysis, and plotting. At first, the implementation version of s language was mainly S-plus. S-PLUS is a commercial software based on the s language and further improved by the Statistics Division of mathsoft.
R is a complete software system for data processing, computing, and drawing. Its functions include: data storage and processing systems; array computing tools (especially powerful in vector and matrix operations); complete and consistent statistical analysis tools; and excellent statistical drawing functions; simple and powerful programming language: It can manipulate the input and output of data, implement branches and loops, and customize functions. R is not so much a statistical software as a mathematical computing environment, because R does not only provide several statistical programs, users only need to specify the database and several parameters for a statistical analysis. R's idea is: it can provide some integrated statistical tools, but more importantly, it provides various mathematical and statistical computing functions, this allows users to flexibly perform data analysis and even create new statistical calculation methods that meet their needs.
Address, http://blog.xufun.cn /? P = 655
Big Data and our starting points