The most important part of data analysis is probably not the programming tools, analysis software, or statistical knowledge you are familiar, instead, you can understand the principles, assumptions, and limitations behind your statistical knowledge (statistics, metering, time series, and non-parameters), and learn about various data analysis tools (such as data mining) what can be brought, what cannot be brought, what can you say when you see the results of a set of statistical tests, and what cannot be said.
Behind all this, you need a complete logical framework of "science" to let you understand the nature of the tools in your hands, so that you can "correctly" find valid information from the data, instead of using a bunch of tools that you don't know about to pile up the analysis results, it is useless and harmful to get the analysis results.
After learning this, you need to train your abilities and insights to become a data analyst. Since it is a "data analyst", we should start with "data" and "analysis" respectively.
"Data" includes data collection, processing, visualization, and other content. Each stage has a critical impact on the final result. The technical content is only part of it. More importantly, you need to understand data collection (is there any sampling deviation? How to correct or improve it ?) Handle (are there any vulnerabilities or exceptions not considered ?) The logic behind it.
For example, if you analyze stock data to design trading strategies, you need to understand not only the data processing itself, but also the basic knowledge of the financial market. For example, when using the stock price, whether to use the closing price or the re-ownership price should be used; if the re-ownership price is used, whether to use the pre-recovery price or post-recovery price. These choices do not have much to do with data analysis. They are solely determined by the purpose of your analysis. Therefore, you must fully understand the logic and motivation behind these concepts to make a correct choice based on your own purposes.
Data visualization is more of an art: How to present information to people who want to obtain it in the most appropriate way. First, you must fully understand what the information is and what features it has so that you can choose the appropriate visualization tools.
The other part is "analysis 」. Of course, it is all kinds of analysis models. We still need to understand the logic behind these models and put them in the context of the entire project, rather than simply looking at the models.
All in all, "Understanding" data and its information are very important. This determines whether your analysis and presentation methods are appropriate and whether the final conclusion is reliable.
Now you can answer the subject's question: to become a data analyst, pay attention to "Understanding" your knowledge and form a system, instead of mechanically applying models like robots. Train your programming skills in this concept to understand the principle of the object you are analyzing and learn as much details as possible. On this basis, we can talk about data analysis.
If you want to work on big data, learn how to lay the foundation by yourself. This article will introduce several books for you to learn about big data and mass data processing.
Massive Data is divided into two parts: System Construction Technology and Application of massive data.
First, let's talk about system construction. Now the mainstream technology is HADOOP, which is mainly based on the mapreduce distributed framework. You can learn this first. However, in my opinion, before the distributed system came out, it was mainly a centralized architecture, such as DB2 and oracle. Why is the distributed architecture used now? It is because the current centralized architecture is limited by IO performance and the release speed is slow. If another hardware technology is used, it can quickly process massive data and meet performance requirements, therefore, the centralized architecture is better than the distributed architecture because the centralized architecture is stable and the O & M pressure is low. The current centralized architecture either fails to meet the requirements or is too expensive. I look forward to the emergence of a technology that can transmit and process data very quickly, so the centralized architecture will be eye-catching again. Let's talk about the application of massive data. Massive Data Applications mainly involve data mining and machine algorithms. There are different application scenarios, such as personalized search and recommendation, social network discovery, Precision Marketing, precise advertising, real-time Optimal Path, and artificial intelligence. It depends on whether you want to implement the system support technology or the application technology integrated with the business.
If you want to learn about system construction technology, read the following books:
If you want to learn Data Mining and machine algorithms, we recommend that you first read the introduction to data mining, statistical analysis principles, Mahout, R, and MATLAB.