What is the "milestone" in learning data analysis?

Source: Internet
Author: User
Tags dnn
This is a creation in Article, where the information may have evolved or changed.

Author: Still

Data analysis is a comprehensive technology. It contains both hardcore programming techniques and soft knowledge of many analytical logic.

Remember Wei Wei once wrote a money answer (Wei Wei: How to be proficient in Excel? , which outlines Excel's "Five-layer Heart" from beginner to proficient, including shortcut keys, function formulas, charts, pivot tables, and VBA. I prefer to call it a "milestone", because on the one hand, these milestones are for every friend who climbs the Excel skill ladder to face sooner or later, and on the other hand, by crossing this milestone, Excel technology can be said to have greatly improved .

For data scientists, Excel is only a small part of your skill graph. Although there are many schools of data scientists and different paths of growth, it is difficult to summarize all the "milestones" encountered in the growth process with just one article. However, there are still some things that are common. This article tries to summarize some of the data science "milestones" that I have in mind.

What is a milestone?

"Milestone" is an important part of a set of knowledge system , no matter which tutorial, how to start learning, it will always be a level you must face. It may not be difficult, but if you want to go further in your ability, the milestone will not be around.

across "milestones", technology can get a qualitative leap , such as learning to VLOOKUP this is not too difficult skills, Excel productivity can be greatly improved. And if you master VBA, you can use Excel to do a lot of things that were impossible.

Milestone 1: Understanding what makes the big data age

Presumably every student who has studied data science will scoff at the word "big data".

It's not just that the word has been overused, it's because it's saying nothing. What exactly is big data? There is still no clear definition of the present.

But the big data age is real. The technological innovations and industries associated with data are now in full swing, and although they are different in form, there is nothing wrong with the "Big Data Age" framework.

This involves a very big question-what is at the core of the era of great data? Why can data-related industries suddenly erupt and flourish? The answer to this question could have a direct impact on the career planning and worldview of data scientists.

My personal understanding is: The Big Data age, is the massive data + algorithm + Computing ability common outbreak .

massive data --the development of information technology today has greatly improved the ability to record raw data. From macro economic and financial data to microscopic industry data, from traditional structured data to image, sound and text data. The huge growth in raw data has opened up bigger windows for people to understand the world and explore the world.

algorithm --every algorithm used in data analysis can be said to be the crystallization of human wisdom. Most of them have a long history, such as the hottest deep learning of the moment, DNN's algorithm dates back to the 1956 Rosenblatt invented the Perceptron, and DNN's core BP algorithm was published in 1975. Before the proper algorithm is produced, the computer's good computational performance cannot be used to solve the problem of specific data analysis. But by now, the different algorithms designed for specific business needs have been extremely rich and have greatly improved in performance and effectiveness.

Computational Power -computing power is the last link in the era of big data, which can be said to be the last straw to crush a camel. On the hardware side, the development of supercomputer, CPU and GPU, storage and data performance have contributed a lot; in software, the deployment of distributed computing, the computational framework of MapReduce, has increased speed, from R to Python to Golang, and so on, to the birth of high-level languages, And the endless package, the data analysis of the "console" to do more and more humane.

three are indispensable, but the computational power is undoubtedly the forefront of the pioneers . The world has a small number of data and algorithms, so we can carry out preliminary statistical analysis, but still far from the creation of a new era of the degree. Only a huge burst of data, coupled with the ability to break through the bottleneck, can make the entire industry scale start to grow exponentially.

Privately, it is only by understanding the origins of the era of big data that we can put our position in the tide of times.

Milestone 2:r/python


Two years ago, we were discussing "what software should be used for statistical analysis". There were many options at the time, Spss,sas,r,python,excel,eviews,stata,c++,java ... The number is not counted.

A year ago, we were talking about whether we should learn R or Python. At that time, the two worlds are already in the world, the software listed above also occasional supporters, but has not been too big spray.

Now, everyone is talking about "how to get Started Python".

Over the past two years, a large number of established software gradually faded out of the field of data scientists. The death posture of these tool software, I can briefly summarize for the following several.

1. the function ceiling of the software is too low . Typical representative is eviews, SPSS and the like interface software. Once, they became famous for the easy-to-operate interface, but ultimately because the interface can provide limited functionality, software ceiling is too low, abandoned by the new era.

2. non-open source . The typical representative is SAS, the first of which SAS was the ultimate solution for big data analytics and the only solution. It uses hard disk read and write mode, is the only time to overcome the large amount of data in the memory space of the software, and because SAS built up a lot of statistical software library, only need a small amount of code to complete complex analysis, output a professional and perfect report. But SAS was defeated in its old grammar system and non-open source two points. SAS syntax is a headache, not object-oriented or functional programming, the novice needs to spend a long time to adapt to its grammatical structure, more importantly, SAS products are not open source, there is no external package to call, simply can't keep up with the rapid development of the algorithm. Now that SAS has been forgotten by data scientists, it retains its absolute advantage in biomedicine and banking systems (but only because of policy barriers or industry inertia).

3. It 's too hard . This refers to C + + and Java, whose code is too low-level. The advantage is that the computation is fast and the downside is that it takes time to develop. In order to complete a data analysis, the operation speed can be properly sacrificed, giving way to development time. Alternatively, after the initial analysis and algorithm development, the algorithm is passed to the backend to be implemented in C + + or java. In business, customers and data analysts don't have much energy to wait for the wheels to build up, and what they want is a high-level language that's easy to learn-obviously, only R and Python are left.

4. died of unknown aoe from deep learning. For a long time, R would have been difficult to compete with Python. But since Alphago brush screen, deep learning has become hot, R began to face crisis, because the current deep learning TensorFlow framework, Keras package and so on basically all built on Python. This is embarrassing, R directly missed the wave of deep learning. Until recently, the great God has shifted the framework of deep learning to r, but it seems to be late, and Python is already on the top of the list of data analysis best tools. Of course, R is not extinct, due to the high frequency of use of r in academia, almost all the academic research of the new algorithm in the R platform for simulation testing, so R's algorithm package reserve is not a substitute for python.

So we can actually find that when a data analysis tool was born, its fate was almost doomed. R and Python are almost finally laughed at because of the multiple features of command line + open source + high-level languages. For data scientists, these two languages are destined to be their best friends, and every data scientist should at least use one of them as their main language.

This is a major milestone for data scientists. no matter what your base is, when you first start learning R or Python, you're really using the data analysis tools that are best suited to the big data era and into a new world .

It is worth mentioning that the future of this list may also add Golang, after all, the Google developed and advocated by the new data science tools since the birth of the development has been very rapid. But whether to join the luxury package, we should take into account the Golang own struggle, but also take into account the history of the trip.

Milestone 3:spark


Over the past two years, big data engineers have agreed that spark is the most effective helper for salary increases in the list of all their skills.

Spark has a distinct character. On the one hand, it is currently the fastest data analysis platform , fully realize the Hadoop and MapReduce framework inheritance and transcendence. On the other hand, it has a high degree of abstraction and requires a lot of functional programming with lambda functions, so it's cumbersome , and Spark is far less community-perfect than R and Python, although Spark builds on Scala, The ability to invoke Scala and Java packages, but it's still cumbersome, not to mention a distributed computing platform for Spark is not even a simple matter.

There may be a lot of words that you can't read in the above paragraph. But it doesn't matter, summing up, about spark really just need to know two points:

1. Spark is very fast! Spark is very fast! Spark is very fast!

2. Spark is particularly difficult! Spark is particularly difficult! Spark is particularly difficult!

There is no doubt that spark belongs to a very bad milestone. But the benefits and the pay, at least in my opinion, spark is not so difficult. You may even fall in love with this feeling when you've adapted to writing maps and reduce with a lambda function. At the same time, if there is a certain Java foundation, Spark will become much easier.

Milestone 4: Think about the model from demand, rather than mechanically the model.

Currently, data scientists are often divided into three factions.

Statistical School : Statistical background of data scientists, often like to solve problems with mathematical methods, pay particular attention to the logic of every step of data analysis, very much like to do hypothesis testing. As a result of a large number of parameter statistics training, for them, each model parameter is not credible, and even every model itself is not credible-until a reasonable mathematical proof, and the test of each parameter. When the statisticians first approached machine learning, they tended to be very uncomfortable with their "black box" patterns, but in the end they were often only impressed by the superior predictive power of the model.

computer PIE : cs-born data scientists with strong engineering temperament, the habit of thinking is modular, step-by-step engineering thinking. They are more concerned with the steps and results of machine learning than the logic of each step. The advantage is that they don't have to overcome some kind of inertia when they're learning data analysis, they just need to make a good model of engineering thinking. The downside is that sometimes you pay too much attention to the model itself, ignoring its applicable conditions.

Business Faction : This school of data scientists background can be said to be sanjiaojiuliu, but anyway, is always with the data point edge. Their thinking characteristic is from the business logic, pay special attention to the early process of model construction, especially the feature engineering. And they always expect the output of the model to match their guesses, or they might be furious.

All three will be good data scientists, but before they are fully developed, they often encounter some problems. These troubles can be summed up as-- don't look at the demand, just model .

The model of the statistical school may be multivariate linear regression, time series analysis and nonparametric statistics, the computer is probably the most popular dnn, SVM, the business model is their business logic. All three can not avoid the most accustomed to the way of thinking into the changeable reality of the problem. For example, analysis of house price data, statistical faction often must give the data to do a regression or time series analysis, computer pie likes to set the data classification standard and then apply the classification algorithm, business will have to first put the price data logic analysis a pass, put forward a lot of assumptions, But can not find a good model to apply their own assumptions.

This is not the best way to deal with it. Good data analysis should be the combination of the three concepts, that is-first, like business school analysis of the original data, good exploratory analysis and feature engineering; Then, like the statisticians, carefully analyze the applicability of the model, choose reasonable model assumptions; Finally, as the computer sent the same bold modeling, positive tuning, toward the direction of cross-fitting to continue to move forward.

Summing up, the most important, or to abandon their own thoughts inherent in a number of thinking patterns, from the data itself analysis needs, choose the most appropriate analysis methods, data cleaning ideas, feature engineering and models.

Unfortunately, this milestone is a metaphysical milestone. Most people may know the existence of this milestone, but do not know whether or not they really cross over. But there is no doubt that it is very rare to know that you want to start with your needs.

Milestone 5: Learn to start improving your code

Perhaps the biggest milestone for statisticians and business is the aesthetics of Understanding Code and Code .

Financial practitioners may be very familiar with how to make a beautiful PowerPoint presentation, and statisticians may be very familiar with how to write concise and clear proof processes. However, to the code here, most people are still willing to make their own IDE as a "quadrochromatic", feel that the things you want to do it. Not to mention the improvements to the code itself, as many R newcomers write a lot of for loops in the code, and loop loops, causing the run times to grow exponentially.

In fact, they can't all blame them. Data analysis of the code is generally not very good to write, need to try again, during the code is very easy to write too random, messy. Finally, if lazy do not do integration, it is so messy down.

But in the actual work, the data analysis of the code still need to ensure readability, or the backend rewrite how to read your algorithm? How does the PR person who makes the document and PPT interpret your code? Can you even recognize your code after one months?

When this reality is needed, data scientists will begin to look for the beauty of the code.

Making the code neat and tidy is just the first aspect. What needs to be done is to adjust the indentation, write the comments carefully, do a good job of chunking the code, limit the number of characters in each line, and so on. But after that, you need to learn how to improve the speed of your code (like the first tip of the R language: replacing A For loop with the Apply system function), and how to write more readable code (such as defining common functions as functions). The aesthetics of code is endless and deserves to be explored by every data scientist.

Unthinking wrote such a pile, and did not know whether the crossing were useful.

Milestones are important, we use milestones to mark past achievements, and newcomers use their predecessors ' milestones as the direction of their efforts. Everyone in their own field, can list a lot of "milestones", write them out to the new people to see, it will be a very good thing.

Source: 36 Big Data

End.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.