With the upsurge of large data, there are flood-like information in almost every field, and it is far from satisfying to do data processing in the face of thousands of users ' browsing records and recording behavior data. But if only some of the operational software to analyze, but not how to use logical data analysis, it is also a simple data processing.
Rather than being able to go deep into the core of the planning strategy.
Of course, basic skills is the most important link, want to become data scientists, for these procedures you should have some understanding:
R
To list all the program languages, you can forget the rest, but the most unforgettable is R. Appearing quietly from 1997, the biggest advantage is that it is free, for expensive statistical software like a MATLAB or SAS alternative.
But over the past few years, its value has turned upside down, becoming a treasure in the eyes of the scientific community. Not only are the dumb statisticians familiar with it, including wallstreet traders, biologists, and Silicon Valley developers, who are quite familiar with R. Diversified companies like Google, Facebook, Bank of America and focusing all use R, and its commercial utility continues to improve.
The advantage of R is that it's easy to start with, through R, you can filter your data from complex datasets, manipulate data from complex model functions, and create orderly graphs to present numbers, all with just a few lines of program code, for example, it's like a good version of Excel.
R The best asset is an active dynamic system, with the R community continually adding new software packages and building a rich set of features. More than 2 million people are currently estimated to be using R, and a recent survey shows that, in the data science community, the most popular language so far has accounted for 61% of respondents (after 39% of Python).
It also attracts the attention of Wallstreet. Traditionally, securities analysts saw the night in Excel, but now the use of R in financial modeling has increased, especially in the visualization tool, said Niallo ' Conno, BofA's vice-president, that "r made our tacky table stand out.
In data modeling, it is moving toward a maturing professional language, although R is still limited to the fact that when a company needs to make a mass product, some people say he is usurped by other languages.
"R is more useful in drawing than modeling. Ceo,michaeldriscoll of Metamarkets, a top data analyst, said
"You're not going to see R in Google's Web page rankings or Facebook friends recommending algorithms, and engineers will build a prototype in R and then write model syntax in Java or Python."
Given a well-known example of R, in 2010, Paulbutler used R to build Facebook's world map, proving how rich and powerful visual data is in the language, although he now uses r less than before.
"R has become obsolete, and it runs slow and bulky under a huge dataset," Butler said.
So what does he use next?
Python
If R is a neurotic and likeable geek, Python is an easy-going girl.
Python combines R's fast, the ability to handle complex data mining, as well as the more pragmatic language, is rapidly becoming mainstream, and python is easier and more intuitive to learn than R, and its ecosystem has grown incredibly fast in recent years and is more statistically analytical than R.
"In the past two years, the dramatic change from R to Python has been like a giant pushing forward," Butler said.
In the context of data processing, there is usually a trade-off between size and complexity, and Python appears as a compromise. Ipythonnotebook (Notepad software) and NumPy are used to temporarily access a lower workload, while Python is a good tool for medium-scale data processing; Python has a wealth of data families that provide a wealth of toolkits and statistical features.
Bank of America uses Python to build new products and infrastructure in banks, as well as to deal with financial data, "Python is more extensive and resilient, so everyone will be flocking to it." Said O ' Donnell.
However, although its advantages can compensate for R's shortcomings, it is still not the most efficient language, and occasionally can deal with large scale, core infrastructure. Driscoll thinks so.
Julia
Most of today's data science is based on R, Python, Java, Matlab and SAS, but there is still a gap to make up, and this time, the new person Julia saw this pain point.
Julia is still too secretive to be widely used in the industry, but data hackers are hard to explain when it comes to its potential to snatch R and Python's throne. The reason is that Julia is a higher-order, incredibly fast and expressive language, much faster than R, and easy to handle compared to Python's potential for larger-scale data.
"Julia will become more and more important, and in the end, things you can do in R and Python can be done in Julia." Butler thinks so.
For now, the reason that Julia's development will go backwards is probably that it's too young. Julia's data community is still in its infancy and needs more toolkits and packages before it can compete with R or Python.
Driscoll says it is because it is young that it can become mainstream and promising.
Java
Java and Java-based architectures are built by the core of several of the biggest technology companies in Silicon Valley, Driscoll says, and if you look at Twitter, LinkedIn or Facebook, you'll find that Java for all data engineering infrastructures, is a very basic language.
Java is not as good a visualization as R and Python, and it's not the best tool for statistical modeling, but if you need to build a huge system and use past prototypes, Java is usually your most basic choice.
Hadoop and Hive
In order to meet the needs of a large number of data processing, the Java-based tool group sprang up. Hadoop is the key to developing a Java based architecture for processing a batch of data processing, and Hadoop is much slower than other processing tools, but extremely accurate and widely used in backend database analysis. And hive with a good match, Hive is based on the query architecture, operating pretty well.
Scala
Another Java-based language, similar to Java, Scala is a growing tool for anyone who wants to do large-scale mechanical learning or build higher-order algorithms. It is a good presentation and has the ability to build reliable systems.
"Java is built with steel, and Scala allows you to take it to the kiln and bake it into a steel clay," Driscoll said.
Kafka Andstorm
What do you think of when you need a quick, real-time analysis? Kafka will be your best friend. It has been there for five years, just because the recent spate of streaming has become more and more popular.
Kafka was born from LinkedIn and is a particularly fast query information system. Kafka's weakness? It's too fast, so it makes mistakes in real time and sometimes misses things.
You can't have your cake and eat it, "you have to make a choice between accuracy and speed," ",driscoll said. So all the big tech companies in Silicon Valley are using two of pipelines: processing real-time data with Kafka or storm, and then turning on Hadoop to process a batch of data systems that sounds a bit cumbersome and slow, but the upside is that it's very, very precise.
Storm is another architecture written from Scala, and it's not surprising that the Silicon Valley is gradually increasing its popularity in streaming and being merged by Twitter, because Twitter has a lot of interest in fast event handling.
Matlab
Matlab can be said to be lasting, even if it is a high price, in a very specific niche market it uses a wide range, including intensive research machine learning, signal processing, image identification and so on.
Octave
Octave and Matlab are very similar, except that it is free. However, it is almost always mentioned in the academic signal processing circle.
Go
Go is another emerging newcomer, developed from Google, relaxed that it comes from C and is becoming a contender for Java and Python in building a strong infrastructure.
So many software can be used, but I think not necessarily each must be able to do, know your goal and direction is what, choose the most suitable tool to use it! can help you improve efficiency and achieve accurate results.
(Responsible editor: Mengyishan)