With the upsurge of large data, there are flood-like information in almost every field, and it is far from satisfying to do data processing in the face of thousands of users ' browsing records and recording behavior data. But if only some of the operational software to analyze, but not how to use logical data analysis, it is also a simple data processing.
Rather than being able to go deep into the core of the planning strategy.
Of course, basic skills is the most important link, want to become data scientists, for these procedures you should have some understanding:
R
To list all the program languages, you can forget the rest, but the most unforgettable is R. Appearing quietly from 1997, the biggest advantage is that it is free, for expensive statistical software like a MATLAB or SAS alternative.
But over the past few years it has turned out to be a treasure in the eyes of the scientific community. Not only are the dumb statisticians familiar with it, including wallstreet traders, biologists, and Silicon Valley developers, who are quite familiar with R. Diversified companies like Google, Facebook, Bank of America and focusing all use R, and its commercial utility continues to improve. The advantage of
R is that it's easy to start with, through R, you can filter your data from complex datasets, manipulate data from complex model functions, and create orderly graphs to present numbers, all with just a few lines of program code, for example, It's like a nice-moving version of Excel.
R The best assets are active dynamic systems, the R community continually adds new software packages, and features a built-in rich feature set. More than 2 million people are currently estimated to be using R, and a recent survey shows that, in the data science community, the most popular language so far has accounted for 61% of respondents (after 39% of Python).
It also attracts the attention of Wallstreet. Traditionally, securities analysts saw the night in Excel, but now the use of R in financial modeling has increased, especially in the visualization tool, said Niallo ' Conno, BofA's vice-president, that "r made our tacky table stand out.
on data modeling, it is moving towards a maturing professional language, although R is still limited to the fact that when a company needs to make a mass product, some people say he is usurped by other languages.
R is more useful in drawing than modeling. Ceo,michaeldriscoll of Metamarkets, a top data analyst, says that
"You won't see R when Google's Web page is ranked at the core or Facebook friends recommend algorithms. The engineer builds a prototype in R and then writes the model syntax in Java or Python.
Give an example that uses r well, and in 2010, Paulbutler used R to build Facebook's world map., proving how rich and powerful visual data is in this language, although he now uses r less than before.
"R has become obsolete, and it runs slowly and unwieldy under a huge dataset," Butler said.
So what does he use next?
Python
If R is a neurotic and likeable geek, Python is an easy-going girl.
Python combines R's fast, complex data mining capabilities and more pragmatic language to become mainstream quickly, and Python is easier and more intuitive to learn than R, and its ecosystem has grown incredibly fast in recent years, It is more powerful than r in statistical analysis.
Butler said, "in the past two years, the dramatic change from R to Python has been like a giant pushing forward."
In the context of data processing, there is usually a trade-off between size and complexity, and Python appears as a compromise. Ipythonnotebook (Notepad software) and NumPy are used to temporarily access lower-burden workloads, while Python is a good tool for medium-scale data processing; Python has a wealth of data families, providing a wealth of toolkits and statistical features.
Bank of America uses Python to build new products and infrastructure interfaces at banks, as well as to deal with financial data, "Python is more extensive and resilient, so everyone will flock to it." Said O ' Donnell.
However, although its advantages can compensate for R's shortcomings, it is still not the most efficient language and can occasionally handle large scale, core infrastructure. Driscoll thinks so.
Julia
Most of today's data science is based on R, Python, Java, Matlab, and SAS, but there is still a gap to make up for, and this time, the new person Julia saw the pain point.
Julia is still too secretive to be widely used in the industry, but data hackers are hard to explain when it comes to its potential to rob the throne of R and Python. The reason is that Julia is a higher-order, incredibly fast and expressive language, much faster than R, and easy to handle compared to Python's potential for larger-scale data.
Julia becomes more and more important, and in the end, things you can do in R and Python can be done in Julia. Butler thinks so.
For now, it is probably too young to say that Julia is going backwards. Julia's data community is still in its infancy and needs more toolkits and packages before it can compete with R or Python.
Driscoll says it is because it is young that it can become mainstream and promising.
Java
Driscoll says Java and Java based architectures are built by the core of the largest technology companies in Silicon Valley, and if you look at Twitter, LinkedIn, or Facebook, you'll find that Java for all data engineering infrastructures, is a very basic language.
Java is not as good a visualization as R and Python, and it's not the best tool for statistical modeling, but if you need to build a huge system and use the prototype of the past, Java is usually your most basic choice.
Hadoop and Hive
A Java-based toolset arises to cater for a large number of data processing needs. Hadoop is the key to developing a Java based architecture for processing a batch of data processing, and Hadoop is much slower than other processing tools, but extremely accurate and widely used in backend database analysis. And hive with a good match, Hive is based on the query architecture, operating pretty well.
Scala
is another Java-like language, and Java is a very similar one, and Scala will be a rising tool for anyone who wants to do large-scale mechanical learning or build higher-order algorithms. It is a good presentation and has the ability to build reliable systems.
Java looks like it was built of steel; Scala is the one that allows you to take it to the kiln and bake it into a steel clay, "Driscoll said.
Kafkaandstorm
What do you think of when you need a quick, real-time analysis? Kafka will be your best friend. It has been there for five years, just because the recent spate of streaming has become more and more popular. The
Kafka was born from within LinkedIn and is a particularly fast query information system. Kafka's weakness? It's too fast, so it makes mistakes in real time and sometimes misses things.
You can't have your cake and eat it, "you have to make a choice between accuracy and speed," ",driscoll said. So all the big tech companies in Silicon Valley are using two of pipelines: processing real-time data with Kafka or storm, and then turning on Hadoop to process a batch of data systems that sounds a bit cumbersome and slow, but the upside is that it's very, very precise. The
Storm is another architecture written from Scala, and it is not surprising that the Silicon Valley is gradually increasing its popularity in streaming and being merged by Twitter, as Twitter has great interest in fast event handling.
Matlab
Matlab can be said to be enduring, even if it is priced very high; in very specific niche markets it is used quite extensively, including intensive research on machine learning, signal processing, image identification, and so on.
Octave
Octave is much like Matlab, except that it is free. However, it is almost always mentioned in the academic signal processing circle. The
Go
went is another emerging newcomer, developed from Google, relaxed that it comes from C and is becoming a competitor for Java and Python in building a strong infrastructure.
So many software can be used, but I do not think that every one must be able to do, know your goal and direction is what, select the most appropriate tool to use it! can help you improve efficiency and achieve accurate results.