Objective
There is a big data project, you know the problem area (problem domain), you know what infrastructure to use, and maybe even decide which framework to use to process all of this data, but one decision has been delayed: which language should I choose? (or perhaps more specifically, the question is, what language should I force all my developers and data scientists to use?) The question will not be postponed for too long, but will be decided sooner or later.
Before I share it, I would like to recommend my own creation of the Big Data Learning Exchange Qun531629188 Whether it is Daniel or want to change the students want to learn the small part I am very welcome, today's information has been uploaded to the group files, not regularly share dry goods, including my own to organize a new big data for the 2018 study tutorial, welcome beginner and advanced small partners.
Of course, nothing prevents you from using other mechanisms, such as XSLT transformations, to work with big data. But in general, there are three languages available for big Data today: R, Python, and Scala, plus Java, which has always been in the business world. So, what language should you choose? Why choose it, or when to choose it?
Here's a brief introduction to each language to help you make a reasonable decision.
R
R is often referred to as "a language developed by statisticians for statisticians". If you need an esoteric statistical model for calculations, you might find it on Cran-you know, Cran called the integrated R-File network (Comprehensive r Archive Networks) is not for no reason. When it comes to analysis and plotting, nothing is better than Ggplot2. And if you want to take advantage of features that are more powerful than what your machine provides, you can use SPARKR bindings to run Spark on R.
However, if you are not a data scientist and have not used Matlab, SAS, or octave before, you may need to tweak it to use R for efficient processing. Although R is good for analyzing data, it is not very good for general purposes. You can build the model with R, but you need to consider converting the model to Scala or Python to use in a production environment, and you're unlikely to write a cluster control system in that language (you can debug it if you're lucky).
Python
If your data scientists don't use r, they might get a thorough understanding of Python. For more than more than 10 years, Python has been popular in academia, especially in the fields of natural language processing (NLP). Thus, if you have a project that requires NLP, you will face a bewildering number of choices, including classic ntlk, modeling using Gensim themes, or ultra-fast, accurate spacy. Similarly, when it comes to neural networks, Python is also well-Theano and TensorFlow, followed by Scikit-learn for machine learning and numpy and pandas for data analysis.
and juypter/ipython――. This web-based notebook server framework allows you to mix code, graphics, and almost any object with a shareable log format. This has always been one of the killer features of Python, but this year, this concept proved to be so useful that it appears in almost all languages that adhere to the concept of read-read-output-loop (REPL), including Scala and R.
Python is often supported in the framework of big data processing, but at the same time it is often not a "class citizen". For example, the new functionality in spark almost always appears at the top of the Scala/java binding, and it may be necessary to write a few minor versions of Pyspark for those updates (especially for spark streaming/mllib development tools).
In contrast to R, Python is a traditional object-oriented language, so most developers use it quite handy, and the initial exposure to R or Scala can be daunting. A small problem is that you need to leave the correct space in your code. This divides the people into two camps, one that says "This is very helpful for readability", while the other faction argues that in 2016 we should not need to force the interpreter to let the program run because a single character in a line of code is not in place.
Scala
Now talk about Scala: in the four languages described in this article, Scala is the most relaxed language because everyone appreciates its type system. Scala, which runs on the JVM, has largely succeeded in combining functional paradigms and object-oriented paradigms, and is now making great strides in the financial world and in companies that need to process massive amounts of data, often in a massively distributed way (such as Twitter and LinkedIn). It is also a language that drives spark and Kafka.
Because Scala runs inside the JVM, it can instantly access the Java ecosystem, but it also has a wide range of "native" libraries to handle large-scale data (especially Twitter's algebird and Summingbird). It also includes an easy-to-use REPL for interactive development and analysis, just as with Python and R.
I personally love Scala because it includes many practical programming features, such as pattern matching, and is considered much simpler than standard Java. However, using Scala to develop more than one method, this language as a feature to promote. That's a good thing! But given that it has a turing-complete type system and various winding operators ("/:" On behalf of Foldleft, ": \" for Foldright), it's easy to open scala files, Thought you were looking at a piece of annoying Perl code. This requires a good set of practices and guidelines to be followed when writing Scala (Databricks is reasonable).
Another drawback is that the Scala compiler runs a bit too slow to recall the previous "Compile!" Of the day. However, it has REPL, big data support, and a Web-based notebook framework in the form of Jupyter and Zeppelin, so I think many of its small problems are excusable.
Java
In the end, there is always the language of Java―― no one loves, abandoned, a company that seems to care about it only by suing Google for money to make it (note: Oracle) all, completely out of fashion. Only drones in the corporate world use java!. However, Java may be a good fit for your big Data project. Think about Hadoop MapReduce, which is written in Java. What about HDFs? also written in Java. Even storm, Kafka, and Spark can run on the JVM (using Clojure and Scala), which means that Java is a "one-class citizen" in these projects. There are new technologies like Google Cloud Dataflow (now Apache Beam) that have only recently been supported in Java.
Java may not be the preferred language for rock star favorites. But because developers are trying to sort out a set of callbacks in a node. js application, using Java gives you access to a large ecosystem (including analyzers, debuggers, monitoring tools, and libraries that ensure enterprise security and interoperability), and beyond, Most of the content has been tested over the past 20 years (unfortunately, Java is 21 years old this year, we are all old).
One of the main reasons for shelling Java is that it is cumbersome and tedious, and lacks the repl (R, Python, and Scala) required for interactive development. I've seen 10 lines of Scala-based spark code quickly become a perverted 200-line code written in Java, as well as huge types of statements that occupy much of the screen space. However, the new lambda support features in Java 8 can be helpful in improving this situation. Java is never as compact as Scala, but Java 8 does make development in Java less painful.
As for REPL? Well, not yet. Next year's Java 9 will include Jshell, which is expected to meet all your REPL requirements.
Which language wins?
Which language should you use for big data projects? I'm afraid it's still "subject to the circumstances." If you're doing heavy data analysis on obscure statistical operations, you're not in favor of R. Python is a good choice if you are working with NLP or dense neural networks across the GPU. Java or Scala is definitely a great choice if you want a hardened, production-oriented data flow solution with all the important operational tools.
Of course, not necessarily. For example, if you use Spark, you can use R or Python to train the model and machine learning Pipeline (pipeline) with static data, then serialize the pipeline and pour it out to the storage system, where it can be used by your production Scala Spark The streaming application is used. While you should not be overly obsessed with a language (or your team will soon be fatigued with language), using a heterogeneous language that uses a different set of strengths may result in big data projects.
There is a word called three people will have my teacher, in fact, as a developer, there is a learning atmosphere with a communication circle is particularly important this is a my big data exchange Learning Group 531629188 No matter you are small white or Daniel Welcome to enter, are looking for a job can also join, we exchange learning, the words rough not rough, Learn from each other, make progress together and refuel together.
How to choose a programming language for big Data