Big Data learning: What Spark is and how to perform data analysis with spark

Source: Internet
Author: User
Tags apache mesos spark rdd



Share with you what spark is? How to analyze data with spark, and small partners who are interested in big data to learn about it.



Big Data Online Learning



What is Apache Spark?



Apache Spark is a cluster computing platform designed for speed and general purpose.



From a speed point of view, Spark inherits from the popular MapReduce model and can more effectively support multiple types of computations, such as interactive queries and stream processing. Speed is important in the processing of large data sets, which can determine whether users can process data interactively, or wait a few minutes or even hours. One of the key features that spark provides for speed is that it can run calculations in memory, even though spark is still more effective than mapreduce for complex disk-based applications.



In general terms, spark can handle tasks that require multiple independent distributed systems to handle, including batch applications, interactive algorithms, interactive queries, and data flow. By supporting these tasks with the same engine, spark makes merging different types of processing easier, while merge operations are frequently used in production data analysis. Furthermore, spark reduces the administrative burden of maintaining different tools.



Spark is designed to be highly accessible, provides simple APIs in Python, Java, Scala, and SQL, and provides a rich library of built-in libraries. Spark is also integrated with other big data tools. In particular, spark can run on a Hadoop cluster and can access any Hadoop data source, including Cassandra.



Spark Core Components



Spark core components include the basic capabilities of spark, a task scheduling component, a memory management component, a fault-tolerant recovery component, a component that interacts with the storage system, and so on. The Spark Core component provides an API for defining elastic distributed Datasets (resilient distributed Datasets,rdds), which is the main programming abstraction of Spark. Rdds represents a  distributed across multiple machine nodes that can be processed in parallel. The Spark Core component provides many APIs to create and manipulate these collections.



Spark Sqlspark SQL is a package that spark uses to process structured data. It makes it possible to query data through SQL statements like the Hive Query Language (hive Queries Language, HQL), supporting a variety of data sources, including hive tables, parquet, and JSON. In addition to providing a SQL interface for Spark, Spark SQL allows developers to mix SQL queries and data programming operations supported by RDDS through Python, Java, and Scala into a single application, combining SQL with complex analytics. Tight integration with compute-intensive environments makes spark SQL different from any other open source data warehousing tool. Spark SQL introduces spark in the Spark 1.0 release.



Shark is an older SQL project on Spark, developed by the University of California and Berkeley, and runs on spark by modifying hive. It has now been superseded by spark SQL to provide better integration with the Spark engine and API.



The spark stream (spark streaming) spark stream acts as a component of spark and can process live streaming data. Examples of streaming data are log files generated by a production environment Web server, and a user requests a message containing status updates to a Web service. The spark Stream provides an API that is very well matched to the spark core RDD API, making it easier for programmers to understand projects and to quickly switch between operating memory data, disk data, and applications for real-time data. The spark Stream is designed to provide the same level of fault tolerance, throughput and scalability as the spark core components.



The Mllibspark contains a library of machine learning called Mllib. Mllib provides several types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, and supports model evaluation and data import capabilities. Mllib also provides a low-level machine learning primitive, including a general gradient descent optimization algorithm. All of these methods can be applied to a cluster.



Graphxgraphx is an action diagram (such as a friend graph for social networks) and a library that performs parallel calculations based on graphs. Like spark flow and spark SQL, GRAPHX extends the spark RDD API, allowing us to create a forward graph with any property that is bound to each node and edge. GRAPHX also provides a wide variety of operator diagram operators, as well as a library of common graph algorithms.



Cluster Manager cluster managers at the bottom, spark can effectively scale from one compute node to hundreds of nodes. To achieve this goal while maximizing flexibility, spark can run on multiple cluster managers, including the Hadoop Yarn,apache Mesos and a simple cluster manager, called the Standalone scheduler, contained in Spark. If you install spark on an empty machine group, the standalone scheduler provides a simple way; If you already have a Hadoop yarn or Mesos cluster, spark supports your app to allow it on these cluster managers. The seventh chapter gives different choices and how to choose the right cluster manager.



Who uses spark? What to do with spark?



Because Spark is a universal framework for cluster computing, it can be used in many different applications. There are two main types of users: Data scientists and data engineers. We carefully analyze the two types of people and how they use spark. Obviously, typical use cases are different, but we can roughly divide them into two categories, data science and data applications.



Data Science task Data science, a discipline that has emerged in recent years, focuses on analyzing data. Although there is no standard definition, we believe that the main work of a data scientist is to analyze and model data. Data scientists may have SQL, statistics, predictive models (machine learning), programmed with Python, Matlab, or R. Data scientists can format the data for further analysis.



Data scientists use relevant technical analysis data to answer a question or conduct in-depth research. Often, their work involves special analysis, so they use the interactive shell so that they can see the query results and snippets in the shortest amount of time. The speed of spark and the simple API interface are a good match for this goal, and its built-in library means that many algorithms can be used at any time.



Spark supports different data science tasks through several components. The Spark shell makes interactive data analysis easy with Python or Scala. Spark SQL also has a separate SQL shell that can be used for data analysis with SQL, or spark SQL in spark programs or spark shells. The Mllib library supports machine learning and data analysis. Also, support calls to programs written in the external Matlab or R language. Spark makes it possible for data scientists to work with tools such as R or pandas to deal with problems with large amounts of data.



Sometimes, after the initial data processing phase, the work of the database scientist will be manufactured, expanded, hardened (fault tolerant), and thus become a production data processing application as a component of the commercial application. For example, a data scientist's research might result in a product recommendation system that is integrated into a Web application to generate product recommendations to users. The work of data scientists is typically produced by another person, such as an engineer.



Data processing application Another major use of spark can be described from an engineer's point of view. Here, engineers are referring to the large number of software developers who use spark to build production data processing applications. These developers understand the concepts and principles of software engineering, such as encapsulation, interface design, and object-oriented programming. They usually have degrees in computer science. They design and build software systems that implement a business usage scenario through their own software engineering skills.



For engineers, Spark provides an easy way to parallelize these applications between clusters, hiding the complexities of distributed systems, network communications, and fault-tolerant processing. The system enables engineers to have sufficient authority to monitor, inspect, and adjust applications while fulfilling tasks. The modular nature of the API makes it easy to reuse existing work and local testing.



Spark users use spark as their data processing application because they provide rich functionality, are easy to learn and use, and are mature and reliable. If you're ready, start acting right away!



Attention, your attention is my biggest motivation.



Want to learn big data can add group: 142973723



Big Data learning: What Spark is and how to perform data analysis with spark


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.