One months of subway reading time, read the "Spark for Python Developers" ebook, not moving pen and ink do not read, readily in Evernote do a translation, for many years do not learn English, entertain themselves. Weekend finishing, found that more do a little more basic written, so began this series of Subway translation.
In this chapter, we will build a separate virtual environment for development, complementing the environment with the Pydata library provided by Spark and Anaconda. These libraries include Pandas,scikit-learn, Blaze, Matplotlib, Seaborn, and Bokeh. Our operation is as follows:
- Build your development environment using Anaconda's Python publishing package, including using the IPython Notebook environment to complete our data exploration tasks.
- Install and make spark and pydata libraries work properly, such as Pandas,scikit-learn, Blaze, Matplotlib, and Bokeh.
- Build a Word Count example program to make sure everything works fine.
Many big data-driven companies have emerged in recent years, such as Amazon, Google, Twitter, LinkedIn, and Facebook. These companies, through communication and sharing, revealing their infrastructure concepts, software practices, and data processing frameworks, have nurtured a vibrant open source software community that has evolved into enterprise technologies, systems and software architectures, as well as new infrastructure, DEVOPS, virtualization, cloud computing and software-defined networks.
Inspired by Google File System (GFS), the open source distributed computing framework Hadoop and MapReduce were developed to process petabytes of data. The complexity of the expansion, while keeping costs low, also leads to new data storage, such as recent database technology, Columnstore database Cassandra, document database MongoDB, and graph database neo4j.
Hadoop, thanks to his ability to handle big datasets, has nurtured a huge ecosystem of data iterations and interactive queries through pig, Hive, Impala, and Tez.
When using only the MapReduce batch mode, the operation of Hadoop is cumbersome and cumbersome. Spark created a revolution in data analysis and processing, overcoming the pitfalls of the MapReduce task disk IO and bandwidth constraints. Spark is implemented in Scala and natively integrates the Java Virtual machine (JVM) ecosystem. Spark provided Python APIs early and used Pyspark. Based on the robust performance of Java systems, the architecture and ecosystem of Spark is inherently multilingual.
This book focuses on Pyspark and pydata ecosystem Python in the data intensive processing of the academic and scientific community is a preferred programming language. Python has evolved into a rich ecosystem. Pandas and Blaze provide a tool library for data processing Scikit-learn focus on machine learning matplotlib, Seaborn, and bokeh complete data visualization Therefore, the purpose of this book is to use spark and python for data-intensive applications to build a An end-to-end system architecture. In order to put these concepts into practice we will analyze such social networks as Twitter, GitHub, and Meetup. We visit these sites to focus on the social interaction and interactions of the spark and open source community.
Building data-intensive applications requires a highly scalable infrastructure, multi-language storage, seamless data integration, multi-analysis processing, and effective visualization. The architecture blueprint for the data-intensive applications described below will be throughout this book. This is the backbone of the book.
We will discover the application scenarios of spark in the vast pydata ecosystem.
Understanding the architecture of data-intensive applications
? In order to understand the architecture of data-intensive applications using the following conceptual framework, the architecture is designed to be 5 layers:
? Infrastructure Layer
? Persistence layer
? Integration Layer
? Analysis Layer
? Participation Layer
Describes the five tiers of a data-intensive application framework:
From the bottom up we traverse the main uses of each layer.
Infrastructure tier (Infrastructure layer)
The infrastructure layer focuses on virtualization, extensibility, and continuous integration. In practice, the term virtualization refers to the VirtualBox of the development environment and the virtual machine environment of spark and Anaconda. If we extend it, we can create a similar environment in the cloud. Create an isolated development environment, and then migrate to a test environment, through DevOps tools, and as part of continuous integration into production environments such as Vagrant, Chef, Puppet, and Docker. Docker is a very popular open source project that makes it easy to deploy and install new environments. This bookstore is limited to building virtual machines using VirtualBox. From a data-intensive application architecture, we will only elaborate on the basic steps of virtualization with a focus on extensibility and continuous integration.
Persistence layers (persistence layer)
The persistence layer manages various warehouses adapted to the needs and patterns of the data. It guarantees the establishment and management of multivariate data storage. This includes relational databases such as MySQL and postgresql;key-value data storage for Hadoop, Riak, and Redis, and Columnstore databases such as HBase and Cassandra; Document-based database MongoDB and couchbase; Atlas database such as NEO4J. The persistence layer also manages a wide variety of file systems, such as Hadoop ' s HDFS. It interacts with a variety of storage systems, from raw hard disks to Amazon S3. It also manages a wide variety of file storage formats such as CSV, JSON, and parquet (this is a column-oriented format).
Integration layers (Integration layer)
The integration layer focuses on data acquisition, transfer, quality, persistence, consumption, and control. Basically driven by the following 5C: Connect, collect, correct, compose and consume. These five steps describe the life cycle of the data. They focus on acquiring interesting datasets, exploring data, and refining them to enrich the information collected and prepare for data consumption. Therefore, these steps perform the following actions:
- Connect: The goal is to choose the best method from a variety of data sources. If present, these data sources provide APIs, input formats, data acquisition rates, and provider restrictions.
- Correct: Focus on data transfer for further processing while ensuring quality and consistency of maintenance data
- Collect: What data is stored in what format facilitates assembly and consumption in the later stages
- Compose: Focus on how to mix and match the various datasets that have been collected, and enrich this information to build a data-driven product that brings in wins.
- Consume: Focus on the use of data, rendering, and how to make the right data at the right time to achieve the right results.
- Control: This is the sixth additional step, sooner or later, as data, organization, and participants grow, which guarantees the control of the data.
Describes the iterative process of data acquisition and refining consumption:
Analytics Layers (Analytics layer)
The analysis layer is where spark processes data and draws useful insights through a variety of models, algorithms, and machine learning pipelines. For us, the analysis layer of this book uses spark. We'll dig into the good features of spark in the next chapters. In short, we make it strong enough to complete the analysis processing of the multi-cycle paradigm on a single platform. It allows batch processing, stream processing, and interactive analysis. Batch processing on large datasets Although there is a long delay order that allows us to extract patterns and insights, we can also handle real-time events in streaming mode. Interaction and iterative analysis are better suited for data exploration. Spark provides a binding API for the Python and R languages, with the Sparksql module and Spark Dataframe, which provides a very familiar analysis interface.
Participating layers (Engagement layer)
The engagement layer completes the interaction with the user, providing Dashboards, interactive visualization and alerting. We will focus on the tools provided by the Pydata ecosystem such as Matplotlib, Seaborn, and bokeh.
Spark for Python developers---build spark virtual Environment 1