The recent move from Hadoop 1.x to Hadoop 2.x has also reduced the code on the platform by converting some Java programs into Scala, and, in the implementation process, the deployment of some spark-related yarn is based on the previous Hadoop 1.x partial approach, There is basically no need to deploy this on the Hadoop2.2 + version. The reason for this is Hadoop YARN Unified resource Management.
On the Spark official website
The spark application runs on the cluster as a separate collection of processes, and in your main program (called the driver) is Sparkcontext objects. In particular, in order to run on a cluster, sparkcontext can be connected to several types of cluster managers (Spark's own separate cluster manager or Mesos/yarn), which can allocate resources between applications. Once connected, spark needs the thread pool nodes on the cluster, which are the working processes that perform the computation and storage of application data. It then sends your application code (a file that is defined in a jar or Python and is routed to Sparkcontext) to the thread pool. Finally, the Sparkcontext sends a task to let the thread pool run. (So it is sent to the other nodes via Sparkcontext, where you just need to get sparkcontext, OK). about this architecture there are a few places to swim: 1. Each app has its own thread pool process that maintains and runs tasks across multiple threads as the entire application runs. The advantage of this is that the application is isolated from each other, that is, the scheduling aspect (each driver dispatches its own task) is also in the execution aspect (the tasks of different applications run on different JVMs). However, this also means that data cannot be shared between different spark applications (Sparkcontext instances) without the data being written to an additional storage system. 2. For potential cluster managers, spark is not known. As long as it requires the process of the thread pool and the communication between them, it is relatively easy to run even on cluster managers (for example, Mesos/yarn) that also support other applications. 3. Because the driver dispatches the task on the cluster, it should run close to the working node, preferably within the same LAN. If you want to send a request to a remote cluster, a good choice is to open an RPC for the driver so that it commits the operation to the nearest location rather than running a drive far from the working node. Cluster Management type system currently supports cluster management in 3: (1) Singleton mode A simple cluster management, which includes a very easy to build a cluster of Spark (2) Apache Mesos mode A common cluster management, Patterns that can run Hadoop MapReduce and Service Applications (3) resource management mode in Hadoop yarn mode Hadoop 2.0 In fact, the Amazon Spark's EC2 startup script in EC2 (Amazon Elastic Compute Cloud) makes it easy to start a singleton mode. A recommended way to publish code to a cluster publish code to a cluster is through SparkcontExt constructor, this constructor can generate a Jar file list (Java/scala) or an. egg file and a. zip package file (Python) for the work node. You can also execute Sparkcontext.addjar and AddFile to dynamically create a send file. Monitor Each driver has a web UI, typically on port 4040, you can see information about the tasks that are running, the size of the program and the storage space, and so on. You can enter a simple URL in the browser to access the:http://< driver node >::4040. The monitor can also guide the description of other monitoring information. (If you are using Spark YARN mode, only run spark to see the UI page, you stop, the log data is gone, but you can persist the log). Task Scheduler spark can be allocated by resource in the application (cluster management level) and application (if there are multiple calculation instructions in the same sparkcontext).
Spark on Yarn