Saturday, 12 November 2016

Spark Core Concepts

Here we say our Spark program or Spark Application as Spark Driver Application. Driver program : It contains application’s main function and defines distributed datasets on the cluster, then applies operations to them. And It runs on master node.
As a first step of starting a Spark Driver Application, create a SparkContext. As we know Spark application runs as independent set of processes on cluster.

SparkContext will allow your spark driver application to access the spark cluster through resource managers/cluster managers (YARN/Mesos or Spark's own standalone cluster manager).
Inorder to create SparkContext first we need to create SparkConf.

SparkConf has the configuration parameters that driver program will sent to SparkContext.
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("Wordcount").setMaster("local")
setAppName () application name shown in spark web UI , setMaster() Master URL to connect local means run locally with one thread, local[4] run locally with four cores, spark://master:7077 run on spark stand alone cluster, yarn-client or yarn-cluster.
import org.apache.spark.SparkContext
val sc = new SparkContext(conf)
Now spark driver application knows to access the cluster as per the SparkConf settings. The resource manager/cluster manager will allocate resources to Spark application. If the resources are available spark acquires executors on nodes in the cluster.

Executor is a program launched on the worker/slave node when a job starts executions. Each Spark Driver Program has its own executors in cluster, and it remain running as long as driver application has SparkContext. SparkContext will create jobs, that are splits into stages, again into tasks. These tsks are scheduled to executors by SparkContext.
Reference1
Reference2