Saturday, 12 November 2016

Spark Core Concepts

Here we say our Spark program or Spark Application as Spark Driver Application. Driver program : It contains application’s main function and defines distributed datasets on the cluster, then applies operations to them. And It runs on master node.
As a first step of starting a Spark Driver Application, create a SparkContext. As we know Spark application runs as independent set of processes on cluster.

SparkContext will allow your spark driver application to access the spark cluster through resource managers/cluster managers (YARN/Mesos or Spark's own standalone cluster manager).
Inorder to create SparkContext first we need to create SparkConf.

SparkConf has the configuration parameters that driver program will sent to SparkContext.
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("Wordcount").setMaster("local")
setAppName () application name shown in spark web UI , setMaster() Master URL to connect local means run locally with one thread, local[4] run locally with four cores, spark://master:7077 run on spark stand alone cluster, yarn-client or yarn-cluster.
import org.apache.spark.SparkContext
val sc = new SparkContext(conf)
Now spark driver application knows to access the cluster as per the SparkConf settings. The resource manager/cluster manager will allocate resources to Spark application. If the resources are available spark acquires executors on nodes in the cluster.

Executor is a program launched on the worker/slave node when a job starts executions. Each Spark Driver Program has its own executors in cluster, and it remain running as long as driver application has SparkContext. SparkContext will create jobs, that are splits into stages, again into tasks. These tsks are scheduled to executors by SparkContext.
Reference1
Reference2

Wednesday, 21 September 2016

Scala List Methods

Define list as
     >val l=List(9,8,89,900)
to get first element
    > l.head
to get elements except first
    >l.tail
to get lat element
    >l.last
to get length
    >l.length
to get n'th element(n:Int)
    >l(n)
to append 2 lists 'l' and 'k'
    >l++k
to get difference(display the elements in 'l' not in 'k')
    >l.diff(k)
to drop n elements (n:Int)
    > l.drop(n)
drop n elements from right
    > l.dropRight(2)

Saturday, 4 June 2016

Spark-1.6.0 installation

Apache Spark is a fast and general-purpose cluster computing system. Spark runs on both Windows and UNIX-like systems. Java installation is one of the mandatory things in installing Spark. Spark provides high-level APIs in Java, Scala, Python and R. Spark runs on Java 7+, Python 2.6+, and R 3.1+. For the Scala API, Spark 1.6.1 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
 Install Python, Java, R, or Scala before Spark installation and verify the versions too. For Scala 2.11.8 installation follow the link. I have installed Java, Python and Scala in my system.

After installing,
Download Spark-1.6.0.tgz from here. Extract the spark tar file by following command:
$ tar xvf spark-1.6.0.tgz
To build Spark and its example programs, run:
$ cd spark-1.6.0
$ build/mvn -DskipTests clean package

To confirm spark installation run one of the sample scala programs in the `examples` directory. Here run a program to compute Pi value
$ ./bin/run-example SparkPi
It gives output:
Pi is roughly 3.142324

If python is already installed in your system, you can run sample python programs in the 'examples' directory to confirm spark installation.
$ ./bin/spark-submit examples/src/main/python/pi.py
It gives output:
Pi is roughly 3.130720

Run Sample Python Program In Spark
Install Spark-1.6.0 by following my previous post.
Here i am going to tell you about how you can run a sample python programs in the 'spark-.6.0/examples/src/main/python/ml' directory
A python program 'tokenizer_example.py' that splits the sentences into word tokens. This can be run by:
$ cd spark-1.6.0
$ ./bin/spark-submit examples/src/main/python/ml/tokenizer_example.py
output:
Row(words=[u'hi', u'i', u'heard', u'about', u'spark'], label=0)
Row(words=[u'i', u'wish', u'java', u'could', u'use', u'case', u'classes'], label=1)
Row(words=[u'logistic,regression,models,are,neat'], label=2)

Thursday, 2 June 2016

Scala Installation

Scala is an "object-functional" programming language and it runs on Java platform(Java Virtual Machine). It interoperates with both Java and Javascript. Scala is the implementation language of important frameworks, including Apache Spark, Kafka and Akka.

Install Scala on Ubuntu 15.04
Scala needs Java Runtime 1.6 or later; You can ski installing Java if you already meet this requirement.

Check java version by:
$ java -version
Output :
openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK Server VM (build 25.45-b02, mixed mode)

Download Scala from here.
$ sudo mkdir /usr/local/src/scala
$ sudo tar xvf scala-2.11.8.tgz -C /usr/local/src/scala/

For quick access add scala and scalac to your path:
$ vi .bashrc
Add the following lines at the end:
export SCALA_HOME=/usr/local/src/scala/scala-2.11.8
export PATH=$SCALA_HOME/bin:$PATH
restart bashrc:
$ . .bashrc

Check scala version:
$ scala -version
Output will be like this:
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

Monday, 9 May 2016

Speaker Diarization

Speaker Diarization is a task of finding who spokes when in a speech signal or audio wave. Actually this process is carried out without any prior information like number of speakers in an audio wave, and their identities.
Actually a basic speaker diarization have 3 steps
  1. Speech Activity Detection 
  2. Speaker Segmentatin 
  3. Speaker Clustering 
In Speaker activity detection, it will identify speech and non-speech regions and discard the non-speech from further processing of speaker diarization system. So it can be considered as a preprocessing step. Speaker segmentation step it will segment the speech signals according to speaker identities. In Speaker clustering, group the segments of same speaker into one cluster, so one cluster for each speaker at the end of the clustering. 

LIUM_SpkDiarization is a software dedicated to speaker diarization. You can download software from here and work.