Netscientium

Hello World in Apache Spark

Hello World in Apache Spark… Let us learn little bit of Spark basics

Let us build Spark from source and start the spark-shell.cmd (in Windows).

If you have not yet built it,

Here is our first Word Count in SPARK, this is with an assumption that you have learnt Scala.

When you open the REPL, Spark context is available there as sc

scala> val file = sc.textFile(“C:\somefile.txt”)

This will create the Text File RDD from the local file. You can also create the RDD from HDFS or other Hadoop-supported filesystem, or HTTP, HTTPS, FTP hdfs://s3://kfs://,file://, etc URI

scala> val words = file.flatMap(_.split(“ ”))

This is going to flatten the lines and split it into List of words.

scala> words.count()

This will count the number of words.

scala> words.distinct().count()

Count of unique words.

scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

map will convert the list of words to (word, 1) sequence. Now the shuffle stage transformation reduceByKey will reduce it to a dataset of (word, total count of this word) form. All Transformation are lazy operation. So we need to perform an Action to execute and return the output to the driver program.

scala> wordCounts.saveAsTextFile(“sparkHelloWord”)

We want to modify this further and want to sort it starting from maximum times a word appears to the least (decreasing order). Unfortunately, we do not have sortByValue, but we have sortByKey. So we have to reverse the order of key and value and then sortByKey and the reverse it again.

scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => (y,x)}.sortByKey(false).map{case(i,j) => (j, i)}

The following Action will print the top 5 words in the console

scala> wordCounts.take(5)

August 4, 2015

0 Responses on Hello World in Apache Spark"

2014 © Netscitus Corporation. All Rights Reserved.