I have come across Apache Spark when looking for tools for an ETL process. I am a big fan of Scala, and Spark with Scala was mentioned a few times in an ETL context. Therefore, I couldn’t not to use it as a potential option.
Spark can be hosted on a local machine – such a deployment mode is called Spark Standalone. Pre-built package can be obtained from Spark’s download page. Once unpacked it is ready to go! It contains a set of scripts to manage local spark cluster, which is well described in the documentation.
Spark can also be started in an embedded mode. This happens out of the box when running main method that creates SparkContext and connects to “local” master. In this case Spark binaries don’t have to be downloaded.
Spark offers interactive shell to play with its functionality. If you have downloaded spark, running
will start the console.
Structured & Unstructured Data
SparkContext is the main entry point for Spark functionality. It represents the connection to a Spark cluster, and can be used to create RDDs (Resilient Distributed Datasets), accumulators and broadcast variables on that cluster. RDD is at the core of Spark and provides efficient way to work with unstructured data.
In order to create SparkContext in a Spark application, the following code can be used:
val conf = new SparkConf() .setAppName("appName") .setMaster("local[*]") val sc = new SparkContext(conf)
The code above creates an application of name appName running on a standalone cluster that uses all available logical cores on a local machine (check master-url docs for more information).
If the data is in structured or semi-structured format like JSON, CSV or even a table in the database, Spark SQL would be a perfect fit. It allows running SQL queries against the data and is optimised for storing and distributing it in the cluster. In order to use Spark SQL, SparkSession has to be created. It can be done in the following way:
val session = SparkSession .builder() .master("local[*]") .appName("appName") .getOrCreate()
Visit sql programming guide for details.
Since AWS is the environment that I’m using at the moment I looked at this particular cloud provider for hosting options. If managed Spark cluster is the preferred approach, AWS has two options:
- Amazon Glue
- Glue has a concept of Jobs that run code form a script. Currently Scala and Python are supported.
- AWS Glue API is a custom library that gives developers additional tools to manipulate data as well as obtain data sources and destinations from AWS Glue Data Catalog.
- Currently requires developer endpoint to properly test scripts, especially written in Scala.
- Amazon EMR
- Usage feels more like local Spark cluster. Submit application using spark-submit script.
- Since there is no custom library, use standard Spark code, test locally and run on the cluster.
- Uses yarn to manage the cluster.
On the other hand if there are Spark gurus in your company, maybe self managed Spark installation on an EC2 is better!
Spark is getting more and more popular. There are plenty of options to choose from, including:
- Apache Spark documentation
- A GitBook by Jacek Laskowski: Mastering Apache Spark
- Online learning platforms
- Spark Summit conference
- Spark was easy to set up on a local machine and to run first sample application
- Getting it to work locally for my use case on a subset of data took less than a day
- It is intuitive and has a well known functional and SQL-like style of querying data
- It is a powerful tool with friendly API
- Spark is available as a service on AWS