Practical Spark – Calculate AWS Spot price

How much does your current EC2 fleet cost? How much would it cost if EC2 Spot Instances were used?
Use Apache Spark, Scala and Jupyter Notebook to find out!

Someone might say that in order to calculate such cost the best thing to do is to use the AWS Cost Explorer. And I agree, it is super cool and can do a lot, and maybe the use case that will be described here can be achieved with it as well (I haven’t tried that actually). However, it wouldn’t be that fun without a bit of Scala and Apache Spark code! I used Jupter Notebook for code development. Thanks Robert for advertising it over and over again!

The Use Case

On Demand EC2 instances are used to host Java applications (a.k.a [micro]services). There are different types of EC2 instances used by different types of applications. I would like to find out, how much will be spent if those instances were actually Spot Instances.

Step 1: Prepare Environment

Step 2: Count existing EC2 Instances

In order to complete that step, I need to know what instances do I have and how many. AWS CLI describe-instances command is something that will be helpful. Note that only London’s (eu-west-2) prices will be downloaded. Since Spark will be used, lets save the result in JSON file:

aws --profile my-dev-profile \
ec2 describe-instances --region eu-west-2 \
> all-dev-ec2.json

Read the file to create Spark’s SQL DataFrame. Assume filesIn holds a path to the folder where all-dev-ec2.json is located:

val dev = spark.read.option("multiLine", "true")
    .json(s"$filesIn/all-dev-ec2.json")

Since I have more than one non-prod environment and many types of applications, my target is to get the following information:

  • Instance Type
  • Application Role (a.k.a type)
  • Environment
  • Number of Instances

Since Role and Environment are tags, I’m using a self join eventually to have those as columns in one row:

val instances = dev
    .select(explode($"Reservations") as "Reservation")
    .select("Reservation.*")
    .select(explode($"Instances") as "Instance")
    .select("Instance.*")

val instanceTypesWithTags = instances
    .withColumn("Tags", explode($"Tags"))
    .select("InstanceId", "InstanceType", "Tags.*")

val instancesWithRoleAndEnv = instanceTypesWithTags.as("i1")
    .join(instanceTypesWithTags.as("i2"),
          $"i1.InstanceId" === $"i2.InstanceId")
    .select($"i1.InstanceId",
            $"i1.InstanceType",
            $"i1.Key",
            $"i1.Value" as "Role",
            $"i2.Key" as "Key2",
            $"i2.Value" as "Environment")

Final step is to group by Role, Environment and Instance Type and get the count of the rows per group:

val instancesPerTypeAndEnv = instancesWithRoleAndEnv
    .where($"Key" === "Role" && $"Key2" === "Environment")
    .groupBy("InstanceType", "Role", "Environment")
    .count()

instancesPerTypeAndEnv.show()

Sample output could be the following:

+------------+-----------+-----------+-----+
|InstanceType|       Role|Environment|count|
+------------+-----------+-----------+-----+
|    t2.small|     consul| env1      |    3|
|   m4.xlarge|    jenkins| env1      |    1|
|    t2.small|   frontend| env1      |    1|
|    t3.large|   keycloak| env2      |    2|
|   t2.medium|         dc| env2      |    2|
|   t2.medium|    gateway| env2      |    1|
+------------+-----------+-----------+-----+

Step 3: Calculate current EC2 price

AWS exposes and API to download their offer files as either JOSN or CSV. Details can be found here. I would like to download current EC2 prices for London region:
https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-2/index.csv.

Read the downloaded file in Spark:

val awsEc2Price = spark.read.option("header", "true")
    .csv(s"$filesIn/eu-west-2-ec2.csv")

Select On Demand price for Linux VMs. There are different types of Linux instances, however I’m only interested in those without pre-installed SQL. There are various ways to do it, one is to select only those where operation equals RunInstances:

val ec2OnDemandLondon = awsEc2Price
    .where($"TermType" === "OnDemand"
        && $"serviceCode" === "AmazonEC2"
        && $"Operating System" === "Linux"
        && $"Tenancy" === "Shared"
        && $"operation" === "RunInstances")
    .select($"Instance Type", $"PricePerUnit")

ec2OnDemandLondon.show()

Sample output of the show method would be like the one below. Price is in US dollars, Unit is an hour:

+-------------+-------------+
|Instance Type| PricePerUnit|
+-------------+-------------+
|    c4.xlarge| 0.2370000000|
|   d2.4xlarge| 3.0870000000|
|  c5d.2xlarge| 0.4600000000|
|     c4.large| 0.1190000000|
|   m5.4xlarge| 0.8880000000|
|    d2.xlarge| 0.7720000000|
+-------------+-------------+

Not all of my environments run 100% of the time. I will provide the Utilisation in another CSV file, where 1 = 100%, 0.5 = 50%, etc.:

Environment,Utilization
env1,0.42
env2,1
...

Read the CSV, join current usage instancesPerTypePerEnv with EC2 price ec2OnDemandLondon and utilisation envUtilisation to calculate the cost:

val envUtilisation = spark.read.option("header", "true")
    .csv(s"$filesIn/utilization.csv")
val instancesWithCost = instancesPerTypePerEnv.as("i")
    .join(ec2OnDemandLondon.as("c"),
          $"i.InstanceType" === $"c.Instance Type")

val instancesWithUtilisation = instancesWithCost.as("i")
    .join(envUtilisation.as("eu"),
          $"i.Environment" === $"eu.Environment")
    .withColumn("TotalCostPerMonth",
                $"count" * $"PricePerUnit" * 730 * $"Utilization")
    .select($"i.Environment",
            $"i.InstanceType",
            $"i.Role",
            $"i.count" as "Count",
            $"eu.Utilization",
            $"TotalCostPerMonth")
    .sort($"Environment")

instancesWithUtilisation.show()

That provides the total cost per application, environment and instance type taking into consideration the utilisation of each environment. 730 is just an average number of hours in a month calculated as 365 * 24 / 12.

Step 4: Calculate Spot Cost

Download Spot price history and store it in JSON using AWS CLI:

aws --profile my-dev-profile ec2 \
describe-spot-price-history --region eu-west-2 \
> ec2-spot-price-history.json

Read the file in Spark. Select average price of Spot instances, grouping by Instance Type:

val ec2SpotHistorySrc = spark.read
    .option("multiLine", "true")
    .json(s"$filesIn/ec2-spot-price-history.json")

import org.apache.spark.sql.types.DoubleType

val ec2SpotHistory = ec2SpotHistorySrc
    .select(explode($"SpotPriceHistory") as "SpotPriceHistory")
    .select($"SpotPriceHistory.*")
    .where($"ProductDescription" === "Linux/UNIX")
    .withColumn("SpotPrice", $"SpotPrice".cast(DoubleType))
    .groupBy("InstanceType")
    .avg("SpotPrice")
    .orderBy("avg(SpotPrice)")

Finally join instancesWithUtilisation with the spot price history ec2SpotHistory and calculate final Spot cost and what is the difference between On Demand and Spot:

val spotVsDemand = instancesWithUtilisation.as("i")
    .join(ec2SpotHistory.as("hi"),
          $"i.InstanceType" === $"hi.InstanceType")
    .withColumn("AvgSpotCost",
                $"Count" * $"avg(SpotPrice)" * 730 * $"Utilization")
    .select($"i.Environment",
            $"i.Role",
            $"TotalCostPerMonth",
            $"AvgSpotCost")
    .withColumn("Difference",
                $"TotalCostPerMonth" - $"AvgSpotCost")

spotVsDemand.show()

Sample result:

+-----------+----------------+------------------+------------------+------------------+
|Environment|            Role| TotalCostPerMonth|       AvgSpotCost|        Difference|
+-----------+----------------+------------------+------------------+------------------+
|       env2|         jenkins|             81.03|  23.1926282894736|  57.8373717105264|
|       env2|        jumphost|             18.98| 5.694000000000014|13.285999999999987|
|       env2|           nexus|             37.96| 11.38800000000003| 26.57199999999997|
|       env1|          consul|12.141359999999999| 3.679200000000002| 8.462159999999997|
+-----------+----------------+------------------+------------------+------------------+

Summary

What you have seen is that Spark, and Spark SQL in particular, can be used to quickly join and analyse data from various sources. Using Jupyter Notebook can boost the performance of code development and results analysis. The use case was calculating AWS EC2 Spot costs for existing fleet of Instances.  Once I store the described notebook on GitHub I will update this post – stay tuned!

I literally love what Spark can do and will use it more! Just need to find a good excuse … 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s