
Apache Spark has always been very well known for distributing computation among multiple nodes using the assistance of partitions, and CPU cores have always performed processing within a single partition.
What’s less widely known is that it is possible to accelerate Spark with GPUs. Harnessing this power in the right situation brings immense advantages: it reduces the cost of infrastructure and the amount of servers needed, speeds up query completion times to deliver quicker results up to 7 times when compared to traditional CPUs computing, and does it all in the background without having to alter any existing Spark applications’ code. We’re excited to share that our team at Canonical has enabled GPU support for Spark jobs using the NVIDIA RAPIDS Accelerator – a feature we’ve been developing to address real performance bottlenecks in large-scale data processing.
This blog will explain what advantages Spark can deliver on GPUs, how it delivers them, when GPUs might not be the right option, and guide you through how to launch Spark jobs with GPUs.
Why data scientists should care about Spark and GPUs
Running Apache Spark on GPUs is a notable opportunity to accelerate big data analytics and processing workloads by taking advantage of the specific strengths of GPUs.
Unlike traditional CPUs, which typically have a low number of cores designed for sequential processing, GPUs are made up of thousands of smaller, power-saving cores that are designed for executing thousands of parallel threads concurrently. This architectural difference makes GPUs well suited to the highly distributed operations common in Spark workloads. By offloading such operations to GPUs, Spark can improve performance significantly, reducing query execution times by orders of magnitude compared to CPU-only environments, usually accelerating data computing by 2x to 7x. This significantly reduces time to insight for organizations, making a noticeable difference.
In this regard, GPU acceleration in Apache Spark constitutes a big advantage for data scientists, as they transition from traditional analytics to AI applications. Standard Spark workloads are CPU core-intensive, which does offer an extremely powerful computation due to its distributed nature – however, it may not be powerful enough to manage AI-powered analytics workloads.
With GPUs, on the other hand, data scientists can work at higher speed – greater data scale, and improved efficiency. This means data scientists can iterate faster, explore data more interactively, and provide actionable insights in near real-time, which is critical in today’s fast-paced decision-making environments.
Alongside speed acceleration, GPU acceleration also simplifies the data science workflow by combining data engineering and machine learning workloads on a single platform. Through Spark with GPU acceleration, users can efficiently perform data preparation, feature engineering, model training, and inference in one environment without separate infrastructure, or complicated data movement between systems. Consolidating workflows reduces operational complexity and speeds up end-to-end data science projects.
A third major advantage of using Spark on GPUs is that it reduces operational expenses. Given GPUs offer much greater throughput per machine, companies can achieve equal – or better – results with fewer servers. This keeps costs down, and reduces power consumption. This makes big-data analytics more affordable and sustainable – increasingly important areas for enterprises.
Finally, all of this is achievable without code rewriting or workflow modification, as technologies like NVIDIA RAPIDS smoothly integrate with Spark. Making adoption easier helps users to overcome a major barrier to unlocking the capabilities of GPUs, so they can prioritize rapid value delivery.
When should you rely on traditional CPUs?
It is important to note that not all workloads in Spark will benefit equally from GPU acceleration.
Firstly, GPUs aren’t efficient for small data set workloads, since data transfer overhead between GPU and CPU memory can be higher than the performance benefit of GPU acceleration. With small workloads, fine-grained parallelism doesn’t benefit from the strengths of GPUs. Likewise, workloads that involve consistent data shuffling within the cluster may not be well suited. This is because shuffling leads to costly data movement across CPU and GPU memory, effectively slowing down operations.
Another good reason to rely on CPUs is if your Spark jobs rely significantly on user-defined functions that are not supported or optimized for execution on GPU.
Similarly, if your workloads entail operations that directly operate on Resilient Distributed Datasets (RDDs), GPUs might not be the best choice. This is because the RAPIDS Accelerator is currently not capable of handling these workloads and will run them on the CPU instead. Finally, you will also need to make sure that your environment meets the hardware and configuration requirements for GPU acceleration.
To find out whether GPU acceleration is useful in your chosen environment, it’s worth carefully profiling and benchmarking your workloads.
How to launch Spark jobs with GPUs
Our charm for Apache Spark works with Kubernetes as a cluster manager, so to enable GPUs on Apache Spark we will need to work with pods and containers.
First, you will need to deploy Charmed Apache Spark’s OCI image that supports the Apache Spark Rapids plugin. Read our guide to find out how.
Once you’ve completed the deployment, and you’re ready to launch your first job, you’ll need to create a pod template to limit the amount of GPU per container. To do so, edit the pod manifest file (gpu_executor_template.yaml) by adding the following content:
apiVersion: v1
kind: Pod
spec:
containers:
- name: executor
resources:
limits:
nvidia.com/gpu: 1
With the spark-client snap, we can submit the desired Spark job, adding some configuration options for enabling GPU acceleration:
spark-client.spark-submit \
... \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.memory.pinnedPool.size=1G \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
--conf spark.executor.resource.gpu.vendor=nvidia.com \
--conf spark.kubernetes.container.image=ghcr.io/canonical/charmed-spark-gpu:3.4-22.04_
edge\
--conf spark.kubernetes.executor.podTemplateFile=gpu_executor_template.yaml
…
With the Spark Client snap, you can configure the Apache Spark settings at the service account level so they automatically apply to every job. Find out how to manage options at the service account level in our guide.
Spark with GPUs: the takeaway
In short, NVIDIA RAPIDS GPU acceleration offers Apache Spark enormous performance boosts, enabling faster data processing, cost savings, and does so without code change. This means data scientists can process bigger data sets and heavy models more efficiently, generating insights faster than before. Not all workloads, however, are benefited equally; small data sets, excessive data shuffling, or unsupported functions could limit GPU advantages. Careful profiling must be performed in order to determine when GPUs are a cost effective choice to make. Overall, Spark on GPUs offers a powerful way to accelerate data science and drive innovation.