Scaling Kubernetes Jobs for Unity Simulation

Unity Simulation enables product developers, researchers, and engineers to smoothly and efficiently run thousands of instances of parameterized Unity builds in batch in the cloud. Unity Simulation allows you to parameterize a Unity project in ways that will change from run to run. You can also specify simulation output data necessary for your end application, whether that be the generation of training data for machine learning, the testing and validation of AI algorithms, or the evaluation and optimization of modeled systems. With Unity Simulation, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. This blog post showcases how our engineers are continually innovating to ensure that our customers’ jobs run as fast and as cost-effectively as possible on Unity Simulation.

Unity Simulation leverages Kubernetes, an open source system, to containerize, schedule, and execute simulation jobs across the right number and type of compute instances. Kubernetes allows for easy download of simulation output data to a cloud storage location to connect to design, training, and testing workflows. By leveraging Kubernetes, you can run multiple simulations at a time without having to worry about compute resource allocation or capacity planning.

The following concepts are fundamental to understanding Unity Simulation and Kubernetes:

Run Definition: This specifies the name and description of the simulation, a set of application parameters for the simulation execution, system parameters specifying the compute resources to use, and the Unity build id, which references an uploaded Unity executable.
Kubernetes Job: When you deploy Kubernetes, you get a cluster. A Kubernetes cluster consists of a set of worker machines, called Nodes, that run containerized applications. The worker Node(s) host the Pods that are the components of the application workload. A Pod is the basic execution unit for a Kubernetes system and represents the processes being run on your cluster. A Kubernetes Job is managed by the system level Job Controller that supervises Pods participating in a batch process that runs for a certain amount of time and then completes. Where the term “Job” appears in this blog, it always refers to a Kubernetes Job.
Kubernetes Controller and Operator: A Kubernetes Controller is responsible for incrementally moving the current state of a resource toward the desired state. The Kubernetes Job Controller creates one or more Pods and ensures a specified number of them complete successfully. A Kubernetes Operator is a controller that follows this pattern but is extended to embody specific operational knowledge required to run a workload. Our Simulation Job Operator has an understanding of the Kubernetes Autoscaler and how its behavior can affect the current state versus the desired state, as we discuss in this article.

A Kubernetes Job creates one or more Pods and ensures that the correct number of Pods successfully complete. A work queue is typically used to distribute tasks to the Pods assigned to the Job. The application process running in the container can pick tasks from the queue in parallel or separately as needed. The Job’s parallelism parameter is used to determine the number of parallel Pods that the Job runs simultaneously (or in other words, the number of concurrent simulation instances for the run execution). The Job’s completions parameter determines the number of Pods that must successfully finish.

The Unity Simulation scheduler orchestrates run executions using the Kubernetes Job and the queue design pattern. The scheduler enqueues a message for each simulation instance to an independent run execution queue before submitting the Job to the Kubernetes cluster.

The following diagram shows how queues are used to distribute the messages to the Pods for executing Jobs. This diagram shows a Job with a parallelism of four. In other words, there are four instances of a Unity project running in the simulation.

Our application of Kubernetes differs from most in that we use a combination of batch processing and autoscaling. We discovered that batch processing Jobs, combined with the Kubernetes Autoscaler behavior, leads to an unexpected interaction that results in a significant waste of compute resources and Job inefficiencies. The Kubernetes Autoscaler alternates between scaling up and scaling down the cluster, and the Job Controller reports incorrect state. This leads to overblown estimates of Job time, inaccurate reports after Job completion, and overall CPU inefficiency.

During the Job lifecycle, the completion count should either stay the same or increase, but our metrics showed Jobs whose count decreased. The incorrect completion counts caused the Job Controller to create more Pods to satisfy the completion count requirement, which determines the number of Pods that must successfully finish. The Job requested more Pods, causing the Kubernetes Autoscaler to add nodes to the cluster. The newly created Pods completed immediately after they were created because there were no remaining tasks in the queue. The added nodes quickly became idle after completing the Pods because the completion count was reached. This caused the Autoscaler to remove the idle nodes from the cluster, causing the Pod completion count to decrease.

This behavior causes the following vicious cycle:

Scale up to run more Pods
Pods complete immediately
Scale down because nodes become idle

The scaling up left our cluster unavailable to execute other work because the problematic Job was utilizing all of the cluster’s resources. At best, this wastes resources and, at worst, makes the service unavailable.

The problem is described in detail in the following steps.

Step 1:

Unity Simulation is a multicloud solution and for our product running on Google Cloud Platform we use GKE, a managed Kubernetes solution. Let us assume the GKE Cluster has one running node, Node1, capable of hosting five pods. The new Job requires 15 pods and causes the GKE cluster to add two nodes, Node2 and Node3, to increase capacity to run the 15 pods.

Step 2:

All Pods are ‘active’ on the GKE cluster.

Step 3:

Five pods on Node1 go to the ‘complete’ state (green) for the Job.

Step 4:

The Pods in Node1 are completed, so it becomes idle and is scaled down. When Node1 is removed from the cluster, its completion count of five is lost, so the completion count for the Job drops to zero when it should be five. This triggers the error condition because only ten pods are accounted for while the Job expects 15 pods to exist. The Job Controller requests five new pods, which causes the Autoscaler to add a node to the cluster again.

We need to dive deep to understand this problem better and carefully study the Job Controller source code in Kubernetes. The SyncJob function synchronizes the state of the Job based on the current state of the pods it manages. SyncJob calls getStatus to get the number of successful and failed Pods for the Job. The Pods are retrieved by querying for the Pods currently existing in the cluster using a selector in the getPodsForJob function.

Unfortunately, when a node is removed from the Kubernetes cluster, this deletes the metadata for the Pods that ran on that node. When the Job Controller queries Kubernetes for a Job’s Pods, after the Autoscaler has taken a node down, the Job Controller receives incorrect completion counts. We easily reproduced this behavior by creating a simple Job that executes a single long-running sleep command and many shorter sleep commands in separate tasks.

After getting more familiar with the Job Controller source code, we realized we could fix the scaling problem by persisting the status of the Pods. This ensures that the Pod metadata is captured even when it is not available in the Kubernetes cluster. We found that developing an Operator for executing simulations is beneficial for other reasons too.

The custom resource definition and Operator we implemented is very similar to the current Kubernetes Job Controller with a fix for the autoscaling issue. Our Simulation Job (SimJob) Operator updates a list of the unique successful and failed Pods each time the SimJob Operator’s control loop runs. The current state of the Pods determines the current state of the SimJob in the Kubernetes cluster and the data store that contains the unique set of successful and failed Pods.

The following diagram shows how our SimJob Operator maintains the correct ‘completion’ count even when the cluster is scaled down:

We have run the SimJob Operator in production for the past two months. It has successfully executed over 1,000 simulations with nearly 50,000 total execution instances (or Pods in other words: there can be one or more simulation instances per Pod). We can safely autoscale our simulation run executions without risking the availability of the cluster and the Unity Simulation service in turn. We are very happy with this trend and are excited to continue to improve and add new features to the SimJob Operator.

Unity Simulation is at the forefront of data-driven artificial intelligence, whether that be the generation of training data for machine learning, the testing and validation of AI algorithms, or the evaluation and optimization of modeled systems. Our teams continue to innovate daily to provide the best-managed simulation service ecosystem.

If you’d like to join us to work on exciting Unity Simulation and AI challenges, we are hiring for several positions; please apply!

Learn more about Unity Simulation.

Source: Unity Technologies Blog

PhoneticLight

Make; Play; Repeat

Scaling Kubernetes Jobs for Unity Simulation

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply