What is Inference¶
Machine learning (ML) inference is the process of running live data points into a machine-learning algorithm to calculate an output.
With Inference, you are taking a trained Model and deploying it into a production environment. The deployment must align with the organization's production standards such as average and 95% response time as well as up-time.
Inference and GPUs¶
The Inference process is a subset of the original Training algorithm on a single datum (e.g. one sentence or one image), or a small batch. As such, GPU memory requirements are typically smaller than a full-blown Training process.
Given that, Inference lends itself nicely to the usage of Run:AI Fractions. You can, for example, run 4 instances of an Inference server on a single GPU, each employing a fourth of the memory.
Run:AI provides Inference services as an equal part together with the other two Workload types: Train and Build.
Inference is considered a high-priority workload as it is customer-facing. Running an Inference workload (within the Project's quota) will preempt any Run:AI Workload marked as Training.
Inference is implemented as a Kubernetes Deployment with a defined number of replicas. The replicas are load-balanced by Kubernetes so that adding more replicas will improve the overall throughput of the system.
Multiple replicas will appear in Run:AI as a single Inference workload. The workload will appear in all Run:AI dashboards and views as well as the Command-line interface.
Inference workloads can be submitted via Run:AI Command-line interface as well as Kubernetes API/YAML. Internally, spawning an Inference workload also creates a Kubernetes Service. The service is an end-point to which clients can connect.