Intro to Inference: How to Run AI Models on a GPU
Learn how to set up and run AI inference on GPUs in Google Cloud. This pathway gets you started
with the inference pipeline, model formats, and performance metrics through hands-on examples.
Go back
Join the community
Join today to access an exclusive forum, joint learning paths, rewards, and earn badges.
Training vs. Inference: Components of the inference pipeline
Learn the role of AI inference in production applications and how each step of the inference pipeline comes together to enable fast, large-scale AI applications.
Foundations of Inference: Build an inference pipeline
This hands-on lab introduces the fundamental concepts of model inference—how a trained model takes an input and produces an output. We will build a step-by-step understanding of the inference pipeline:
- CPU vs GPU inference: Showing why hardware acceleration matters.
- Pipeline components: Preprocessing, batching & padding, forward pass, decoding, and postprocessing.
- Performance fundamentals: Introducing key principals of inference efficiency, like why batching requests is important for throughput, and why decoding without a cache can cause bottlenecks.
Choosing the right format for your AI model
This article explains the different file formats used to save, share, and deploy AI models. By reading this article, you will learn:
- When to use the most common modern model formats, including Safetensors, GGUF, TensorRT, and ONNX.
- How to choose the right format for your specific needs, from sharing models openly to deploying them for high-performance production.
- The technology that makes it possible to run massive language models on your personal laptop, thanks to formats like GGUF.
- The crucial role that model formats play in the entire lifecycle of a model, from initial training to final deployment.
Introduction to inference performance metrics
Understand the core metrics that define AI performance—from Time to First Token (TTFT) to throughput—and see how using percentiles is the key to balancing system efficiency with a consistently great user experience.
Measure and understand inference metrics: Latency, throughput, and UX
This hands-on lab introduces key performance aspects of running AI inference. You’ll learn how to:
- Define and measure the core metrics of latency (the time it takes to get a response) and throughput (how many requests a model can handle over time).
- Analyze the trade-offs between latency and throughput and why optimizing for one often impacts the other.
- Identify key parameters—like batch size, prompt length, and sampling strategy—and how they directly affect inference performance.
- Interpret LLM latency measurements, including how to measure and visualize p50 vs p90 latency and first-token versus total latency.
- Find the "sweet spot" by balancing these technical metrics to deliver the best possible real-world user experience.