by Anirban Ghoshal

Senior Writer

Google launches TPU monitoring library to boost AI infrastructure efficiency

news

Jul 21, 20254 mins

Artificial IntelligenceDevelopment Tools

Integrated with LibTPU, the new monitoring library provides detailed telemetry, performance metrics, and debugging tools to help enterprises optimize AI workloads on Google Cloud TPUs.

Credit: Tada Images / Shutterstock

Google has introduced a new monitoring library to enhance Tensor Processing Unit (TPU) resource efficiency as enterprises scale AI workloads to meet growing internal and customer demand while managing costs.

The TPU Monitoring Library is integrated within the LibTPU library — the foundational library that powers machine learning frameworks such as JAX, PyTorch, and TensorFlow to run models on Google Cloud TPUs.

“The TPU Monitoring Library gives you (enterprise users) detailed information on how machine learning workloads are performing on TPU hardware. It’s designed to help you understand your TPU utilization, identify bottlenecks, and debug performance issues,” Google explained in its documentation.

While the library uses telemetry API and a metrics suite to deliver detailed insights into operational performance and TPUs’ behavior, it provides a software development kit (SDK) and a command-line interface (CLI) as a diagnostic toolkit to allow enterprises to perform in-depth performance analysis of TPU resources and carry out debugging.

Observability and insights into AI infrastructure performance are critical areas for enterprises when it comes to scaling their AI workloads, Charlie Dai, vice president and principal analyst at Forrester, said.

“According to Forrester’s Tech Pulse Survey in Q4 2024, 85% of IT decision-makers are focusing on observability and AIOps in general,” Dai added.

Google’s new TPU Monitoring Library offers at least seven indicators that enterprises can use to determine TPU utilization and efficiency.

These indicators include Tensor Core Utilization, which measures how effectively the TPU’s specialized cores are being used during operations, and the Duty Cycle Percentage indicator that reveals how busy each TPU chip is over time.

It also offers the HBM Capacity Total and HBM Capacity Usage indicators that track the total and active use of high-bandwidth memory, respectively.

For network performance, the Buffer Transfer Latency metric is used to capture latency distributions for large-scale data transfers, helping identify communication bottlenecks, Google said in its documentation.

Additionally, the library comes with High Level Operation (HLO) Execution Time Distribution Metrics, offering detailed timing breakdowns of compiled operations, and HLO Queue Size, which monitors execution pipeline congestion.

AWS and Microsoft, too, have similar tools

However, Google isn’t the only AI infrastructure provider that is releasing tools to optimize resources (CPU accelerators, GPUs) performance and usage.

Rival hyperscaler AWS has a host of ways using which enterprises can optimize their cost of running AI workloads while ensuring maximum usage of their resources.

To begin with, it provides Amazon CloudWatch — a service that is capable of providing end-to-end observability on training workloads running on Trainium and Inferentia, including metrics like GPU/accelerator utilization, latency, throughput, and resource availability.

AWS services such as SageMaker, via offerings like SageMaker HyperPod, also make way for more efficient usage of resources while reducing training time.

In contrast to the manual model training process — which is prone to delays, unnecessary expenditure, and other complications — HyperPod removes the heavy lifting involved in building and optimizing machine learning infrastructure for training models, reducing training time by up to 40%, according to AWS.

Similar to the TPU Monitoring Library, Microsoft offers Maia SDK as the core toolkit to optimize model execution for its Azure Maia chipsets, together with developer tools like Maia Debugger and Profiler for debugging and tracking, Dai said.

Although rivals are offering similar tools, Dai pointed out that the new monitoring library is expected to effectively help Google Cloud further expand its footprint in the AI-native infrastructure cloud market.

by Anirban Ghoshal

Senior Writer

Anirban is an award-winning journalist with a passion for enterprise software, cloud computing, databases, data analytics, AI infrastructure, and generative AI. He writes for CIO, InfoWorld, Computerworld, and Network World. He won the 2024 Silver Azbee Award for Best News Article in the Technology category. He has a post-graduate diploma in journalism from the Indian Institute of Journalism and New Media.

Show me more

Topics

About

Policies

Our Network

More

Google launches TPU monitoring library to boost AI infrastructure efficiency

Integrated with LibTPU, the new monitoring library provides detailed telemetry, performance metrics, and debugging tools to help enterprises optimize AI workloads on Google Cloud TPUs.

AWS and Microsoft, too, have similar tools

More from this author

Neo4j unveils Infinigraph to merge OLTP and OLAP for Agentic AI

OpenAI acquires Statsig to speed up generative AI-based product launches

OpenAI adds MCP and SIP support to gpt-realtime for smarter voice-based agents

The era of cheap AI coding assistants may be over

Databricks buys Tecton to give context to AI agents

Microsoft adds MCP support to Visual Studio to boost development of agentic applications

Anthropic adds Claude Code to its Claude enterprise plans

AWS blames bug for Kiro pricing glitch that drained developer limits

Show me more

Rust Innovation Lab launched, sponsors first project

Is Meta’s $10 billion cloud deal a good idea for you?

What makes JavaScript great

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust