Titan AI Training Pod

Cluster Overview

Basic Information

Cluster Type

AI Training

Status

Healthy

Total Racks

Location

Datacenter

ASIA-PAC-1

Halls

Hall D

Description

High-performance AI training cluster with NVIDIA H100 GPUs

Manage Nodes Schedule Maintenance

Cluster Utilization

5.2%

86.9%

356 active jobs

Power Consumption

17.8 MW

PUE: 1 • 557.7 kW/node

GPU Health

88%

All GPUs operational

Compute Performance

356.4K TFLOPS

178 jobs queued

Cluster Specifications

Compute Resources

Total Nodes

CPU Cores

4,096

Memory

64 TB

Storage

2 PB

GPU Configuration

Total GPUs

256

GPU Models

H100 SXM

Topology

DGX_POD

Interconnect

INFINIBAND_NDR

GPU Utilization

87%

Network Configuration

Compute Fabric

INFINIBAND_NDR

Topology

DRAGONFLY

Bandwidth

36 Tbps

Latency

1813 ns

Management Subnet

10.211.30.0/24

Cluster Utilization

Loading cluster utilization data...

Rack Composition

Rack R1-1

COMPUTE

GPUs

88% healthy

Power

52.9 / 55 kW

Cooling

liquid

Temps

23°C → 27°C

Space

36/48U (12U free)

Rack R1-2

COMPUTE

GPUs

88% healthy

Power

60.9 / 55 kW

Cooling

liquid

Temps

19°C → 35°C

Space

36/48U (12U free)

Rack R1-3

NETWORK

Power

10.7 / 35 kW

Cooling

liquid

Temps

14°C → 42°C

Space

18/48U (30U free)

Rack R1-4

STORAGE

Power

11.5 / 35 kW

Cooling

liquid

Temps

10°C → 50°C

Space

16/48U (32U free)

Workload Scheduler

Type

PBS

Endpoint

https://closed-mom.name

Version

8.2.12

Jobs Running

356

Jobs Queued

178

Configuration

Auto Scaling

Enabled

Power Capping

Enabled

Power Limit

17.8 MW

Maintenance Window

Tue 8:00 (3h)

Connected Storage Systems

Lustre HPC StorageHEALTHY

LUSTRE • LUSTRE

Capacity

400

/ 800 TB

Performance

IOPS

800.0K

Throughput

80 GB/s

1 degraded drives

WekaFS StorageWARNING

WEKA • WEKA

Capacity

800

/ 1600 TB

Performance

IOPS

1.1M

Throughput

110 GB/s

1 degraded drives

Metadata

Created

7/27/2026, 6:16:51 AM

Last Updated

7/27/2026, 6:16:51 AM