Cluster Overview
Basic Information
Cluster Type
AI Training
Status
HealthyTotal Racks
4
Location
Datacenter
ASIA-PAC-1
Halls
Hall D
Description
High-performance AI training cluster with NVIDIA H100 GPUs
Cluster Utilization
86.9%
356 active jobs
Power Consumption
17.8 MW
PUE: 1 • 557.7 kW/node
GPU Health
88%
All GPUs operational
Compute Performance
356.4K TFLOPS
178 jobs queued
Cluster Specifications
Compute Resources
Total Nodes
32
CPU Cores
4,096
Memory
64 TB
Storage
2 PB
GPU Configuration
Total GPUs
256
GPU Models
H100 SXM
Topology
DGX_POD
Interconnect
INFINIBAND_NDR
GPU Utilization
87%
Network Configuration
Compute Fabric
INFINIBAND_NDR
Topology
DRAGONFLY
Bandwidth
36 Tbps
Latency
1813 ns
Management Subnet
10.211.30.0/24
Cluster Utilization
Loading cluster utilization data...
Rack Composition
Rack R1-1
COMPUTEGPUs
64
88% healthy
Power
52.9 / 55 kW
Cooling
liquid
Temps
23°C → 27°C
Space
36/48U (12U free)
Rack R1-2
COMPUTEGPUs
64
88% healthy
Power
60.9 / 55 kW
Cooling
liquid
Temps
19°C → 35°C
Space
36/48U (12U free)
Rack R1-3
NETWORKPower
10.7 / 35 kW
Cooling
liquid
Temps
14°C → 42°C
Space
18/48U (30U free)
Rack R1-4
STORAGEPower
11.5 / 35 kW
Cooling
liquid
Temps
10°C → 50°C
Space
16/48U (32U free)
Workload Scheduler
Type
PBS
Endpoint
https://closed-mom.name
Version
8.2.12
Jobs Running
356
Jobs Queued
178
Configuration
Auto Scaling
Enabled
Power Capping
Enabled
Power Limit
17.8 MW
Maintenance Window
Tue 8:00 (3h)
Connected Storage Systems
Lustre HPC StorageHEALTHY
LUSTRE • LUSTRE
Capacity
400
/ 800 TB
Performance
IOPS
800.0K
Throughput
80 GB/s
1 degraded drives
WekaFS StorageWARNING
WEKA • WEKA
Capacity
800
/ 1600 TB
Performance
IOPS
1.1M
Throughput
110 GB/s
1 degraded drives
Metadata
Created
7/5/2025, 7:57:28 AM
Last Updated
7/5/2025, 7:57:28 AM
Tags