Cluster Overview
Basic Information
Cluster Type
AI Training
Status
HealthyTotal Racks
4
Location
Datacenter
US-EAST-1
Halls
Hall A
Description
AMD MI300X-based AI training cluster
Cluster Utilization
85.9%
238 active jobs
Power Consumption
11.9 MW
PUE: 1 • 1493.5 kW/node
GPU Health
100%
All GPUs operational
Compute Performance
238.3K TFLOPS
119 jobs queued
Cluster Specifications
Compute Resources
Total Nodes
8
CPU Cores
1,024
Memory
16 TB
Storage
0.5 PB
GPU Configuration
Total GPUs
64
GPU Models
MI300X
Topology
CUSTOM
Interconnect
INFINIBAND_NDR
GPU Utilization
86%
Network Configuration
Compute Fabric
INFINIBAND_HDR
Topology
FAT_TREE
Bandwidth
24 Tbps
Latency
1229 ns
Management Subnet
10.43.204.0/24
Cluster Utilization
Loading cluster utilization data...
Rack Composition
Rack R1-1
COMPUTEGPUs
32
100% healthy
Power
43.8 / 55 kW
Cooling
liquid
Temps
22°C → 30°C
Space
36/48U (12U free)
Rack R1-2
COMPUTEGPUs
32
100% healthy
Power
56.6 / 55 kW
Cooling
liquid
Temps
18°C → 37°C
Space
36/48U (12U free)
Rack R1-3
NETWORKPower
14.7 / 35 kW
Cooling
liquid
Temps
14°C → 45°C
Space
18/48U (30U free)
Rack R1-4
STORAGEPower
15.5 / 35 kW
Cooling
liquid
Temps
30°C → 52°C
Space
16/48U (32U free)
Workload Scheduler
Type
PBS
Endpoint
https://steel-waist.name/
Version
1.16.20
Jobs Running
238
Jobs Queued
119
Configuration
Auto Scaling
Enabled
Power Capping
Enabled
Power Limit
11.9 MW
Maintenance Window
Mon 5:00 (2h)
Metadata
Created
7/5/2025, 6:08:00 AM
Last Updated
7/5/2025, 6:08:00 AM
Tags