Radeon AI Accelerator


Cluster Overview

Basic Information

Cluster Type

AI Training

Status

Healthy

Total Racks

4

Location

Datacenter

US-EAST-1

Halls

Hall A

Description

AMD MI300X-based AI training cluster

Cluster Utilization

5.2%

85.9%

238 active jobs

Power Consumption

11.9 MW

PUE: 1 • 1493.5 kW/node

GPU Health

100%

All GPUs operational

Compute Performance

238.3K TFLOPS

119 jobs queued

Cluster Specifications

Compute Resources

Total Nodes

8

CPU Cores

1,024

Memory

16 TB

Storage

0.5 PB

GPU Configuration

Total GPUs

64

GPU Models

MI300X

Topology

CUSTOM

Interconnect

INFINIBAND_NDR

GPU Utilization

86%

Network Configuration

Compute Fabric

INFINIBAND_HDR

Topology

FAT_TREE

Bandwidth

24 Tbps

Latency

1229 ns

Management Subnet

10.43.204.0/24

Cluster Utilization

Loading cluster utilization data...

Rack Composition

Rack R1-1

COMPUTE

GPUs

32

100% healthy

Power

43.8 / 55 kW

Cooling

liquid

Temps

22°C → 30°C

Space

36/48U (12U free)

Rack R1-2

COMPUTE

GPUs

32

100% healthy

Power

56.6 / 55 kW

Cooling

liquid

Temps

18°C → 37°C

Space

36/48U (12U free)

Rack R1-3

NETWORK

Power

14.7 / 35 kW

Cooling

liquid

Temps

14°C → 45°C

Space

18/48U (30U free)

Rack R1-4

STORAGE

Power

15.5 / 35 kW

Cooling

liquid

Temps

30°C → 52°C

Space

16/48U (32U free)

Workload Scheduler

Type

PBS

Endpoint

https://steel-waist.name/

Version

1.16.20

Jobs Running

238

Jobs Queued

119

Configuration

Auto Scaling

Enabled

Power Capping

Enabled

Power Limit

11.9 MW

Maintenance Window

Mon 5:00 (2h)

Metadata

Created

7/5/2025, 6:08:00 AM

Last Updated

7/5/2025, 6:08:00 AM

Tags

high-priorityhigh-priority