Senior Datacenter Technical Program Manager, At-Scale AI Clusters

b960b2cd-63f Senior Datacenter Technical Program Manager, At-Scale AI Clusters We are looking for a highly-motivated Technical Program Manager (TPM) to join our Applied Systems Engineering Team to drive datacenter integration for the next generation of NVIDIA AI supercomputing systems.

This TPM will play a crucial role throughout the lifecycle of the latest AI systems at scale, from datacenter design and requirements definition, through systems integration of AI clusters into the datacenter environment, and support for these systems as they enter production.

The successful candidate will collaborate with outstanding engineers and architects to build and deploy large-scale GPU computing systems based on NVIDIA's reference supercomputing architectures.

Key responsibilities include:

Collaborating with engineering leaders across multiple hardware and software teams to build AI supercomputers for NVIDIA engineers and develop reference architectures to advise customers and partners.

Leading the integration of new AI clusters with datacenter facilities with demanding requirements on power, cooling, and instrumentation.

Coordinating design and fit-out of new datacenter builds, working with both internal engineering teams and external contractors.

Owning and producing detailed documentation for the end-to-end process for datacenter fit-out and integration.

Communicating internally with engineering leadership to prioritize and address key issues essential to the success of our largest customers.

We are looking for a TPM with a strong background in high-performance computing systems and GPU clusters deployed in on-premises datacenters.

BS in Applied Science or Engineering (or equivalent experience)

8+ years of overall experience

Experience with high-performance computing systems and GPU clusters deployed in on-premises datacenters

A passion for understanding challenging technical problems and driving the process of finding a solution

Strong teamwork and interpersonal skills, to facilitate building a collaborative workflow for coordination between many teams

Understanding of datacenter design, including familiarity with power and cooling technologies

Expertise in system monitoring and instrumentation of large clusters, using technologies such as Prometheus, Grafana, Splunk, Modbus, and BACNet

Experience working with the engineering or academic research community supporting high-performance computing or deep learning

You will also be eligible for equity and benefits.

XML job scraping automation by YubHub

]]> full-time senior onsite high-performance computing systems, GPU clusters, datacenter design, power and cooling technologies, system monitoring and instrumentation, Prometheus, Grafana, Splunk, Modbus, BACNet Engineering Technology NVIDIA https://logos.yubhub.co/nvidia.com.png NVIDIA is a leading technology company that specializes in designing graphics processing units (GPUs) and high-performance computing hardware. https://www.nvidia.com/ https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Datacenter-Technical-Program-Manager_JR2011480 Santa Clara 2026-04-25 61e08612-10d Data Center Operations Technician As a Data Center Operations Technician at xAI, you will be responsible for the health of our server and network infrastructure for Data Centers and Global Points of Presence. You will be responsible for our two most important data center operations metrics: mean time to detect (MTTD) and mean time to repair (MTTR).

Your primary responsibilities will include:

Reporting to the job site during initial construction and reporting back to the engineering team as required.
Performing troubleshooting and monitoring of the servers and network in our data centers and global points of presence.
Rack and stacking of data center network equipment.
Maintaining Warehouse inventory and asset management using our internal application.
Labelling and troubleshooting for fibre/optics cables.
Power supply cabling, installation, troubleshooting and repair.
Installation of racks, servers and switches; this includes staging racks in place, cabling, power up and handoff of hardware to the provisioning team for customer capacity allocation.
Managing, responding and resolving of data center operations tickets used cross functionally within xAI via Jira.
Creating and maintaining documentation of tasks and standard operating procedures.
Receipt and decommissioning of data center hardware.
Vendor returns for infrastructure under and out of warranty.
Managing spare parts inventory within the data center.
Defining, designing, and implementing network layouts and solutions within our data centers.

To be successful in this role, you will need:

A high school diploma or equivalency certificate.
2+ years of experience working with server, storage, compute and network hardware.
2+ years of experience troubleshooting and repairing servers and networking infrastructure.
2+ years of experience in Inventory Management, and ordering, receiving and shipping server and network equipment.
Strong Linux skills, including navigating the system's directories and filing system, manipulating files in the Linux shell, user permission configuration, package installation and software management.
Ability to identify and apply different filesystem types, using Linux commands for process management, basic troubleshooting and debugging, and Bash or other scripting.
Experience being on-call and ability to respond to critical events as needed.
Experience leading Data Center Infrastructure projects.
Curious to always learn new things within the Data Center World.
Excellent prioritization and time management skills.
Able to work in a fast-paced environment.
Detail-oriented.
Oracle Experience.
Inventory Management.
4+ years of experience in Structured Cabling Copper/Fibre.
4+ years of experience in Power and Cooling concepts inside the data center.

XML job scraping automation by YubHub

]]> full-time mid onsite Linux, Server Hardware, Network Hardware, Inventory Management, Troubleshooting, Debugging, Scripting, Oracle Experience, Structured Cabling Copper/Fibre, Power and Cooling Concepts, Strong Linux skills, Experience leading Data Center Infrastructure projects, Curious to always learn new things within the Data Center World, Excellent prioritization and time management skills, Able to work in a fast-paced environment, Detail-oriented Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/4741579007 Memphis, TN; Southaven, MS 2026-04-18 7cd8e557-30a Data Center Operations Technician As a Data Center Operations Technician at xAI, you will be responsible for the health of our server and network infrastructure for Data Centers. Your primary focus will be on maintaining our network infrastructure, performing troubleshooting and monitoring of servers, and managing data center operations tickets.

Responsibilities

Perform troubleshooting and monitoring of servers, diagnose and repair issues.
Maintain our network infrastructure, including network gear swaps, troubleshooting optics/fiber links, and repairing them.
Manage, respond, and resolve data center operations tickets used cross-functionally via Jira.
Create and maintain documentation of tasks and standard operating procedures.
Receive and decommission data center hardware.
Install racks, servers, and switches, including staging racks in place, cabling, power up, and handoff of hardware to the provisioning team for customer capacity allocation.
Maintain warehouse inventory and asset management using our internal application.
Vendor returns for infrastructure under and out of warranty.
Manage spare parts inventory within the data center.
Define, design, and implement network layouts and solutions within our data centers.

Qualifications

2+ years of experience working with server, storage, compute, and network hardware.
2+ years of experience troubleshooting and repairing servers and networking infrastructure.
1+ year of experience in inventory management, and ordering, receiving, and shipping server and network equipment.
Strong Linux skills, including proficiency in lifting 75 lbs.
Ability to work 24/7 in a fast-paced environment with excellent prioritization and time management skills.

Preferred Skills and Experience

Experience being on-call and responding to critical events as needed.
Experience leading Data Center Infrastructure projects.
Curious to always learn new things within the Data Center World.
Excellent prioritization and time management skills.
Able to work in a fast-paced environment and detail-oriented.
1+ year of experience in structured cabling copper/fiber.
1+ year of experience in power and cooling concepts inside the data center.

XML job scraping automation by YubHub

]]> full-time mid onsite Linux, Server hardware, Network hardware, Troubleshooting, Inventory management, Lifting 75 lbs, On-call experience, Data Center Infrastructure project leadership, Structured cabling copper/fiber, Power and cooling concepts Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity's pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/5085890007 Hillsboro, OR 2026-04-18 24be48df-238 Field Hardware Engineer, HPC We're hiring a Field HW Engineer to work on-site at our data centre in Bruyères-le-Châtel. As a Field HW Engineer, you will be responsible for understanding end-to-end systems, executing complex/vendor-level interventions, and guiding L1 engineers on site.

Your work will involve hands-on troubleshooting and repair of compute, storage, interconnect and cooling systems to keep our large GPU/CPU cluster healthy and scalable. You will also be responsible for leading complex interventions, advanced diagnostics, guiding and uplifting L1s, process and automation, safety and compliance, and parts and logistics.

To be successful in this role, you will need 5+ years of experience in data center/server hardware or L2/L3 hardware support, with proven complex hands-on work in production (HPC/AI/Cloud at scale). You should have end-to-end hardware expertise, including comfort with CPU/memory/PCIe cards, NICs, PSUs, drives, network, power and cooling. You should also be confident in analyzing BMC/IPMI logs, linux software logs and crashes simple CLI checks, and have methodical root cause analysis skills.

The ideal candidate will be willing to travel between sites (Paris area or nearby regions, occasionally in Europe or US) and have a strong understanding of safety and discipline, including impeccable ESD/LOTO/PPE habits, zero rough handling, and clean, labeled, auditable work.

XML job scraping automation by YubHub

]]> full-time senior onsite data center/server hardware, L2/L3 hardware support, complex hands-on work in production (HPC/AI/Cloud at scale), end-to-end hardware expertise, CPU/memory/PCIe cards, NICs, PSUs, drives, network, power and cooling, BMC/IPMI logs, linux software logs, crashes simple CLI checks, root cause analysis, vendor tools (iDRAC/iLO/IPMI), RAID/storage basics (NVMe/SAS/SATA), high-speed interconnect (Ethernet/InfiniBand), coding/automation (Python/Bash) Engineering Technology Mistral AI Mistral AI designs and develops high-performance, optimized, open-source and cutting-edge AI models, products and solutions for enterprise use. https://mistral.ai https://jobs.lever.co/mistral/ea94b55b-58e1-437b-bf3d-07ed150308e3 Bruyères-le-Châtel 2026-03-10