Member of Technical Staff - Infrastructure Reliability

7520a7f6-8b6 Member of Technical Staff - Infrastructure Reliability We are seeking a Member of Technical Staff - Infrastructure Reliability to join our team. As a key member of our infrastructure team, you will own the availability, performance, and evolution of our core compute, storage, and networking infrastructure. This is a joint xAI/X role: you will own 24×7 reliability for the world's largest GPU training superclusters and one of the highest-QPS production systems on the planet.

You will define and execute the technical strategy for infrastructure reliability and scalability, build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy, lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes, identify, instrument, and eliminate systemic failure patterns, design and implement high-leverage systems software in Python and Rust, and push the state of the art in large-scale GPU cluster operations and AI workload reliability.

To succeed in this role, you will need 5+ years shipping production software and/or operating distributed infrastructure at scale, expert-level knowledge of Linux systems, TCP/IP networking, and systems programming, strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++, deep experience with large-scale distributed systems in on-prem and cloud environments, hands-on expertise with container orchestration, container runtimes, and infrastructure-as-code, intimate understanding of common failure modes in distributed systems and how to mitigate them, and a track record of participating in (or building) effective on-call rotations in high-stakes environments.

In addition to a competitive base salary, you will receive equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

XML job scraping automation by YubHub

]]> full-time staff onsite $180,000 - $400,000 USD Linux systems, TCP/IP networking, systems programming, Rust, Python, Go, C++, container orchestration, container runtimes, infrastructure-as-code, high-performance networking, low level configuration, deployment, support, monitoring, administration, troubleshooting Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/4801451007 Palo Alto, CA 2026-04-18