<?xml version="1.0" encoding="UTF-8"?>
<source>
  <jobs>
    <job>
      <externalid>51758515-c12</externalid>
      <Title>Member of Technical Staff</Title>
      <Description><![CDATA[<p>We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment.</p>
<p>This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure.</p>
<p>The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime,including close partnership with facility operations to address physical infrastructure impacts.</p>
<p>In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities.</p>
<p>By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scale environments like those at hyperscale cloud providers.</p>
<p>Responsibilities:</p>
<ul>
<li>Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.</li>
</ul>
<ul>
<li>Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers,open to innovative stacks beyond traditional ones like ELK.</li>
</ul>
<ul>
<li>Collaborate with cross-functional teams,including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management),to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).</li>
</ul>
<ul>
<li>Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.</li>
</ul>
<ul>
<li>Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.</li>
</ul>
<ul>
<li>Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation.</li>
</ul>
<ul>
<li>Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios.</li>
</ul>
<ul>
<li>Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies.</li>
</ul>
<p>Basic Qualifications:</p>
<ul>
<li>Bachelor&#39;s degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).</li>
</ul>
<ul>
<li>5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.</li>
</ul>
<ul>
<li>Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.</li>
</ul>
<ul>
<li>Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.</li>
</ul>
<ul>
<li>Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).</li>
</ul>
<ul>
<li>Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.</li>
</ul>
<ul>
<li>Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.</li>
</ul>
<ul>
<li>Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.</li>
</ul>
<ul>
<li>Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.</li>
</ul>
<ul>
<li>Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).</li>
</ul>
<p>Preferred Skills and Experience:</p>
<ul>
<li>7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.</li>
</ul>
<ul>
<li>Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.</li>
</ul>
<ul>
<li>Proficiency in Rust for systems programming and performance-critical components.</li>
</ul>
<ul>
<li>Direct experience integrating software reliability tools with physical data center infrastructure.</li>
</ul>
<ul>
<li>Experience with observability tools and practices, such as metrics collection, logging, tracing, and dashboards.</li>
</ul>
<ul>
<li>Familiarity with containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).</li>
</ul>
<ul>
<li>Experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.</li>
</ul>
<ul>
<li>Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.</li>
</ul>
<ul>
<li>Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.</li>
</ul>
<ul>
<li>Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).</li>
</ul>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange></Salaryrange>
      <Skills>Python, Rust, Linux systems administration, performance tuning, kernel-level understanding, scripting/automation, containerization, orchestration, observability, metrics collection, logging, tracing, dashboards, networking fundamentals, TCP/IP, routing, redundancy, DNS, Kubernetes, Docker, Grafana, Prometheus, ELK, DevOps, SRE, infrastructure engineering, systems engineering</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>xAI</Employername>
      <Employerlogo>https://logos.yubhub.co/xai.com.png</Employerlogo>
      <Employerdescription>xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge.</Employerdescription>
      <Employerwebsite>https://www.xai.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/xai/jobs/5044403007</Applyto>
      <Location>Memphis, TN</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
  </jobs>
</source>