{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/title/member-of-technical-staff"},"x-facet":{"type":"title","slug":"member-of-technical-staff","display":"Member Of Technical Staff","count":1},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_51758515-c12"},"title":"Member of Technical Staff","description":"<p>We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment.</p>\n<p>This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure.</p>\n<p>The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime,including close partnership with facility operations to address physical infrastructure impacts.</p>\n<p>In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities.</p>\n<p>By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scale environments like those at hyperscale cloud providers.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.</li>\n</ul>\n<ul>\n<li>Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers,open to innovative stacks beyond traditional ones like ELK.</li>\n</ul>\n<ul>\n<li>Collaborate with cross-functional teams,including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management),to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).</li>\n</ul>\n<ul>\n<li>Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.</li>\n</ul>\n<ul>\n<li>Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.</li>\n</ul>\n<ul>\n<li>Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation.</li>\n</ul>\n<ul>\n<li>Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios.</li>\n</ul>\n<ul>\n<li>Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies.</li>\n</ul>\n<p>Basic Qualifications:</p>\n<ul>\n<li>Bachelor&#39;s degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).</li>\n</ul>\n<ul>\n<li>5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.</li>\n</ul>\n<ul>\n<li>Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.</li>\n</ul>\n<ul>\n<li>Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.</li>\n</ul>\n<ul>\n<li>Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).</li>\n</ul>\n<ul>\n<li>Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.</li>\n</ul>\n<ul>\n<li>Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.</li>\n</ul>\n<ul>\n<li>Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.</li>\n</ul>\n<ul>\n<li>Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.</li>\n</ul>\n<ul>\n<li>Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).</li>\n</ul>\n<p>Preferred Skills and Experience:</p>\n<ul>\n<li>7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.</li>\n</ul>\n<ul>\n<li>Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.</li>\n</ul>\n<ul>\n<li>Proficiency in Rust for systems programming and performance-critical components.</li>\n</ul>\n<ul>\n<li>Direct experience integrating software reliability tools with physical data center infrastructure.</li>\n</ul>\n<ul>\n<li>Experience with observability tools and practices, such as metrics collection, logging, tracing, and dashboards.</li>\n</ul>\n<ul>\n<li>Familiarity with containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).</li>\n</ul>\n<ul>\n<li>Experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.</li>\n</ul>\n<ul>\n<li>Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.</li>\n</ul>\n<ul>\n<li>Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.</li>\n</ul>\n<ul>\n<li>Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_51758515-c12","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://www.xai.com/","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/5044403007","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Python","Rust","Linux systems administration","performance tuning","kernel-level understanding","scripting/automation","containerization","orchestration","observability","metrics collection","logging","tracing","dashboards","networking fundamentals","TCP/IP","routing","redundancy","DNS"],"x-skills-preferred":["Kubernetes","Docker","Grafana","Prometheus","ELK","DevOps","SRE","infrastructure engineering","systems engineering"],"datePosted":"2026-04-18T15:39:31.440Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Memphis, TN"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Rust, Linux systems administration, performance tuning, kernel-level understanding, scripting/automation, containerization, orchestration, observability, metrics collection, logging, tracing, dashboards, networking fundamentals, TCP/IP, routing, redundancy, DNS, Kubernetes, Docker, Grafana, Prometheus, ELK, DevOps, SRE, infrastructure engineering, systems engineering"}]}