Member of Technical Staff - Infrastructure Reliability

7520a7f6-8b6 Member of Technical Staff - Infrastructure Reliability We are seeking a Member of Technical Staff - Infrastructure Reliability to join our team. As a key member of our infrastructure team, you will own the availability, performance, and evolution of our core compute, storage, and networking infrastructure. This is a joint xAI/X role: you will own 24×7 reliability for the world's largest GPU training superclusters and one of the highest-QPS production systems on the planet.

You will define and execute the technical strategy for infrastructure reliability and scalability, build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy, lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes, identify, instrument, and eliminate systemic failure patterns, design and implement high-leverage systems software in Python and Rust, and push the state of the art in large-scale GPU cluster operations and AI workload reliability.

To succeed in this role, you will need 5+ years shipping production software and/or operating distributed infrastructure at scale, expert-level knowledge of Linux systems, TCP/IP networking, and systems programming, strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++, deep experience with large-scale distributed systems in on-prem and cloud environments, hands-on expertise with container orchestration, container runtimes, and infrastructure-as-code, intimate understanding of common failure modes in distributed systems and how to mitigate them, and a track record of participating in (or building) effective on-call rotations in high-stakes environments.

In addition to a competitive base salary, you will receive equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

XML job scraping automation by YubHub

]]> full-time staff onsite $180,000 - $400,000 USD Linux systems, TCP/IP networking, systems programming, Rust, Python, Go, C++, container orchestration, container runtimes, infrastructure-as-code, high-performance networking, low level configuration, deployment, support, monitoring, administration, troubleshooting Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/4801451007 Palo Alto, CA 2026-04-18 a814df90-b97 Staff Software Engineer, Applied Training We're building the Applied Training team to fix the problem of researchers spending their first month on cluster setup instead of research. You'll be an early member of a small team, responsible for our Kubernetes-native research cluster platform, or the sandbox client for agentic training and evaluation, or possibly a new project altogether.

Your responsibilities will include contributing to the roadmap for Applied Training, designing and building a complete research cluster experience, owning the Python SDK, and writing documentation for running popular OSS training frameworks on CoreWeave.

You'll work with infrastructure teams and customers directly, understanding how they structure their internal supercomputing stacks and bringing that knowledge back to what we build.

As a staff software engineer, you'll have 8-12+ years of experience building distributed systems, ML infrastructure, or developer platforms, with real Kubernetes experience and a passion for rigorous engineering enabled by AI-based workflows.

You'll be a good communicator, able to work with customers, translate researcher complaints into system designs, and contribute to the growth and success of our team.

If you're excited about this opportunity, please apply!

XML job scraping automation by YubHub

]]> full-time staff hybrid $165,000 to $242,000 Kubernetes, Distributed systems, ML infrastructure, Developer platforms, Python, SDK development, Documentation writing, Agentic AI, RL training, Sandbox isolation, Container runtimes, Isolation, Serverless platforms, OSS contributions Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for artificial intelligence (AI) development and deployment. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4647607006 New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 09c520cf-f62 Systems Engineer, Kernel CoreWeave is seeking a highly skilled and motivated Systems Kernel Engineer to join our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role, you will be a key contributor to the stability, performance, and evolution of CoreWeave's Linux-based infrastructure.

As a kernel generalist, you will be responsible for debugging kernel-level issues, analysing and fixing crashes, panics, dumps, and upstreaming fixes and features that improve the performance and reliability of our stack.

This position is ideal for someone who thrives in low-level systems engineering, and understands how modern workloads stress kernels, and is excited to work across a diverse hardware/software ecosystem including CPUs, GPUs, DPUs, networking, and storage.

Kernel Hardware - Acceleration - Virtualization - Operating Systems - Containerization - Kubelet

Our Team's Stack:

Python, Go, bash/sh, C

Prometheus, Victoria Metrics, Grafana

Linux Kernel (custom build), Ubuntu

Intel/AMD/ARM CPUs, Nvidia GPUs, DPUs, Infiniband and Ethernet NICs

Docker, kubernetes (k8s), KubeVirt, containerd, kubelet

Focus Areas:

Kernel Debugging – Analyse kernel crashes, oopses, panics, and dumps to identify root causes and propose fixes.

Upstream Contributions – Develop patches for the Linux kernel and upstream them where applicable (networking, storage, virtualization, GPU/DPU enablement).

Stack-Wide Support – Ensure kernel support and stability across:

Virtualization (KubeVirt, QEMU, vFIO)

Container runtimes (containerd, nydus, kubelet)

HPC/AI workloads (CUDA, GPUDirect, RoCE/InfiniBand)

Kernel-Hardware Enablement – Support new hardware bring-up across Intel, AMD, ARM CPUs, NVIDIA GPUs, DPUs, and NICs.

Performance & Stability – Tune kernel subsystems for latency, throughput, and scalability in distributed HPC/AI clusters.

About the role:

Triage and fix kernel crashes and performance regressions.

Develop, test, and upstream kernel patches relevant to CoreWeave’s hardware/software environment.

Collaborate with hardware vendors and the Linux community on feature enablement.

Implement diagnostics and tooling for kernel-level observability.

Work closely with HPC and Fleet teams to ensure kernel readiness for production workloads.

Provide kernel-level expertise during incident response and root-cause investigations.

Who You Are:

5+ years of professional experience in Linux kernel engineering or systems-level development.

Deep understanding of kernel internals (memory management, scheduling, networking, storage, drivers).

Experience debugging kernel crashes, dumps, and panics using tools like crash, gdb, kdump.

Strong C programming skills with the ability to write maintainable and upstream-quality code.

Experience working with kernel modules, drivers, and subsystems.

Strong problem-solving abilities with a “full-stack” systems perspective.

Preferred:

Contributions to the Linux kernel or related open-source projects.

Familiarity with virtualization (KVM, QEMU, VFIO) and container runtimes.

Networking stack expertise (InfiniBand, RoCE, TCP/IP performance tuning).

GPU/DPU bring-up and driver experience.

Experience in HPC or large-scale distributed systems.

Familiarity with QA/QE best practices

Experience working in Cloud environments

Experience as a software engineer writing large-scale applications

Experience with machine learning is a huge bonus

The base salary range for this role is $165,000 to $242,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

What We Offer

The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

Medical, dental, and vision insurance - 100% paid for by CoreWeave

Company-paid Life Insurance

Voluntary supplemental life insurance

Short and long-term disability insurance

Flexible Spending Account

Health Savings Account

Tuition Reimbursement

Ability to Participate in Employee Stock Purchase Program (ESPP)

Mental Wellness Benefits through Spring Health

Family-Forming support provided by Carrot

Paid Parental Leave

Flexible, full-service childcare support with Kinside

401(k) with a generous employer match

Flexible PTO

Catered lunch each day in our office and data center locations

A casual work environment

A work culture focused on innovative disruption

Our Workplace

While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.

California Consumer Privacy Act - California applicants only

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 Linux kernel engineering, Systems-level development, C programming, Kernel modules, Drivers, Subsystems, Kernel debugging, Upstream contributions, Stack-wide support, Virtualization, Container runtimes, HPC/AI workloads, Kernel-hardware enablement, Performance & stability, Contributions to the Linux kernel, Networking stack expertise, GPU/DPU bring-up and driver experience, Experience in HPC or large-scale distributed systems, QA/QE best practices, Cloud environments, Software engineer writing large-scale applications, Machine learning Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI with confidence. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4599319006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 022d9aef-8cd Member of Technical Staff - Infrastructure Reliability About the Role

We are training some of the largest models in the world on the latest hardware across multiple environments. To do this reliably at xAI's pace, we need engineers who have battle-tested experience keeping massive distributed infrastructure up and running 24/7, including on-prem and cloud-based infrastructure.

You will own the availability, performance, and evolution of xAI's core compute, storage, and networking infrastructure. This is not an ops-only role , strong coding is a hard requirement. You will design, implement, and ship systems software, automation, and tooling in Python and/or Rust that directly impact training throughput and cluster utilization.

Responsibilities

Define and execute the technical strategy for infrastructure reliability and scalability
Build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy
Lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes
Identify, instrument, and eliminate systemic failure patterns (capacity, network, hardware, storage, software)
Design and implement high-leverage systems software (daemons, controllers, schedulers, etc.) in Python and Rust.

Basic Qualifications

5+ years shipping production software and/or operating distributed infrastructure at scale
Expert-level knowledge of Linux systems, TCP/IP networking, and systems programming
Strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++.

Preferred Skills and Experience

Significant contributions to large-scale GPU clusters or AI/ML infrastructure
Experience in on-call rotations and incident response in high-stakes environments.

Compensation and Benefits

$180,000 - $400,000 USD

Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

XML job scraping automation by YubHub

]]> full-time staff onsite $180,000 - $400,000 USD Linux systems, TCP/IP networking, systems programming, Rust, Python, Go, C++, container orchestration, container runtimes, infrastructure-as-code, large-scale GPU clusters, AI/ML infrastructure, on-call rotations, incident response Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/4801451007 Palo Alto, CA 2026-04-18