<?xml version="1.0" encoding="UTF-8"?>
<source>
  <jobs>
    <job>
      <externalid>7520a7f6-8b6</externalid>
      <Title>Member of Technical Staff - Infrastructure Reliability</Title>
      <Description><![CDATA[<p>We are seeking a Member of Technical Staff - Infrastructure Reliability to join our team. As a key member of our infrastructure team, you will own the availability, performance, and evolution of our core compute, storage, and networking infrastructure. This is a joint xAI/X role: you will own 24×7 reliability for the world&#39;s largest GPU training superclusters and one of the highest-QPS production systems on the planet.</p>
<p>You will define and execute the technical strategy for infrastructure reliability and scalability, build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy, lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes, identify, instrument, and eliminate systemic failure patterns, design and implement high-leverage systems software in Python and Rust, and push the state of the art in large-scale GPU cluster operations and AI workload reliability.</p>
<p>To succeed in this role, you will need 5+ years shipping production software and/or operating distributed infrastructure at scale, expert-level knowledge of Linux systems, TCP/IP networking, and systems programming, strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++, deep experience with large-scale distributed systems in on-prem and cloud environments, hands-on expertise with container orchestration, container runtimes, and infrastructure-as-code, intimate understanding of common failure modes in distributed systems and how to mitigate them, and a track record of participating in (or building) effective on-call rotations in high-stakes environments.</p>
<p>In addition to a competitive base salary, you will receive equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short &amp; long-term disability insurance, life insurance, and various other discounts and perks.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange>$180,000 - $400,000 USD</Salaryrange>
      <Skills>Linux systems, TCP/IP networking, systems programming, Rust, Python, Go, C++, container orchestration, container runtimes, infrastructure-as-code, high-performance networking, low level configuration, deployment, support, monitoring, administration, troubleshooting</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>xAI</Employername>
      <Employerlogo>https://logos.yubhub.co/xai.com.png</Employerlogo>
      <Employerdescription>xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge.</Employerdescription>
      <Employerwebsite>https://www.xai.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/xai/jobs/4801451007</Applyto>
      <Location>Palo Alto, CA</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>a814df90-b97</externalid>
      <Title>Staff Software Engineer, Applied Training</Title>
      <Description><![CDATA[<p>We&#39;re building the Applied Training team to fix the problem of researchers spending their first month on cluster setup instead of research. You&#39;ll be an early member of a small team, responsible for our Kubernetes-native research cluster platform, or the sandbox client for agentic training and evaluation, or possibly a new project altogether.</p>
<p>Your responsibilities will include contributing to the roadmap for Applied Training, designing and building a complete research cluster experience, owning the Python SDK, and writing documentation for running popular OSS training frameworks on CoreWeave.</p>
<p>You&#39;ll work with infrastructure teams and customers directly, understanding how they structure their internal supercomputing stacks and bringing that knowledge back to what we build.</p>
<p>As a staff software engineer, you&#39;ll have 8-12+ years of experience building distributed systems, ML infrastructure, or developer platforms, with real Kubernetes experience and a passion for rigorous engineering enabled by AI-based workflows.</p>
<p>You&#39;ll be a good communicator, able to work with customers, translate researcher complaints into system designs, and contribute to the growth and success of our team.</p>
<p>If you&#39;re excited about this opportunity, please apply!</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$165,000 to $242,000</Salaryrange>
      <Skills>Kubernetes, Distributed systems, ML infrastructure, Developer platforms, Python, SDK development, Documentation writing, Agentic AI, RL training, Sandbox isolation, Container runtimes, Isolation, Serverless platforms, OSS contributions</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>CoreWeave</Employername>
      <Employerlogo>https://logos.yubhub.co/coreweave.com.png</Employerlogo>
      <Employerdescription>CoreWeave is a cloud computing company that provides a platform for artificial intelligence (AI) development and deployment.</Employerdescription>
      <Employerwebsite>https://www.coreweave.com</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/coreweave/jobs/4647607006</Applyto>
      <Location>New York, NY / Sunnyvale, CA / Bellevue, WA</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>09c520cf-f62</externalid>
      <Title>Systems Engineer, Kernel</Title>
      <Description><![CDATA[<p>CoreWeave is seeking a highly skilled and motivated Systems Kernel Engineer to join our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role, you will be a key contributor to the stability, performance, and evolution of CoreWeave&#39;s Linux-based infrastructure.</p>
<p>As a kernel generalist, you will be responsible for debugging kernel-level issues, analysing and fixing crashes, panics, dumps, and upstreaming fixes and features that improve the performance and reliability of our stack.</p>
<p>This position is ideal for someone who thrives in low-level systems engineering, and understands how modern workloads stress kernels, and is excited to work across a diverse hardware/software ecosystem including CPUs, GPUs, DPUs, networking, and storage.</p>
<p>Kernel Hardware - Acceleration - Virtualization - Operating Systems - Containerization - Kubelet</p>
<p>Our Team&#39;s Stack:</p>
<ul>
<li>Python, Go, bash/sh, C</li>
</ul>
<ul>
<li>Prometheus, Victoria Metrics, Grafana</li>
</ul>
<ul>
<li>Linux Kernel (custom build), Ubuntu</li>
</ul>
<ul>
<li>Intel/AMD/ARM CPUs, Nvidia GPUs, DPUs, Infiniband and Ethernet NICs</li>
</ul>
<ul>
<li>Docker, kubernetes (k8s), KubeVirt, containerd, kubelet</li>
</ul>
<p>Focus Areas:</p>
<ul>
<li>Kernel Debugging – Analyse kernel crashes, oopses, panics, and dumps to identify root causes and propose fixes.</li>
</ul>
<ul>
<li>Upstream Contributions – Develop patches for the Linux kernel and upstream them where applicable (networking, storage, virtualization, GPU/DPU enablement).</li>
</ul>
<ul>
<li>Stack-Wide Support – Ensure kernel support and stability across:</li>
</ul>
<ul>
<li>Virtualization (KubeVirt, QEMU, vFIO)</li>
</ul>
<ul>
<li>Container runtimes (containerd, nydus, kubelet)</li>
</ul>
<ul>
<li>HPC/AI workloads (CUDA, GPUDirect, RoCE/InfiniBand)</li>
</ul>
<ul>
<li>Kernel-Hardware Enablement – Support new hardware bring-up across Intel, AMD, ARM CPUs, NVIDIA GPUs, DPUs, and NICs.</li>
</ul>
<ul>
<li>Performance &amp; Stability – Tune kernel subsystems for latency, throughput, and scalability in distributed HPC/AI clusters.</li>
</ul>
<p>About the role:</p>
<ul>
<li>Triage and fix kernel crashes and performance regressions.</li>
</ul>
<ul>
<li>Develop, test, and upstream kernel patches relevant to CoreWeave’s hardware/software environment.</li>
</ul>
<ul>
<li>Collaborate with hardware vendors and the Linux community on feature enablement.</li>
</ul>
<ul>
<li>Implement diagnostics and tooling for kernel-level observability.</li>
</ul>
<ul>
<li>Work closely with HPC and Fleet teams to ensure kernel readiness for production workloads.</li>
</ul>
<ul>
<li>Provide kernel-level expertise during incident response and root-cause investigations.</li>
</ul>
<p>Who You Are:</p>
<ul>
<li>5+ years of professional experience in Linux kernel engineering or systems-level development.</li>
</ul>
<ul>
<li>Deep understanding of kernel internals (memory management, scheduling, networking, storage, drivers).</li>
</ul>
<ul>
<li>Experience debugging kernel crashes, dumps, and panics using tools like crash, gdb, kdump.</li>
</ul>
<ul>
<li>Strong C programming skills with the ability to write maintainable and upstream-quality code.</li>
</ul>
<ul>
<li>Experience working with kernel modules, drivers, and subsystems.</li>
</ul>
<ul>
<li>Strong problem-solving abilities with a “full-stack” systems perspective.</li>
</ul>
<p>Preferred:</p>
<ul>
<li>Contributions to the Linux kernel or related open-source projects.</li>
</ul>
<ul>
<li>Familiarity with virtualization (KVM, QEMU, VFIO) and container runtimes.</li>
</ul>
<ul>
<li>Networking stack expertise (InfiniBand, RoCE, TCP/IP performance tuning).</li>
</ul>
<ul>
<li>GPU/DPU bring-up and driver experience.</li>
</ul>
<ul>
<li>Experience in HPC or large-scale distributed systems.</li>
</ul>
<ul>
<li>Familiarity with QA/QE best practices</li>
</ul>
<ul>
<li>Experience working in Cloud environments</li>
</ul>
<ul>
<li>Experience as a software engineer writing large-scale applications</li>
</ul>
<ul>
<li>Experience with machine learning is a huge bonus</li>
</ul>
<p>The base salary range for this role is $165,000 to $242,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>
<p>What We Offer</p>
<p>The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.</p>
<p>In addition to a competitive salary, we offer a variety of benefits to support your needs, including:</p>
<ul>
<li>Medical, dental, and vision insurance - 100% paid for by CoreWeave</li>
</ul>
<ul>
<li>Company-paid Life Insurance</li>
</ul>
<ul>
<li>Voluntary supplemental life insurance</li>
</ul>
<ul>
<li>Short and long-term disability insurance</li>
</ul>
<ul>
<li>Flexible Spending Account</li>
</ul>
<ul>
<li>Health Savings Account</li>
</ul>
<ul>
<li>Tuition Reimbursement</li>
</ul>
<ul>
<li>Ability to Participate in Employee Stock Purchase Program (ESPP)</li>
</ul>
<ul>
<li>Mental Wellness Benefits through Spring Health</li>
</ul>
<ul>
<li>Family-Forming support provided by Carrot</li>
</ul>
<ul>
<li>Paid Parental Leave</li>
</ul>
<ul>
<li>Flexible, full-service childcare support with Kinside</li>
</ul>
<ul>
<li>401(k) with a generous employer match</li>
</ul>
<ul>
<li>Flexible PTO</li>
</ul>
<ul>
<li>Catered lunch each day in our office and data center locations</li>
</ul>
<ul>
<li>A casual work environment</li>
</ul>
<ul>
<li>A work culture focused on innovative disruption</li>
</ul>
<p>Our Workplace</p>
<p>While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.</p>
<p>California Consumer Privacy Act - California applicants only</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$165,000 to $242,000</Salaryrange>
      <Skills>Linux kernel engineering, Systems-level development, C programming, Kernel modules, Drivers, Subsystems, Kernel debugging, Upstream contributions, Stack-wide support, Virtualization, Container runtimes, HPC/AI workloads, Kernel-hardware enablement, Performance &amp; stability, Contributions to the Linux kernel, Networking stack expertise, GPU/DPU bring-up and driver experience, Experience in HPC or large-scale distributed systems, QA/QE best practices, Cloud environments, Software engineer writing large-scale applications, Machine learning</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>CoreWeave</Employername>
      <Employerlogo>https://logos.yubhub.co/coreweave.com.png</Employerlogo>
      <Employerdescription>CoreWeave is a cloud computing company that provides a platform for building and scaling AI with confidence.</Employerdescription>
      <Employerwebsite>https://www.coreweave.com</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/coreweave/jobs/4599319006</Applyto>
      <Location>Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>022d9aef-8cd</externalid>
      <Title>Member of Technical Staff - Infrastructure Reliability</Title>
      <Description><![CDATA[<p><strong>About the Role</strong></p>
<p>We are training some of the largest models in the world on the latest hardware across multiple environments. To do this reliably at xAI&#39;s pace, we need engineers who have battle-tested experience keeping massive distributed infrastructure up and running 24/7, including on-prem and cloud-based infrastructure.</p>
<p>You will own the availability, performance, and evolution of xAI&#39;s core compute, storage, and networking infrastructure. This is not an ops-only role , strong coding is a hard requirement. You will design, implement, and ship systems software, automation, and tooling in Python and/or Rust that directly impact training throughput and cluster utilization.</p>
<p><strong>Responsibilities</strong></p>
<ul>
<li>Define and execute the technical strategy for infrastructure reliability and scalability</li>
<li>Build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy</li>
<li>Lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes</li>
<li>Identify, instrument, and eliminate systemic failure patterns (capacity, network, hardware, storage, software)</li>
<li>Design and implement high-leverage systems software (daemons, controllers, schedulers, etc.) in Python and Rust.</li>
</ul>
<p><strong>Basic Qualifications</strong></p>
<ul>
<li>5+ years shipping production software and/or operating distributed infrastructure at scale</li>
<li>Expert-level knowledge of Linux systems, TCP/IP networking, and systems programming</li>
<li>Strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++.</li>
</ul>
<p><strong>Preferred Skills and Experience</strong></p>
<ul>
<li>Significant contributions to large-scale GPU clusters or AI/ML infrastructure</li>
<li>Experience in on-call rotations and incident response in high-stakes environments.</li>
</ul>
<p><strong>Compensation and Benefits</strong></p>
<p>$180,000 - $400,000 USD</p>
<p>Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short &amp; long-term disability insurance, life insurance, and various other discounts and perks.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange>$180,000 - $400,000 USD</Salaryrange>
      <Skills>Linux systems, TCP/IP networking, systems programming, Rust, Python, Go, C++, container orchestration, container runtimes, infrastructure-as-code, large-scale GPU clusters, AI/ML infrastructure, on-call rotations, incident response</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>xAI</Employername>
      <Employerlogo>https://logos.yubhub.co/xai.com.png</Employerlogo>
      <Employerdescription>xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge.</Employerdescription>
      <Employerwebsite>https://www.xai.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/xai/jobs/4801451007</Applyto>
      <Location>Palo Alto, CA</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
  </jobs>
</source>