Senior Engineer, Storage Control Plane

7dc0b69a-5b8 Senior Engineer, Storage Control Plane We're looking for a Senior Storage Engineer to play a key role in designing, building, and operating the control plane for our high-performance AI storage platform. You'll help evolve CoreWeave's storage systems by building reliable, scalable, and high-throughput solutions that power some of the largest and innovative AI workloads in the world.

This role involves close collaboration with teams across infrastructure, compute, and platform to ensure our storage services scale automatically and seamlessly while maximizing performance and reliability.

Key responsibilities include:

Design and implement a highly scalable multi-tenant control plane that supports CoreWeave's growing AI storage and cloud infrastructure needs.
Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file system and integrate dedicated storage clusters into diverse customer environments.
Work with technologies such as RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, and distributed filesystems to optimize storage performance and efficiency.
Participate in efforts to improve the reliability, durability, and observability of our storage stack.
Collaborate with operations teams to monitor, analyze, and optimize storage systems using telemetry, metrics, and dashboards to improve performance, latency, and resilience.
Work cross-functionally with platform, product, and infrastructure teams to deliver seamless storage capabilities across the stack.
Share your knowledge and mentor other engineers on best practices in building distributed, high-performance systems.

The ideal candidate will have:

A Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
6–10 years of experience working in storage systems engineering or infrastructure.
Strong hands-on experience with object storage or distributed filesystems in production environments.
Experience with one or more storage protocols (e.g. S3, NFS) and file systems such as Ceph, DAOS, or similar.
Proficiency in a systems programming language such as Go, C, or Rust.
Familiarity with storage observability tools and telemetry pipelines (e.g., ClickHouse, Prometheus, Grafana).
Solid understanding of cloud-native infrastructure, Kubernetes, and scalable system architecture.
Strong debugging and problem-solving skills in distributed, high-performance environments.
Clear communicator, able to work collaboratively across teams and share technical insights effectively.

XML job scraping automation by YubHub

]]> full-time senior hybrid $139,000 to $204,000 object storage, distributed filesystems, RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, cloud-native infrastructure, Kubernetes, scalable system architecture Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4611874006 Sunnyvale, CA / Bellevue, WA 2026-04-18 6f3a053e-c43 Staff Software Engineer, AI Reliability Engineering We're seeking a Staff Software Engineer to join our AI Reliability Engineering team. As a key member of our team, you will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, and lead incident response for critical AI services.

You will work closely with teams across Anthropic to improve reliability across our most critical serving paths. You will be responsible for making the systems that deliver Claude more robust and resilient, whether during an incident or collaborating on projects.

To be successful in this role, you should have strong distributed systems, infrastructure, or reliability backgrounds. You should be curious and brave, comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.

You will be working on high-availability serving infrastructure across multiple regions and cloud providers. You will support the reliability of safeguard model serving, which is critical for both site reliability and Anthropic's safety commitments.

If you're committed to creating reliable, interpretable, and steerable AI systems, and you're passionate about working on complex technical problems, we'd love to hear from you.

XML job scraping automation by YubHub

]]> full-time staff hybrid €235.000-€295.000 EUR distributed systems, infrastructure, reliability, Service Level Objectives, monitoring, observability, incident response, high-availability serving infrastructure, cloud providers, SRE, Production Engineer, chaos engineering, systematic resilience testing, AI-specific observability tools and frameworks, ML hardware accelerators, RDMA, InfiniBand Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5101169008 Dublin, IE 2026-04-18 33821044-320 Principal Engineer, Storage We're looking for a Principal Engineer to play a key role in designing, building, and operating the data plane for our high-performance AI storage platform.

You'll develop CoreWeave's storage systems by building reliable, scalable, and high-throughput solutions that power some of the largest and most innovative AI workloads in the world.

About the role:

Design and implement a highly scalable multi-tenant control plane that supports CoreWeave's growing AI storage and cloud infrastructure needs.
Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file system and integrate dedicated storage clusters into diverse customer environments.
Work with technologies such as RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, and distributed filesystems to optimize storage performance and efficiency.
Participate in efforts to improve the reliability, durability, and observability of our storage stack.
Collaborate with operations teams to monitor, analyze, and optimize storage systems using telemetry, metrics, and dashboards to improve performance, latency, and resilience.
Work cross-functionally with platform, product, and infrastructure teams to deliver seamless storage capabilities across the stack.
Share your knowledge and mentor other engineers on best practices in building distributed, high-performance systems, especially focusing on low level storage details that improve performance and durability.

Who You Are:

Bachelor’s, Master’s, or PHD degree in Computer Science, Engineering, or a related field.
8–10+ years of experience working in storage systems engineering.
Strong hands-on experience with object storage, block storage or distributed filesystems in production environments.
Proficiency in a systems programming language such as Go, C, or Rust.
Solid understanding of cloud-native infrastructure, Kubernetes, and scalable system architecture.
Strong debugging and problem-solving skills in distributed, high-performance environments.
Clear communicator, able to work collaboratively across teams and share technical insights effectively.
Familiarity with the trade offs between HDD and SSD based storage systems.

The base salary range for this role is $206,000 to $303,000.

In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

What We Offer

The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

Medical, dental, and vision insurance
100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

XML job scraping automation by YubHub

]]> full-time senior hybrid $206,000 to $303,000 object storage, block storage, distributed filesystems, RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, cloud-native infrastructure, Kubernetes, scalable system architecture, systems programming language, Go, C, Rust Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a technology company that delivers a platform for building and scaling AI with confidence. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4646276006 Bellevue, WA 2026-04-18 0ae48270-bef Senior Software Engineer, Storage Engineer The Storage Engine Organisation at CoreWeave is responsible for the product capabilities and data plane function of CoreWeave's managed storage products.

We build reliable, scalable storage solutions with segment leading performance. Storage engine works with engineering teams across infrastructure, compute, and platform to ensure our storage services meet the needs of the world's most demanding AI workloads.

The role involves designing and implementing distributed storage solutions to support scaling data-intensive AI workloads, contributing to the development of exabyte-scale, S3-compatible object storage, and integrating dedicated storage clusters into diverse customer environments.

Key responsibilities include working with technologies such as RDMA, GPU Direct Storage, and distributed filesystems protocols like NFS or FUSE to optimise storage performance and efficiency, participating in efforts to improve the reliability, durability, and observability of our storage stack, collaborating with operations teams to monitor, troubleshoot, and improve storage systems in production environments, and helping develop metrics and dashboards to provide visibility into storage performance and health.

The ideal candidate will have a strong background in storage systems engineering or infrastructure, with experience working with object storage or distributed filesystems in production environments, proficiency in a systems programming language like Go, C, or Rust, and familiarity with storage observability tools and telemetry pipelines.

As a senior software engineer, you will be responsible for designing, developing, and deploying scalable and efficient storage solutions, working closely with cross-functional teams to ensure seamless integration with other components of the platform, and mentoring junior engineers to help them grow in their roles.

If you are passionate about building high-performance storage solutions and have a strong background in software engineering, we encourage you to apply for this exciting opportunity.

XML job scraping automation by YubHub

]]> full-time senior hybrid $139,000 to $204,000 Storage systems engineering, Infrastructure, Object storage, Distributed filesystems, RDMA, GPU Direct Storage, NFS, FUSE, Systems programming languages (Go, C, Rust), Storage observability tools, Telemetry pipelines Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4643524006 Livingston, NJ/ New York , NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 5b6f9322-a9a Staff Engineer, Storage Engine CoreWeave is seeking a Staff Engineer, Storage Engine to join their team. The successful candidate will design and implement distributed storage solutions to support scaling data-intensive AI workloads. They will contribute to the development of exabyte-scale, S3-compatible object storage and integrate dedicated storage clusters into diverse customer environments.

Key responsibilities include:

Designing and implementing distributed storage solutions to support scaling data-intensive AI workloads
Contributing to the development of exabyte-scale, S3-compatible object storage
Integrating dedicated storage clusters into diverse customer environments
Working with technologies such as RDMA, GPU Direct Storage, and distributed filesystems protocols such as NFS or FUSE to optimize storage performance and efficiency
Leading efforts to improve the reliability, durability, security, and observability of the storage stack
Collaborating with operations teams to monitor, troubleshoot, and improve storage systems in production environments
Setting the bar for developing metrics and dashboards to provide visibility into storage performance and health
Analyzing telemetry and system data to drive improvements in throughput, latency, and resilience
Working cross-functionally with platform, product, and infrastructure teams to deliver seamless storage capabilities across the stack
Sharing knowledge and mentoring other engineers on best practices in building distributed, high-performance systems

Requirements include:

Bachelor's, Master's, or PhD degree in Computer Science, Engineering, or a related field
8-10+ years of experience working in storage systems engineering or infrastructure
Strong hands-on experience with object storage or distributed filesystems in production environments
Experience with one or more storage protocols (e.g. S3, NFS) and file systems such as Ceph, DAOS, or similar
Proficiency in a systems programming language such as Go, C, or Rust
Proficiency leveraging AI tools to augment software development
Familiarity with storage observability tools and telemetry pipelines (e.g., ClickHouse, Prometheus, Grafana)
Experience working with cloud-native infrastructure, Kubernetes, and scalable system architectures

The base salary range for this role is $188,000 to $275,000.

XML job scraping automation by YubHub

]]> full-time staff hybrid $188,000 to $275,000 distributed storage, object storage, S3-compatible object storage, RDMA, GPU Direct Storage, distributed filesystems protocols, NFS, FUSE, storage performance and efficiency, reliability, durability, security, observability, telemetry, system data, throughput, latency, resilience, cloud-native infrastructure, Kubernetes, scalable system architectures Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4612047006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 ec7cc743-ef4 Senior Software Engineer II, Inference We're seeking a senior software engineer to join our team and lead the design and development of our Kubernetes-native inference platform. As a senior engineer, you will be responsible for leading design reviews, driving architecture, and ensuring the reliability and scalability of our platform.

Key responsibilities include:

Leading design reviews and driving architecture within the team
Defining and owning SLIs/SLOs and ensuring post-incident actions land and reliability improves release-over-release
Implementing advanced optimizations such as micro-batch schedulers, speculative decoding, and KV-cache reuse
Strengthening incident posture through capacity planning, autoscaling policy, and rollback/traffic-shift strategies
Mentoring IC1/IC2 engineers and reviewing cross-team designs to elevate coding/testing standards

We're looking for someone with strong coding skills in Python or Go, deep familiarity with networked systems and performance, and hands-on experience with Kubernetes at production scale. If you have experience with inference internals, batching, caching, mixed precision, and streaming token delivery, that's a plus.

In addition to a competitive salary, we offer a range of benefits including medical, dental, and vision insurance, company-paid life insurance, and flexible PTO. We're committed to creating a work environment that's inclusive, diverse, and supportive of our employees' well-being.

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 Python, Go, Kubernetes, Networked systems, Performance, Inference internals, Batching, Caching, Mixed precision, Streaming token delivery, CUDA kernels, NCCL/SHARP, RDMA/NUMA, GPU interconnect topologies, Contributions to inference frameworks, Experience with multi-team initiatives Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4604832006 Sunnyvale, CA / Bellevue, WA 2026-04-18 854e95b5-76b Sr. Director of Product, Research and Training Infrastructure CoreWeave is seeking a visionary Sr. Director of Product, Research Training Infrastructure to lead the product strategy and engineering execution for the services that power the most ambitious AI research labs in the world.

This executive leader will own the product strategy and engineering execution for the Research Training Stack, focusing on the specialized orchestration, evaluation, and iteration tools required for massive-scale pre-training and post-training.

Key responsibilities include:

Frontier Orchestration: Oversee the evolution of SUNK (Slurm on Kubernetes) to provide researchers with deterministic, bare-metal performance through a cloud-native interface.

Holistic Training Services: Drive the development of next-generation orchestrators and automated training-based evaluation frameworks that ensure model quality throughout the lifecycle.

Post-Training Excellence: Build the infrastructure required for sophisticated Reinforcement Learning (RL) and RLHF pipelines, enabling labs to refine foundation models with maximum efficiency.

Customer Advocacy: Act as the primary technical partner for lead researchers at global AI labs, translating their 'future-state' requirements into actionable product roadmaps.

Requirements include:

Proven leadership experience in engineering leadership, with at least 5+ years managing large-scale infrastructure at a top-tier research lab or an AI-native cloud provider.

Deep, hands-on knowledge of Slurm, Kubernetes, and the specific networking requirements (InfiniBand/RDMA) for distributed training clusters.

Research mindset and understanding of the 'pain points' of a research scientist.

Scaling experience delivering mission-critical services on multi-thousand GPU clusters (H100/Blackwell/Rubin architectures).

Strategic vision to define 'what's next' in the AI stack, from automated RL loops to specialized sandbox environments.

Why CoreWeave?

In 2026, CoreWeave is the foundation of the largest infrastructure buildout in human history. We are building AI Factories, not just data centers.

Silicon-Up Innovation: Work directly with the latest NVIDIA architectures.

Impact: You will be the architect of the environment that enables the next new discovery.

Velocity: We move at the speed of the researchers we support, bypassing legacy cloud bottlenecks to deliver raw power.

XML job scraping automation by YubHub

]]> full-time executive hybrid $233,000 to $341,000 Slurm, Kubernetes, InfiniBand/RDMA, Distributed training clusters, GPU clusters, H100/Blackwell/Rubin architectures, Reinforcement Learning (RL), RLHF pipelines Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides infrastructure and tools for artificial intelligence research and development. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4665964006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 2d198020-3d5 Sr. Engineer, Storage The Storage Engine Team at CoreWeave is responsible for the product capabilities and data plane function of CoreWeave's managed storage products. We build reliable, scalable storage solutions with segment leading performance. Storage engine works with engineering teams across infrastructure, compute, and platform to ensure our storage services meet the needs of the world's most demanding AI workloads.

The primary responsibilities of this role include designing and implementing distributed storage solutions to support scaling data-intensive AI workloads, contributing to the development of exabyte-scale, S3-compatible object storage, and integrating dedicated storage clusters into diverse customer environments. Additionally, the successful candidate will work with technologies such as RDMA, GPU Direct Storage, and distributed filesystems protocols such as NFS or FUSE to optimize storage performance and efficiency.

Key responsibilities also include leading efforts to improve the reliability, durability, security, and observability of our storage stack, collaborating with operations teams to monitor, troubleshoot, and improve storage systems in production environments, setting the bar for developing metrics and dashboards to provide visibility into storage performance and health, analyzing telemetry and system data to drive improvements in throughput, latency, and resilience, and working cross-functionally with platform, product, and infrastructure teams to deliver seamless storage capabilities across the stack.

A key aspect of this role is sharing knowledge and mentoring other engineers on best practices in building distributed, high-performance systems.

To be successful in this role, the ideal candidate will have a strong background in storage systems engineering or infrastructure, with a minimum of 8-10 years of experience. They will also have hands-on experience with object storage or distributed filesystems in production environments, as well as proficiency in a systems programming language such as Go, C, or Rust. Additionally, they will have experience working with cloud-native infrastructure, Kubernetes, and scalable system architectures, and familiarity with storage observability tools and telemetry pipelines.

If you're a motivated and experienced engineer looking to join a dynamic team and contribute to the development of cutting-edge storage solutions, we encourage you to apply for this exciting opportunity.

XML job scraping automation by YubHub

]]> full-time senior hybrid $143,000 to $210,000 storage systems engineering, infrastructure, object storage, distributed filesystems, RDMA, GPU Direct Storage, NFS, FUSE, cloud-native infrastructure, Kubernetes, scalable system architectures, storage observability tools, telemetry pipelines, Go, C, Rust, distributed systems, high-performance systems, storage performance and efficiency Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4664429006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 9701c504-1a6 Senior Software Engineer I, Inference We're looking for a Senior Software Engineer I to join our team. As a senior engineer, you'll lead designs, raise engineering standards, and deliver measurable improvements to latency, throughput, and reliability across multiple services. You'll partner with product, orchestration, and hardware teams to evolve our Kubernetes-native inference platform and meet strict P99 SLAs at scale.

Key responsibilities include:

Lead design reviews and drive architecture within the team; decompose multi-service work into clear milestones.
Define and own SLIs/SLOs; ensure post-incident actions land and reliability improves release-over-release.
Implement advanced optimizations (e.g., micro-batch schedulers, speculative decoding, KV-cache reuse) and quantify impact.
Strengthen incident posture: capacity planning, autoscaling policy, graceful degradation, rollback/traffic-shift strategies.
Mentor IC1/IC2 engineers; review cross-team designs and elevate coding/testing standards.

Requirements include:

3-5 years of industry experience building distributed systems or cloud services.
Strong coding in Python or Go (C++ a plus) and deep familiarity with networked systems and performance.
Hands-on experience with Kubernetes at production scale, CI/CD, and observability stacks (Prometheus, Grafana, OpenTelemetry).
Practical knowledge of inference internals: batching, caching, mixed precision (BF16/FP8), streaming token delivery.
Proven track record improving tail latency (P95/P99) and service reliability through metrics-driven work.

Preferred qualifications include contributions to inference frameworks, experience with CUDA kernels, NCCL/SHARP, RDMA/NUMA, or GPU interconnect topologies, and leading multi-team initiatives or partnering with customers on mission-critical launches.

XML job scraping automation by YubHub

]]> full-time senior hybrid $139,000 to $204,000 Python, Go, Kubernetes, CI/CD, Observability stacks, Inference internals, Batching, Caching, Mixed precision, Streaming token delivery, Contributions to inference frameworks, CUDA kernels, NCCL/SHARP, RDMA/NUMA, GPU interconnect topologies Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI. It was founded in 2017 and became a publicly traded company in March 2025. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4647603006 Sunnyvale, CA / Bellevue, WA 2026-04-18 a51375e8-30e Member of Technical Staff, Software Co-Design AI HPC Systems Our team's mission is to architect, co-design, and productionize next-generation AI systems at datacenter scale. We operate at the intersection of models, systems software, networking, storage, and AI hardware, optimizing end-to-end performance, efficiency, reliability, and cost. Our work spans today's frontier AI workloads and directly shapes the next generation of accelerators, system architectures, and large-scale AI platforms. We pursue this mission through deep hardware–software co-design, combining rigorous systems thinking with hands-on engineering. The team invests heavily in understanding real production workloads large-scale training, inference, and emerging multimodal models and translating those insights into concrete improvements across the stack: from kernels, runtimes, and distributed systems, all the way down to silicon-level trade-offs and datacenter-scale architectures. This role sits at the boundary between exploration and production. You will work closely with internal infrastructure, hardware, compiler, and product teams, as well as external partners across the hardware and systems ecosystem. Our operating model emphasizes rapid ideation and prototyping, followed by disciplined execution to drive high-leverage ideas into production systems that operate at massive scale. In addition to delivering real-world impact on large-scale AI platforms, the team actively contributes to the broader research and engineering community. Our work aligns closely with leading communities in ML systems, distributed systems, computer architecture, and high-performance computing, and we regularly publish, prototype, and open-source impactful technologies where appropriate.

About the Team

We build foundational AI infrastructure that enables large-scale training and inference across diverse workloads and rapidly evolving hardware generations. Our work directly shapes how AI systems are designed, deployed, and scaled today and into the future. Engineers on this team operate with end-to-end ownership, deep technical rigor, and a strong bias toward real-world impact.

Microsoft Superintelligence Team

Microsoft Superintelligence team’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

This role is part of Microsoft AI’s Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!

Responsibilities

Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks. Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements. Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems. Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps. Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations. Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs. Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams. Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.

XML job scraping automation by YubHub

]]> full-time staff hybrid AI accelerator or GPU architectures, Distributed systems and large-scale AI training/inference, High-performance computing (HPC) and collective communications, ML systems, runtimes, or compilers, Performance modeling, benchmarking, and systems analysis, Hardware–software co-design for AI workloads, Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development, Experience designing or operating large-scale AI clusters for training or inference, Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications, Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand), Background in performance modeling and capacity planning for future hardware generations, Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews, Publications, patents, or open-source contributions in systems, architecture, or ML systems Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a technology company that develops and markets software products and services. It is one of the largest and most successful technology companies in the world. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-software-co-design-ai-hpc-systems-mai-superintelligence-team-3/ London 2026-03-08 cd1a0d16-311 Member of Technical Staff, Software Co-Design AI HPC Systems Our team's mission is to architect, co-design, and productionize next-generation AI systems at datacenter scale. We operate at the intersection of models, systems software, networking, storage, and AI hardware, optimizing end-to-end performance, efficiency, reliability, and cost.

We pursue this mission through deep hardware–software co-design, combining rigorous systems thinking with hands-on engineering. The team invests heavily in understanding real production workloads large-scale training, inference, and emerging multimodal models and translating those insights into concrete improvements across the stack: from kernels, runtimes, and distributed systems, all the way down to silicon-level trade-offs and datacenter-scale architectures.

This role sits at the boundary between exploration and production. You will work closely with internal infrastructure, hardware, compiler, and product teams, as well as external partners across the hardware and systems ecosystem. Our operating model emphasizes rapid ideation and prototyping, followed by disciplined execution to drive high-leverage ideas into production systems that operate at massive scale.

In addition to delivering real-world impact on large-scale AI platforms, the team actively contributes to the broader research and engineering community. Our work aligns closely with leading communities in ML systems, distributed systems, computer architecture, and high-performance computing, and we regularly publish, prototype, and open-source impactful technologies where appropriate.

Microsoft Superintelligence Team Microsoft Superintelligence team’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.

Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.

Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.

Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.

Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.

Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.

Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.

Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.

Qualifications Bachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Additional or Preferred Qualifications Master’s Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor’s Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Strong background in one or more of the following areas: AI accelerator or GPU architectures Distributed systems and large-scale AI training/inference High-performance computing (HPC) and collective communications ML systems, runtimes, or compilers Performance modeling, benchmarking, and systems analysis Hardware–software co-design for AI workloads Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.

Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders. Experience designing or operating large-scale AI clusters for training or inference. Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications. Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand). Background in performance modeling and capacity planning for future hardware generations. Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews. Publications, patents, or open-source contributions in systems, architecture, or ML systems are a plus.

XML job scraping automation by YubHub

]]> full-time staff hybrid $139,900 – $274,800 per year C, C++, C#, Java, JavaScript, Python, AI accelerator or GPU architectures, Distributed systems and large-scale AI training/inference, High-performance computing (HPC) and collective communications, ML systems, runtimes, or compilers, Performance modeling, benchmarking, and systems analysis, Hardware–software co-design for AI workloads, Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development, LLMs, multimodal models, or recommendation systems, and their systems-level implications, Accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand), Performance modeling and capacity planning for future hardware generations, Contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews, Publications, patents, or open-source contributions in systems, architecture, or ML systems Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a technology company that develops and markets software products and services. It is one of the largest and most successful technology companies in the world. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-software-co-design-ai-hpc-systems-mai-superintelligence-team-2/ Redmond 2026-03-08 73ff6f07-c0e Staff Software Engineer, AI Reliability Engineering About Anthropic

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the Role

Claude has your back. AIRE has Claude's. Help us keep Claude reliable for everyone who depends on it.

AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.

Reliability here is an emergent phenomenon that transcends any single team's boundaries, so someone has to zoom out and look at the whole picture. That's us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.

Responsibilities

Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.
Design and implement monitoring and observability systems across the token path.
Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

You may be a good fit if you

Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs.
Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.
Think holistically about how systems compose and where the seams are.
Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.
Care about users and feel ownership over outcomes, even for systems you don't own.
Have excellent communication and collaboration skills -- you'll be partnering across the entire company.
Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between.

Strong candidates may also

Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems
Have experience operating large-scale model serving or training infrastructure (>1000 GPUs).
Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
Understand ML-specific networking optimizations like RDMA and InfiniBand.
Have expertise in AI-specific observability tools and frameworks.
Have experience with chaos engineering and systematic resilience testing.
Have contributed to open-source infrastructure or ML tooling.

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship

We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification.

Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

Your safety matters to us.

To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit anthropic.com/careers directly for confirmed position openings.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and engineering as it does with computer science.

XML job scraping automation by YubHub

]]> full-time staff hybrid £325,000 - £390,000GBP distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, SRE, Production Engineer, reliability-focused roles, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. It has a quickly growing team of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5101173008 London, UK 2026-03-08 10798a1e-9fa Staff Software Engineer, AI Reliability Engineering About Anthropic

About the Role

Claude has your back. AIRE has Claude's. Help us keep Claude reliable for everyone who depends on it.

Responsibilities

Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.
Design and implement monitoring and observability systems across the token path.
Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

You may be a good fit if you

Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs.
Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.
Think holistically about how systems compose and where the seams are.
Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.
Care about users and feel ownership over outcomes, even for systems you don't own.
Have excellent communication and collaboration skills -- you'll be partnering across the entire company.
Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between.

Strong candidates may also

Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems
Have experience operating large-scale model serving or training infrastructure (>1000 GPUs).
Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
Understand ML-specific networking optimizations like RDMA and InfiniBand.
Have expertise in AI-specific observability tools and frameworks.
Have experience with chaos engineering and systematic resilience testing.
Have contributed to open-source infrastructure or ML tooling.

Logistics

Salary

The annual compensation range for this role is €235.000 - €295.000EUR.

How we're different

XML job scraping automation by YubHub

]]> full-time staff hybrid €235.000 - €295.000EUR distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, communication, collaboration, diverse experience, product stacks, databases, distributed systems Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. It has a quickly growing team of researchers, engineers, policy experts, and business leaders. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5101169008 Dublin 2026-03-08 cb9e2dd0-6da Linux Kernels Software Lead Job Posting

Linux Kernels Software Lead

Location

San Francisco

Employment Type

Full time

Department

Scaling

Compensation

$342K – $555K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

The Scaling team builds and optimizes large-scale infrastructure to enable next-generation AI workloads.

About the Role

We’re looking for a founding/lead Linux kernel developer to join our Scaling team. In this role, you’ll design and develop Linux kernel components, working at the intersection of hardware and software to unlock performance at scale.

Responsibilities

Lead and bootstrap the development of our Linux kernel stack to support high-performance systems.

Design and implement kernel drivers, including for functionality related to DMA, PCIe, NICs, and RDMA.

Drive end-to-end development of system-scale networking, including required kernel and other low-level software.

Collaborate with vendors to integrate their technologies within our systems.

Bring up and debug the kernel on new platforms.

Build userspace software to support integration, testing, diagnostics, and performance validation.

Qualifications

Proven experience leading development within the Linux kernel.

Deep knowledge of subsystems relevant to high-performance systems: PCIe, dma-buf, RDMA, P2P, SR-IOV, IOMMU, etc.

Knowledge of subsystems and frameworks related to scale-out networking: ibverbs, ECN/DCQCN, etc.

Strong programming skills in C, C++, Python, and Linux shell scripting; Rust experience is a strong plus.

Experience working directly with engineering teams to define interfaces and tooling.

Track record of managing vendor deliverables and technical relationships.

Background in embedded systems development (bootloaders, drivers, hardware/software integration).

Ability to thrive in ambiguity and build systems from scratch.

_To comply with U.S. export control laws and regulations, candidates for this role may need to meet certain legal status requirements as provided in those laws and regulations._

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time senior onsite $342K – $555K • Offers Equity Linux kernel development, C, C++, Python, Linux shell scripting, Rust, PCIe, dma-buf, RDMA, P2P, SR-IOV, IOMMU, ibverbs, ECN/DCQCN, Embedded systems development, Bootloaders, Drivers, Hardware/software integration Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/e5691162-4e45-4dc6-a6bf-64f60ebf1ac4 San Francisco 2026-03-06 2ad46876-f84 Software Engineer, Collective Communication Job Posting

Software Engineer, Collective Communication

Location

San Francisco

Employment Type

Full time

Department

Scaling

Compensation

$380K – $555K • Offers Equity

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

The Workload Networking team is responsible for the collective communication stack used in our largest training jobs. Using a combination of C++ and CUDA we work on novel collective communication techniques that enable efficient training of our flagship models on our largest custom built supercomputers.

The models we train are key ingredients to the AI research progress at OpenAI and the field as a whole, and we continually incorporate learnings from our entire research org into our training platform.

About the Role

As a Software Engineer, Networking you will design and implement custom networking collectives that are tightly integrated into our training stack.

We’re looking for people who have a background in low level performance critical software. Experience with collective communication is a bonus.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

Collaborate closely with ML researchers to design and implement efficient collective operations in C++ and CUDA.

Ensure that our largest training jobs take full advantage of the different network transports used in our supercomputers.

Work on simulations to inform our future supercomputer network designs.

You might thrive in this role if you:

Have written distributed algorithms using RDMA in the past.

Are comfortable writing low level performance sensitive CPU and/or GPU code.

Are familiar with network simulation techniques.

About OpenAI

XML job scraping automation by YubHub

]]> full-time mid hybrid $380K – $555K • Offers Equity C++, CUDA, RDMA, network simulation techniques, low level performance sensitive CPU and/or GPU code, distributed algorithms, collective communication Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/340c0c22-8d8f-4232-b17e-f642b64c25c3 San Francisco 2026-03-06