Engineering Manager, Inference Routing and Performance

ac45e205-e7d Engineering Manager, Inference Routing and Performance About the role\nEvery request that hits Claude , from claude.ai, the API, our cloud partners, or internal research , passes through a routing decision. Not a generic load balancer round-robin, but a decision that accounts for what's already cached where, which accelerator the request runs best on, and what else is in flight across the fleet.\n\nGet it right and you extract meaningfully more throughput from the same hardware. Get it wrong and you burn capacity, miss latency SLOs, or shed load that shouldn't have been shed.\n\nThe Inference Routing team owns this layer. We build the cluster-level routing and coordination plane for Anthropic's inference fleet , the system that sits between the API surface and the inference engines themselves, making fleet-wide efficiency decisions in real time.\n\nAs Anthropic moves from "many independent inference replicas" toward "a single warehouse-scale computer running a coordinated program," Dystro is the coordination layer. This is a deeply technical team.\n\nThe engineers here design custom load-balancing algorithms, build quantitative models of system performance, debug latency spikes that cross kernel, network, and framework boundaries, and reason carefully about cache placement across thousands of accelerators.\n\nThey work shoulder-to-shoulder with teams that write kernels and ML framework internals.\n\nThe EM for this team doesn't need to write kernels , but they do need the systems depth to make architectural calls, evaluate deeply technical candidates, and spot when a proposed optimization will have second-order effects on the fleet.\n\nYou'll inherit a strong team of distributed-systems engineers, and you'll be accountable for two things that pull in different directions: shipping system-level performance improvements that measurably increase fleet throughput and efficiency, and running the team operationally so that deploys are safe, incidents are rare, and the teams who depend on Dystro can plan around you with confidence.\n\nThe job is holding both.\n\n## Representative work:\nThings the Inference Routing EM actually spends time on:\n- Deciding whether a proposed routing algorithm change is worth the deploy risk, given the modeled throughput gain and the blast radius if it regresses\n- Sequencing a quarter where KV-cache offload, a new coordination protocol, and two model launches all compete for the same engineers\n- Working through a persistent tail-latency regression with the team , walking down from fleet-level metrics to per-replica behavior to a root cause in the networking stack\n- Building the case (with numbers) to peer teams for why a cross-team protocol change unlocks the next efficiency win\n- Running the post-incident review after a cache-eviction bug caused a capacity event, and turning it into process changes that stick\n- Interviewing a candidate who has built schedulers at supercomputing scale, and deciding whether they'd be additive to a team that already goes deep\n\n## What you'll do:\nDrive system-level performance\n- Own the technical roadmap for cluster-level inference efficiency , routing decisions, cache placement and eviction, cross-replica coordination, and the protocols that keep routing and inference engines in sync\n- Partner with the inference engine, kernels, and performance teams to identify fleet-level throughput and latency wins, then turn those into shipped improvements with measurable results\n- Build the team's habit of quantitative performance modeling: claim a win only when you can measure it, and know before you ship what the expected effect is\n\nDeliver reliably and operate cleanly\n- Set technical strategy for how routing evolves across heterogeneous hardware (GPUs, TPUs, Trainium) and across all our serving surfaces\n- Run the team's operational backbone , on-call rotation, incident response, postmortem review, deploy safety , so the team can ship aggressively without the system becoming fragile\n- Create clarity at a seam: Inference Routing sits between the API surface, the inference engines, and the cloud deployment teams. You'll make sure commitments are realistic, dependencies are understood, and nobody is surprised\n\nBuild and grow the team\n- Develop and retain a strong existing team, and hire against the bar described above: people who can go to the OS and framework level when the problem demands it, and who care about production reliability\n- Coach engineers through a roadmap where priorities shift with model launches, new hardware, and scaling demands. We pair a lot here , you'll help make that collaboration pattern productive\n- Pick up slack when it matters. This is a small team in a critical path; sometimes the EM is the one unblocking a stuck deploy or synthesizing a design debate\n\n## You may be a good fit if you:\n- Have 5+ years of engineering management experience, ideally with at least part of that leading teams on critical-path production infrastructure at scale\n- Have a deep systems background , load balancing, scheduling, cache-coherent distributed state, high-performance networking, or similar. You need enough depth to make architectural calls about routing and efficiency, and to evaluate candidates who go to the kernel and framework level\n- Have shipped performance improvements in large-scale systems and can explain, with numbers, what the impact was\n- Have run production infrastructure with real operational stakes: on-call, incident response, capacity events, deploy discipline\n- Are results-oriented with a bias toward impact, and comfortable working in a space where throughput, latency, stability, and feature velocity all pull in different directions\n- Build strong relationships across team boundaries , this is a seam role, and much of the job is making sure other teams can rely on yours\n- Are curious about machine learning systems. You don't need an ML research background, but you should want to learn how transformer inference actually works and how that shapes the systems problems\n\nStrong candidates may also have:\n- Experience with LLM inference serving , KV caching, continuous batching, request scheduling, prefill/decode disaggregation\n- Background in cluster schedulers, load balancers, service meshes, or coordination planes at scale\n- Familiarity with heterogeneous accelerator fleets (GPU/TPU/Trainium) and how hardware differences affect workload placement\n- Experience with GPU/accelerator programming, ML framework internals, or OS-level performance debugging , enough to follow and evaluate the technical work, not necessarily to do it daily\n- Led teams at supercomputing or hyperscaler infrastructure scale\n- Led teams through rapid-growth periods where hiring and onboarding competed with roadmap delivery\n\nThe annual compensation range for this role is listed below. For sales roles, the range provided is the role’s On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\nAnnual Salary: $405,000-$485,000 USD

XML job scraping automation by YubHub

]]> full-time senior hybrid $405,000-$485,000 USD engineering management, distributed systems, load balancing, scheduling, cache-coherent distributed state, high-performance networking, machine learning systems, LLM inference serving, cluster schedulers, load balancers, service meshes, coordination planes, heterogeneous accelerator fleets, GPU/TPU/Trainium, GPU/accelerator programming, ML framework internals, OS-level performance debugging Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5155391008 San Francisco, CA | New York City, NY 2026-04-18 a14533c3-732 Senior Engineer, Cilium CNI & Cloud Networking Network Services Team

The Network Services team builds and operates the foundational networking that powers CoreWeave's Kubernetes platforms at cloud scale. The team is responsible for container networking, connectivity, and network services that support large-scale, GPU-driven workloads across regions and environments. They focus on scalability, reliability, security, and performance while delivering intuitive platforms for internal teams and customers.

About the Role

As a Senior Engineer focused on our Cilium-based CNI, you will design, build, and operate the container networking layer that underpins CoreWeave's Kubernetes platforms. Day to day, you will work on evolving our CNI stack to support large, high-density GPU clusters with demanding throughput and latency requirements. You will partner closely with Kubernetes, Infrastructure, and Network Services engineers to ensure the platform is highly available, observable, and secure. This role spans architecture, implementation, and operations, with ownership from prototype through production. You will also help shape how our networking platform scales for future growth.

Who You Are

5+ years of experience as a Software Engineer or Systems Engineer working on cloud infrastructure or large-scale distributed systems.
Hands-on production experience with Cilium CNI (or equivalent advanced CNIs), including cluster configuration and lifecycle management.
Strong understanding of Cilium's eBPF datapath, policy model, and load-balancing mechanisms.
Deep knowledge of cloud networking concepts, including VPCs, subnets, routing, security groups/ACLs, NAT, and ingress/egress architectures.
Experience designing multi-tenant network architectures with strong isolation and security.
Solid grounding in TCP/IP, dynamic routing (e.g., BGP), ECMP, MTU/fragmentation, and overlay/underlay networking (VXLAN, Geneve, encapsulation).
Experience with network observability and troubleshooting across L3–L7.
Proficiency in at least one systems language such as Golang or C/C++.
Experience working in modern CI/CD environments.
Experience operating Kubernetes at scale, including cluster lifecycle management and debugging networking issues across pods, nodes, and external services.
Demonstrated ownership of complex systems end-to-end.

Preferred

Experience operating cloud-scale network services across tens of thousands of nodes and multiple regions.
Contributions to Cilium, Kubernetes, or related open-source networking projects.
Experience with eBPF development and performance tuning.
Experience building Kubernetes operators or controllers.
Familiarity with service meshes, multi-cluster networking, or cluster mesh solutions.
Experience in GPU-heavy, HPC, or other performance-sensitive environments.

Wondering if you’re a good fit?

We believe in investing in our people and value candidates who bring diverse experiences , even if you’re not a 100% match on paper. If some of this sounds like you, we’d love to talk.

You love solving complex distributed systems and networking challenges at scale.
You’re curious about cloud-native networking, eBPF, and Kubernetes internals.
You’re an expert in building reliable, scalable infrastructure that runs in production.

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at Your Core
Act Like an Owner
Empower Employees
Deliver Best-in-Class Client Experiences
Achieve More Together

The base salary range for this role is $165,000 to $242,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

What We Offer

The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location. In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

Medical, dental, and vision insurance
100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

Our Workplace

While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.

California Consumer Privacy Act - California applicants only

CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information. As part of this commitment and consistent with the Americans with Disabilities Act (ADA), CoreWeave will ensure that qualified applicants and candidates with disabilities are provided reasonable accommodations for the hiring process, unless such accommodation would cause an undue hardship. If reasonable accommodation is needed, please contact: careers@coreweave.com.

Export Control Compliance

This position requires access to export controlled information. To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. § 1157, or (iv) asylee under 8 U.S.C. § 1158, (B) eligible to access the export controlled information without a required export authorization, or (C) eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency. CoreWeave may, for legitimate business reasons, decline to pursue any export licensing process.

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 Cilium CNI, cloud infrastructure, large-scale distributed systems, container networking, connectivity, network services, Kubernetes, eBPF datapath, policy model, load-balancing mechanisms, cloud networking concepts, VPCs, subnets, routing, security groups/ACLs, NAT, ingress/egress architectures, TCP/IP, dynamic routing, ECMP, MTU/fragmentation, overlay/underlay networking, Golang, C/C++, CI/CD environments, Kubernetes at scale, cluster lifecycle management, debugging networking issues, cloud-scale network services, Cilium, eBPF development, performance tuning, Kubernetes operators, controllers, service meshes, multi-cluster networking, cluster mesh solutions, GPU-heavy, HPC, performance-sensitive environments Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4653971006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 63af8568-789 Engineering Manager, Inference Routing and Performance About the role\nEvery request that hits Claude , from claude.ai, the API, our cloud partners, or internal research , passes through a routing decision. Not a generic load balancer round-robin, but a decision that accounts for what's already cached where, which accelerator the request runs best on, and what else is in flight across the fleet.\n\nGet it right and you extract meaningfully more throughput from the same hardware. Get it wrong and you burn capacity, miss latency SLOs, or shed load that shouldn't have been shed.\n\nThe Inference Routing team owns this layer. We build the cluster-level routing and coordination plane for Anthropic's inference fleet , the system that sits between the API surface and the inference engines themselves, making fleet-wide efficiency decisions in real time.\n\nAs Anthropic moves from "many independent inference replicas" toward "a single warehouse-scale computer running a coordinated program," Dystro is the coordination layer. This is a deeply technical team.\n\nThe engineers here design custom load-balancing algorithms, build quantitative models of system performance, debug latency spikes that cross kernel, network, and framework boundaries, and reason carefully about cache placement across thousands of accelerators.\n\nThey work shoulder-to-shoulder with teams that write kernels and ML framework internals.\n\nThe EM for this team doesn't need to write kernels , but they do need the systems depth to make architectural calls, evaluate deeply technical candidates, and spot when a proposed optimization will have second-order effects on the fleet.\n\nYou'll inherit a strong team of distributed-systems engineers, and you'll be accountable for two things that pull in different directions: shipping system-level performance improvements that measurably increase fleet throughput and efficiency, and running the team operationally so that deploys are safe, incidents are rare, and the teams who depend on Dystro can plan around you with confidence.\n\nThe job is holding both.\n\n## Representative work:\nThings the Inference Routing EM actually spends time on:\n- Deciding whether a proposed routing algorithm change is worth the deploy risk, given the modeled throughput gain and the blast radius if it regresses\n- Sequencing a quarter where KV-cache offload, a new coordination protocol, and two model launches all compete for the same engineers\n- Working through a persistent tail-latency regression with the team , walking down from fleet-level metrics to per-replica behavior to a root cause in the networking stack\n- Building the case (with numbers) to peer teams for why a cross-team protocol change unlocks the next efficiency win\n- Running the post-incident review after a cache-eviction bug caused a capacity event, and turning it into process changes that stick\n- Interviewing a candidate who has built schedulers at supercomputing scale, and deciding whether they'd be additive to a team that already goes deep\n\n## What you'll do:\nDrive system-level performance\n- Own the technical roadmap for cluster-level inference efficiency , routing decisions, cache placement and eviction, cross-replica coordination, and the protocols that keep routing and inference engines in sync\n- Partner with the inference engine, kernels, and performance teams to identify fleet-level throughput and latency wins, then turn those into shipped improvements with measurable results\n- Build the team's habit of quantitative performance modeling: claim a win only when you can measure it, and know before you ship what the expected effect is\n\nDeliver reliably and operate cleanly\n- Set technical strategy for how routing evolves across heterogeneous hardware (GPUs, TPUs, Trainium) and across all our serving surfaces\n- Run the team's operational backbone , on-call rotation, incident response, postmortem review, deploy safety , so the team can ship aggressively without the system becoming fragile\n- Create clarity at a seam: Inference Routing sits between the API surface, the inference engines, and the cloud deployment teams. You'll make sure commitments are realistic, dependencies are understood, and nobody is surprised\n\nBuild and grow the team\n- Develop and retain a strong existing team, and hire against the bar described above: people who can go to the OS and framework level when the problem demands it, and who care about production reliability\n- Coach engineers through a roadmap where priorities shift with model launches, new hardware, and scaling demands. We pair a lot here , you'll help make that collaboration pattern productive\n- Pick up slack when it matters. This is a small team in a critical path; sometimes the EM is the one unblocking a stuck deploy or synthesizing a design debate\n\n## You may be a good fit if you:\n- Have 5+ years of engineering management experience, ideally with at least part of that leading teams on critical-path production infrastructure at scale\n- Have a deep systems background , load balancing, scheduling, cache-coherent distributed state, high-performance networking, or similar. You need enough depth to make architectural calls about routing and efficiency, and to evaluate candidates who go to the kernel and framework level\n- Have shipped performance improvements in large-scale systems and can explain, with numbers, what the impact was\n- Have run production infrastructure with real operational stakes: on-call, incident response, capacity events, deploy discipline\n- Are results-oriented with a bias toward impact, and comfortable working in a space where throughput, latency, stability, and feature velocity all pull in different directions\n- Build strong relationships across team boundaries , this is a seam role, and much of the job is making sure other teams can rely on yours\n- Are curious about machine learning systems. You don't need an ML research background, but you should want to learn how transformer inference actually works and how that shapes the systems problems\n\nStrong candidates may also have:\n- Experience with LLM inference serving , KV caching, continuous batching, request scheduling, prefill/decode disaggregation\n- Background in cluster schedulers, load balancers, service meshes, or coordination planes at scale\n- Familiarity with heterogeneous accelerator fleets (GPU/TPU/Trainium) and how hardware differences affect workload placement\n- Experience with GPU/accelerator programming, ML framework internals, or OS-level performance debugging , enough to follow and evaluate the technical work, not necessarily to do it daily\n- Led teams at supercomputing or hyperscaler infrastructure scale\n- Led teams through rapid-growth periods where hiring and onboarding competed with roadmap delivery\n\nThe annual compensation range for this role is listed below. For sales roles, the range provided is the role’s On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\nAnnual Salary: $405,000-$485,000 USD

XML job scraping automation by YubHub

]]> full-time senior hybrid $405,000-$485,000 USD engineering management, deep systems background, load balancing, scheduling, cache-coherent distributed state, high-performance networking, LLM inference serving, cluster schedulers, load balancers, service meshes, coordination planes, heterogeneous accelerator fleets, GPU/TPU/Trainium, GPU/accelerator programming, ML framework internals, OS-level performance debugging Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5155391008 San Francisco, CA | New York City, NY 2026-04-18 a560bd4c-a1a Cloud Security Engineer We're looking for a Cloud Security Engineer to join our team. As a Cloud Security Engineer at Starling, you'll be building and supporting tooling and infrastructure that spans across AWS and GCP supporting our internal operations and interfacing with other teams to deliver the services that support our business.

Key Responsibilities:

Engineer Secure Foundations: You will lead the design and implementation of critical security services, with a heavy focus on building robust Identity and Access Management (IAM) systems and automated, API-driven certificate management workflows.
Security-as-Code & Scalability: Leveraging a software-first philosophy, you will develop and maintain high-quality, scalable security tooling and middleware within ECS and Kubernetes environments, ensuring security logic is integrated directly into the deployment pipeline.
Collaborative Code Ownership: You will serve as a technical authority in cross-functional code reviews, acting as an engineering peer who helps teams bake security into their services from the first line of code to the final pull request.
Proactive System Hardening: You will stay ahead of the evolving threat landscape by treating security as a continuous engineering challenge,proactively identifying vulnerabilities and architecting technical solutions to fortify our global ecosystem.

Professional Requirements:

Demonstrated ability to architect secure, distributed systems with a focus on programmatic IAM and automated, API-driven PKI management.
Extensive experience with Infrastructure as Code (IaC) in Terraform and a deep commitment to writing clean, maintainable, and production-grade code,ideally in Golang.
A test-first mentality toward security, with experience building unit and integration tests into CI/CD pipelines to ensure that security guardrails are as reliable as the features they protect.
A strong conceptual grasp of cryptographic primitives and hands-on experience securing containerized workloads and service meshes within ECS and Kubernetes.
A track record of taking end-to-end ownership of complex technical projects, from initial design docs and RFCs through to deployment and observability.
A belief that if it isn't tested, it's broken, and a drive to proactively identify and fix vulnerabilities by treating security as a continuous engineering challenge.

Our Team Philosophy: The Security Engineering team is a diverse and dynamic group passionate about building secure and resilient systems. We're enthusiastic about security, but we're not about rigid, one-size-fits-all controls. We believe in striking a balance between protecting our systems and empowering our developers to build and innovate.

XML job scraping automation by YubHub

]]> full-time senior hybrid Cloud Security, AWS, GCP, Identity and Access Management, API-driven Certificate Management, Infrastructure as Code, Terraform, Golang, Cryptographic Primitives, Containerized Workloads, Service Meshes Engineering Finance Starling https://logos.yubhub.co/starlingbank.com.png Starling is a fully licensed UK bank with over 3,000 employees across four offices. https://www.starlingbank.com/ https://apply.workable.com/j/3B7E26FC24 London 2026-03-20 cbb7e2e4-4bc Security Engineer, Infrastructure Security Security Engineer, Infrastructure Security

Location

Remote - US; New York City; San Francisco; Seattle

Employment Type

Full time

Location Type

Remote

Department

Security

Compensation

SF, Seattle or NYC $230K – $385K • Offers Equity
Zone A $207K – $346.5K • Offers Equity
Zone B $184K – $308K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible
Relocation support for eligible employees
Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

Security is at the foundation of OpenAI’s mission to ensure that artificial general intelligence benefits all of humanity.

The Security team protects OpenAI’s technology, people, and products. We are technical in what we build but are operational in how we do our work, and are committed to supporting all products and research at OpenAI. Our Security team tenets include: prioritizing for impact, enabling researchers, preparing for future transformative technologies, and engaging a robust security culture.

About the Role

OpenAI is seeking a Security Engineer to join our Infrastructure Security (InfraSec) team. InfraSec protects the foundations of OpenAI’s research and production environments, spanning GPU supercomputing clusters, multi-cloud infrastructure, datacenters, networking, storage, and the critical services that power our frontier AI models. Our charter includes securing everything from bare-metal hardware and firmware, to Kubernetes clusters and service meshes, to data storage and access pathways for highly sensitive model weights and user data.

In this role, you will:

Design and build security controls across diverse layers (e.g., physical hardware, firmware/BMC, OS, Kubernetes, networks, and CI/CD) to defend against sophisticated adversaries and insider threats.
Collaborate with engineering and security teams to drive deployment of security enhancements and control changes across broad-scale infrastructure.
Tackle high-impact projects such as checkpoint encryption, network isolation, secret management, and machine identity, while continuously raising the security bar for emerging AI workloads.
Take a generalist approach to building security controls, balancing a mix of security expertise and broad technical skillsets to adapt to evolving challenges.

You will thrive in this role if you have:

Deep understanding of security principles, best practices, and common vulnerabilities.
A proactive mindset, with the ability to identify and address security gaps or inefficiencies through automation and tooling.
A track record of delivering scalable solutions and driving impactful changes across infrastructure in real-world projects.
Expertise in the security of cloud platforms (e.g., Amazon AWS, Microsoft Azure), especially securing multi-cloud networks and infrastructure, and designing cloud agnostic systems.
Experience securing on-prem deployments and datacenters from construction to multi-tenant use.
Familiarity with container security, orchestration security, and authentication/authorization.
Strong analytical and problem-solving skills, with an ability to think critically and objectively assess security risks.
Excellent communication skills, with the ability to convey complex security concepts to technical and non-technical stakeholders.
Excitement about collaborating with cross-functional teams to build secure, reliable systems that scale globally.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time senior remote $230K – $385K security principles, best practices, common vulnerabilities, cloud platforms, Amazon AWS, Microsoft Azure, container security, orchestration security, authentication/authorization, Kubernetes, service meshes, data storage, access pathways, firmware, BMC, OS, networks, CI/CD, security expertise, broad technical skillsets, cloud agnostic systems, on-prem deployments, datacenters, multi-tenant use, strong analytical skills, problem-solving skills, critical thinking, objectively assess security risks Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/f51f750f-a737-4441-8f96-30133a2a8049 Remote - US; New York City; San Francisco; Seattle 2026-03-06 14dd5de2-4dc Software Engineer, Infrastructure Security Software Engineer, Infrastructure Security

Location

Remote - US; New York City; San Francisco; Seattle

Employment Type

Full time

Location Type

Remote

Department

Security

Compensation

SF, Seattle or NYC $230K – $385K • Offers Equity
Zone A $207K – $346.5K • Offers Equity
Zone B $184K – $308K • Offers Equity

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible
Relocation support for eligible employees
Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

Security is at the foundation of OpenAI’s mission to ensure that artificial general intelligence benefits all of humanity.

The Security team protects OpenAI’s technology, people, and products. We are technical in what we build but operational in how we execute, and we support every product and research effort at OpenAI. Our tenets include prioritizing for impact, enabling researchers and developers, preparing for future transformative technologies, and fostering a strong, collaborative security culture.

About the Role

OpenAI is seeking a Security Software Engineer to join the Infrastructure Security (InfraSec) team.

InfraSec safeguards the core of OpenAI’s research and production environments—GPU supercomputing clusters, multi-cloud infrastructure, datacenters, networking, storage, and the critical services that power our frontier AI models. Our charter spans everything from bare-metal hardware and firmware to Kubernetes clusters, service meshes, and the data pathways that carry highly sensitive model weights and user data.

As a Security Software Engineer, you will design and build critical foundational services, such as authentication systems, egress/ingress proxies, access brokers, and key management platforms, that demand high standards of reliability, scalability, and software craftsmanship. These systems form the security backbone of OpenAI’s supercomputing environment and must remain robust under intense scale and adversarial pressure.

In this role, you will:

Architect and implement production-grade security services (e.g., auth services, access brokers, secure proxies, key-management infrastructure) that provide strong guarantees across hardware, operating systems, Kubernetes, networks, and CI/CD.
Partner with infrastructure and research engineers to embed security into high-performance compute clusters, enabling rapid model training and deployment without compromising protection.
Develop automation and detection tooling to continuously identify and mitigate risks in large-scale cloud and on-prem environments.
Drive high-impact initiatives such as line-speed encryption, machine identity, and network isolation, continuously raising the security bar for emerging AI workloads.
Lead or participate in design reviews and threat models to ensure new systems launch with strong security foundations and operational excellence.

You will thrive in this role if you have:

Strong software engineering skills in languages such as Python, Go, Rust, or C/C++, with a track record of shipping and operating high-reliability distributed services.
Experience building or operating critical security infrastructure (e.g., auth services, service-to-service proxies, certificate or key-management systems).
Deep understanding of security principles, best practices, and common vulnerabilities.
Expertise in securing large-scale cloud platforms (e.g., Azure, AWS, GCP), including multi-cloud networks and cloud-agnostic system design.
Familiarity with container and orchestration security (Kubernetes, service meshes) and modern authentication/authorization standards (OIDC, mTLS, SPIFFE/SPIRE).
A proactive mindset, with the ability to identify and address security gaps or inefficiencies through automation and tooling.
A track record of delivering scalable solutions and driving impactful changes across infrastructure in real-world projects.
Strong analytical and problem-solving skills, with an ability to think critically and objectively assess security risks.
Excellent communication skills, with the ability to convey complex security concepts to technical and non-technical stakeholders.
Excitement about collaborating

XML job scraping automation by YubHub

]]> full-time senior remote $230K – $385K Python, Go, Rust, C/C++, Kubernetes, Service meshes, OIDC, mTLS, SPIFFE/SPIRE, Cloud security, Container security, Orchestration security, Authentication, Authorization, Security principles, Best practices, Common vulnerabilities, Cloud platforms, Multi-cloud networks, Cloud-agnostic system design, Automation, Detection tooling, Line-speed encryption, Machine identity, Network isolation Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is a technology company that focuses on developing and applying artificial general intelligence. It was founded in 2015 and has since grown to become one of the leading AI research and development companies in the world. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/98ad9beb-4f91-496c-bd16-ac0b2a8d5bb2 Remote - US; New York City; San Francisco; Seattle 2026-03-06 32d33889-c44 Software Engineer, Caching Infrastructure Software Engineer, Caching Infrastructure

Location

San Francisco

Employment Type

Full time

Department

Applied AI

Compensation

$230K – $385K • Offers Equity

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

At OpenAI, we’re building safe and beneficial artificial general intelligence. We deploy our models through ChatGPT, our APIs, and other cutting-edge products. Behind the scenes, making these systems fast, reliable, and cost-efficient requires world-class infrastructure.

The Caching Infrastructure team is responsible for building a caching layer that powers many critical use cases at OpenAI. We aim to provide a high-availability, multi-tenant cache platform that scales automatically with workload, minimizes tail latency, and supports a diverse range of use cases.

We’re looking for an experienced engineer to help design and scale this critical infrastructure. The ideal candidate has deep experience in distributed caching systems (e.g., Redis, Memcached), networking fundamentals, and Kubernetes-based service orchestration.

In This Role, You Will:

Design, build, and operate OpenAI’s multi-tenant caching platform used across inference, identity, quota, and product experiences.

Define the long-term vision and roadmap for caching as a core infra capability, balancing performance, durability, and cost.

Collaborate with other infra teams (e.g., networking, observability, databases) and product teams to ensure our caching platform meets their needs.

You Might Thrive In This Role If You:

Have 5+ years of experience building and scaling distributed systems, with a strong focus on caching, load balancing, or storage systems.

Have deep expertise with Redis, Memcached, or similar solutions, including clustering, durability configurations, client-side connection patterns, and performance tuning.

Have production experience with Kubernetes, service meshes (e.g., Envoy), and autoscaling systems.

Think rigorously about latency, reliability, throughput, and cost in designing platform capabilities.

Thrive in a fast-paced environment and enjoy balancing pragmatic engineering with long-term technical excellence.

About OpenAI

XML job scraping automation by YubHub

]]> full-time senior onsite $230K – $385K • Offers Equity distributed caching systems, Redis, Memcached, Kubernetes, service meshes, autoscaling systems, clustering, durability configurations, client-side connection patterns, performance tuning Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/a20b7fc6-6f01-4618-ba35-37b40083f93e San Francisco 2026-03-06