Senior/Staff Site Reliability Engineer

198d64d4-207 Senior/Staff Site Reliability Engineer You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems , from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better.

Key Responsibilities

Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads

Build and maintain CI/CD pipelines and deployment infrastructure

Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability

Build dashboards, alerting, and anomaly detection across our systems

Define and enforce SLOs and build out incident response processes

Manage and improve our networking, load balancing, and service mesh configurations

Drive reliability improvements across the stack through automation, runbooks, and chaos engineering

Requirements

5+ years experience in managing critical production systems and software development workflows

Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)

Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS

Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)

Proficiency in Python and either Go or Bash for tooling and automation

Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)

Excellent communication and ability to drive technical decisions across teams

Self-starter who executes quickly, takes ownership, and constantly seeks improvement

Nice to have

Experience with managing GPU and AI/ML workloads

Experience with kernel-based monitoring and routing (eBPF, XDP)

Experience with security tooling (Falco, Coroot, SIEM)

Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)

Experience with distributed storage systems (Ceph, Longhorn, etc.)

Compensation

$180,000-250,000 plus equity + benefits

Benefits

Interesting and challenging work

A lot of learning and growth opportunities

Regular team events and offsites

Health, dental, and vision insurance (US)

Visa sponsorship and relocation assistance

XML job scraping automation by YubHub

]]> full-time senior onsite $180,000-250,000 Kubernetes, Infrastructure-as-code, Linux networking, Container networking, CI/CD systems, GitOps workflows, Python, Go, Bash, Logging, Monitoring, Alerting, GPU and AI/ML workloads, Kernel-based monitoring and routing, Security tooling, Bare metal Kubernetes networking, Distributed storage systems Engineering Technology Fal https://logos.yubhub.co/fal.com.png Fal is a technology company that operates in the San Francisco area. https://fal.com https://job-boards.greenhouse.io/fal/jobs/4146019009 San Francisco 2026-04-24 38c10a5f-35e CPU/Storage/PoP-WAN Program Manager We are seeking a highly technical Program Manager to lead execution across CPU, Storage, PoP, and WAN infrastructure programs that directly unlock OpenAI's next generation compute capacity.

In this role, you will own complex cross-functional programs spanning compute cluster activation, storage deployment, PoP bring-up, and backbone expansion. You will coordinate hardware readiness, site readiness, network pathing, storage availability, vendor execution, and engineering dependencies required to turn contracted infrastructure into live training and inference capacity.

This role requires strong technical fluency across hardware systems, network infrastructure, storage architecture, and deployment execution. You should be comfortable operating from rack-level implementation details through executive-level capacity planning discussions.

Key Responsibilities:

Lead end-to-end execution of CPU / GPU cluster activation programs across OpenAI's global infrastructure footprint
Drive readiness to convert contracted compute capacity into schedulable production clusters
Own deployment programs for new PoPs, backbone nodes, WAN expansion, and interconnection initiatives
Build integrated schedules spanning procurement, logistics, installation, storage readiness, network turn-up, testing, and production handoff
Coordinate BOM readiness, server delivery, racks, optics, cabling, storage hardware, and vendor milestones
Partner with engineering teams to align compute, storage, and networking dependencies before cluster activation
Manage deployment of storage systems supporting training and inference workloads, including readiness, validation, performance checks, and scaling plans
Coordinate backbone capacity expansion, cross-connects, inter-region pathing, and cloud interconnect readiness with Azure and third-party providers
Lead physical deployment execution including rack-and-stack, hardware bring-up, L1 validation, and site acceptance criteria
Build repeatable deployment playbooks, dashboards, governance cadences, and operating mechanisms for scale
Identify risks early across supply chain, site readiness, technical constraints, and vendor execution, then drive mitigation plans
Communicate milestones, escalations, and capacity forecasts to senior leadership

Qualifications:

8+ years of experience in technical program management, infrastructure deployment, network deployment, or data center operations
Strong experience delivering programs involving compute, storage, networking, or large-scale infrastructure systems
Working knowledge of servers, clusters, storage arrays, routers, switches, optics, and structured cabling
Experience owning cross-functional programs across engineering, operations, supply chain, and external vendors
Strong understanding of deployment lifecycles from planning and procurement through production handoff
Ability to reason across physical infrastructure execution and logical systems architecture dependencies
Proven ability to build integrated schedules and drive accountability across multiple stakeholders
Strong executive communication skills with experience managing critical escalations and leadership updates
Comfortable operating in fast-moving environments with aggressive timelines and evolving priorities
Highly analytical with strong problem-solving and execution instincts

Preferred Skills:

Experience at a hyperscaler, cloud provider, AI infrastructure company, or global network operator
Experience deploying GPU clusters, HPC systems, or large training environments
Familiarity with distributed storage systems and high-performance data infrastructure
Experience with PoP deployments, WAN backbone expansion, or global network buildouts
Experience working across first-party, colo, and cloud environments
Experience building repeatable infrastructure deployment systems in high-growth environments

About OpenAI:

OpenAI is an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

XML job scraping automation by YubHub

]]> Full time senior hybrid $342K – $555K technical program management, infrastructure deployment, network deployment, data center operations, compute, storage, networking, or large-scale infrastructure systems, servers, clusters, storage arrays, routers, switches, optics, and structured cabling, cross-functional programs across engineering, operations, supply chain, and external vendors, deployment lifecycles from planning and procurement through production handoff, physical infrastructure execution and logical systems architecture dependencies, integrated schedules and drive accountability across multiple stakeholders, executive communication skills with experience managing critical escalations and leadership updates, hyperscaler, cloud provider, AI infrastructure company, or global network operator, deploying GPU clusters, HPC systems, or large training environments, distributed storage systems and high-performance data infrastructure, PoP deployments, WAN backbone expansion, or global network buildouts, first-party, colo, and cloud environments, repeatable infrastructure deployment systems in high-growth environments Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. https://openai.com https://jobs.ashbyhq.com/openai/667c09e2-6efc-45dc-9714-078bedf17343 San Francisco; Seattle 2026-04-24 acd7d096-766 Staff Backend Engineer, Non Human Identities Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.

This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

The PAM Team

Ever wonder how large organisations make sure the right people can access their most critical systems? That's the problem the Okta Privileged Access Management (PAM) team solves. Our solution controls who can reach sensitive servers, databases and cloud resources and grants access only when it's needed. It is the security layer between people (and non-human-identities) and the systems they need to do their jobs.

The Staff Backend Engineer Opportunity

We are seeking a world-class Staff Engineer to help us architect and build the high-performance core of our non-human identity platform. Your work, in close collaboration with our principal engineers and architects, will be the foundation of our strategy for managing privileged access in the modern enterprise. If you are a systems programmer who thrives on influencing the design of high-performance, concurrent, and resilient security software, this is the role for you.

What you’ll be doing

Contribute to Core Architecture: Partner with principal engineers and architects to design and implement a low-latency, high-throughput secrets engine for non-human identities

Solve for Massive Scale: Write highly concurrent, performance-critical code capable of handling millions of machine-to-machine authentication and authorization requests

Shape Technical Strategy: Play a key role in defining the long-term technical roadmap for scalability and performance, ensuring our platform can meet the demands of the largest enterprises

Mentor and Elevate: As a senior engineer on the team, you will work with junior engineers to help them advance their SDLC expertise.

On-Call: Participate in the rotational on-call activities with SRE and product development team

What you’ll bring to the role

Required Experience:

8+ years of professional software engineering experience, with a heavy focus on backend or systems-level development

Bachelor’s or Master’s degree in Computer Science, or equivalent practical experience

Core Technical Expertise:

Deep, hands-on expertise in multi-platform Go development and building high-performance, concurrent applications

Experience designing or operating distributed systems

Experience with secure systems (authn/authz, encryption, TLS, token handling, PKI, CAs, diagnosing TLS issues)

Deep expertise in distributed storage systems, with a focus on replication, backup, and restore, and data management. (Postgres, etc.)

Direct experience designing, building, or contributing to a secrets management, service mesh, or machine identity platform

Expert-level at ergonomic API design (gRPC/openAPI), and building for reliability at scale

Deep knowledge of cloud-native infrastructure

And extra credit if you have experience in any of the following!

Experience at a leading Cybersecurity or Infrastructure-as-Code company

Contributions to open-source projects in the identity, security, or infrastructure space

XML job scraping automation by YubHub

]]> full-time staff hybrid $160,000-$220,000 CAD Go, Distributed systems, Secure systems, Distributed storage systems, Secrets management, Service mesh, Machine identity platform, Ergonomic API design, Cloud-native infrastructure Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7819476 Toronto, Ontario, Canada 2026-04-24 21104c69-8cb Staff Backend Engineer, Non Human Identities Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.

We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

The PAM Team

The Staff Backend Engineer Opportunity

What you’ll be doing

Contribute to Core Architecture: Partner with principal engineers and architects to design and implement a low-latency, high-throughput secrets engine for non-human identities

Solve for Massive Scale: Write highly concurrent, performance-critical code capable of handling millions of machine-to-machine authentication and authorization requests

Shape Technical Strategy: Play a key role in defining the long-term technical roadmap for scalability and performance, ensuring our platform can meet the demands of the largest enterprises

Mentor and Elevate: As a senior engineer on the team, you will work with junior engineers to help them advance their SDLC expertise.

On-Call: Participate in the rotational on-call activities with SRE and product development team

What you’ll bring to the role

Required Experience:

8+ years of professional software engineering experience, with a heavy focus on backend or systems-level development

Bachelor’s or Master’s degree in Computer Science, or equivalent practical experience

Core Technical Expertise:

Deep, hands-on expertise in multi-platform Go development and building high-performance, concurrent applications

Experience designing or operating distributed systems

Experience with secure systems (authn/authz, encryption, TLS, token handling, PKI, CAs, diagnosing TLS issues)

Deep expertise in distributed storage systems, with a focus on replication, backup, and restore, and data management. (Postgres, etc.)

Direct experience designing, building, or contributing to a secrets management, service mesh, or machine identity platform

Expert-level at ergonomic API design (gRPC/openAPI), and building for reliability at scale

Deep knowledge of cloud-native infrastructure

And extra credit if you have experience in any of the following!

Experience at a leading Cybersecurity or Infrastructure-as-Code company

Contributions to open-source projects in the identity, security, or infrastructure space

XML job scraping automation by YubHub

]]> full-time staff hybrid $194,000-$267,300 USD Go, Distributed systems, Secure systems, Distributed storage systems, Secrets management, Service mesh, Machine identity platform, Ergonomic API design, Cloud-native infrastructure Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7842962 San Francisco, California 2026-04-24 3a40dbfa-d00 Staff Software Engineer, Non-Human Identity Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.

The Team

The Okta Privileged Access Management (PAM) team is building the future of identity for machines, services, and applications. We are seeking a world-class Staff Engineer to help us architect and build the high-performance core of our non-human identity platform.

Your work, in close collaboration with our principal engineers and architects, will be the foundation of our strategy for managing privileged access in the modern enterprise. If you are a systems programmer who thrives on influencing the design of high-performance, concurrent, and resilient security software, this is the role for you.

What you’ll be doing

Contribute to Core Architecture:
Partner with principal engineers and architects to design and implement a low-latency, high-throughput secrets engine for non-human identities
Solve for Massive Scale:
Write highly concurrent, performance-critical code capable of handling millions of machine-to-machine authentication and authorization requests
Shape Technical Strategy:
Play a key role in defining the long-term technical roadmap for scalability and performance, ensuring our platform can meet the demands of the largest enterprises
Mentor and Elevate:
As a senior engineer on the team, you will work with junior engineers to help them advance their SDLC expertise.
On-Call:
Participate in the rotational on-call activities with SRE and product development team

What you’ll bring to the role

Required Experience:
8+ years of professional software engineering experience, with a heavy focus on backend or systems-level development
Bachelor’s or Master’s degree in Computer Science, or equivalent practical experience
Core Technical Expertise:
Deep, hands-on expertise in multi-platform Go development and building high-performance, concurrent applications
Experience designing or operating distributed systems
Experience with secure systems (authn/authz, encryption, TLS, token handling, PKI, CAs, diagnosing TLS issues)
Deep expertise in distributed storage systems, with a focus on replication, backup, and restore, and data management. (Postgres, etc.)
Direct experience designing, building, or contributing to a secrets management, service mesh, or machine identity platform
Expert-level at ergonomic API design (gRPC/openAPI), and building for reliability at scale
Deep knowledge of cloud-native infrastructure
Key Attributes:
You are driven by the challenge of optimizing systems for performance, latency, and throughput, with a proven ability to diagnose complex, multi-system issues
You have a proven track record of making significant contributions to the architecture of complex, mission-critical systems
You thrive in an environment where you can focus on deep technical problems
Bonus Points:
Experience at a leading Cybersecurity or Infrastructure-as-Code company
Contributions to open-source projects in the identity, security, or infrastructure space

And extra credit if you have experience in any of the following!

Deep expertise in backend systems engineering
Experience building and scaling beyond standard three-tier monolithic architectures, with a focus on modern distributed systems
Have worked on projects with complex, established systems
Possess significant, hands-on experience in a Linux/Unix environment

XML job scraping automation by YubHub

]]> full-time staff hybrid $194,000-$267,000 USD Go development, Distributed systems, Secure systems, Distributed storage systems, Secrets management, Service mesh, Machine identity platform, Ergonomic API design, Cloud-native infrastructure Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta is a technology company that provides identity and access management solutions. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7674829 San Francisco, California 2026-04-18 18ae1499-b22 Research Engineer, Discovery As a Research Engineer on our team, you will work end-to-end across the whole model stack, identifying and addressing key infra blockers on the path to scientific AGI. Strong candidates should have familiarity with elements of language model training, evaluation, and inference and eagerness to quickly dive and get up to speed in areas they are not yet an expert on.

Responsibilities:

Design and implement large-scale infrastructure systems to support AI scientist training, evaluation, and deployment across distributed environments
Identify and resolve infrastructure bottlenecks impeding progress toward scientific capabilities
Develop robust and reliable evaluation frameworks for measuring progress towards scientific AGI
Build scalable and performant VM/sandboxing/container architectures to safely execute long-horizon AI tasks and scientific workflows
Collaborate to translate experimental requirements into production-ready infrastructure
Develop large scale data pipelines to handle advanced language model training requirements
Optimize large scale training and inference pipelines for stable and efficient reinforcement learning

You may be a good fit if you:

Have 6+ years of highly-relevant experience in infrastructure engineering with demonstrated expertise in large-scale distributed systems
Are a strong communicator and enjoy working collaboratively
Possess deep knowledge of performance optimization techniques and system architectures for high-throughput ML workloads
Have experience with containerization technologies (Docker, Kubernetes) and orchestration at scale
Have proven track record of building large-scale data pipelines and distributed storage systems
Excel at diagnosing and resolving complex infrastructure challenges in production environments
Can work effectively across the full ML stack from data pipelines to performance optimization
Have experience collaborating with other researchers to scale experimental ideas
Thrive in fast-paced environments and can rapidly iterate from experimentation to production

Strong candidates may also have:

Experience with language model training infrastructure and distributed ML frameworks (PyTorch, JAX, etc.)
Background in building infrastructure for AI research labs or large-scale ML organizations
Knowledge of GPU/TPU architectures and language model inference optimization
Experience with cloud platforms (AWS, GCP) at enterprise scale
Familiarity with VM and container orchestration
Experience with workflow orchestration tools and experiment management systems
History working with large scale reinforcement learning
Comfort with large scale data pipelines (Beam, Spark, Dask, …)

The annual compensation range for this role is $350,000-$850,000 USD.

XML job scraping automation by YubHub

]]> full-time senior hybrid $350,000-$850,000 USD large-scale distributed systems, containerization technologies (Docker, Kubernetes), performance optimization techniques, system architectures for high-throughput ML workloads, data pipelines, distributed storage systems, ML frameworks (PyTorch, JAX, etc.), GPU/TPU architectures, cloud platforms (AWS, GCP), VM and container orchestration, workflow orchestration tools, experiment management systems, reinforcement learning, large scale data pipelines (Beam, Spark, Dask, …) Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/4669581008 San Francisco, CA 2026-04-18 8aa2a018-294 Sr. Staff Software Engineer - Distributed System Development As a Sr. Staff Software Engineer – Distributed Systems at Alluxio, you will lead the end-to-end architecture and technical evolution of our next-generation distributed data platform.

You will drive system-level design decisions that enable Alluxio to scale to thousands of nodes and exabytes of data, while maintaining performance, reliability, and simplicity for users.

In this role, you will operate as a technical architect and hands-on engineering leader, partnering closely with engineering teams and product management to translate complex requirements into scalable distributed system designs.

Responsibilities

Lead the end-to-end architecture and design of large-scale distributed systems powering the Alluxio platform.
Drive technical strategy and architectural direction across multiple teams and components.
Design systems that support high scalability, fault tolerance, performance optimization, and data durability.
Provide hands-on development and deep technical guidance in critical areas of the system.
Lead complex system design reviews and mentor senior engineers on distributed systems design.
Identify and resolve system-level performance bottlenecks and reliability challenges.
Collaborate with product management and engineering leadership to translate product goals into technical solutions.
Influence the broader technical ecosystem through open-source contributions and architectural thought leadership.

Requirements

Master or BS degree in Computer Science or related technical field, or equivalent practical experience.
Proven experience of 2+ years in a technical leadership or architect role, driving system-level design and guiding engineering teams.
Strong hands-on software development experience in one or more general-purpose programming languages, including but not limited to Java, C/C++, or Go.
Deep architecting expertise in at least two of the following areas:
Distributed and parallel systems
Distributed storage systems
Architecting large-scale software systems
Demonstrated ability to design and implement high-quality, stable, and scalable end-to-end system architectures in production environments.
Strong analytical thinking and complex problem-solving skills.
Excellent communication skills and ability to influence technical direction across teams.

Nice to Have

PhD in Computer Science, Distributed Systems, or related fields.
Deep understanding of consensus algorithms, storage engines, or large-scale data systems.
Experience building or operating cloud-native infrastructure platforms.
Experience contributing to or maintaining open-source distributed systems projects.
Track record of designing systems that operate at massive scale (thousands of nodes or higher).
Passion for building high-performance infrastructure software.
Contributions to Alluxio open-source community.

XML job scraping automation by YubHub

]]> full-time senior onsite Java, C/C++, Go, Distributed and parallel systems, Distributed storage systems, Architecting large-scale software systems Engineering Technology Alluxio https://logos.yubhub.co/alluxio.com.png Alluxio is a distributed data platform company. https://www.alluxio.com/ https://jobs.lever.co/alluxio/f997ed6c-941f-4873-b308-a1f33b6b78ef Beijing 2026-04-17 94b47f45-76d Distributed Systems Engineer Are you interested in joining a group of highly talented engineers working on an open source project that is solving challenging problems across big data analytics, machine learning and artificial intelligence?

As a distributed systems engineer at Alluxio, you will be responsible for evolving the state-of-the-art Alluxio project. The work would involve solving challenging problems in the area of Distributed Data Services, Memory and data structure efficiency, Thread concurrency and locking optimizations, process coordination and caching policies and implementation.

The role would include developing innovative solutions for scaling systems to thousands of nodes and providing Data Durability and High Availability.

You will be part of a team that includes leaders, innovators, explorers, and risk-takers with extensive industry experience from top tech companies including Google, Palantir and VMWare and alumni from top computer science programs including CMU, Stanford and UC Berkeley.

We are looking for someone with a BS degree in Computer Science, similar technical field of study or equivalent practical experience. You should have software development experience in one or more general purpose programming languages including but not limited to: Java, C/C++, or Go.

Experience working with two or more from the following is a must: distributed and parallel systems, distributed storage systems, architecting large scale software systems, and/or security software development.

Excellent analytical and problem solving skills are required. Working proficiency and communication skills in verbal and written English are also necessary.

Preferred qualifications include a Master’s, PhD degree, further education or equivalent practical experience in engineering, computer science or other technical related field. Experience designing, developing, and deploying Kubernetes applications is also desirable.

If you are interested in contributing to an open source project and want to work in a fast-paced, collaborative and iterative programming environment, please apply.

XML job scraping automation by YubHub

]]> full-time mid onsite Java, C/C++, Go, Distributed systems, Parallel systems, Distributed storage systems, Architecting large scale software systems, Security software development, Kubernetes, Master’s, PhD degree, further education or equivalent practical experience in engineering, computer science or other technical related field Engineering Technology Alluxio https://logos.yubhub.co/alluxio.com.png Alluxio is a project from AMPLab, backed by Andreessen-Horowitz, and has been named as top 10 hot startups for 2018. https://www.alluxio.com/ https://jobs.lever.co/alluxio/ad547017-b276-4c99-ae4e-4c5a073daf93 San Francisco 2026-04-17 da726093-b19 Research Engineer, Discovery About the Role

As a Research Engineer on our team, you will work end to end across the whole model stack, identifying and addressing key infra blockers on the path to scientific AGI. Strong candidates should have familiarity with elements of language model training, evaluation, and inference and eagerness to quickly dive and get up to speed in areas they are not yet an expert on. This may include performance optimization, distributed systems, VM/sandboxing/container deployment, and large scale data pipelines.

Responsibilities:

Design and implement large-scale infrastructure systems to support AI scientist training, evaluation, and deployment across distributed environments
Identify and resolve infrastructure bottlenecks impeding progress toward scientific capabilities
Develop robust and reliable evaluation frameworks for measuring progress towards scientific AGI.
Build scalable and performant VM/sandboxing/container architectures to safely execute long-horizon AI tasks and scientific workflows
Collaborate to translate experimental requirements into production-ready infrastructure
Develop large scale data pipelines to handle advanced language model training requirements
Optimize large scale training and inference pipelines for stable and efficient reinforcement learning

You may be a good fit if you:

Have 6+ years of highly-relevant experience in infrastructure engineering with demonstrated expertise in large-scale distributed systems
Are a strong communicator and enjoy working collaboratively
Possess deep knowledge of performance optimization techniques and system architectures for high-throughput ML workloads
Have experience with containerization technologies (Docker, Kubernetes) and orchestration at scale
Have proven track record of building large-scale data pipelines and distributed storage systems
Excel at diagnosing and resolving complex infrastructure challenges in production environments
Can work effectively across the full ML stack from data pipelines to performance optimization
Have experience collaborating with other researchers to scale experimental ideas
Thrive in fast-paced environments and can rapidly iterate from experimentation to production

Strong candidates may also have:

Experience with language model training infrastructure and distributed ML frameworks (PyTorch, JAX, etc.)
Background in building infrastructure for AI research labs or large-scale ML organizations
Knowledge of GPU/TPU architectures and language model inference optimization
Experience with cloud platforms (AWS, GCP) at enterprise scale
Familiarity with VM and container orchestration.
Experience with workflow orchestration tools and experiment management systems
History working with large scale reinforcement learning
Comfort with large scale data pipelines (Beam, Spark, Dask, …)

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.
Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.
Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit anthropic.com/careers directly for confirmed position openings.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale projects, and we're committed to making a positive impact on the world.

XML job scraping automation by YubHub

]]> full-time senior hybrid $350,000 - $850,000 USD infrastructure engineering, large-scale distributed systems, performance optimization, containerization technologies, orchestration at scale, data pipelines, distributed storage systems, complex infrastructure challenges, ML stack, workflow orchestration tools, experiment management systems, reinforcement learning, large scale data pipelines, language model training infrastructure, distributed ML frameworks, GPU/TPU architectures, language model inference optimization, cloud platforms, VM and container orchestration, workflow orchestration tools, experiment management systems, large scale reinforcement learning, large scale data pipelines Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that aims to create reliable, interpretable, and steerable AI systems. It has a team of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/4669581008 San Francisco, CA 2026-03-08