<?xml version="1.0" encoding="UTF-8"?>
<source>
  <jobs>
    <job>
      <externalid>198d64d4-207</externalid>
      <Title>Senior/Staff Site Reliability Engineer</Title>
      <Description><![CDATA[<p>You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems , from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better.</p>
<p><strong>Key Responsibilities</strong></p>
<ul>
<li>Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads</li>
</ul>
<ul>
<li>Build and maintain CI/CD pipelines and deployment infrastructure</li>
</ul>
<ul>
<li>Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability</li>
</ul>
<ul>
<li>Build dashboards, alerting, and anomaly detection across our systems</li>
</ul>
<ul>
<li>Define and enforce SLOs and build out incident response processes</li>
</ul>
<ul>
<li>Manage and improve our networking, load balancing, and service mesh configurations</li>
</ul>
<ul>
<li>Drive reliability improvements across the stack through automation, runbooks, and chaos engineering</li>
</ul>
<p><strong>Requirements</strong></p>
<ul>
<li>5+ years experience in managing critical production systems and software development workflows</li>
</ul>
<ul>
<li>Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)</li>
</ul>
<ul>
<li>Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS</li>
</ul>
<ul>
<li>Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)</li>
</ul>
<ul>
<li>Proficiency in Python and either Go or Bash for tooling and automation</li>
</ul>
<ul>
<li>Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)</li>
</ul>
<ul>
<li>Excellent communication and ability to drive technical decisions across teams</li>
</ul>
<ul>
<li>Self-starter who executes quickly, takes ownership, and constantly seeks improvement</li>
</ul>
<p><strong>Nice to have</strong></p>
<ul>
<li>Experience with managing GPU and AI/ML workloads</li>
</ul>
<ul>
<li>Experience with kernel-based monitoring and routing (eBPF, XDP)</li>
</ul>
<ul>
<li>Experience with security tooling (Falco, Coroot, SIEM)</li>
</ul>
<ul>
<li>Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)</li>
</ul>
<ul>
<li>Experience with distributed storage systems (Ceph, Longhorn, etc.)</li>
</ul>
<p><strong>Compensation</strong></p>
<ul>
<li>$180,000-250,000 plus equity + benefits</li>
</ul>
<p><strong>Benefits</strong></p>
<ul>
<li>Interesting and challenging work</li>
</ul>
<ul>
<li>A lot of learning and growth opportunities</li>
</ul>
<ul>
<li>Regular team events and offsites</li>
</ul>
<ul>
<li>Health, dental, and vision insurance (US)</li>
</ul>
<ul>
<li>Visa sponsorship and relocation assistance</li>
</ul>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange>$180,000-250,000</Salaryrange>
      <Skills>Kubernetes, Infrastructure-as-code, Linux networking, Container networking, CI/CD systems, GitOps workflows, Python, Go, Bash, Logging, Monitoring, Alerting, GPU and AI/ML workloads, Kernel-based monitoring and routing, Security tooling, Bare metal Kubernetes networking, Distributed storage systems</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Fal</Employername>
      <Employerlogo>https://logos.yubhub.co/fal.com.png</Employerlogo>
      <Employerdescription>Fal is a technology company that operates in the San Francisco area.</Employerdescription>
      <Employerwebsite>https://fal.com</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/fal/jobs/4146019009</Applyto>
      <Location>San Francisco</Location>
      <Country></Country>
      <Postedate>2026-04-24</Postedate>
    </job>
    <job>
      <externalid>38c10a5f-35e</externalid>
      <Title>CPU/Storage/PoP-WAN Program Manager</Title>
      <Description><![CDATA[<p>We are seeking a highly technical Program Manager to lead execution across CPU, Storage, PoP, and WAN infrastructure programs that directly unlock OpenAI&#39;s next generation compute capacity.</p>
<p>In this role, you will own complex cross-functional programs spanning compute cluster activation, storage deployment, PoP bring-up, and backbone expansion. You will coordinate hardware readiness, site readiness, network pathing, storage availability, vendor execution, and engineering dependencies required to turn contracted infrastructure into live training and inference capacity.</p>
<p>This role requires strong technical fluency across hardware systems, network infrastructure, storage architecture, and deployment execution. You should be comfortable operating from rack-level implementation details through executive-level capacity planning discussions.</p>
<p>Key Responsibilities:</p>
<ul>
<li>Lead end-to-end execution of CPU / GPU cluster activation programs across OpenAI&#39;s global infrastructure footprint</li>
<li>Drive readiness to convert contracted compute capacity into schedulable production clusters</li>
<li>Own deployment programs for new PoPs, backbone nodes, WAN expansion, and interconnection initiatives</li>
<li>Build integrated schedules spanning procurement, logistics, installation, storage readiness, network turn-up, testing, and production handoff</li>
<li>Coordinate BOM readiness, server delivery, racks, optics, cabling, storage hardware, and vendor milestones</li>
<li>Partner with engineering teams to align compute, storage, and networking dependencies before cluster activation</li>
<li>Manage deployment of storage systems supporting training and inference workloads, including readiness, validation, performance checks, and scaling plans</li>
<li>Coordinate backbone capacity expansion, cross-connects, inter-region pathing, and cloud interconnect readiness with Azure and third-party providers</li>
<li>Lead physical deployment execution including rack-and-stack, hardware bring-up, L1 validation, and site acceptance criteria</li>
<li>Build repeatable deployment playbooks, dashboards, governance cadences, and operating mechanisms for scale</li>
<li>Identify risks early across supply chain, site readiness, technical constraints, and vendor execution, then drive mitigation plans</li>
<li>Communicate milestones, escalations, and capacity forecasts to senior leadership</li>
</ul>
<p>Qualifications:</p>
<ul>
<li>8+ years of experience in technical program management, infrastructure deployment, network deployment, or data center operations</li>
<li>Strong experience delivering programs involving compute, storage, networking, or large-scale infrastructure systems</li>
<li>Working knowledge of servers, clusters, storage arrays, routers, switches, optics, and structured cabling</li>
<li>Experience owning cross-functional programs across engineering, operations, supply chain, and external vendors</li>
<li>Strong understanding of deployment lifecycles from planning and procurement through production handoff</li>
<li>Ability to reason across physical infrastructure execution and logical systems architecture dependencies</li>
<li>Proven ability to build integrated schedules and drive accountability across multiple stakeholders</li>
<li>Strong executive communication skills with experience managing critical escalations and leadership updates</li>
<li>Comfortable operating in fast-moving environments with aggressive timelines and evolving priorities</li>
<li>Highly analytical with strong problem-solving and execution instincts</li>
</ul>
<p>Preferred Skills:</p>
<ul>
<li>Experience at a hyperscaler, cloud provider, AI infrastructure company, or global network operator</li>
<li>Experience deploying GPU clusters, HPC systems, or large training environments</li>
<li>Familiarity with distributed storage systems and high-performance data infrastructure</li>
<li>Experience with PoP deployments, WAN backbone expansion, or global network buildouts</li>
<li>Experience working across first-party, colo, and cloud environments</li>
<li>Experience building repeatable infrastructure deployment systems in high-growth environments</li>
</ul>
<p>About OpenAI:</p>
<p>OpenAI is an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>Full time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$342K – $555K</Salaryrange>
      <Skills>technical program management, infrastructure deployment, network deployment, data center operations, compute, storage, networking, or large-scale infrastructure systems, servers, clusters, storage arrays, routers, switches, optics, and structured cabling, cross-functional programs across engineering, operations, supply chain, and external vendors, deployment lifecycles from planning and procurement through production handoff, physical infrastructure execution and logical systems architecture dependencies, integrated schedules and drive accountability across multiple stakeholders, executive communication skills with experience managing critical escalations and leadership updates, hyperscaler, cloud provider, AI infrastructure company, or global network operator, deploying GPU clusters, HPC systems, or large training environments, distributed storage systems and high-performance data infrastructure, PoP deployments, WAN backbone expansion, or global network buildouts, first-party, colo, and cloud environments, repeatable infrastructure deployment systems in high-growth environments</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>OpenAI</Employername>
      <Employerlogo>https://logos.yubhub.co/openai.com.png</Employerlogo>
      <Employerdescription>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.</Employerdescription>
      <Employerwebsite>https://openai.com</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://jobs.ashbyhq.com/openai/667c09e2-6efc-45dc-9714-078bedf17343</Applyto>
      <Location>San Francisco; Seattle</Location>
      <Country></Country>
      <Postedate>2026-04-24</Postedate>
    </job>
    <job>
      <externalid>acd7d096-766</externalid>
      <Title>Staff Backend Engineer, Non Human Identities</Title>
      <Description><![CDATA[<p>Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.</p>
<p>This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We&#39;re all in on this mission. If you are too, let&#39;s talk.</p>
<p>The PAM Team</p>
<p>Ever wonder how large organisations make sure the right people can access their most critical systems? That&#39;s the problem the Okta Privileged Access Management (PAM) team solves. Our solution controls who can reach sensitive servers, databases and cloud resources and grants access only when it&#39;s needed. It is the security layer between people (and non-human-identities) and the systems they need to do their jobs.</p>
<p>The Staff Backend Engineer Opportunity</p>
<p>We are seeking a world-class Staff Engineer to help us architect and build the high-performance core of our non-human identity platform. Your work, in close collaboration with our principal engineers and architects, will be the foundation of our strategy for managing privileged access in the modern enterprise. If you are a systems programmer who thrives on influencing the design of high-performance, concurrent, and resilient security software, this is the role for you.</p>
<p>What you’ll be doing</p>
<p>Contribute to Core Architecture: Partner with principal engineers and architects to design and implement a low-latency, high-throughput secrets engine for non-human identities</p>
<p>Solve for Massive Scale: Write highly concurrent, performance-critical code capable of handling millions of machine-to-machine authentication and authorization requests</p>
<p>Shape Technical Strategy: Play a key role in defining the long-term technical roadmap for scalability and performance, ensuring our platform can meet the demands of the largest enterprises</p>
<p>Mentor and Elevate: As a senior engineer on the team, you will work with junior engineers to help them advance their SDLC expertise.</p>
<p>On-Call: Participate in the rotational on-call activities with SRE and product development team</p>
<p>What you’ll bring to the role</p>
<p>Required Experience:</p>
<p>8+ years of professional software engineering experience, with a heavy focus on backend or systems-level development</p>
<p>Bachelor’s or Master’s degree in Computer Science, or equivalent practical experience</p>
<p>Core Technical Expertise:</p>
<p>Deep, hands-on expertise in multi-platform Go development and building high-performance, concurrent applications</p>
<p>Experience designing or operating distributed systems</p>
<p>Experience with secure systems (authn/authz, encryption, TLS, token handling, PKI, CAs, diagnosing TLS issues)</p>
<p>Deep expertise in distributed storage systems, with a focus on replication, backup, and restore, and data management. (Postgres, etc.)</p>
<p>Direct experience designing, building, or contributing to a secrets management, service mesh, or machine identity platform</p>
<p>Expert-level at ergonomic API design (gRPC/openAPI), and building for reliability at scale</p>
<p>Deep knowledge of cloud-native infrastructure</p>
<p>And extra credit if you have experience in any of the following!</p>
<p>Experience at a leading Cybersecurity or Infrastructure-as-Code company</p>
<p>Contributions to open-source projects in the identity, security, or infrastructure space</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$160,000-$220,000 CAD</Salaryrange>
      <Skills>Go, Distributed systems, Secure systems, Distributed storage systems, Secrets management, Service mesh, Machine identity platform, Ergonomic API design, Cloud-native infrastructure</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Okta</Employername>
      <Employerlogo>https://logos.yubhub.co/okta.com.png</Employerlogo>
      <Employerdescription>Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.</Employerdescription>
      <Employerwebsite>https://www.okta.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/okta/jobs/7819476</Applyto>
      <Location>Toronto, Ontario, Canada</Location>
      <Country></Country>
      <Postedate>2026-04-24</Postedate>
    </job>
    <job>
      <externalid>21104c69-8cb</externalid>
      <Title>Staff Backend Engineer, Non Human Identities</Title>
      <Description><![CDATA[<p>Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.</p>
<p>We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We&#39;re all in on this mission. If you are too, let&#39;s talk.</p>
<p>The PAM Team</p>
<p>Ever wonder how large organisations make sure the right people can access their most critical systems? That&#39;s the problem the Okta Privileged Access Management (PAM) team solves. Our solution controls who can reach sensitive servers, databases and cloud resources and grants access only when it&#39;s needed. It is the security layer between people (and non-human-identities) and the systems they need to do their jobs.</p>
<p>The Staff Backend Engineer Opportunity</p>
<p>We are seeking a world-class Staff Engineer to help us architect and build the high-performance core of our non-human identity platform. Your work, in close collaboration with our principal engineers and architects, will be the foundation of our strategy for managing privileged access in the modern enterprise. If you are a systems programmer who thrives on influencing the design of high-performance, concurrent, and resilient security software, this is the role for you.</p>
<p>What you’ll be doing</p>
<p>Contribute to Core Architecture: Partner with principal engineers and architects to design and implement a low-latency, high-throughput secrets engine for non-human identities</p>
<p>Solve for Massive Scale: Write highly concurrent, performance-critical code capable of handling millions of machine-to-machine authentication and authorization requests</p>
<p>Shape Technical Strategy: Play a key role in defining the long-term technical roadmap for scalability and performance, ensuring our platform can meet the demands of the largest enterprises</p>
<p>Mentor and Elevate: As a senior engineer on the team, you will work with junior engineers to help them advance their SDLC expertise.</p>
<p>On-Call: Participate in the rotational on-call activities with SRE and product development team</p>
<p>What you’ll bring to the role</p>
<p>Required Experience:</p>
<p>8+ years of professional software engineering experience, with a heavy focus on backend or systems-level development</p>
<p>Bachelor’s or Master’s degree in Computer Science, or equivalent practical experience</p>
<p>Core Technical Expertise:</p>
<p>Deep, hands-on expertise in multi-platform Go development and building high-performance, concurrent applications</p>
<p>Experience designing or operating distributed systems</p>
<p>Experience with secure systems (authn/authz, encryption, TLS, token handling, PKI, CAs, diagnosing TLS issues)</p>
<p>Deep expertise in distributed storage systems, with a focus on replication, backup, and restore, and data management. (Postgres, etc.)</p>
<p>Direct experience designing, building, or contributing to a secrets management, service mesh, or machine identity platform</p>
<p>Expert-level at ergonomic API design (gRPC/openAPI), and building for reliability at scale</p>
<p>Deep knowledge of cloud-native infrastructure</p>
<p>And extra credit if you have experience in any of the following!</p>
<p>Experience at a leading Cybersecurity or Infrastructure-as-Code company</p>
<p>Contributions to open-source projects in the identity, security, or infrastructure space</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$194,000-$267,300 USD</Salaryrange>
      <Skills>Go, Distributed systems, Secure systems, Distributed storage systems, Secrets management, Service mesh, Machine identity platform, Ergonomic API design, Cloud-native infrastructure</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Okta</Employername>
      <Employerlogo>https://logos.yubhub.co/okta.com.png</Employerlogo>
      <Employerdescription>Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.</Employerdescription>
      <Employerwebsite>https://www.okta.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/okta/jobs/7842962</Applyto>
      <Location>San Francisco, California</Location>
      <Country></Country>
      <Postedate>2026-04-24</Postedate>
    </job>
    <job>
      <externalid>3a40dbfa-d00</externalid>
      <Title>Staff Software Engineer, Non-Human Identity</Title>
      <Description><![CDATA[<p>Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.</p>
<p>We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We&#39;re all in on this mission. If you are too, let&#39;s talk.</p>
<p>The Team</p>
<p>The Okta Privileged Access Management (PAM) team is building the future of identity for machines, services, and applications. We are seeking a world-class Staff Engineer to help us architect and build the high-performance core of our non-human identity platform.</p>
<p>Your work, in close collaboration with our principal engineers and architects, will be the foundation of our strategy for managing privileged access in the modern enterprise. If you are a systems programmer who thrives on influencing the design of high-performance, concurrent, and resilient security software, this is the role for you.</p>
<p>What you’ll be doing</p>
<ul>
<li>Contribute to Core Architecture:</li>
<li>Partner with principal engineers and architects to design and implement a low-latency, high-throughput secrets engine for non-human identities</li>
<li>Solve for Massive Scale:</li>
<li>Write highly concurrent, performance-critical code capable of handling millions of machine-to-machine authentication and authorization requests</li>
<li>Shape Technical Strategy:</li>
<li>Play a key role in defining the long-term technical roadmap for scalability and performance, ensuring our platform can meet the demands of the largest enterprises</li>
<li>Mentor and Elevate:</li>
<li>As a senior engineer on the team, you will work with junior engineers to help them advance their SDLC expertise.</li>
<li>On-Call:</li>
<li>Participate in the rotational on-call activities with SRE and product development team</li>
</ul>
<p>What you’ll bring to the role</p>
<ul>
<li>Required Experience:</li>
<li>8+ years of professional software engineering experience, with a heavy focus on backend or systems-level development</li>
<li>Bachelor’s or Master’s degree in Computer Science, or equivalent practical experience</li>
<li>Core Technical Expertise:</li>
<li>Deep, hands-on expertise in multi-platform Go development and building high-performance, concurrent applications</li>
<li>Experience designing or operating distributed systems</li>
<li>Experience with secure systems (authn/authz, encryption, TLS, token handling, PKI, CAs, diagnosing TLS issues)</li>
<li>Deep expertise in distributed storage systems, with a focus on replication, backup, and restore, and data management. (Postgres, etc.)</li>
<li>Direct experience designing, building, or contributing to a secrets management, service mesh, or machine identity platform</li>
<li>Expert-level at ergonomic API design (gRPC/openAPI), and building for reliability at scale</li>
<li>Deep knowledge of cloud-native infrastructure</li>
<li>Key Attributes:</li>
<li>You are driven by the challenge of optimizing systems for performance, latency, and throughput, with a proven ability to diagnose complex, multi-system issues</li>
<li>You have a proven track record of making significant contributions to the architecture of complex, mission-critical systems</li>
<li>You thrive in an environment where you can focus on deep technical problems</li>
<li>Bonus Points:</li>
<li>Experience at a leading Cybersecurity or Infrastructure-as-Code company</li>
<li>Contributions to open-source projects in the identity, security, or infrastructure space</li>
</ul>
<p>And extra credit if you have experience in any of the following!</p>
<ul>
<li>Deep expertise in backend systems engineering</li>
<li>Experience building and scaling beyond standard three-tier monolithic architectures, with a focus on modern distributed systems</li>
<li>Have worked on projects with complex, established systems</li>
<li>Possess significant, hands-on experience in a Linux/Unix environment</li>
</ul>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$194,000-$267,000 USD</Salaryrange>
      <Skills>Go development, Distributed systems, Secure systems, Distributed storage systems, Secrets management, Service mesh, Machine identity platform, Ergonomic API design, Cloud-native infrastructure</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Okta</Employername>
      <Employerlogo>https://logos.yubhub.co/okta.com.png</Employerlogo>
      <Employerdescription>Okta is a technology company that provides identity and access management solutions.</Employerdescription>
      <Employerwebsite>https://www.okta.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/okta/jobs/7674829</Applyto>
      <Location>San Francisco, California</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>18ae1499-b22</externalid>
      <Title>Research Engineer, Discovery</Title>
      <Description><![CDATA[<p>As a Research Engineer on our team, you will work end-to-end across the whole model stack, identifying and addressing key infra blockers on the path to scientific AGI. Strong candidates should have familiarity with elements of language model training, evaluation, and inference and eagerness to quickly dive and get up to speed in areas they are not yet an expert on.</p>
<p>Responsibilities:</p>
<ul>
<li>Design and implement large-scale infrastructure systems to support AI scientist training, evaluation, and deployment across distributed environments</li>
<li>Identify and resolve infrastructure bottlenecks impeding progress toward scientific capabilities</li>
<li>Develop robust and reliable evaluation frameworks for measuring progress towards scientific AGI</li>
<li>Build scalable and performant VM/sandboxing/container architectures to safely execute long-horizon AI tasks and scientific workflows</li>
<li>Collaborate to translate experimental requirements into production-ready infrastructure</li>
<li>Develop large scale data pipelines to handle advanced language model training requirements</li>
<li>Optimize large scale training and inference pipelines for stable and efficient reinforcement learning</li>
</ul>
<p>You may be a good fit if you:</p>
<ul>
<li>Have 6+ years of highly-relevant experience in infrastructure engineering with demonstrated expertise in large-scale distributed systems</li>
<li>Are a strong communicator and enjoy working collaboratively</li>
<li>Possess deep knowledge of performance optimization techniques and system architectures for high-throughput ML workloads</li>
<li>Have experience with containerization technologies (Docker, Kubernetes) and orchestration at scale</li>
<li>Have proven track record of building large-scale data pipelines and distributed storage systems</li>
<li>Excel at diagnosing and resolving complex infrastructure challenges in production environments</li>
<li>Can work effectively across the full ML stack from data pipelines to performance optimization</li>
<li>Have experience collaborating with other researchers to scale experimental ideas</li>
<li>Thrive in fast-paced environments and can rapidly iterate from experimentation to production</li>
</ul>
<p>Strong candidates may also have:</p>
<ul>
<li>Experience with language model training infrastructure and distributed ML frameworks (PyTorch, JAX, etc.)</li>
<li>Background in building infrastructure for AI research labs or large-scale ML organizations</li>
<li>Knowledge of GPU/TPU architectures and language model inference optimization</li>
<li>Experience with cloud platforms (AWS, GCP) at enterprise scale</li>
<li>Familiarity with VM and container orchestration</li>
<li>Experience with workflow orchestration tools and experiment management systems</li>
<li>History working with large scale reinforcement learning</li>
<li>Comfort with large scale data pipelines (Beam, Spark, Dask, …)</li>
</ul>
<p>The annual compensation range for this role is $350,000-$850,000 USD.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$350,000-$850,000 USD</Salaryrange>
      <Skills>large-scale distributed systems, containerization technologies (Docker, Kubernetes), performance optimization techniques, system architectures for high-throughput ML workloads, data pipelines, distributed storage systems, ML frameworks (PyTorch, JAX, etc.), GPU/TPU architectures, cloud platforms (AWS, GCP), VM and container orchestration, workflow orchestration tools, experiment management systems, reinforcement learning, large scale data pipelines (Beam, Spark, Dask, …)</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems.</Employerdescription>
      <Employerwebsite>https://www.anthropic.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/4669581008</Applyto>
      <Location>San Francisco, CA</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>8aa2a018-294</externalid>
      <Title>Sr. Staff Software Engineer - Distributed System Development</Title>
      <Description><![CDATA[<p>As a Sr. Staff Software Engineer – Distributed Systems at Alluxio, you will lead the end-to-end architecture and technical evolution of our next-generation distributed data platform.</p>
<p>You will drive system-level design decisions that enable Alluxio to scale to thousands of nodes and exabytes of data, while maintaining performance, reliability, and simplicity for users.</p>
<p>In this role, you will operate as a technical architect and hands-on engineering leader, partnering closely with engineering teams and product management to translate complex requirements into scalable distributed system designs.</p>
<p><strong>Responsibilities</strong></p>
<ul>
<li>Lead the end-to-end architecture and design of large-scale distributed systems powering the Alluxio platform.</li>
<li>Drive technical strategy and architectural direction across multiple teams and components.</li>
<li>Design systems that support high scalability, fault tolerance, performance optimization, and data durability.</li>
<li>Provide hands-on development and deep technical guidance in critical areas of the system.</li>
<li>Lead complex system design reviews and mentor senior engineers on distributed systems design.</li>
<li>Identify and resolve system-level performance bottlenecks and reliability challenges.</li>
<li>Collaborate with product management and engineering leadership to translate product goals into technical solutions.</li>
<li>Influence the broader technical ecosystem through open-source contributions and architectural thought leadership.</li>
</ul>
<p><strong>Requirements</strong></p>
<ul>
<li>Master or BS degree in Computer Science or related technical field, or equivalent practical experience.</li>
<li>Proven experience of 2+ years in a technical leadership or architect role, driving system-level design and guiding engineering teams.</li>
<li>Strong hands-on software development experience in one or more general-purpose programming languages, including but not limited to Java, C/C++, or Go.</li>
<li>Deep architecting expertise in at least two of the following areas:</li>
<li>Distributed and parallel systems</li>
<li>Distributed storage systems</li>
<li>Architecting large-scale software systems</li>
<li>Demonstrated ability to design and implement high-quality, stable, and scalable end-to-end system architectures in production environments.</li>
<li>Strong analytical thinking and complex problem-solving skills.</li>
<li>Excellent communication skills and ability to influence technical direction across teams.</li>
</ul>
<p><strong>Nice to Have</strong></p>
<ul>
<li>PhD in Computer Science, Distributed Systems, or related fields.</li>
<li>Deep understanding of consensus algorithms, storage engines, or large-scale data systems.</li>
<li>Experience building or operating cloud-native infrastructure platforms.</li>
<li>Experience contributing to or maintaining open-source distributed systems projects.</li>
<li>Track record of designing systems that operate at massive scale (thousands of nodes or higher).</li>
<li>Passion for building high-performance infrastructure software.</li>
<li>Contributions to Alluxio open-source community.</li>
</ul>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange></Salaryrange>
      <Skills>Java, C/C++, Go, Distributed and parallel systems, Distributed storage systems, Architecting large-scale software systems</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Alluxio</Employername>
      <Employerlogo>https://logos.yubhub.co/alluxio.com.png</Employerlogo>
      <Employerdescription>Alluxio is a distributed data platform company.</Employerdescription>
      <Employerwebsite>https://www.alluxio.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://jobs.lever.co/alluxio/f997ed6c-941f-4873-b308-a1f33b6b78ef</Applyto>
      <Location>Beijing</Location>
      <Country></Country>
      <Postedate>2026-04-17</Postedate>
    </job>
    <job>
      <externalid>94b47f45-76d</externalid>
      <Title>Distributed Systems Engineer</Title>
      <Description><![CDATA[<p>Are you interested in joining a group of highly talented engineers working on an open source project that is solving challenging problems across big data analytics, machine learning and artificial intelligence?</p>
<p>As a distributed systems engineer at Alluxio, you will be responsible for evolving the state-of-the-art Alluxio project. The work would involve solving challenging problems in the area of Distributed Data Services, Memory and data structure efficiency, Thread concurrency and locking optimizations, process coordination and caching policies and implementation.</p>
<p>The role would include developing innovative solutions for scaling systems to thousands of nodes and providing Data Durability and High Availability.</p>
<p>You will be part of a team that includes leaders, innovators, explorers, and risk-takers with extensive industry experience from top tech companies including Google, Palantir and VMWare and alumni from top computer science programs including CMU, Stanford and UC Berkeley.</p>
<p>We are looking for someone with a BS degree in Computer Science, similar technical field of study or equivalent practical experience. You should have software development experience in one or more general purpose programming languages including but not limited to: Java, C/C++, or Go.</p>
<p>Experience working with two or more from the following is a must: distributed and parallel systems, distributed storage systems, architecting large scale software systems, and/or security software development.</p>
<p>Excellent analytical and problem solving skills are required. Working proficiency and communication skills in verbal and written English are also necessary.</p>
<p>Preferred qualifications include a Master’s, PhD degree, further education or equivalent practical experience in engineering, computer science or other technical related field. Experience designing, developing, and deploying Kubernetes applications is also desirable.</p>
<p>If you are interested in contributing to an open source project and want to work in a fast-paced, collaborative and iterative programming environment, please apply.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>mid</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange></Salaryrange>
      <Skills>Java, C/C++, Go, Distributed systems, Parallel systems, Distributed storage systems, Architecting large scale software systems, Security software development, Kubernetes, Master’s, PhD degree, further education or equivalent practical experience in engineering, computer science or other technical related field</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Alluxio</Employername>
      <Employerlogo>https://logos.yubhub.co/alluxio.com.png</Employerlogo>
      <Employerdescription>Alluxio is a project from AMPLab, backed by Andreessen-Horowitz, and has been named as top 10 hot startups for 2018.</Employerdescription>
      <Employerwebsite>https://www.alluxio.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://jobs.lever.co/alluxio/ad547017-b276-4c99-ae4e-4c5a073daf93</Applyto>
      <Location>San Francisco</Location>
      <Country></Country>
      <Postedate>2026-04-17</Postedate>
    </job>
    <job>
      <externalid>da726093-b19</externalid>
      <Title>Research Engineer, Discovery</Title>
      <Description><![CDATA[<p><strong>About the Role</strong></p>
<p>As a Research Engineer on our team, you will work end to end across the whole model stack, identifying and addressing key infra blockers on the path to scientific AGI. Strong candidates should have familiarity with elements of language model training, evaluation, and inference and eagerness to quickly dive and get up to speed in areas they are not yet an expert on. This may include performance optimization, distributed systems, VM/sandboxing/container deployment, and large scale data pipelines.</p>
<p><strong>Responsibilities:</strong></p>
<ul>
<li>Design and implement large-scale infrastructure systems to support AI scientist training, evaluation, and deployment across distributed environments</li>
<li>Identify and resolve infrastructure bottlenecks impeding progress toward scientific capabilities</li>
<li>Develop robust and reliable evaluation frameworks for measuring progress towards scientific AGI.</li>
<li>Build scalable and performant VM/sandboxing/container architectures to safely execute long-horizon AI tasks and scientific workflows</li>
<li>Collaborate to translate experimental requirements into production-ready infrastructure</li>
<li>Develop large scale data pipelines to handle advanced language model training requirements</li>
<li>Optimize large scale training and inference pipelines for stable and efficient reinforcement learning</li>
</ul>
<p><strong>You may be a good fit if you:</strong></p>
<ul>
<li>Have 6+ years of highly-relevant experience in infrastructure engineering with demonstrated expertise in large-scale distributed systems</li>
<li>Are a strong communicator and enjoy working collaboratively</li>
<li>Possess deep knowledge of performance optimization techniques and system architectures for high-throughput ML workloads</li>
<li>Have experience with containerization technologies (Docker, Kubernetes) and orchestration at scale</li>
<li>Have proven track record of building large-scale data pipelines and distributed storage systems</li>
<li>Excel at diagnosing and resolving complex infrastructure challenges in production environments</li>
<li>Can work effectively across the full ML stack from data pipelines to performance optimization</li>
<li>Have experience collaborating with other researchers to scale experimental ideas</li>
<li>Thrive in fast-paced environments and can rapidly iterate from experimentation to production</li>
</ul>
<p><strong>Strong candidates may also have:</strong></p>
<ul>
<li>Experience with language model training infrastructure and distributed ML frameworks (PyTorch, JAX, etc.)</li>
<li>Background in building infrastructure for AI research labs or large-scale ML organizations</li>
<li>Knowledge of GPU/TPU architectures and language model inference optimization</li>
<li>Experience with cloud platforms (AWS, GCP) at enterprise scale</li>
<li>Familiarity with VM and container orchestration.</li>
<li>Experience with workflow orchestration tools and experiment management systems</li>
<li>History working with large scale reinforcement learning</li>
<li>Comfort with large scale data pipelines (Beam, Spark, Dask, …)</li>
</ul>
<p><strong>Logistics</strong></p>
<ul>
<li>Education requirements: We require at least a Bachelor&#39;s degree in a related field or equivalent experience.</li>
<li>Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</li>
<li>Visa sponsorship: We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</li>
</ul>
<p><strong>We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</strong></p>
<p><strong>Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links—visit anthropic.com/careers directly for confirmed position openings.</strong></p>
<p><strong>How we&#39;re different</strong></p>
<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale projects, and we&#39;re committed to making a positive impact on the world.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$350,000 - $850,000 USD</Salaryrange>
      <Skills>infrastructure engineering, large-scale distributed systems, performance optimization, containerization technologies, orchestration at scale, data pipelines, distributed storage systems, complex infrastructure challenges, ML stack, workflow orchestration tools, experiment management systems, reinforcement learning, large scale data pipelines, language model training infrastructure, distributed ML frameworks, GPU/TPU architectures, language model inference optimization, cloud platforms, VM and container orchestration, workflow orchestration tools, experiment management systems, large scale reinforcement learning, large scale data pipelines</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic is a company that aims to create reliable, interpretable, and steerable AI systems. It has a team of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</Employerdescription>
      <Employerwebsite>https://job-boards.greenhouse.io</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/4669581008</Applyto>
      <Location>San Francisco, CA</Location>
      <Country></Country>
      <Postedate>2026-03-08</Postedate>
    </job>
  </jobs>
</source>