Technical Program Manager, Safeguards (Infrastructure & Evals)

c4e35d55-5d1 Technical Program Manager, Safeguards (Infrastructure & Evals) Job Title: Technical Program Manager, Safeguards (Infrastructure & Evals)

About Anthropic

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole.

About the Role

Safeguards Engineering builds and operates the infrastructure that keeps Anthropic's AI systems safe in production , the classifiers, detection pipelines, evaluation platforms, and monitoring systems that sit between our models and the real world. That infrastructure needs to be not just correct, but reliable: when a safety-critical pipeline goes down or degrades, the consequences can be serious, and they can be invisible until someone looks closely.

As a Technical Program Manager for Safeguards Infrastructure and Evals, you'll own the operational health and forward momentum of this stack. Your primary responsibility is driving reliability , owning the incident-response and post-mortem process, ensuring SLOs are defined and met in partnership with various teams, and making sure that when things go wrong, the right people know, the right actions get taken, and those actions actually get closed out.

Alongside that ongoing operational rhythm, you'll coordinate the larger platform investments: migrations, eval-platform improvements, and the cross-team dependencies that connect them. This role sits at the intersection of operations and program management. It requires genuine technical depth , you need to understand how these systems work well enough to triage effectively, judge what's actually safety-critical versus what can wait, and have informed conversations with the engineers building and maintaining them. But the core of the job is keeping the machine running well and the work moving.

What You'll Do:

Own the Safeguards Engineering ops review
Drive the recurring cadence that keeps the team informed and coordinated: surfacing recent incidents and failures, bringing visibility to reliability trends, and making sure the right people are in the room when decisions need to be made.
Drive incident tracking and post-mortem execution
Establish and maintain SLOs with partner teams
Maintain runbook quality and incident-ownership clarity
Drive platform migrations and infrastructure projects
Coordinate evals platform improvements

You might be a good fit if you:

Have solid technical program management experience, particularly in operational or infrastructure-heavy environments , you're comfortable owning a mix of ongoing operational cadences and discrete project work simultaneously.
Understand how production ML systems work well enough to triage incidents intelligently and have substantive conversations with engineers about what's going wrong and why , you don't need to write the code, but you need to follow the technical thread.
Are energized by closing loops. Post-mortem action items that never get done, SLOs that no one checks, runbooks that go stale , these things bother you, and you know how to build the processes and follow-ups that fix them.
Can work effectively across team boundaries , comfortable coordinating with partner teams (like Inference) where you don't have direct authority, and skilled at keeping shared work moving through influence and clear communication.
Thrive in environments where the work shifts between 'keep the lights on' and 'build something new' , and can context-switch between incident follow-ups and longer-horizon platform projects without dropping either.
Have experience with or strong interest in AI safety , you understand why the reliability of a safety-critical pipeline is a different kind of problem than the reliability of a product feature, and that distinction motivates you.

Strong candidates may also:

Have experience with SRE practices, incident management frameworks, or on-call operations at scale.
Have worked on or with evaluation infrastructure for ML systems , understanding how evals get designed, run, and interpreted.
Have experience driving infrastructure migrations in complex, multi-team environments , particularly where the migration touches operational systems that can't go offline.
Be familiar with monitoring and alerting tooling (PagerDuty, Datadog, or equivalents) and the operational culture around them.

Deadline to apply: None, applications will be received on a rolling basis.

The annual compensation range for this role is listed below. For sales roles, the range provided is the role's On Target Earnings ('OTE') range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.

Annual Salary: $290,000-$365,000 USD

XML job scraping automation by YubHub

]]> full-time mid hybrid $290,000-$365,000 USD Technical Program Management, Operational or Infrastructure-heavy environments, Production ML systems, Incident management frameworks, On-call operations, Evaluation infrastructure for ML systems, Infrastructure migrations, Monitoring and alerting tooling Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a technology company focused on developing artificial intelligence systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5108695008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 ca221b6f-dca Technical Program Manager, Safeguards (Infrastructure & Evals) About the Role

Safeguards Engineering builds and operates the infrastructure that keeps Anthropic's AI systems safe in production. As a Technical Program Manager for Safeguards Infrastructure and Evals, you'll own the operational health and forward momentum of this stack.

Your primary responsibility is driving reliability , owning the incident-response and post-mortem process, ensuring SLOs are defined and met in partnership with various teams, and making sure that when things go wrong, the right people know, the right actions get taken, and those actions actually get closed out.

Alongside that ongoing operational rhythm, you'll coordinate the larger platform investments: migrations, eval-platform improvements, and the cross-team dependencies that connect them.

This role sits at the intersection of operations and program management. It requires genuine technical depth , you need to understand how these systems work well enough to triage effectively, judge what's actually safety-critical versus what can wait, and have informed conversations with the engineers building and maintaining them.

But the core of the job is keeping the machine running well and the work moving.

Responsibilities

Own the Safeguards Engineering ops review
Drive the recurring cadence that keeps the team informed and coordinated: surfacing recent incidents and failures, bringing visibility to reliability trends, and making sure the right people are in the room when decisions need to be made.
Drive incident tracking and post-mortem execution
Establish and maintain SLOs with partner teams
Maintain runbook quality and incident-ownership clarity
Drive platform migrations and infrastructure projects
Coordinate evals platform improvements

Requirements

Solid technical program management experience, particularly in operational or infrastructure-heavy environments
Understanding of how production ML systems work well enough to triage incidents intelligently and have substantive conversations with engineers about what's going wrong and why
Ability to work effectively across team boundaries
Experience with or strong interest in AI safety

Nice to Have

Experience with SRE practices, incident management frameworks, or on-call operations at scale
Familiarity with monitoring and alerting tooling (PagerDuty, Datadog, or equivalents)
Experience driving infrastructure migrations in complex, multi-team environments

XML job scraping automation by YubHub

]]> full-time senior hybrid $290,000-$365,000 USD Technical Program Management, Operational or Infrastructure-heavy Environments, Production ML Systems, Incident Tracking and Post-Mortem Execution, Service-Level Objectives (SLOs), Runbook Quality and Incident-Ownership Clarity, Platform Migrations and Infrastructure Projects, Evals Platform Improvements, SRE Practices, Incident Management Frameworks, On-Call Operations at Scale, Monitoring and Alerting Tooling, Infrastructure Migrations in Complex, Multi-Team Environments Engineering Technology Anthropic https://logos.yubhub.co/anthropic.ai.png Anthropic develops artificial intelligence systems. It has a growing team of researchers, engineers, and business leaders. https://anthropic.ai/ https://job-boards.greenhouse.io/anthropic/jobs/5108695008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 bd9625d9-99b ML Infrastructure Engineer, Safeguards We are seeking a Machine Learning Infrastructure Engineer to join our Safeguards organization, where you'll build and scale the critical infrastructure that powers our AI safety systems.

As part of the Safeguards team, you'll design and implement ML infrastructure that powers Claude safety. Your work will directly contribute to making AI systems more trustworthy and aligned with human values, ensuring our models operate safely as they become more capable.

Responsibilities:

Design and build scalable ML infrastructure to support real-time and batch classifier and safety evaluations across our model ecosystem
Build monitoring and observability tools to track model performance, data quality, and system health for safety-critical applications
Collaborate with research teams to productionize safety research, translating experimental safety techniques into robust, scalable systems
Optimize inference latency and throughput for real-time safety evaluations while maintaining high reliability standards
Implement automated testing, deployment, and rollback systems for ML models in production safety applications
Partner with Safeguards, Security, and Alignment teams to understand requirements and deliver infrastructure that meets safety and production needs
Contribute to the development of internal tools and frameworks that accelerate safety research and deployment

You may be a good fit if you:

Have 5+ years of experience building production ML infrastructure, ideally in safety-critical domains like fraud detection, content moderation, or risk assessment
Are proficient in Python and have experience with ML frameworks like PyTorch, TensorFlow, or JAX
Have hands-on experience with cloud platforms (AWS, GCP) and container orchestration (Kubernetes)
Understand distributed systems principles and have built systems that handle high-throughput, low-latency workloads
Have experience with data engineering tools and building robust data pipelines (e.g., Spark, Airflow, streaming systems)
Are results-oriented, with a bias towards reliability and impact in safety-critical systems
Enjoy collaborating with researchers and translating cutting-edge research into production systems
Care deeply about AI safety and the societal impacts of your work

Strong candidates may have experience with:

Working with large language models and modern transformer architectures
Implementing A/B testing frameworks and experimentation infrastructure for ML systems
Developing monitoring and alerting systems for ML model performance and data drift
Building automated labeling systems and human-in-the-loop workflows
Experience in trust & safety, fraud prevention, or content moderation domains
Knowledge of privacy-preserving ML techniques and compliance requirements
Contributing to open-source ML infrastructure projects

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed.

XML job scraping automation by YubHub

]]> full-time senior hybrid $320,000-$405,000 USD Python, PyTorch, TensorFlow, JAX, Cloud platforms (AWS, GCP), Container orchestration (Kubernetes), Distributed systems principles, Data engineering tools (Spark, Airflow, streaming systems), Large language models and modern transformer architectures, A/B testing frameworks and experimentation infrastructure for ML systems, Monitoring and alerting systems for ML model performance and data drift, Automated labeling systems and human-in-the-loop workflows, Trust & safety, fraud prevention, or content moderation domains, Privacy-preserving ML techniques and compliance requirements, Open-source ML infrastructure projects Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that focuses on creating reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/4778843008 San Francisco, CA 2026-04-18 26bff84c-def Senior/Staff Platform Engineer/SRE About the Role We are seeking a Senior Platform Engineer who will design, develop, and deploy robust platform solutions to ensure the reliability, scalability, and security of our system.

Responsibilities

Identify and build AI-powered capabilities into Flow's platform, from intelligent automation in building operations to personalized resident experiences.
Use AI-assisted development tools (e.g., Cursor, Claude Code) as part of your daily workflow to accelerate development, improve code quality, and push the boundaries of what a small team can ship.
Collaborate with product and engineering teams to define clear requirements and translate them into software solutions.
Core contributor to implementing foundational infrastructure, tooling and automation that is scalable, reliable, and secure.
Elevate site reliability engineering best practices while collaborating with back-end developers.
Develop service-level tooling to enhance productionization, data migrations, system hardening, and related initiatives.
Manage and optimize a multi-region environment.
Be available for on-call activities for infrastructure and services.

Ideal Background

A minimum 10 years in software engineering, site reliability engineering, or platform engineering.
Fluency with AI-assisted development tools and a strong point of view on how AI changes the way software gets built.
Ability to design, implement and maintain the tools and systems that support service reliability, monitoring, and alerting.
Deep understanding of the principles of ensuring high availability, fault tolerance, and efficiency in distributed systems.
Experience with Infrastructure as Code (IaC): Proficiency with Terraform.
Experience with Kubernetes.
Experience administering cloud-based infrastructure (GCP preferred).
Experience troubleshooting production issues related to cloud infrastructure, configuration, monitoring, deployments, continuous integration and delivery.
A keen ability to balance elegant design with pragmatic tradeoffs, prioritizing continuous delivery of business value.
Ability to quickly learn and adapt to new skillsets.
Experience building software in fast-moving startup environments.
Participate in incident response and post-mortems to identify and address systemic issues.

Additional Information Benefits

Comprehensive Benefits Package (Medical / Dental / Vision / Disability / Life)
Paid time off and 13 paid holidays
401(k) retirement plan
Healthcare and Dependent Care Flexible Spending Accounts (FSAs)
Access to HSA-compatible plans
Pre-tax commuter benefits
Employee Assistance Program (EAP), free therapy through SpringHealth, acupuncture, and other wellness offerings

XML job scraping automation by YubHub

]]> full-time senior hybrid $180,000-275,000 per year AI-assisted development tools, Terraform, Kubernetes, Cloud-based infrastructure administration, Site reliability engineering, Monitoring and alerting, Service-level tooling, Multi-region environment management Engineering Technology Flow https://logos.yubhub.co/flow.com.png Flow is a real estate company that operates a technology platform and operations ecosystem spanning condominiums, hotels, multifamily residences, and office spaces. https://flow.com https://jobs.lever.co/flowlife/3ae47b09-e4b4-41be-9312-fafb1d85cf4d Palo Alto 2026-04-17 f8883394-0fc Solutions Architect, AI and ML We are looking for an experienced Cloud Solution Architect to help assist customers with adoption of GPU hardware and Software, as well as building and deploying Machine Learning (ML) , Deep Learning (DL), data analytics solutions on various Cloud Computing Platforms.

As a Solutions Architect, you will engage directly with developers, researchers, and data scientists with some of NVIDIA’s most strategic technology customers as well as work directly with business and engineering teams on product strategy.

Key Responsibilities:

Help cloud customers craft, deploy, and maintain scalable, GPU-accelerated inference pipelines on cloud ML services and Kubernetes for large language models (LLMs) and generative AI workloads.
Enhance performance tuning using TensorRT/TensorRT-LLM, vLLM, Dynamo, and Triton Inference Server to improve GPU utilization and model efficiency.
Collaborate with multi-functional teams (engineering, product) and offer technical mentorship to cloud customers implementing AI inference at scale.
Build custom PoCs for solution that address customer’s critical business needs applying NVIDIA hardware and software technology
Partner with Sales Account Managers or Developer Relations Managers to identify and secure new business opportunities for NVIDIA products and solutions for ML/DL and other software solutions
Prepare and deliver technical content to customers including presentations about purpose-built solutions, workshops about NVIDIA products and solutions, etc.
Conduct regular technical customer meetings for project/product roadmap, feature discussions, and intro to new technologies. Establish close technical ties to the customer to facilitate rapid resolution of customer issues

Requirements:

BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Statistics, Physics, or other Engineering fields or equivalent experience.
3+ Years in Solutions Architecture with a proven track record of moving AI inference from POC to production in cloud computing environments including AWS, GCP, or Azure
3+ years of hands-on experience with Deep Learning frameworks such as PyTorch and TensorFlow
Excellent knowledge of the theory and practice of LLM and DL inference
Strong fundamentals in programming, optimizations, and software design, especially in Python
Experience with containerization and orchestration technologies like Docker and Kubernetes, monitoring, and observability solutions for AI deployments
Knowledge of Inference technologies - NVIDIA NIM, TensorRT-LLM, Dynamo, Triton Inference Server, vLLM, etc
Proficiency in problem-solving and debugging skills in GPU environments
Excellent presentation, communication and collaboration skills

Nice to Have:

AWS, GCP or Azure Professional Solution Architect Certification.
Experience optimizing and deploying large MoE LLMs at scale
Active contributions to open-source AI inference projects (e.g., vLLM, TensorRT-LLM Dynamo, SGLang, Triton or similar)
Experience with Multi-GPU Multi-node Inference technologies like Tensor Parallelism/Expert Parallelism, Disaggregated Serving, LWS, MPI, EFA/Infiniband, NVLink/PCIe, etc
Experience in developing and integrating monitoring and alerting solutions using Prometheus, Grafana, and NVIDIA DCGM and GPU performance Analysis and tools like NVIDIA Nsight Systems

XML job scraping automation by YubHub

]]> full-time senior onsite Cloud Solution Architecture, GPU hardware and Software, Machine Learning (ML), Deep Learning (DL), Data Analytics, Cloud Computing Platforms, Kubernetes, TensorRT, TensorRT-LLM, vLLM, Dynamo, Triton Inference Server, Python, Containerization, Orchestration, Monitoring, Observability, Inference technologies, NVIDIA NIM, Problem-solving, Debugging, GPU environments, AWS, GCP, Azure, Professional Solution Architect Certification, Large MoE LLMs, Open-source AI inference projects, Multi-GPU Multi-node Inference technologies, Monitoring and alerting solutions, Prometheus, Grafana, NVIDIA DCGM, GPU performance Analysis, NVIDIA Nsight Systems Engineering Technology NVIDIA https://logos.yubhub.co/nvidia.com.png NVIDIA is a leading technology company that specializes in designing and manufacturing graphics processing units (GPUs) and high-performance computing hardware. https://nvidia.wd5.myworkdayjobs.com https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-WA-Redmond/Solutions-Architect--AI-and-ML_JR2005988-1 Redmond, CA, Santa Clara, Seattle 2026-03-09 6cc383e0-ff6 ML Infrastructure Engineer, Safeguards About the role

We are seeking a Machine Learning Infrastructure Engineer to join our Safeguards organization, where you'll build and scale the critical infrastructure that powers our AI safety systems. You'll work at the intersection of machine learning, large-scale distributed systems, and AI safety, developing the platforms and tools that enable our safeguards to operate reliably at scale.

Responsibilities:

Design and build scalable ML infrastructure to support real-time and batch classifier and safety evaluations across our model ecosystem
Build monitoring and observability tools to track model performance, data quality, and system health for safety-critical applications
Collaborate with research teams to productionize safety research, translating experimental safety techniques into robust, scalable systems
Optimize inference latency and throughput for real-time safety evaluations while maintaining high reliability standards
Implement automated testing, deployment, and rollback systems for ML models in production safety applications
Partner with Safeguards, Security, and Alignment teams to understand requirements and deliver infrastructure that meets safety and production needs
Contribute to the development of internal tools and frameworks that accelerate safety research and deployment

You may be a good fit if you:

Have 5+ years of experience building production ML infrastructure, ideally in safety-critical domains like fraud detection, content moderation, or risk assessment
Are proficient in Python and have experience with ML frameworks like PyTorch, TensorFlow, or JAX
Have hands-on experience with cloud platforms (AWS, GCP) and container orchestration (Kubernetes)
Understand distributed systems principles and have built systems that handle high-throughput, low-latency workloads
Have experience with data engineering tools and building robust data pipelines (e.g., Spark, Airflow, streaming systems)
Are results-oriented, with a bias towards reliability and impact in safety-critical systems
Enjoy collaborating with researchers and translating cutting-edge research into production systems
Care deeply about AI safety and the societal impacts of your work

Strong candidates may have experience with:

Working with large language models and modern transformer architectures
Implementing A/B testing frameworks and experimentation infrastructure for ML systems
Developing monitoring and alerting systems for ML model performance and data drift
Building automated labeling systems and human-in-the-loop workflows
Experience in trust & safety, fraud prevention, or content moderation domains
Knowledge of privacy-preserving ML techniques and compliance requirements
Contributing to open-source ML infrastructure projects

Deadline to apply:

None. Applications will be reviewed on a rolling basis.

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.
Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.
Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification.

Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

Your safety matters to us.

To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit anthropic.com/careers directly for confirmed position openings.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing the state of the art in AI safety and making a meaningful difference in the world.

XML job scraping automation by YubHub

]]> full-time senior hybrid $320,000 - $405,000 USD Python, PyTorch, TensorFlow, JAX, AWS, GCP, Kubernetes, Spark, Airflow, streaming systems, large language models, modern transformer architectures, A/B testing frameworks, experimentation infrastructure, monitoring and alerting systems, automated labeling systems, human-in-the-loop workflows, trust & safety, fraud prevention, content moderation domains, privacy-preserving ML techniques, compliance requirements Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. It has a quickly growing team of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/4778843008 San Francisco, CA 2026-03-08 3514d749-08c Senior Support Engineer Senior Support Engineer - San Francisco

Location

San Francisco

Employment Type

Full time

Department

Compensation

$234K – $260K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

The Technical Support team is responsible for ensuring that developers and enterprises can reliably build mission critical solutions using OpenAI models. We provide technical guidance, resolve complex issues and support customers in maximizing value and adoption from deploying our highly-capable models. We work closely with Technical Success, Product, Engineering and others to deliver the best possible experience to our customers at scale. We think from an automation-first mindset and leverage the latest in AI to scale our support operations. Join the Senior Support Engineering (SSE) team at OpenAI and help shape the future of Technical Support in the age of AI.

About the Role

We are looking for a Senior Support Engineer to collaborate directly with our strategic enterprise accounts and product teams, helping solve some of the most difficult problems faced by our Customers. You will be part of the best technical troubleshooting team at OpenAI, and our Customers and Engineering teams will look to you for technical guidance in addressing the most technically difficult issues in our environment.

As a Senior Support Engineer, you will design and run operational processes to monitor our top strategic customers and a 24x7 response team. You’ll work closely with our Infrastructure and Engineering teams to deliver the best possible experience to customers at scale. Working directly with our most strategic Customers - You will be crucial to the success of the most innovative, disruptive, and high-scale AI solutions being built with the OpenAI API platform.

The nature of this role will be low volume, high difficulty.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.

Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.

Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.

In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.

Design and refine incident response processes and documentation across strategic customers, engineering and support teams.

Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.

Provide support coverage during holidays and weekends based on business needs.

You might thrive in this role if you:

Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.

Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.

Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).

Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and communication with stakeholders.

Are able to work effectively in a fast-paced environment, prioritize tasks, and manage multiple projects simultaneously.

Are a strong communicator and team player, with excellent written and verbal communication skills.

Are able to adapt to changing priorities and requirements, and are flexible in your approach to problem-solving.

XML job scraping automation by YubHub

]]> full-time senior hybrid $234K – $260K Bachelor’s degree in Computer Science or a related field, 8+ years of experience in technical operations roles such as SRE/NOC, Designing monitoring systems and resolving production issues in fast-paced and mission-critical environments, Troubleshooting complex technical problems at the systems level, Modern monitoring, alerting, and observability practices, Metrics, logging, and tracing for distributed systems, SLIs/SLOs, alert tuning, dashboard creation, Incident response for high‑severity outages or service disruptions, Real-time incident coordination, root cause analysis, and communication with stakeholders, Automation and advancements in AI technologies, Automation-first mindset and leveraging the latest in AI to scale support operations, Technical and troubleshooting expertise for API platform at OpenAI, Proactive identification and implementation of opportunities to scale support operations, Advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time, Reliability reviews and preparedness for new features, launches, or strategic customer requirement updates, Operational readiness (monitoring, alerting, and fallback plans), Incident response processes and documentation across strategic customers, engineering and support teams, Operational metrics and incident RCAs to identify areas for improvement, Enhancements to monitoring dashboards, alert configurations, and support workflows Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is a technology company that develops and offers artificial intelligence (AI) models and tools. It was founded in 2015 and is headquartered in San Francisco, California. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/5431666c-530b-49c0-b67e-32477f9eaf5e San Francisco 2026-03-06 70806a42-556 Senior Support Engineer Senior Support Engineer - Dublin

Location

Dublin, Ireland

Employment Type

Full time

Department

About the Team

About the Role

The nature of this role will be low volume, high difficulty.

This role is based in Dublin, Ireland. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.

Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.

Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.

In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.

Design and refine incident response processes and documentation across strategic customers, engineering and support teams.

Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.

Provide support coverage during holidays and weekends based on business needs.

You might thrive in this role if you:

Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.

Have 5+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.

Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).

Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and drive follow‑ups (post‑mortems, action items) to prevent recurrence. Knowledge of industry best practices for incident management and fault diagnosis.

Have strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools.

Have solid understanding of cloud infrastructure and distributed systems fundamentals. Comfortable working with cloud services, load balancers, databases, and containerized applications.

Are effective at working cross‑functionally in a high‑trust environment. Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders. You can coordinate efforts across teams and are comfortable providing updates in the midst of an ongoing incident.

Compensation, Benefits and Perks

This is a position with OpenAI Ireland Ltd., which controls the hiring and management of this position.

Total compensation includes an annual salary, generous equity, and benefits.

Medical, dental, and vision insurance for you and your family

Mental health and wellness support

PRSA plan with 8% employer matching

Unlimited time off

Annual learning & development stipend ($1,500 USD equivalent per year)

#LI-NM2

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time senior hybrid Python, Cloud infrastructure, Distributed systems, Monitoring and alerting, Observability, Scripting, Software engineering, Cloud services, Load balancers, Databases, Containerized applications, SLIs/SLOs, Alert tuning, Dashboard creation, Incident management, Fault diagnosis, Cross-functional collaboration, Communication Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/988016e1-de50-42be-925a-438b97291c5d Dublin 2026-03-06 e38e0353-95c Senior Support Engineer Senior Support Engineer - Tokyo

Location

Tokyo, Japan

Employment Type

Full time

Department

About the Team

About the Role

The nature of this role will be low volume, high difficulty.

This role is based in Tokyo, Japan. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.

Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.

Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.

In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.

Design and refine incident response processes and documentation across strategic customers, engineering and support teams.

Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.

Provide support coverage during holidays and weekends based on business needs.

You might thrive in this role if you:

Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.

Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.

Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).

Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and drive follow‑ups (post‑mortems, action items) to prevent recurrence. Knowledge of industry best practices for incident management and fault diagnosis.

Have strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools.

Have solid understanding of cloud infrastructure and distributed systems fundamentals. Comfortable working with cloud services, load balancers, databases, and containerized applications.

Are effective at working cross‑functionally in a high‑trust environment. Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders. You can coordinate efforts across teams and are comfortable providing updates in the midst of an ongoing incident.

About OpenAI

XML job scraping automation by YubHub

]]> full-time senior hybrid Python, Cloud infrastructure, Distributed systems, Monitoring and alerting, Observability, Scripting, Software engineering, Cloud services, Load balancers, Databases, Containerized applications, Automation, AI technologies, Incident response, Reliability reviews, Post-mortems, Action items, Cross-functional collaboration, Communication, Technical writing Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/b2fd550d-3e04-434e-bb91-c5b7bc8ac8b7 Tokyo, Japan 2026-03-06 fb4acb2b-bab Security Reliability Engineering, Lead Security Reliability Engineering, Lead

Location

San Francisco

Employment Type

Full time

Department

Security

Compensation

$293K – $385K

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

The Infrastructure Engineering function sits within IT and is responsible for reliably building, deploying, and operating critical on prem and hybrid environments that power internal services and critical R&D environments.

This is a new, bootstrap team focused on applying strong Site Reliability Engineering discipline to environments where uptime, safety, recoverability, and security are non-negotiable. The team replaces bespoke, one off infrastructure with standardized infrastructure-as-code building blocks that compound reliability and operational leverage as OpenAI scales.

About the Role

We are looking for a Security Reliability Engineering Lead to design, build, and operate reliable, secure, and scalable infrastructure that underpins identity, access, endpoint, and shared platform services across the company.

In this role, you will own infrastructure and identity systems end to end, from foundational design and provisioning through policy enforcement, upgrades, recovery, and day two operations. You will establish durable, production grade platforms that remove operational friction, enforce security by default, and enable teams to move faster with confidence.

This role is well suited for a senior engineer who thrives in ambiguity, enjoys owning complex systems end to end, and raises the reliability and security bar by replacing fragile implementations with standardized, repeatable infrastructure.

This role is based in our San Francisco HQ and requires in-office presence.

In this role, you will:

Set direction and establish strong foundations

Define and evolve infrastructure patterns for on prem and hybrid environments, including self hosted platforms, vendor supported systems, and lab environments.

Establish standardized, production grade deployment and operational models that replace bespoke implementations.

Partner with IT, Security, Identity, and Network teams to ensure infrastructure meets reliability, security, and access requirements by design.

Design and mature the production architecture for IAM adjacent platforms such as Microsoft Entra using SRE principles.

Establish common management rules and shared resources within Azure subscriptions to ensure consistent, policy aligned operations.

Build, operate, and scale reliably

Own the full lifecycle of infrastructure systems, including deployment, upgrades, patching, recovery, and ongoing operations.

Operate and harden shared infrastructure provisioned through Infra Terraform, ensuring repeatability, auditability, and safe change management.

Design and implement infrastructure as code and configuration management to support shared services, identity adjacent systems, and endpoint platforms using tools like Chef, Ansible and Terraform.

Build and operate monitoring, alerting, and incident response mechanisms to meet high availability and recoverability targets.

Lead incident response and postmortems across infrastructure, identity adjacent platforms, and fleet systems, driving durable fixes and shared learning.

Build and operate containerized and platform services, including Kubernetes and Docker-based workloads, using DevOps practices that emphasize reliability, repeatability, and safe change management.

Use Git-based workflows as the source of truth for infrastructure and policy changes, enabling review, auditability, and safe, reversible automation.

Automate for leverage and safety

Identify high leverage automation opportunities that eliminate manual toil and reduce operational risk across infrastructure and access related systems.

Implement guardrails, safety mechanisms, and progressive rollout patterns for infrastructure and policy enforcement changes.

Ensure automation is safe, observable, and resilient under failure conditions, particularly for shared services and high blast radius systems.

XML job scraping automation by YubHub

]]> full-time senior onsite $293K – $385K Security Reliability Engineering, Infrastructure as Code, Cloud Computing, Containerization, DevOps, Git, Terraform, Ansible, Chef, Kubernetes, Docker, Microsoft Entra, Azure, Identity and Access Management, Endpoint Security, Platform Services, Site Reliability Engineering, Cloud Security, Container Orchestration, Infrastructure Automation, Monitoring and Alerting, Incident Response, Postmortem Analysis, DevOps Practices, Cloud-Native Applications, Microservices Architecture Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is a technology company that specializes in artificial intelligence. It was founded in 2015 and is headquartered in San Francisco. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/645ccd65-eb60-4eb7-b094-b01c2269638c San Francisco 2026-03-06 cb538332-6a9 Senior/Staff Web Platform Engineer We are looking for a Senior/Staff Web Platform Engineer to join our team. The successful candidate will be responsible for building and maintaining the systems that enable our product teams to deliver high-quality, performant single-page web applications across desktop, mobile web, and the Comet browser.

What you'll do

Optimize performance for critical web flows (search, answer rendering, browsing), with a focus on improving application response speed and perceived latency.
Design and implement solutions for customizable user interfaces and experience components across Perplexity and Comet.

What you need

6+ years of practical experience as a software engineer with a strong focus on web technologies.
Deep expertise in modern JavaScript frameworks, particularly React.
Experience optimizing web application performance and working with metrics such as time-to-first-byte, time-to-interactive, and application response speed.

XML job scraping automation by YubHub

]]> full-time senior onsite $250K - $405K JavaScript, React, Web performance optimization, TypeScript, Frontend monitoring and alerting systems Engineering Technology Perplexity AI https://logos.yubhub.co/perplexity.com.png Perplexity AI's mission is to build the world's best answer engine and AI-native browser that make finding and understanding information effortless for everyone. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/perplexity/cf179df1-3d69-4a9d-bda0-0c423efa9255 San Francisco, New York City, Seattle 2026-03-04 bed7736d-0a7 Browser Infrastructure Engineer This role exists to build reliable, automated, and scalable infrastructure for Chromium-based browser teams. As a Browser Infrastructure Engineer, you will focus on CI/CD pipelines, monitoring, and development environments to support fast-paced browser innovation.

What you'll do

You will set up and maintain CI/CD pipelines for builds and testing, support and evolve Chromium browser development infrastructure, configure monitoring and alerting systems, manage cloud infrastructure, develop automation scripts, and ensure high availability, resilience, and security of development infrastructure.

What you need

You will need 3+ years in software development infrastructure, preferably Chromium browsers, hands-on DevOps and SRE experience, including monitoring and incident management, proficiency in k8s, Terraform, Datadog, Sentry, AWS, Unix, TeamCity, strong CI/CD implementation skills, and ability to thrive in Agile teams with excellent communication.

XML job scraping automation by YubHub

]]> full-time mid remote software development infrastructure, CI/CD pipelines, monitoring and alerting systems, cloud infrastructure, automation scripts, DevOps and SRE experience, k8s, Terraform, Datadog, Sentry, AWS, Unix, TeamCity Engineering Technology Perplexity https://logos.yubhub.co/perplexity.com.png Perplexity is a young, fast-growing Chromium-based browser. They are committed to building reliable, automated, and scalable infrastructure for their browser development teams. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/perplexity/7bce0fcf-eef6-41aa-9243-896f07a0316e Belgrade 2026-03-04