Production Engineer

4d40b93e-629 Production Engineer – Team Lead As the Production Engineer – Team Lead, you will be at the heart of CoreWeave's cloud infrastructure stability and reliability. This senior generalist role is designed to provide strategic direction, operational continuity, and technical expertise across various facets of our platform.

You will act as a bridge between engineering reliability and broader technical and organizational goals, ensuring a seamless connection between incident response, platform reliability, and team development. You will be responsible for guiding the team's response to critical incidents, tracking performance against Service Level Objectives (SLOs), and driving improvements that enhance both operational readiness and reliability across the organization.

The Cloud Platform; Production Engineer Team Lead will reduce ambiguity, provide clarity, and keep reliability at the forefront of decision-making.

Key Responsibilities:

Incident Management & Recovery: Act as the Incident Commander during incidents, providing decisive leadership to ensure timely and effective resolution while minimizing impact. Coordinate cross-functional teams, including engineering, operations, and customer-facing units, during incidents, ensuring clear communication at all stages. Lead root cause analysis (RCA) efforts, working with engineering teams to implement long-term, sustainable solutions and prevent recurrence. Own and refine the post-incident review (PIR) process, ensuring actionable outcomes and continuous learning across the team. Oversee the creation and maintenance of incident response playbooks to ensure team readiness for diverse failure scenarios. Drive the escalation process, acting as the primary point of contact for high-priority incidents.

Operational Excellence & Reliability: Define and track Service Level Objectives (SLOs) and ensure alignment with business goals and team objectives. Champion the use of SLOs to guide incident prioritization, drive improvements, and communicate reliability outcomes. Identify and lead initiatives to improve system resilience, scalability, and disaster recovery capabilities across the platform. Develop and optimize KPIs, SLAs, and performance metrics for incident management and operational efficiency. Spearhead the implementation of automation strategies to reduce Mean Time to Detection (MTTD) and Mean Time to Recovery (MTTR), while increasing overall platform reliability. Mentor and guide the cloud operations team, ensuring consistent growth in technical skills, incident response expertise, and leadership capabilities.

Team Development & Mentorship: Lead the development of the team by training and mentoring Production Engineer I/II in incident management best practices, tools, and systems. Foster a collaborative environment where knowledge sharing, continuous learning, and feedback are prioritized. Support the creation and evolution of team processes, ensuring scalability and the ability to respond effectively to both current and future needs. Encourage professional growth and up-leveling within the team, creating a strong foundation for the next generation of Cloud Platform SREs.

Required Qualifications: 4+ years of experience in production engineering, cloud operations, site reliability engineering (SRE), or incident response roles. Deep knowledge of cloud platforms (e.g., Kubernetes-based infrastructure, AWS, GCP). Strong familiarity with incident management frameworks such as ITIL and SRE best practices. Proficiency with monitoring and alerting tools (e.g., Prometheus, Grafana) and strong understanding of observability principles. Hands-on experience with automation, scripting, and configuration management tools (e.g., Python, Bash, Terraform). Demonstrated ability to make critical decisions under pressure, guiding teams through high-stakes incident resolution. Excellent communication skills, with the ability to translate complex technical issues for both technical and non-technical stakeholders. Proven experience mentoring and coaching technical teams, driving a culture of growth and continuous improvement.

Preferred Qualifications: Previous experience in an Incident Commander role, managing high-priority incidents and major service restorations. Advanced knowledge of Kubernetes, containerization, and distributed systems. Familiarity with change management processes, post-incident analysis techniques, and runbook automation. Experience with developing and managing self-healing infrastructure.

Why CoreWeave? At CoreWeave, we work hard, have fun, and move fast! We're in an exciting stage of hyper-growth that you will not want to miss out on. We're not afraid of a little chaos, and we're constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values: Be Curious at Your Core Act Like an Owner Empower Employees Deliver Best-in-Class Client Experiences Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!

The base salary range for this role is 196,000 to 262,000 SGD. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

XML job scraping automation by YubHub

]]> full-time senior onsite 196,000 to 262,000 SGD cloud platforms, Kubernetes-based infrastructure, AWS, GCP, incident management frameworks, ITIL, SRE best practices, monitoring and alerting tools, Prometheus, Grafana, observability principles, automation, scripting, configuration management tools, Python, Bash, Terraform, Kubernetes, containerization, distributed systems, change management processes, post-incident analysis techniques, runbook automation, self-healing infrastructure Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud platform provider specializing in AI infrastructure and services. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4674395006 Singapore 2026-04-24 c4e35d55-5d1 Technical Program Manager, Safeguards (Infrastructure & Evals) Job Title: Technical Program Manager, Safeguards (Infrastructure & Evals)

About Anthropic

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole.

About the Role

Safeguards Engineering builds and operates the infrastructure that keeps Anthropic's AI systems safe in production , the classifiers, detection pipelines, evaluation platforms, and monitoring systems that sit between our models and the real world. That infrastructure needs to be not just correct, but reliable: when a safety-critical pipeline goes down or degrades, the consequences can be serious, and they can be invisible until someone looks closely.

As a Technical Program Manager for Safeguards Infrastructure and Evals, you'll own the operational health and forward momentum of this stack. Your primary responsibility is driving reliability , owning the incident-response and post-mortem process, ensuring SLOs are defined and met in partnership with various teams, and making sure that when things go wrong, the right people know, the right actions get taken, and those actions actually get closed out.

Alongside that ongoing operational rhythm, you'll coordinate the larger platform investments: migrations, eval-platform improvements, and the cross-team dependencies that connect them. This role sits at the intersection of operations and program management. It requires genuine technical depth , you need to understand how these systems work well enough to triage effectively, judge what's actually safety-critical versus what can wait, and have informed conversations with the engineers building and maintaining them. But the core of the job is keeping the machine running well and the work moving.

What You'll Do:

Own the Safeguards Engineering ops review
Drive the recurring cadence that keeps the team informed and coordinated: surfacing recent incidents and failures, bringing visibility to reliability trends, and making sure the right people are in the room when decisions need to be made.
Drive incident tracking and post-mortem execution
Establish and maintain SLOs with partner teams
Maintain runbook quality and incident-ownership clarity
Drive platform migrations and infrastructure projects
Coordinate evals platform improvements

You might be a good fit if you:

Have solid technical program management experience, particularly in operational or infrastructure-heavy environments , you're comfortable owning a mix of ongoing operational cadences and discrete project work simultaneously.
Understand how production ML systems work well enough to triage incidents intelligently and have substantive conversations with engineers about what's going wrong and why , you don't need to write the code, but you need to follow the technical thread.
Are energized by closing loops. Post-mortem action items that never get done, SLOs that no one checks, runbooks that go stale , these things bother you, and you know how to build the processes and follow-ups that fix them.
Can work effectively across team boundaries , comfortable coordinating with partner teams (like Inference) where you don't have direct authority, and skilled at keeping shared work moving through influence and clear communication.
Thrive in environments where the work shifts between 'keep the lights on' and 'build something new' , and can context-switch between incident follow-ups and longer-horizon platform projects without dropping either.
Have experience with or strong interest in AI safety , you understand why the reliability of a safety-critical pipeline is a different kind of problem than the reliability of a product feature, and that distinction motivates you.

Strong candidates may also:

Have experience with SRE practices, incident management frameworks, or on-call operations at scale.
Have worked on or with evaluation infrastructure for ML systems , understanding how evals get designed, run, and interpreted.
Have experience driving infrastructure migrations in complex, multi-team environments , particularly where the migration touches operational systems that can't go offline.
Be familiar with monitoring and alerting tooling (PagerDuty, Datadog, or equivalents) and the operational culture around them.

Deadline to apply: None, applications will be received on a rolling basis.

The annual compensation range for this role is listed below. For sales roles, the range provided is the role's On Target Earnings ('OTE') range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.

Annual Salary: $290,000-$365,000 USD

XML job scraping automation by YubHub

]]> full-time mid hybrid $290,000-$365,000 USD Technical Program Management, Operational or Infrastructure-heavy environments, Production ML systems, Incident management frameworks, On-call operations, Evaluation infrastructure for ML systems, Infrastructure migrations, Monitoring and alerting tooling Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a technology company focused on developing artificial intelligence systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5108695008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 ca221b6f-dca Technical Program Manager, Safeguards (Infrastructure & Evals) About the Role

Safeguards Engineering builds and operates the infrastructure that keeps Anthropic's AI systems safe in production. As a Technical Program Manager for Safeguards Infrastructure and Evals, you'll own the operational health and forward momentum of this stack.

Your primary responsibility is driving reliability , owning the incident-response and post-mortem process, ensuring SLOs are defined and met in partnership with various teams, and making sure that when things go wrong, the right people know, the right actions get taken, and those actions actually get closed out.

Alongside that ongoing operational rhythm, you'll coordinate the larger platform investments: migrations, eval-platform improvements, and the cross-team dependencies that connect them.

This role sits at the intersection of operations and program management. It requires genuine technical depth , you need to understand how these systems work well enough to triage effectively, judge what's actually safety-critical versus what can wait, and have informed conversations with the engineers building and maintaining them.

But the core of the job is keeping the machine running well and the work moving.

Responsibilities

Own the Safeguards Engineering ops review
Drive the recurring cadence that keeps the team informed and coordinated: surfacing recent incidents and failures, bringing visibility to reliability trends, and making sure the right people are in the room when decisions need to be made.
Drive incident tracking and post-mortem execution
Establish and maintain SLOs with partner teams
Maintain runbook quality and incident-ownership clarity
Drive platform migrations and infrastructure projects
Coordinate evals platform improvements

Requirements

Solid technical program management experience, particularly in operational or infrastructure-heavy environments
Understanding of how production ML systems work well enough to triage incidents intelligently and have substantive conversations with engineers about what's going wrong and why
Ability to work effectively across team boundaries
Experience with or strong interest in AI safety

Nice to Have

Experience with SRE practices, incident management frameworks, or on-call operations at scale
Familiarity with monitoring and alerting tooling (PagerDuty, Datadog, or equivalents)
Experience driving infrastructure migrations in complex, multi-team environments

XML job scraping automation by YubHub

]]> full-time senior hybrid $290,000-$365,000 USD Technical Program Management, Operational or Infrastructure-heavy Environments, Production ML Systems, Incident Tracking and Post-Mortem Execution, Service-Level Objectives (SLOs), Runbook Quality and Incident-Ownership Clarity, Platform Migrations and Infrastructure Projects, Evals Platform Improvements, SRE Practices, Incident Management Frameworks, On-Call Operations at Scale, Monitoring and Alerting Tooling, Infrastructure Migrations in Complex, Multi-Team Environments Engineering Technology Anthropic https://logos.yubhub.co/anthropic.ai.png Anthropic develops artificial intelligence systems. It has a growing team of researchers, engineers, and business leaders. https://anthropic.ai/ https://job-boards.greenhouse.io/anthropic/jobs/5108695008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18