Sr. Manager, Engineering

04ee7215-acf Sr. Manager, Engineering - Model Serving At Databricks, we enable data teams to solve the world's toughest problems by building and running the world's best data and AI infrastructure platform. Our Model Serving product provides enterprises with a unified, scalable, and governed platform to deploy and manage AI/ML models. As a Senior Engineering Manager, you will lead the team owning both the product experience and the foundational infrastructure of Model Serving, shaping customer-facing capabilities while designing for scalability, extensibility, and performance across both CPU and GPU inference. The impact you will have includes leading, mentoring, and growing a high-performing engineering team, defining and owning the product and technical roadmap for Model Serving, collaborating closely with product, research, platform, and infrastructure teams, and ensuring Model Serving meets stringent SLAs, SLOs, and performance and reliability goals.

Key responsibilities include:

Leading, mentoring, and growing a high-performing engineering team responsible for both the customer-facing Model Serving product and its foundational infrastructure.
Defining and owning the product and technical roadmap for Model Serving, balancing customer experience, functionality, and foundational investments across deployment, inference, monitoring, and scaling.
Collaborating closely with product, research, platform, and infrastructure teams to drive end-to-end delivery from ideation and prioritization to launch and operation.
Ensuring Model Serving meets stringent SLAs, SLOs, and performance and reliability goals, continuously improving operational efficiency and customer experience.
Driving architectural decisions and product design around latency, throughput, autoscaling, GPU/CPU placement, and cost optimization.
Advocating for customer needs through direct engagement, ensuring engineering decisions translate to clear product impact.
Promoting best practices in code quality, testing, observability, and operational readiness.
Fostering a culture of excellence, inclusion, and continuous improvement across the team.
Partnering with recruiting to attract, hire, and develop top-tier engineering talent.

XML job scraping automation by YubHub

]]> full-time senior onsite $217,000-$312,200 USD technical leadership, large-scale distributed systems, real-time serving systems, architectural design, operational excellence, production systems, SLAs, SLOs, GPU performance optimization, concurrency, caching, scalability concepts Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks builds and runs the world's best data and AI infrastructure platform. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8211957002 San Francisco, California 2026-04-18 ca221b6f-dca Technical Program Manager, Safeguards (Infrastructure & Evals) About the Role

Safeguards Engineering builds and operates the infrastructure that keeps Anthropic's AI systems safe in production. As a Technical Program Manager for Safeguards Infrastructure and Evals, you'll own the operational health and forward momentum of this stack.

Your primary responsibility is driving reliability , owning the incident-response and post-mortem process, ensuring SLOs are defined and met in partnership with various teams, and making sure that when things go wrong, the right people know, the right actions get taken, and those actions actually get closed out.

Alongside that ongoing operational rhythm, you'll coordinate the larger platform investments: migrations, eval-platform improvements, and the cross-team dependencies that connect them.

This role sits at the intersection of operations and program management. It requires genuine technical depth , you need to understand how these systems work well enough to triage effectively, judge what's actually safety-critical versus what can wait, and have informed conversations with the engineers building and maintaining them.

But the core of the job is keeping the machine running well and the work moving.

Responsibilities

Own the Safeguards Engineering ops review
Drive the recurring cadence that keeps the team informed and coordinated: surfacing recent incidents and failures, bringing visibility to reliability trends, and making sure the right people are in the room when decisions need to be made.
Drive incident tracking and post-mortem execution
Establish and maintain SLOs with partner teams
Maintain runbook quality and incident-ownership clarity
Drive platform migrations and infrastructure projects
Coordinate evals platform improvements

Requirements

Solid technical program management experience, particularly in operational or infrastructure-heavy environments
Understanding of how production ML systems work well enough to triage incidents intelligently and have substantive conversations with engineers about what's going wrong and why
Ability to work effectively across team boundaries
Experience with or strong interest in AI safety

Nice to Have

Experience with SRE practices, incident management frameworks, or on-call operations at scale
Familiarity with monitoring and alerting tooling (PagerDuty, Datadog, or equivalents)
Experience driving infrastructure migrations in complex, multi-team environments

XML job scraping automation by YubHub

]]> full-time senior hybrid $290,000-$365,000 USD Technical Program Management, Operational or Infrastructure-heavy Environments, Production ML Systems, Incident Tracking and Post-Mortem Execution, Service-Level Objectives (SLOs), Runbook Quality and Incident-Ownership Clarity, Platform Migrations and Infrastructure Projects, Evals Platform Improvements, SRE Practices, Incident Management Frameworks, On-Call Operations at Scale, Monitoring and Alerting Tooling, Infrastructure Migrations in Complex, Multi-Team Environments Engineering Technology Anthropic https://logos.yubhub.co/anthropic.ai.png Anthropic develops artificial intelligence systems. It has a growing team of researchers, engineers, and business leaders. https://anthropic.ai/ https://job-boards.greenhouse.io/anthropic/jobs/5108695008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 96d05ee1-799 Staff Software Engineer, Cluster Orchestration Job Description

CoreWeave is The Essential Cloud for AI. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence.

Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability.

Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025.

About the Role

As part of the Cluster Orchestration team, you will play a key role in advancing CoreWeave's orchestration platform including SUNK (Slurm on Kubernetes) and beyond, our Kubernetes-native foundation that powers AI training and inference at scale.

This is an opportunity to help shape one of the most critical layers of the AI cloud: ensuring workloads run seamlessly, reliably, and efficiently across massive GPU clusters.

By building the systems that eliminate infrastructure bottlenecks and create new orchestration capabilities, you will directly empower customers to innovate faster and push the boundaries of what's possible with AI.

What You'll Do

As a Staff Engineer, you will be a technical leader shaping the long-term strategy for CoreWeave's orchestration platform.

You'll define architectural direction, own critical parts of the orchestration platform and other managed services, and drive cross-org initiatives in scheduling, quota enforcement, and scaling at hyperscale.

You'll mentor senior engineers, establish org-wide best practices in reliability and observability, and ensure CoreWeave's orchestration layer evolves to meet the demands of next-generation AI workloads.

Who You Are

8+ years of software engineering experience.

Proven track record designing and operating large-scale distributed systems in production.

Deep expertise in Slurm/Kubernetes internals and cloud-native development.

Advanced proficiency in Go and distributed systems design and cloud-native development.

Experience setting technical direction and influencing cross-team architecture.

Bachelor's or Master's degree in CS, EE, or related field.

Preferred

Familiarity with orchestration and workflow technologies such as Ray, Kubeflow, Kueue, Istio, Knative, or Argo Workflows

Deep expertise in Slurm/Kubernetes internals.

Experience with distributed workloads, GPU-based applications, or ML pipelines.

Knowledge of scheduling concepts like quota enforcement, pre-emption, and scaling strategies.

Exposure to reliability practices including SLOs, alarms, and post-incident reviews.

Experience with AI infrastructure and workloads (ML training, inference, or HPC).

Ability to mentor senior engineers and elevate organizational standards.

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast! We're in an exciting stage of hyper-growth that you will not want to miss out on.

We're not afraid of a little chaos, and we're constantly learning.

Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at Your Core

Act Like an Owner

Empower Employees

Deliver Best-in-Class Client Experiences

Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking.

We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems.

As we get set for take off, the growth opportunities within the organization are constantly expanding.

You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.

Come join us!

Salary and Benefits

The base salary range for this role is $185,000 to $275,000.

The starting salary will be determined based on job-related knowledge, skills, experience, and market location.

We strive for both market alignment and internal equity when determining compensation.

In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

What We Offer

The range we've posted represents the typical compensation range for this role.

To determine actual compensation, we review the market rate for each candidate which can include a variety of factors.

These include qualifications, experience, interview performance, and location.

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

Medical, dental, and vision insurance - 100% paid for by CoreWeave

Company-paid Life Insurance

Voluntary supplemental life insurance

Short and long-term disability insurance

Flexible Spending Account

Health Savings Account

Tuition Reimbursement

Ability to Participate in Employee Stock Purchase Program (ESPP)

Mental Wellness Benefits through Spring Health

Family-Forming support provided by Carrot

Paid Parental Leave

Flexible, full-service childcare support with Kinside

401(k) with a generous employer match

Flexible PTO

Catered lunch each day in our office and data center locations

A casual work environment

A work culture focused on innovative disruption

XML job scraping automation by YubHub

]]> full-time staff hybrid $185,000 to $275,000 software engineering, distributed systems, Slurm, Kubernetes, cloud-native development, Go, scheduling, quota enforcement, scaling strategies, reliability practices, SLOs, alarms, post-incident reviews, AI infrastructure, workloads, ML training, inference, HPC, orchestration and workflow technologies, Ray, Kubeflow, Kueue, Istio, Knative, Argo Workflows Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI with confidence. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4658801006 Bellevue, WA / Sunnyvale, CA 2026-04-18 15a29cc3-0bf Senior Production Engineer CORPORATION

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025.

About the Role

Production Engineering ensures CoreWeave’s cloud delivers world-class reliability, performance, and operational excellence. We are hiring a Senior Production Engineer to take direct, hands-on ownership of critical tooling that drives reliability and delivery success.

In this role, you will work broadly across the cloud stack designing, implementing, deploying, and operating systems that improve delivery velocity, service availability, and operational safety. You’ll be responsible for leading end-to-end technical projects, maintaining long-lived systems the team owns, and strengthening our operational foundations through durable engineering investments.

This is a role for someone who enjoys building, debugging, and operating production systems. You will collaborate closely with service owners, but your primary impact comes from the reliability, quality, and maturity of the systems you deliver and maintain over time.

What You’ll Do

Take hands-on ownership of critical systems and frameworks, driving their architecture, implementation, and long-term evolution.

Lead end-to-end delivery of engineering projects that improve availability, scalability, operational automation, and failure recovery.

Build and maintain observability, alerting, automated remediation, and resilience testing for the systems you support.

Participate in incident response as a subject-matter expert; drive deep root-cause investigations and implement lasting fixes.

Improve runbooks, sources of truth, deployment workflows, and operational tooling to harden production readiness.

Eliminate single points of failure and reduce operational toil through automation, refactors, and system redesigns.

Ship production code regularly in Python, Go, or similar languages, and participate in on-call rotations.

Maintain and mature long-term projects and frameworks owned by the team, ensuring they remain reliable, well-instrumented, and easy to operate.

Collaborate with platform teams to ensure new features and services integrate cleanly with our reliability best-practices and tooling.

What You’ve Worked On (Minimum Qualifications)

7+ years of engineering experience building and operating distributed systems or cloud platforms.

Demonstrated ability to debug complex production issues end-to-end, across services, infrastructure layers, and automation.

Strong programming or scripting ability (Python, Go, or similar), with experience shipping and operating production services and tools.

Deep knowledge of cloud-native technologies and distributed system patterns, particularly Kubernetes.

Experience with modern observability stacks: metrics, tracing, structured logs, SLOs/SLIs, and incident lifecycle practices.

A track record of successfully delivering hands-on reliability improvements through engineering execution.

Preferred Qualifications

Experience building internal tooling, frameworks, or automation that supports high-availability cloud operations.

Familiarity with DR/BCP, service tiering, capacity planning, or chaos engineering.

Background operating or building large-scale AI or GPU-accelerated infrastructure.

Experience maintaining multi-year ownership of foundational production systems.

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at Your Core

Act Like an Owner

Empower Employees

Deliver Best-in-Class Client Experiences

Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization's growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!

The base salary range for this role is $139,000 to $204,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

What We Offer

The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

Medical, dental, and vision insurance - 100% paid for by CoreWeave

Company-paid Life Insurance

Voluntary supplemental life insurance

Short and long-term disability insurance

Flexible Spending Account

Health Savings Account

Tuition Reimbursement

Ability to Participate in Employee Stock Purchase Program (ESPP)

Mental Wellness Benefits through Spring Health

Family-Forming support provided by Carrot

Paid Parental Leave

Flexible, full-service childcare support with Kinside

401(k) with a generous employer match

Flexible PTO

Catered lunch each day in our office and data center locations

A casual work environment

A work culture focused on innovative disruption

Our Workplace

While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.

California Consumer Privacy Act - California applicants only

XML job scraping automation by YubHub

]]> full-time senior hybrid $139,000 to $204,000 cloud computing, distributed systems, cloud platforms, Kubernetes, observability stacks, metrics, tracing, structured logs, SLOs/SLIs, incident lifecycle practices, Python, Go, programming, scripting, production services, tools, internal tooling, frameworks, automation, high-availability cloud operations, DR/BCP, service tiering, capacity planning, chaos engineering, large-scale AI, GPU-accelerated infrastructure Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4670172006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 6a24f057-4f1 Staff Production Engineer The Production Engineering Tools team builds and operates foundational platforms that make CoreWeave's cloud reliable, observable, and scalable. We are hiring a Staff Production Engineer to design, build, and own the foundational platforms and frameworks that underpin operational excellence across CoreWeave.

In this role, you will combine deep technical leadership with hands-on engineering to create systems that improve availability, resiliency, and delivery velocity at scale. This is a high-impact role with broad organisational influence. You will develop a deep understanding of CoreWeave's infrastructure and services, shape architecture and tooling decisions, and partner closely with service owners to operationalise reliability through automation and paved paths rather than manual process or advocacy.

Success requires the ability to pivot quickly between hot incidents, multi-team programs, and initiatives at all levels of the organisation. You will design, build, and own foundational platforms and frameworks from architecture through adoption and operation. You will lead technical strategy and execution for internal tooling that reduces manual operations, improves delivery velocity, and supports CoreWeave's revenue growth through faster, more reliable datacentre delivery.

You will partner with service owners and platform teams to translate reliability and operational requirements into automation, self-service capabilities, and opinionated paved paths. You will build and evolve systems for observability, alerting, automated remediation, resiliency testing, and authoritative sources of truth, operationalising best practices through tooling rather than manual enforcement.

You will participate in incident response for critical outages with the explicit goal of improving systems, tooling, and defaults to reduce future operational load,not as a long-term escalation path. You will ship production code, participate in on-call rotations as needed, and mentor engineers on platform ownership, operational design, and sustainable production practices.

XML job scraping automation by YubHub

]]> full-time staff hybrid $188,000 to $275,000 distributed systems, cloud platforms, Kubernetes, observability, incident practices, metrics, tracing, structured logs, SLIs/SLOs, PIRs, foundational internal platforms, service tiering, disaster recovery, chaos engineering, structured resilience programs Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4644302006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 372999e8-579 Senior Software Engineer II, AI Workload Orchestration As a Senior Software Engineer II on the AI Workload Orchestration team, you will help build and operate CoreWeave's Kubernetes-native platform for admitting, scheduling, and operating AI workloads at scale.

This platform integrates multiple orchestration and scheduling frameworks such as Kueue, Volcano, and Ray to support modern AI training and inference workflows. It complements SUNK (Slurm on Kubernetes) by providing a Kubernetes-first, cloud-native orchestration layer with deep platform integration.

You will own meaningful components of the platform, drive reliability and performance improvements, and help scale the system as customer demand and workload complexity continue to grow.

Responsibilities:

Design, build, and operate Kubernetes-native services for AI workload orchestration and scheduling
Own one or more platform components end-to-end, including design, implementation, testing, and on-call support
Improve scheduling latency, cluster utilization, and workload reliability through metrics-driven engineering
Contribute to architectural discussions across services and influence design decisions within the platform
Work closely with adjacent teams (CKS, infrastructure, managed inference) to ensure clean interfaces and integrations
Mentor junior engineers and raise the quality bar for code, design, and operations

About the role:

5–8 years of professional software engineering experience in distributed systems, cloud infrastructure, or platform engineering
Strong experience building production systems in Go (Python or C++ a plus)
Solid understanding of Kubernetes fundamentals, APIs, controllers, and operating services in production
Experience working with scheduling, resource management, or quota-based systems
Proven ability to improve system reliability and performance using data and operational metrics
Comfortable owning services in production and participating in on-call rotations

Preferred:

Experience with Kubernetes-native orchestration frameworks such as Kueue, Volcano, Ray, Kubeflow, or Argo Workflows
Familiarity with GPU-based workloads, ML training, or inference pipelines
Knowledge of scheduling concepts such as quota enforcement, pre-emption, and backfilling
Experience with reliability practices including SLOs, alerting, and incident response
Exposure to AI infrastructure, HPC, or large-scale distributed compute environments

Why CoreWeave?

Be Curious at Your Core
Act Like an Owner
Empower Employees
Deliver Best-in-Class Client Experiences
Achieve More Together

The base salary range for this role is $165,000 to $242,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

What We Offer

In addition to a competitive salary, we offer a variety of benefits to support your needs, including:

Medical, dental, and vision insurance - 100% paid for by CoreWeave
Company-paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Health Savings Account
Tuition Reimbursement
Ability to Participate in Employee Stock Purchase Program (ESPP)
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our office and data center locations
A casual work environment
A work culture focused on innovative disruption

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 Kubernetes, Go, Distributed systems, Cloud infrastructure, Platform engineering, Scheduling, Resource management, Quota-based systems, Kueue, Volcano, Ray, Kubeflow, Argo Workflows, GPU-based workloads, ML training, Inference pipelines, SLOs, Alerting, Incident response, AI infrastructure, HPC, Large-scale distributed compute environments Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a technology company that delivers a platform for building and scaling AI with confidence. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4647595006 Sunnyvale, CA / Bellevue, WA 2026-04-18 67b4ccd7-51d Senior Software Engineer, Observability Insights Join CoreWeave's Observability team, where we are building the next-generation insights layer for AI systems.

Our team empowers internal and external users to understand, troubleshoot, and optimize complex AI workloads by transforming telemetry into actionable insights.

As a Senior Software Engineer on the Observability Insights team, you will lead the development of agentic interfaces and product experiences that sit atop CoreWeave's telemetry layer.

You'll design multi-tenant APIs, managed Grafana experiences, and MCP-based tool servers to help customers and internal teams interact with data in innovative ways.

Collaborating closely with PMs and engineering leadership, your work will shape the end-to-end observability experience and influence how people engage with cutting-edge AI infrastructure.

About the role

6+ years of experience in software or infrastructure engineering building production-grade backend systems and distributed APIs.

Strong focus on developer-facing infrastructure, with a customer-obsessed approach to SDKs, CLIs, and APIs.

Proficient in reliability engineering, including fault-tolerant design, SLOs, error budgets, and multi-tenant system resilience.

Familiar with observability systems such as ClickHouse, Loki, VictoriaMetrics, Prometheus, and Grafana.

Experienced in agentic applications or LLM-based features, including grounding, tool calling, and operational safety.

Comfortable writing production code primarily in Go, with the ability to integrate Python components when needed.

Collaborative experience in agile teams delivering end-to-end telemetry-to-insights pipelines.

Preferred

Experience operating Kubernetes clusters at scale, especially for AI workloads.

Hands-on experience with logging, tracing, and metrics platforms in production, with deep knowledge of cardinality, indexing, and query optimization.

Experienced in running distributed systems or API services at cloud scale, including event streaming and data pipeline management.

Familiarity with LLM frameworks, MCP, and agentic tooling (e.g., Langchain, AgentCore).

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast!

We're in an exciting stage of hyper-growth that you will not want to miss out on.

We're not afraid of a little chaos, and we're constantly learning.

Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at Your Core

Act Like an Owner

Empower Employees

Deliver Best-in-Class Client Experiences

Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking.

We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems.

As we get set for takeoff, the organization's growth opportunities are constantly expanding.

You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.

Come join us!

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 software engineering, infrastructure engineering, backend systems, distributed APIs, reliability engineering, fault-tolerant design, SLOs, error budgets, multi-tenant system resilience, observability systems, ClickHouse, Loki, VictoriaMetrics, Prometheus, Grafana, agentic applications, LLM-based features, grounding, tool calling, operational safety, Go, Python, Kubernetes, logging, tracing, metrics platforms, cardinality, indexing, query optimization, event streaming, data pipeline management, LLM frameworks, MCP, agent tooling, operating Kubernetes clusters Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4650163006 New York, NY / Sunnyvale, CA 2026-04-18 bf25e8de-318 Director of Engineering (Data Infrastructure) Job Title: Director of Engineering (Data Infrastructure)

Location: Bengaluru, India

We're looking for a seasoned Director of Engineering to lead our data infrastructure organization in Bengaluru. As a founding technical leader in our fastest-growing engineering hub, you will be responsible for building world-class teams and shaping architectural decisions that ripple across the company.

About the Role:

You will build the data infrastructure organization that makes Databricks' continued growth possible.
Establish foundational teams in Bengaluru owning the bedrock systems that guarantee billing correctness, operational resilience, and zero-downtime recovery across our entire monetization stack.
Define what world-class infrastructure looks like for the next decade of data platforms.

Responsibilities:

Deliver the infrastructure vision for systems processing billions in daily billing transactions with zero tolerance for error.
Build Bengaluru's data infrastructure organization by establishing it as the destination for India's top infrastructure talent.
Own business-critical systems operating 24/7/365 across 100+ regions where even 99.9% uptime means hours of customer pain.
Ship platforms that compound engineering leverage across Databricks.

Requirements:

14+ years in distributed systems engineering with 6+ years leading infrastructure organizations and 4+ years managing managers at companies where infrastructure failures meant immediate revenue impact, customer escalations, or regulatory consequences.
Technical depth across petabyte-scale data pipelines and distributed systems reliability.
Track record defining multi-year infrastructure vision and translating it into sequential deliverables that show value quarterly.
Experience building 99.999%+ reliable systems with established practices for SLOs/SLIs, chaos engineering, disaster recovery, and sophisticated observability.
Proven ability to scale infrastructure organizations in high-growth environments.
Communication skills to make complex infrastructure decisions legible to executives.

What You'll Need:

BS in Computer Science or Engineering; MS or Ph.D. preferred.
Experience with Apache Spark, Delta Lake, large-scale data infrastructure, fintech/billing systems, or leading infrastructure through hypergrowth strongly preferred.

Benefits:

At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees.

Our Commitment to Diversity and Inclusion:

At Databricks, we are committed to fostering a diverse and inclusive culture where everyone can excel.

Compliance:

If access to export-controlled technology or source code is required for performance of job duties, it is within Employer's discretion whether to grant such access.

XML job scraping automation by YubHub

]]> full-time executive onsite distributed systems engineering, infrastructure organizations, petabyte-scale data pipelines, distributed systems reliability, SLOs/SLIs, chaos engineering, disaster recovery, observability, Apache Spark, Delta Lake, large-scale data infrastructure, fintech/billing systems Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks is a data and AI company that provides a unified and democratized data, analytics, and AI platform. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8290810002 Bengaluru, India 2026-04-18 40d32156-365 Reliability Lead, Common Services As Reliability Lead, Common Services, you will establish and lead the Reliability Engineering and production operations practice for the Common Services organization. You'll partner closely with engineering leaders and teams across Common Services to define how we build, release, monitor, and operate critical services,raising the bar on reliability, availability, and operational excellence across the board.

In this role, you will:

Establish and lead the SRE / production engineering practice for the Common Services organization, including standards for reliability, incident management, and on-call, in partnership with the central Product Engineering organization.
Develop an Operational Excellence strategy that focuses on not only improving system performance but also monitoring and reducing operational toil
Partner with engineering and product teams to define SLOs, SLIs, and error budgets for critical Common Services, and ensure these become part of how teams plan and make tradeoffs.
Own and improve the incident management lifecycle for Common Services, including on-call rotations, escalation paths, incident tooling, post-incident reviews, and follow-through on corrective actions.
Drive the observability strategy (metrics, logs, traces, dashboards, alerts) for Common Services, ensuring we have actionable visibility into the health, performance, and capacity of key systems.
Collaborate with engineering leads to design and review architectures for reliability, scalability, resilience, and operability, including failure modes, redundancy, and graceful degradation.
Lead efforts to automate and harden operational workflows, including deployments, rollbacks, configuration management, change management, and routine maintenance tasks.
Build strong, trust-based relationships with partner teams and stakeholders, becoming a go-to leader for production readiness and operational risk within Common Services.
Hire, mentor, and develop SRE and production engineering talent, fostering a culture of continuous improvement, learning from incidents, and humane on-call.
Partner with other SRE and production engineering leaders across CoreWeave to align on global practices, tools, and reliability goals, representing the needs and constraints of Common Services.

You will be responsible for defining the reliability strategy, processes, and standards for the Common Services portfolio and driving consistent, high-quality operational practices across multiple teams.

The base salary range for this role is $206,000 to $303,000.

XML job scraping automation by YubHub

]]> full-time senior hybrid $206,000 to $303,000 Site Reliability Engineering, Production Engineering, Linux-based production environments, Containers, Orchestration technologies, Observability stacks, Alerting systems, SLIs/SLOs, Error budgets, Incident management, On-call rotations, Escalation paths, Post-incident reviews, Corrective actions, Automation tooling, Infrastructure-as-code, CI/CD pipelines, GPU workloads, High-performance computing, Latency/throughput-sensitive systems, Multi-tenant environments, Multi-region environments, Regulated environments, Service ownership models, Mentoring, Managing senior engineers Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for AI development and deployment. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4650165006 New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 f516f0ef-a2d Senior Site Reliability Engineer (Auth0) Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.

This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission.

As a Senior Site Reliability Engineer, you'll join our SRE team based in Europe to ensure our production systems are not only operational but also resilient, scalable, and ready for exponential growth. This isn't just about keeping the lights on; it's about directly contributing to the platform's core resiliency and robustness.

You'll be a hands-on builder, crafting solutions that make our system more reliable by design.

Key Responsibilities:

Design and build custom software in Go to enhance the platform's reliability, resiliency, and redundancy.
Partner with engineering teams to embed reliability principles, improving the availability, performance, and observability of our services.
Use your deep understanding of infrastructure and observability principles to identify opportunities for improvement within the product and implement solutions.
Contribute to our on-call rotation, providing rapid, effective response to critical incidents and using your expertise to troubleshoot, mitigate or accurately escalate production issues.
Develop and refine our SRE tooling and processes, focusing on automation and operational efficiency.
Define, document, and champion reliability best practices across the organisation.

Requirements:

A proactive and systematic approach to problem-solving, with a high degree of ownership.
Proven experience in a production environment supporting large-scale, mission-critical applications with a high degree of autonomy.
Proficiency in at least one programming language, with a preference for Go. You should be comfortable writing custom applications, not just scripts.
Experience with infrastructure as code (Terraform), container orchestration (Kubernetes, Docker) and GitOps (ArgoCD).
Demonstrable expertise in a major cloud provider (Azure, AWS, or GCP).
A strong grasp of microservices architecture, databases (SQL, NoSQL), and networking fundamentals, so you can understand how custom code can solve platform-level issues.
An understanding of core SRE principles, including SLIs, SLOs, and error budgets.
Experience in an on-call rotation for a 24/7 cloud-based environment.
Exceptional communication and collaboration skills, with a proven ability to work effectively in a remote, distributed team, where tasks may be self-driven.

We're looking for someone who is not just looking for a job, but a career-defining opportunity to tackle complex challenges at a massive scale. If you're a curious and motivated engineer who's passionate about building reliability directly into the platform, we'd love to hear from you.

XML job scraping automation by YubHub

]]> full-time senior hybrid $136,000-$187,000 CAD Go, Terraform, Kubernetes, Docker, GitOps, Cloud provider (Azure, AWS, or GCP), Microservices architecture, Databases (SQL, NoSQL), Networking fundamentals, Core SRE principles (SLIs, SLOs, error budgets) Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta provides an unparalleled authentication experience for hundreds of millions of users worldwide. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7791590 Toronto, Ontario, Canada 2026-04-18 81e928a2-c9f Senior Site Reliability Engineer (Auth0) Secure Every Identity

We are looking for a Senior Site Reliability Engineer to join our SRE team based in Europe. As a Senior Site Reliability Engineer, you'll ensure our production systems are not only operational but also resilient, scalable, and ready for exponential growth.

This isn't just about keeping the lights on; it's about directly contributing to the platform's core resiliency and robustness. You'll be a hands-on builder, crafting solutions that make our system more reliable by design.

Responsibilities

Design and build custom software in Go to enhance the platform's reliability, resiliency, and redundancy.
Partner with engineering teams to embed reliability principles, improving the availability, performance, and observability of our services.
Use your deep understanding of infrastructure and observability principles to identify opportunities for improvement within the product and implement solutions.
Contribute to our on-call rotation, providing rapid, effective response to critical incidents and using your expertise to troubleshoot, mitigate or accurately escalate production issues.
Develop and refine our SRE tooling and processes, focusing on automation and operational efficiency.
Define, document, and champion reliability best practices across the organisation.

What you'll need to be successful

This role requires a unique blend of a software engineer's mindset and operational expertise. You'll thrive in this role if you have:

A proactive and systematic approach to problem-solving, with a high degree of ownership.
Proven experience in a production environment supporting large-scale, mission-critical applications with a high degree of autonomy.
Proficiency in at least one programming language, with a preference for Go. You should be comfortable writing custom applications, not just scripts.
Experience with infrastructure as code (Terraform), container orchestration (Kubernetes, Docker) and GitOps (ArgoCD).
Demonstrable expertise in a major cloud provider (Azure, AWS, or GCP).
A strong grasp of microservices architecture, databases (SQL, NoSQL), and networking fundamentals, so you can understand how custom code can solve platform-level issues.
An understanding of core SRE principles, including SLIs, SLOs, and error budgets.
Experience in an on-call rotation for a 24/7 cloud-based environment.
Exceptional communication and collaboration skills, with a proven ability to work effectively in a remote, distributed team, where tasks may be self-driven.

The Okta Experience

Supporting Your Well-Being
Driving Social Impact
Developing Talent and Fostering Connection + Community

XML job scraping automation by YubHub

]]> full-time senior remote Go, Terraform, Kubernetes, Docker, GitOps, Cloud provider (Azure, AWS, or GCP), Microservices architecture, Databases (SQL, NoSQL), Networking fundamentals, Core SRE principles (SLIs, SLOs, error budgets) Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta provides an unparalleled authentication experience for hundreds of millions of users worldwide. It is a large technology company. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7418982 Barcelona, Spain 2026-04-18 8a68e8bd-dd5 Consulting Architect - Observability As a Consulting Architect – Observability, you will play a pivotal role in helping our customers realise the value of Elastic’s Solutions. Acting as a trusted technical advisor, you will work with enterprises to design, deliver, and scale architectures that improve application performance, infrastructure visibility, and end-user experience.

You will translate business and technical requirements into scalable, outcome-driven solutions built on the Elastic Stack. You will lead end-to-end delivery of customer engagements , from discovery and design through implementation, enablement, and optimisation. You will partner with customers to architect, deploy, and operationalise Elastic solutions that drive measurable value and adoption.

You will provide technical oversight, guidance, and enablement to customers and teammates throughout project lifecycles. You will collaborate cross-functionally with Sales, Product, Engineering, and Support to ensure successful outcomes and continuous improvement. You will capture and share best practices, lessons learned, and solution patterns across the Elastic Services community.

You will guide customers in using Elastic Agents, Beats, Logstash time-series data ingestion, stream processing, and normalisation, and related technologies. You will design and implement custom dashboards, visualisations, and alerting for critical observability use cases in Kibana. You will optimise ingestion pipelines for performance, scalability, and resiliency at enterprise scale.

You will have 5+ years as a consultant, architect, or engineer with expertise in observability, monitoring, or related domains. You will have strong experience with time-series data ingestion and processing, including pipelines with Elastic Agents, Beats, and Logstash. You will have knowledge of messaging queues (Kafka, Redis) and ingestion optimisation strategies.

You will have understanding of observability concepts like distributed tracing, metrics pipelines, log aggregation, anomaly detection, SLOs/SLIs. You will have experience with one or more: Kubernetes, cloud platforms (AWS, Azure, GCP), or infrastructure as code. You will have familiarity with Elastic Common Schema (ECS), data parsing, and normalisation.

You will have proven experience deploying Elastic Observability (APM, UEM, logs, metrics, infra, network monitoring) or similar solutions at enterprise scale. You will have hands-on expertise in distributed systems and large-scale infrastructure. You will have ability to design and build dashboards, visualisations, and alerting thresholds that drive actionable insights.

You will have experience with Kubernetes, Linux, Java, databases, Docker, AWS/Azure/GCP, VMs, Lucene. You will have strong communication and presentation skills, with experience engaging directly with customers. You will have a Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or related field, or equivalent experience.

You will be comfortable working in highly distributed teams, both remote and on-site when needed. You may require significant travel to customer sites to support engagements and solution implementations; candidates should be comfortable with varying levels of travel based on business needs.

XML job scraping automation by YubHub

]]> full-time senior remote $133,100-$210,600 USD observability, monitoring, time-series data ingestion, processing, pipelines, Elastic Agents, Beats, Logstash, messaging queues, Kafka, Redis, ingestion optimisation strategies, distributed tracing, metrics pipelines, log aggregation, anomaly detection, SLOs/SLIs, Kubernetes, cloud platforms, infrastructure as code, Elastic Common Schema, data parsing, normalisation, databases, Docker, VMs, Lucene Engineering Technology Elastic https://logos.yubhub.co/elastic.co.png Elastic enables everyone to find the answers they need in real time, using all their data, at scale. The Elastic Search AI Platform, used by more than 50% of the Fortune 500, brings together the precision of search and the intelligence of AI to enable everyone to accelerate the results that matter. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7763314 United States 2026-04-18 396fe53d-121 Consulting Architect - Observability As a Consulting Architect – Observability, you will play a pivotal role in helping our customers realise the value of Elastic’s Solutions. Acting as a trusted technical advisor, you will work with enterprises to design, deliver, and scale architectures that improve application performance, infrastructure visibility, and end-user experience.

You'll collaborate with Elastic’s Professional Services, Engineering, Product, and Sales teams to accelerate adoption of the Elastic Observability platform, ensuring customers maximise the value of their data while achieving business outcomes. This is a highly impactful role, with opportunities to guide strategy, lead complex implementations, and mentor both customers and teammates.

Key responsibilities include:

Translating business and technical requirements into scalable, outcome-driven solutions built on the Elastic Stack.
Leading end-to-end delivery of customer engagements , from discovery and design through implementation, enablement, and optimisation.
Partnering with customers to architect, deploy, and operationalise Elastic solutions that drive measurable value and adoption.
Providing technical oversight, guidance, and enablement to customers and teammates throughout project lifecycles.
Collaborating cross-functionally with Sales, Product, Engineering, and Support to ensure successful outcomes and continuous improvement.
Capturing and sharing best practices, lessons learned, and solution patterns across the Elastic Services community.
Contributing to internal enablement, mentoring, and a culture of continuous learning and collaboration

Required skills include:

5+ years as a consultant, architect, or engineer with expertise in observability, monitoring, or related domains.
Expertise in the Telecommunications domain, especially with Mobile networks and devices.
Strong experience with time-series data ingestion and processing, including pipelines with Elastic Agents, Beats, and Logstash.
Knowledge of messaging queues (Kafka, Redis) and ingestion optimisation strategies.
Understanding of observability concepts like distributed tracing, metrics pipelines, log aggregation, anomaly detection, SLOs/SLIs.
Experience with one or more: Kubernetes, cloud platforms (AWS, Azure, GCP), or infrastructure as code.
Familiarity with Elastic Common Schema (ECS), data parsing, and normalisation.
Proven experience deploying Elastic Observability (APM, UEM, logs, metrics, infra, network monitoring) or similar solutions at enterprise scale.
Hands-on expertise in distributed systems and large-scale infrastructure.
Ability to design and build dashboards, visualisations, and alerting thresholds that drive actionable insights.
Experience with Kubernetes, Linux, Java, databases, Docker, AWS/Azure/GCP, VMs, Lucene.
Strong communication and presentation skills, with experience engaging directly with customers.
Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or related field, or equivalent experience.
Comfortable working in highly distributed teams, both remote and on-site when needed.
May require significant travel to customer sites to support engagements and solution implementations; candidates should be comfortable with varying levels of travel based on business needs.

XML job scraping automation by YubHub

]]> full-time senior remote observability, monitoring, Elastic Stack, time-series data ingestion, Elastic Agents, Beats, Logstash, messaging queues, Kafka, Redis, distributed tracing, metrics pipelines, log aggregation, anomaly detection, SLOs/SLIs, Kubernetes, cloud platforms, infrastructure as code, Elastic Common Schema, data parsing, normalisation, databases, Docker, VMs, Lucene Engineering Technology Elastic https://logos.yubhub.co/elastic.co.png Elastic is a software company that enables everyone to find the answers they need in real time, using all their data, at scale. The company's products are used by more than 50% of the Fortune 500. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7440232 Tokyo, Japan 2026-04-18 62efca6f-b6f Senior AI Engineer We're looking for a Senior AI Engineer who is obsessed with building AI systems that actually work in production: reliable, observable, cost-efficient, and genuinely useful. This is not a research role. You will ship AI-powered features that process real financial data for real businesses.

LLM & AI Pipeline Engineering - Design, build, and maintain production-grade LLM integration pipelines , including retrieval-augmented generation (RAG), prompt engineering, output parsing, and chain orchestration.

Develop and operate AI features within Jeeves's core financial products: spend categorization, document extraction, anomaly detection, financial Q&A, and automated reconciliation.

Implement structured output validation, fallback handling, and confidence scoring to ensure AI decisions meet reliability standards for financial use cases.

Evaluate and integrate AI frameworks and tools (LangChain, LlamaIndex, OpenAI API, Anthropic API, HuggingFace, vector databases) and advocate for the right tool for the job.

Establish prompt versioning and evaluation practices to ensure AI outputs remain accurate and consistent as models and data evolve.

Retrieval & Vector Search - Design and maintain vector search pipelines using databases such as Pinecone, Weaviate, or pgvector to power semantic search and RAG-based features.

Build document ingestion and chunking pipelines for Jeeves's financial data , processing invoices, receipts, policy documents, and transaction records.

Optimize retrieval quality through embedding model selection, chunk strategy, metadata filtering, and re-ranking techniques.

ML Model Serving & Operations - Collaborate with data scientists to take trained ML models from experimental notebooks to production serving infrastructure.

Build and maintain model serving endpoints with appropriate latency SLOs, input validation, and output monitoring.

Implement model performance monitoring and data drift detection to ensure production models remain accurate over time.

Support model retraining workflows by designing clean data pipelines and feature engineering that can be continuously updated.

Backend Integration & Reliability - Integrate AI services cleanly with Jeeves's backend microservices , designing clear API contracts, circuit breakers, and graceful degradation patterns.

Write high-quality, testable backend code in Python or Go/Node.js to power AI-integrated features.

Instrument AI components with structured logging, distributed tracing, latency dashboards, and alerting to ensure operational visibility.

Collaboration & Growth - Partner with Product, Backend Engineering, and Data Science to define the AI roadmap and translate requirements into reliable systems.

Contribute to a culture of quality by writing design docs, reviewing peers' AI system designs, and sharing learnings openly.

Help grow the AI engineering practice at Jeeves by establishing patterns, tooling, and best practices that the broader team can build on.

XML job scraping automation by YubHub

]]> full-time senior remote LLM, AI, Python, LangChain, LlamaIndex, OpenAI API, Anthropic API, HuggingFace, vector databases, Pinecone, Weaviate, pgvector, semantic search, RAG-based features, document ingestion, chunking pipelines, embedding model selection, chunk strategy, metadata filtering, re-ranking techniques, model serving infrastructure, latency SLOs, input validation, output monitoring, model performance monitoring, data drift detection, clean data pipelines, feature engineering, API contracts, circuit breakers, graceful degradation patterns, structured logging, distributed tracing, latency dashboards, alerting Engineering Finance Jeeves https://logos.yubhub.co/jeeves.com.png Jeeves is a financial operating system built for global businesses that provides corporate cards, cross-border payments, and spend management software within one unified platform. It operates across 20+ countries and serves over 5,000 clients. https://www.jeeves.com/ https://jobs.lever.co/tryjeeves/ded9e04e-f18e-4d4c-ae43-4b7882c6200b India 2026-04-17 e21cfc44-77a Staff Distributed Systems Engineer - Collaboration At Webflow, we're building the world's leading AI-native Digital Experience Platform, and we're doing it as a remote-first company built on trust, transparency, and a whole lot of creativity.

This work takes grit, because we move fast, without ever sacrificing craft or quality. Our mission is to bring development superpowers to everyone. From entrepreneurs launching their first idea to global enterprises scaling their digital presence, we empower teams to design, launch, and optimize for the web without barriers.

We're looking for a Staff Distributed Systems Engineer

About the role:

Location: Remote-first (United States; BC & ON, Canada)
Full-time
Permanent
Exempt
The cash compensation for this role is tailored to align with the cost of labor in different geographic markets. We've structured the base pay ranges for this role into zones for our geographic markets, and the specific base pay within the range will be determined by the candidate’s geographic location, job-related experience, knowledge, qualifications, and skills.
United States (all figures cited below are in USD and pertain to workers in the United States)

+ Zone A: $187,000 - $289,000 + Zone B: $175,000 - $271,000 + Zone C: $164,000 - $254,000

Canada (figures cited below are in CAD and pertain to workers in ON & BC, Canada)

+ $212,000 - $328,000

This role is also eligible to participate in Webflow's company-wide bonus program. Target amounts are a percentage of base salary and vary by career level. Payouts are based on company performance against established financial and operational goals.
Please visit our Careers page for more information on which locations are included in each of our geographic pay zones. However, please confirm the zone for your specific location with your recruiter.
Application Information:

+ Application deadline: applications accepted on an ongoing basis until position is closed and filled + This posting is for a new position + Reporting to the Senior Manager, Engineering

As a Staff Distributed Systems Engineer, you’ll:

+ Collaborate with exceptional engineers on building systems and services for the world's largest companies. This platform powers millions of production websites and supports massive scale, including over 1% of global Internet traffic and more than 10 billion monthly visits. + Lead architecture for distributed services at scale that synchronize shared state across clients, including clear correctness guarantees (eg: ordering, idempotency, convergence). These services require low latency and high availability, with SLO of 99.99% uptime. + Define concurrency and conflict-resolution semantics for concurrent changes, including trade-offs and constraints. + Design for failure: retries, partial outages, reconnection, and safe recovery paths, with explicit degradation behavior. + Own operational excellence: define SLIs/SLOs, instrument tracing/metrics/logging, and drive reliability improvements through incident learning. + Drive cross-team technical alignment via design docs and decision records; unblock execution across org boundaries. + Raise the bar through design and code reviews, mentoring, and pragmatic standardization that increases leverage. + Deliver maintainable, tested, performant systems and evolve them with a “crawl, walk, run” plan. + Use modern tooling (including agentic coding, debugging and code review) to improve developer velocity and reduce time-to-diagnosis in production. + Participate in engineering citizenship activities such as co-authoring engineering blogs, strengthening and improving our hiring processes, and leading internal hackathon teams. + In addition to the responsibilities outlined above, at Webflow we will support you in identifying where your interests and development opportunities lie and we'll help you incorporate them into your role.

About you:
Requirements:

+ BA/BS degree or equivalent experience + You’ll thrive as a Staff Distributed Systems Engineer, if you have:

At least 7, preferably 10+ years of building and operating large-scale production distributed systems where latency, correctness, and reliability (99.99% uptime) are non-negotiable.
Deep backend systems experience in one or more modern server environments (Java, Go, Rust, Python, Node.js, etc.), with the ability to ramp and adapt quickly in new stacks.
Expertise with distributed systems, concurrency, scaling, and debugging multi-layer systems.
Strong operational judgment: you define SLIs/SLOs, build observability, and improve systems via incidents and feedback loops, not heroics.
Staff behaviors: you lead multi-team initiatives, write appropriate design docs, influence architecture beyond your immediate team, and communicate across the organization.
Ability to make decisions with incomplete information, understand and communicate one-way vs. two-way doors, and move with urgency while keeping critical code operational.
Stay curious and open to growth , actively building fluency in emerging technologies like AI to unlock creativity, accelerate progress, and amplify impact.
Our Core Behaviors:

+ Build lasting customer trust. + Win together. + Reinvent ourselves. + Deliver with speed, quality, and craft.

Benefits:

+ Ownership in what you help build. + Health coverage that actually covers you. + Support for every stage of family life. + Time off that’s actually off. + Wellness for the whole you. + Invest in your future. + Monthly stipends that flex with your life. + Bonus for building together.

Remote, together:

+ At Webflow, equality is a core tenet of our culture. We are an Equal Opportunity (EEO)/Veterans/Disability Employer.

XML job scraping automation by YubHub

]]> full-time staff remote $187,000 - $289,000 (USD) distributed systems, backend systems, server environments, concurrency, scaling, debugging, operational judgment, SLIs/SLOs, observability, incident learning, design docs, decision records, code reviews, mentoring, pragmatic standardization, modern tooling, agentic coding, code review Engineering Technology Webflow https://logos.yubhub.co/webflow.com.png Webflow is a digital experience platform that empowers teams to design, launch, and optimize websites without barriers. https://webflow.com/ https://job-boards.greenhouse.io/webflow/jobs/7630474 U.S. Remote 2026-03-31 981e6f7e-ede Production Readiness Lead - Game Developer Experience (GDX) Electronic Arts creates next-level entertainment experiences that inspire players and fans around the world. Here, everyone is part of the story. Part of a community that connects across the globe. A place where creativity thrives, new perspectives are invited, and ideas matter. A team where everyone makes play happen.

The Electronic Arts Information Technology (EAIT) organization works as a global team to empower EA's employees and business operations to be creative, collaborative, and productive. As a digital entertainment company, EA's enterprise technology needs are diverse and span across game development, workforce collaboration, marketing, publishing, player experience, security, and corporate activities. Our mission is to bring creative technology services to each of these areas, working across the company to ensure better play.

As part of the Game Developer Experience (GDX) organization, the Engineering and Operations team is building a structured, scalable operational lifecycle across GameKit. In this role, you will play a central part in shaping how operational excellence is embedded into product delivery from concept through launch and beyond.

As the Product Readiness Lead, you will integrate operational standards directly into the Product Development Lifecycle (PDLC), ensuring that reliability, scalability, and support readiness are designed in, not added later. You will collaborate closely with Engineering, Product Management, Site Reliability Engineering (SRE), Customer Support, and Operations partners to help teams meet clearly defined expectations for observability, automation, documentation, and launch readiness.

This is a hybrid role (3 days per week in the office) based in Vancouver, reporting to the Director of Operations and partnering broadly across the GameKit ecosystem to establish a repeatable, sustainable operational lifecycle model.

Responsibilities:

Enable a digital-first, automation-forward support strategy by ensuring products are designed with operational readiness from Day 0.
Partner with product and engineering teams to embed automation, AI-enabled support capabilities, and agentic workflows into product designs before launch.
Define and integrate standards for alerting, instrumentation, observability, runbooks, and workflow automation into the PDLC.
Establish lifecycle checkpoints and measurable readiness indicators (e.g., MTTR, signal coverage, operational maturity).
Lead structured operational readiness reviews and provide clear, actionable recommendations to support successful launches.
Be the connector across teams, aligning technical and operational partners around shared reliability and support outcomes.

Qualifications:

8+ years of experience in Operations, Site Reliability Engineering (SRE), Technical Program Management, Platform Operations, or a related discipline.
Demonstrated hands-on experience with Service Level Agreements (SLAs)/Service Level Objectives(SLOs), incident management, observability tooling, dashboards, and automation systems in large-scale, multi-product environments.
Strong collaboration and influence skills, with the ability to work effectively across engineering, product, and operational teams.
Experience driving operational consistency and continuous improvement in dynamic, technology-driven organizations.

Pay Transparency - North America

COMPENSATION AND BENEFITS

The ranges listed below are what EA in good faith expects to pay applicants for this role in these locations at the time of this posting. If you reside in a different location, a recruiter will advise on the applicable range and benefits. Pay offered will be determined based on a number of relevant business and candidate factors (e.g. education, qualifications, certifications, experience, skills, geographic location, or business needs).

PAY RANGES

• British Columbia (depending on location e.g. Vancouver vs. Victoria) $130,800 - $183,000 CAD

Pay is just one part of the overall compensation at EA.

For Canada, we offer a package of benefits including vacation (3 weeks per year to start), 10 days per year of sick time, paid top-up to EI/QPIP benefits up to 100% of base salary when you welcome a new child (12 weeks for maternity, and 4 weeks for parental/adoption leave), extended health/dental/vision coverage, life insurance, disability insurance, retirement plan to regular full-time employees. Certain roles may also be eligible for bonus and equity.

XML job scraping automation by YubHub

]]> full-time senior hybrid $130,800 - $183,000 CAD Service Level Agreements (SLAs), Service Level Objectives (SLOs), incident management, observability tooling, dashboards, automation systems Engineering Technology Electronic Arts https://logos.yubhub.co/jobs.ea.com.png Electronic Arts is a digital entertainment company that creates next-level entertainment experiences. https://jobs.ea.com https://jobs.ea.com/en_US/careers/JobDetail/Production-Readiness-Lead-Game-Developer-Experience-GDX/212677 Vancouver 2026-03-10 3514d749-08c Senior Support Engineer Senior Support Engineer - San Francisco

Location

San Francisco

Employment Type

Full time

Department

Compensation

$234K – $260K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

The Technical Support team is responsible for ensuring that developers and enterprises can reliably build mission critical solutions using OpenAI models. We provide technical guidance, resolve complex issues and support customers in maximizing value and adoption from deploying our highly-capable models. We work closely with Technical Success, Product, Engineering and others to deliver the best possible experience to our customers at scale. We think from an automation-first mindset and leverage the latest in AI to scale our support operations. Join the Senior Support Engineering (SSE) team at OpenAI and help shape the future of Technical Support in the age of AI.

About the Role

We are looking for a Senior Support Engineer to collaborate directly with our strategic enterprise accounts and product teams, helping solve some of the most difficult problems faced by our Customers. You will be part of the best technical troubleshooting team at OpenAI, and our Customers and Engineering teams will look to you for technical guidance in addressing the most technically difficult issues in our environment.

As a Senior Support Engineer, you will design and run operational processes to monitor our top strategic customers and a 24x7 response team. You’ll work closely with our Infrastructure and Engineering teams to deliver the best possible experience to customers at scale. Working directly with our most strategic Customers - You will be crucial to the success of the most innovative, disruptive, and high-scale AI solutions being built with the OpenAI API platform.

The nature of this role will be low volume, high difficulty.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.

Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.

Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.

In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.

Design and refine incident response processes and documentation across strategic customers, engineering and support teams.

Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.

Provide support coverage during holidays and weekends based on business needs.

You might thrive in this role if you:

Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.

Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.

Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).

Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and communication with stakeholders.

Are able to work effectively in a fast-paced environment, prioritize tasks, and manage multiple projects simultaneously.

Are a strong communicator and team player, with excellent written and verbal communication skills.

Are able to adapt to changing priorities and requirements, and are flexible in your approach to problem-solving.

XML job scraping automation by YubHub

]]> full-time senior hybrid $234K – $260K Bachelor’s degree in Computer Science or a related field, 8+ years of experience in technical operations roles such as SRE/NOC, Designing monitoring systems and resolving production issues in fast-paced and mission-critical environments, Troubleshooting complex technical problems at the systems level, Modern monitoring, alerting, and observability practices, Metrics, logging, and tracing for distributed systems, SLIs/SLOs, alert tuning, dashboard creation, Incident response for high‑severity outages or service disruptions, Real-time incident coordination, root cause analysis, and communication with stakeholders, Automation and advancements in AI technologies, Automation-first mindset and leveraging the latest in AI to scale support operations, Technical and troubleshooting expertise for API platform at OpenAI, Proactive identification and implementation of opportunities to scale support operations, Advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time, Reliability reviews and preparedness for new features, launches, or strategic customer requirement updates, Operational readiness (monitoring, alerting, and fallback plans), Incident response processes and documentation across strategic customers, engineering and support teams, Operational metrics and incident RCAs to identify areas for improvement, Enhancements to monitoring dashboards, alert configurations, and support workflows Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is a technology company that develops and offers artificial intelligence (AI) models and tools. It was founded in 2015 and is headquartered in San Francisco, California. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/5431666c-530b-49c0-b67e-32477f9eaf5e San Francisco 2026-03-06 70806a42-556 Senior Support Engineer Senior Support Engineer - Dublin

Location

Dublin, Ireland

Employment Type

Full time

Department

About the Team

About the Role

The nature of this role will be low volume, high difficulty.

This role is based in Dublin, Ireland. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.

Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.

Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.

In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.

Design and refine incident response processes and documentation across strategic customers, engineering and support teams.

Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.

Provide support coverage during holidays and weekends based on business needs.

You might thrive in this role if you:

Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.

Have 5+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.

Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).

Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and drive follow‑ups (post‑mortems, action items) to prevent recurrence. Knowledge of industry best practices for incident management and fault diagnosis.

Have strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools.

Have solid understanding of cloud infrastructure and distributed systems fundamentals. Comfortable working with cloud services, load balancers, databases, and containerized applications.

Are effective at working cross‑functionally in a high‑trust environment. Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders. You can coordinate efforts across teams and are comfortable providing updates in the midst of an ongoing incident.

Compensation, Benefits and Perks

This is a position with OpenAI Ireland Ltd., which controls the hiring and management of this position.

Total compensation includes an annual salary, generous equity, and benefits.

Medical, dental, and vision insurance for you and your family

Mental health and wellness support

PRSA plan with 8% employer matching

Unlimited time off

Annual learning & development stipend ($1,500 USD equivalent per year)

#LI-NM2

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time senior hybrid Python, Cloud infrastructure, Distributed systems, Monitoring and alerting, Observability, Scripting, Software engineering, Cloud services, Load balancers, Databases, Containerized applications, SLIs/SLOs, Alert tuning, Dashboard creation, Incident management, Fault diagnosis, Cross-functional collaboration, Communication Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/988016e1-de50-42be-925a-438b97291c5d Dublin 2026-03-06