Staff Software Engineer, AI Reliability Engineering

6f3a053e-c43 Staff Software Engineer, AI Reliability Engineering We're seeking a Staff Software Engineer to join our AI Reliability Engineering team. As a key member of our team, you will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, and lead incident response for critical AI services.

You will work closely with teams across Anthropic to improve reliability across our most critical serving paths. You will be responsible for making the systems that deliver Claude more robust and resilient, whether during an incident or collaborating on projects.

To be successful in this role, you should have strong distributed systems, infrastructure, or reliability backgrounds. You should be curious and brave, comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.

You will be working on high-availability serving infrastructure across multiple regions and cloud providers. You will support the reliability of safeguard model serving, which is critical for both site reliability and Anthropic's safety commitments.

If you're committed to creating reliable, interpretable, and steerable AI systems, and you're passionate about working on complex technical problems, we'd love to hear from you.

XML job scraping automation by YubHub

]]> full-time staff hybrid €235.000-€295.000 EUR distributed systems, infrastructure, reliability, Service Level Objectives, monitoring, observability, incident response, high-availability serving infrastructure, cloud providers, SRE, Production Engineer, chaos engineering, systematic resilience testing, AI-specific observability tools and frameworks, ML hardware accelerators, RDMA, InfiniBand Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5101169008 Dublin, IE 2026-04-18 709b405a-48b Staff / Senior Software Engineer, AI Reliability We're seeking a Staff / Senior Software Engineer, AI Reliability to join our team. As a key member of our AIRE (AI Reliability Engineering) team, you will partner with teams across Anthropic to improve reliability across our most critical serving paths. You will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, assist in the design and implementation of high-availability serving infrastructure, lead incident response for critical AI services, and support the reliability of safeguard model serving.

You may be a good fit for this role if you have strong distributed systems, infrastructure, or reliability backgrounds, are curious and brave, think holistically about how systems compose and where the seams are, can build lasting relationships across teams, care about users and feel ownership over outcomes, have excellent communication and collaboration skills, and bring diverse experience.

Strong candidates may also have experience operating large-scale model serving or training infrastructure, experience with one or more ML hardware accelerators, understanding of ML-specific networking optimizations, expertise in AI-specific observability tools and frameworks, experience with chaos engineering and systematic resilience testing, and contributions to open-source infrastructure or ML tooling.

We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. We value impact and believe that the highest-impact AI research will be big science. We work as a single cohesive team on just a few large-scale research efforts and value communication skills.

If you're interested in this role, please submit an application even if you don't believe you meet every single qualification. We encourage diversity and strive to include a range of diverse perspectives on our team.

XML job scraping automation by YubHub

]]> full-time staff hybrid $325,000-$485,000 USD distributed systems, infrastructure, reliability, Service Level Objectives, monitoring and observability systems, high-availability serving infrastructure, incident response, safeguard model serving, large-scale model serving or training infrastructure, ML hardware accelerators, ML-specific networking optimizations, AI-specific observability tools and frameworks, chaos engineering and systematic resilience testing, open-source infrastructure or ML tooling Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5113224008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 9537437b-e23 Staff Backend Engineer, Knowledge Graph (Rust) As a Staff Backend Engineer on the GitLab Knowledge Graph team, you'll help design, scale, and operate a high-impact graph data service that underpins agents, analytics, and architecture-level features across GitLab.com, Dedicated, and Self-Managed deployments.

You'll partner with a small, senior Rust-first team to ship reliable graph capabilities and make them easy for other teams and agents to use. The Knowledge Graph service is a distributed SDLC indexing system. It builds a property graph from GitLab SDLC (software development lifecycle) and code data using ClickHouse, NATS JetStream, and the Data Insights Platform. It also exposes secure graph queries and MCP tools for AI agents and product features.

In this role, you'll own core parts of the system end to end: shaping the architecture, hardening multi-tenant behavior and performance, and making it straightforward for other teams and agents to consume graph capabilities. In your first year, you'll take clear ownership of major areas of the service (for example, the graph query engine, SDLC indexing, or multi-tenant authorization), reduce single points of failure through better runbooks and shared context, and raise the bar on how we design, build, and operate analytical services across the stack.

Key responsibilities include:

Leading the design and evolution of core Knowledge Graph services in a production Rust codebase, including the graph query engine, SDLC and code indexing pipelines, and API/MCP surfaces that other GitLab teams and AI agents rely on.

Owning complex, cross-cutting initiatives that span GitLab Rails, the Data Insights Platform (Siphon, NATS, ClickHouse), and GitLab Duo Agent Platform, from technical direction and design docs through implementation, rollout, and iteration.

Driving system design decisions that improve reliability, scalability, and maintainability for analytical (OLAP-style) graph workloads. This includes multi-hop traversals, aggregations, and multi-tenant isolation. Document trade-offs so the broader team can move quickly and stay aligned.

Defining and improving operational maturity for the service, including service level objectives (SLOs), observability, runbooks, incident response, capacity planning, and production readiness (PREP) for GitLab.com, Dedicated, and Self-Managed deployments.

Collaborating asynchronously with product, data, infrastructure, security, and AI teams to sequence work, unblock platform-level dependencies, and land features in a way that is safe for customers and sustainable for the team.

Applying AI-assisted development workflows responsibly (for example, using MCP-aware tools, Knowledge Graph-backed agents, and internal Duo tooling) and help establish practical norms for how the team uses AI while maintaining strong engineering judgment.

Mentoring and supporting other engineers through pairing, technical design reviews, and knowledge-sharing, reinforcing shared ownership of the system and its operational sustainability.

Contributing across the stack when needed, including occasional Ruby (Rails integration and authorization paths) or frontend work (for example, the Software Architecture Map UI) to close gaps and keep delivery moving.

This role requires significant experience building and operating production backend systems, with a track record of owning reliability, maintainability, and on-call readiness for services that support other product teams or platforms. Strong engineering skills in Rust or clear evidence you can ramp quickly and deliver in a Rust-first, performance-sensitive backend codebase are essential. Additionally, strong system design skills, including making and explaining clear architectural decisions, documenting constraints, and aligning trade-offs with product and platform needs, are necessary.

XML job scraping automation by YubHub

]]> full-time staff remote Rust, ClickHouse, NATS JetStream, Data Insights Platform, graph data modeling, query patterns, property graphs, Cypher/GQL, n-hop traversals, aggregations, multi-tenant isolation, service level objectives, observability, runbooks, incident response, capacity planning, production readiness, AI-assisted development workflows, MCP-aware tools, Knowledge Graph-backed agents, internal Duo tooling Engineering Technology GitLab https://logos.yubhub.co/about.gitlab.com.png GitLab is an intelligent orchestration platform for DevSecOps, trusted by over 50 million registered users and more than 50% of the Fortune 100. https://about.gitlab.com/ https://job-boards.greenhouse.io/gitlab/jobs/8481945002 Remote, India 2026-04-18 981e6f7e-ede Production Readiness Lead - Game Developer Experience (GDX) Electronic Arts creates next-level entertainment experiences that inspire players and fans around the world. Here, everyone is part of the story. Part of a community that connects across the globe. A place where creativity thrives, new perspectives are invited, and ideas matter. A team where everyone makes play happen.

The Electronic Arts Information Technology (EAIT) organization works as a global team to empower EA's employees and business operations to be creative, collaborative, and productive. As a digital entertainment company, EA's enterprise technology needs are diverse and span across game development, workforce collaboration, marketing, publishing, player experience, security, and corporate activities. Our mission is to bring creative technology services to each of these areas, working across the company to ensure better play.

As part of the Game Developer Experience (GDX) organization, the Engineering and Operations team is building a structured, scalable operational lifecycle across GameKit. In this role, you will play a central part in shaping how operational excellence is embedded into product delivery from concept through launch and beyond.

As the Product Readiness Lead, you will integrate operational standards directly into the Product Development Lifecycle (PDLC), ensuring that reliability, scalability, and support readiness are designed in, not added later. You will collaborate closely with Engineering, Product Management, Site Reliability Engineering (SRE), Customer Support, and Operations partners to help teams meet clearly defined expectations for observability, automation, documentation, and launch readiness.

This is a hybrid role (3 days per week in the office) based in Vancouver, reporting to the Director of Operations and partnering broadly across the GameKit ecosystem to establish a repeatable, sustainable operational lifecycle model.

Responsibilities:

Enable a digital-first, automation-forward support strategy by ensuring products are designed with operational readiness from Day 0.
Partner with product and engineering teams to embed automation, AI-enabled support capabilities, and agentic workflows into product designs before launch.
Define and integrate standards for alerting, instrumentation, observability, runbooks, and workflow automation into the PDLC.
Establish lifecycle checkpoints and measurable readiness indicators (e.g., MTTR, signal coverage, operational maturity).
Lead structured operational readiness reviews and provide clear, actionable recommendations to support successful launches.
Be the connector across teams, aligning technical and operational partners around shared reliability and support outcomes.

Qualifications:

8+ years of experience in Operations, Site Reliability Engineering (SRE), Technical Program Management, Platform Operations, or a related discipline.
Demonstrated hands-on experience with Service Level Agreements (SLAs)/Service Level Objectives(SLOs), incident management, observability tooling, dashboards, and automation systems in large-scale, multi-product environments.
Strong collaboration and influence skills, with the ability to work effectively across engineering, product, and operational teams.
Experience driving operational consistency and continuous improvement in dynamic, technology-driven organizations.

Pay Transparency - North America

COMPENSATION AND BENEFITS

The ranges listed below are what EA in good faith expects to pay applicants for this role in these locations at the time of this posting. If you reside in a different location, a recruiter will advise on the applicable range and benefits. Pay offered will be determined based on a number of relevant business and candidate factors (e.g. education, qualifications, certifications, experience, skills, geographic location, or business needs).

PAY RANGES

• British Columbia (depending on location e.g. Vancouver vs. Victoria) $130,800 - $183,000 CAD

Pay is just one part of the overall compensation at EA.

For Canada, we offer a package of benefits including vacation (3 weeks per year to start), 10 days per year of sick time, paid top-up to EI/QPIP benefits up to 100% of base salary when you welcome a new child (12 weeks for maternity, and 4 weeks for parental/adoption leave), extended health/dental/vision coverage, life insurance, disability insurance, retirement plan to regular full-time employees. Certain roles may also be eligible for bonus and equity.

XML job scraping automation by YubHub

]]> full-time senior hybrid $130,800 - $183,000 CAD Service Level Agreements (SLAs), Service Level Objectives (SLOs), incident management, observability tooling, dashboards, automation systems Engineering Technology Electronic Arts https://logos.yubhub.co/jobs.ea.com.png Electronic Arts is a digital entertainment company that creates next-level entertainment experiences. https://jobs.ea.com https://jobs.ea.com/en_US/careers/JobDetail/Production-Readiness-Lead-Game-Developer-Experience-GDX/212677 Vancouver 2026-03-10