<?xml version="1.0" encoding="UTF-8"?>
<source>
  <jobs>
    <job>
      <externalid>ffccb977-f95</externalid>
      <Title>Senior Site Reliability Engineer</Title>
      <Description><![CDATA[<p>Are you excited by the idea of building fast, reliable, and intelligent infrastructure for a product used by engineering teams around the world? We&#39;re looking for a Senior Site Reliability Engineer to join the Backstage team at Spotify. We&#39;re building the next generation of our developer platform , one that doesn&#39;t just manage software, but actively helps create and maintain it through AI-native workflows.</p>
<p>In 2026, SRE isn&#39;t just about uptime; it&#39;s about symbiosis. As part of our growing engineering team, you&#39;ll design, build, and operate the cloud infrastructure behind our external developer portal product and our internal fleet of background coding agents. You&#39;ll collaborate closely with experienced engineers (both human and AI-assisted) while operating at real-world scale, with deep observability, strong safety boundaries, and the unique reliability challenges of agentic production systems.</p>
<p>Backstage is more than just a platform , it&#39;s a foundational force in the developer community. Born out of Spotify&#39;s quest for better developer tooling, Backstage now powers developer portals across the globe. But we didn&#39;t stop at catalogs and templates. Today, Backstage is becoming the command center for AI-native engineering. From enterprises orchestrating large-scale migrations to fast-moving teams using AI to improve velocity and quality, our solutions are redefining what great developer experience looks like.</p>
<p>As part of the Backstage team, you&#39;ll shape developer experience for companies large and small, for our thriving open-source community, and for Spotify itself. You&#39;ll help define how reliable, secure infrastructure enables the next wave of agentic developer tooling.</p>
<p><strong>Responsibilities</strong></p>
<ul>
<li>Own fleet reliability. Lead the reliability, security, and scalability strategy for Portal&#39;s SaaS infrastructure, including the runtime environments that power our platform and LLM-driven agent workflows. Define SLOs, drive capacity planning, and ensure our systems meet the demands of a rapidly growing product.</li>
</ul>
<ul>
<li>Architect for the agentic era. Design and evolve infrastructure on GCP and AWS using Terraform and infrastructure-from-code patterns. Shape how we structure environments for non-deterministic AI workloads , including sandboxing, resource isolation, cost governance, and security boundaries.</li>
</ul>
<ul>
<li>Drive operational excellence. Evolve our incident management, on-call, and postmortem practices. Leverage AI assistants to accelerate root cause analysis and build increasingly self-healing capabilities into our production systems.</li>
</ul>
<ul>
<li>Lead fullstack reliability. Operate across a modern web stack (TypeScript, React, Python). While not frontend-heavy, you&#39;ll diagnose and resolve issues across the stack and drive reliability improvements end-to-end.</li>
</ul>
<ul>
<li>Mentor and multiply. Raise the reliability IQ of the broader engineering team. Establish SRE best practices, conduct production-readiness reviews, and mentor engineers on operational thinking.</li>
</ul>
<ul>
<li>Shape the roadmap. Partner with engineering and product leadership to evolve our infrastructure in step with generative AI features. Translate operational insights into strategic input on the product roadmap.</li>
</ul>
<p><strong>Requirements</strong></p>
<ul>
<li>You have 5+ years of hands-on experience operating cloud infrastructure (GCP and/or AWS), using Terraform and Kubernetes to run production systems at scale.</li>
</ul>
<ul>
<li>You have practical experience , or a strong demonstrated interest , in operating LLM-based systems, RAG pipelines, or agentic workloads, and understand the reliability challenges of non-deterministic systems.</li>
</ul>
<ul>
<li>You think in distributed systems first principles , consistency, availability, partition tolerance , and translate that thinking into pragmatic infrastructure decisions.</li>
</ul>
<ul>
<li>You are proficient in at least one modern language (TypeScript, Java, Go, or Python) and comfortable navigating large, heterogeneous codebases, including environments where AI-generated PRs are common.</li>
</ul>
<ul>
<li>You build automation and improve systems so that whole categories of operational issues disappear over time.</li>
</ul>
<ul>
<li>You communicate complex infrastructure trade-offs clearly to both technical and non-technical stakeholders, and you write postmortems that lead to meaningful change.</li>
</ul>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$164,448–$234,926 USD</Salaryrange>
      <Skills>cloud infrastructure, Terraform, Kubernetes, LLM-based systems, RAG pipelines, agentic workloads, distributed systems, TypeScript, Java, Go, Python</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Spotify</Employername>
      <Employerlogo>https://logos.yubhub.co/spotify.com.png</Employerlogo>
      <Employerdescription>Spotify is a music streaming service that provides access to millions of songs. It was founded in 2006 and has since become one of the largest music streaming services in the world.</Employerdescription>
      <Employerwebsite>https://www.spotify.com</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://jobs.lever.co/spotify/fdfe281d-889c-478a-8f27-c9bc36b2b0cf</Applyto>
      <Location>New York</Location>
      <Country></Country>
      <Postedate>2026-03-31</Postedate>
    </job>
  </jobs>
</source>