{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/title/site-reliability-engineer"},"x-facet":{"type":"title","slug":"site-reliability-engineer","display":"Site Reliability Engineer","count":7},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_24f0b17b-a0f"},"title":"Site Reliability Engineer","description":"<p>We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers&#39; expectations.</p>\n<p>As a Site Reliability Engineer, you balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads</li>\n<li>Ensure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters</li>\n<li>Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)</li>\n<li>Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime</li>\n<li>Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs</li>\n</ul>\n<p>Development Responsibilities:</p>\n<ul>\n<li>Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform</li>\n<li>Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments</li>\n<li>Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure</li>\n<li>Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)</li>\n<li>Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements</li>\n<li>Document processes and procedures to ensure consistency and knowledge sharing across the team</li>\n<li>Contribute to open-source projects, research publications, blog articles and conferences</li>\n</ul>\n<p>Requirements:</p>\n<ul>\n<li>Master’s degree in Computer Science, Engineering or a related field</li>\n<li>7+ years of experience in a DevOps/SRE role</li>\n<li>Strong experience with cloud computing and highly available distributed systems</li>\n<li>Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)</li>\n<li>Experience working against reliability KPIs (observability, alerting, SLAs)</li>\n<li>Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)</li>\n<li>Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)</li>\n<li>Familiarity with infrastructure-as-code tools like Terraform or CloudFormation</li>\n<li>Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices</li>\n<li>Strong understanding of networking, security, and system administration concepts</li>\n<li>Excellent problem-solving and communication skills</li>\n</ul>\n<p>Preferred Qualifications:</p>\n<ul>\n<li>Experience in an AI/ML environment</li>\n<li>Experience of high-performance computing (HPC) systems and workload managers (Slurm)</li>\n<li>Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_24f0b17b-a0f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/b320e972-3ed8-4d02-acb1-37950812cdbc","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["cloud computing","highly available distributed systems","DevOps","SRE","Kubernetes","Flux","Terraform","CI/CD","containerization","orchestration","monitoring","logging","alerting","observability","infrastructure-as-code","scripting languages","software development best practices","networking","security","system administration"],"x-skills-preferred":["AI/ML environment","high-performance computing","workload managers","modern AI-oriented solutions"],"datePosted":"2026-04-24T16:09:24.532Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing, highly available distributed systems, DevOps, SRE, Kubernetes, Flux, Terraform, CI/CD, containerization, orchestration, monitoring, logging, alerting, observability, infrastructure-as-code, scripting languages, software development best practices, networking, security, system administration, AI/ML environment, high-performance computing, workload managers, modern AI-oriented solutions"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_5fee1986-021"},"title":"Site Reliability Engineer","description":"<p>Electronic Arts is looking for a Site Reliability Engineer (SRE) to join our GameKit Operations team. As an SRE, you will be part of a newly formed SRE function and help shape the future of how EA builds and operates its development platforms and services.</p>\n<p>The work model for this role is a hybrid one, working 3 days per week from our office in Bucharest. In your first 60 days, you will gain an understanding of the GameKit environment and assess existing monitoring and observability systems. By 90 days, you will begin implementing the observability roadmap, contribute to incident response, and identify opportunities to improve automation and reliability.</p>\n<p>By 120 days, you will take ownership of main SRE plans, guide cross-team collaboration, and influence EA&#39;s approach to operational excellence. Beyond 180 days, you will lead long-term strategies to improve reliability, mentor engineers, and champion sustainable and scalable engineering practices.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Building scalable monitoring and observability systems using Prometheus/Grafana, Datadog, ELK, or similar</li>\n<li>Building infrastructure and tooling using technologies like Terraform, Ansible, AWS CloudFormation, and CI/CD pipelines (GitLab CI/CD)</li>\n<li>Automating operational processes using Python and Bash to reduce manual toil and improve deployment reliability</li>\n<li>Operating and improving containerized applications using Kubernetes platforms (EKS, AKS, GKE)</li>\n<li>Contributing to incident response processes and post-mortems, helping teams learn and improve from every incident</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>Experience operating cloud platforms, especially AWS and Azure</li>\n<li>Expertise in monitoring, observability, and incident response at scale</li>\n<li>Hands-on experience with Infrastructure-as-Code and automation</li>\n<li>Desire to improve processes and team capabilities</li>\n<li>Comfortable working in dynamic environments and solving problems collaboratively</li>\n<li>5+ years of experience building SRE practices from the ground up</li>\n<li>Led on-call rotations or reliability-focused projects</li>\n<li>Mentored junior engineers and influenced engineering culture through documentation and collaboration</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_5fee1986-021","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Electronic Arts","sameAs":"https://jobs.ea.com","logo":"https://logos.yubhub.co/jobs.ea.com.png"},"x-apply-url":"https://jobs.ea.com/en_US/careers/JobDetail/Site-Reliability-Engineer/213684","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["cloud platforms","monitoring","observability","incident response","infrastructure-as-code","automation","containerized applications","kubernetes"],"x-skills-preferred":[],"datePosted":"2026-04-24T13:15:51.532Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bucharest"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud platforms, monitoring, observability, incident response, infrastructure-as-code, automation, containerized applications, kubernetes"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2ac86458-692"},"title":"Site Reliability Engineer","description":"<p>We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers&#39; expectations.</p>\n<p>As a Site Reliability Engineer, you balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads</li>\n<li>Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters</li>\n<li>Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)</li>\n<li>Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime</li>\n<li>Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs</li>\n<li>Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences</li>\n</ul>\n<p><strong>Development</strong></p>\n<ul>\n<li>Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform</li>\n<li>Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments</li>\n<li>Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure</li>\n<li>Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)</li>\n<li>Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements</li>\n<li>Document processes and procedures to ensure consistency and knowledge sharing across the team</li>\n<li>Contribute to open-source projects, research publications, blog articles and conferences</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>Master’s degree in Computer Science, Engineering or a related field</li>\n<li>7+ years of experience in a DevOps/SRE role</li>\n<li>Strong experience with cloud computing and highly available distributed systems</li>\n<li>Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)</li>\n<li>Experience working against reliability KPIs (observability, alerting, SLAs)</li>\n<li>Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)</li>\n<li>Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)</li>\n<li>Familiarity with infrastructure-as-code tools like Terraform or CloudFormation</li>\n<li>Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices</li>\n<li>Strong understanding of networking, security, and system administration concepts</li>\n<li>Excellent problem-solving and communication skills</li>\n</ul>\n<p><strong>Nice to Have</strong></p>\n<ul>\n<li>Experience in an AI/ML environment</li>\n<li>Experience of high-performance computing (HPC) systems and workload managers (Slurm)</li>\n<li>Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2ac86458-692","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/b320e972-3ed8-4d02-acb1-37950812cdbc","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["cloud computing","highly available distributed systems","DevOps","SRE","CI/CD","containerization","orchestration","monitoring","logging","alerting","observability","infrastructure-as-code","Terraform","CloudFormation","scripting languages","Python","Go","Bash","software development best practices","networking","security","system administration"],"x-skills-preferred":[],"datePosted":"2026-04-24T13:10:16.828Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing, highly available distributed systems, DevOps, SRE, CI/CD, containerization, orchestration, monitoring, logging, alerting, observability, infrastructure-as-code, Terraform, CloudFormation, scripting languages, Python, Go, Bash, software development best practices, networking, security, system administration"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_9d0181ff-f94"},"title":"Site Reliability Engineer","description":"<p><strong>About the role</strong></p>\n<p>Gamma&#39;s infrastructure needs to be rock-solid for millions of daily users while enabling our engineering teams to ship fast. You&#39;ll own the operational health of our full backend platform, building automation and tooling that improves reliability and partnering with engineering to design systems that are observable, resilient, and easy to operate. Your work directly impacts every Gamma user&#39;s experience.</p>\n<p>This is a high-impact role where you&#39;ll balance reliability with velocity, knowing when to move fast and when to prioritize stability. You&#39;ll lead incident response, drive systemic improvements, and help shape how Gamma scales to serve its next 100 million users.</p>\n<p>Our team has a strong in-office culture and works in person 4–5 days per week in San Francisco. We love working together to stay creative and connected, with flexibility to work from home when focus matters most.</p>\n<p><strong>What you&#39;ll do</strong></p>\n<ul>\n<li>Own the reliability, availability, and performance of Gamma&#39;s production systems across our AWS infrastructure</li>\n<li>Build observability infrastructure from the ground up: metrics, logging, tracing, and alerting that give the team genuine visibility into system health before users feel the impact</li>\n<li>Design and ship automation that reduces toil, makes deployments safer, and gets us back on our feet faster when things go wrong</li>\n<li>Lead incident response and blameless post-mortems, then follow through on the systemic fixes that keep the same issues from coming back</li>\n<li>Partner with engineering teams on architecture reviews, SLO and SLI design, and reliability best practices that scale with the product</li>\n<li>Manage and optimize our compute, networking, databases, and managed services</li>\n</ul>\n<p><strong>What you&#39;ll bring</strong></p>\n<ul>\n<li>5+ years in site reliability engineering, DevOps, or systems engineering with deep, hands-on AWS expertise</li>\n<li>Strong programming skills in Python, Go, or TypeScript/Node.js, applied to building real tools and automation</li>\n<li>Solid experience with infrastructure-as-code (Terraform, CloudFormation) and end-to-end observability solutions</li>\n<li>Track record of making systems meaningfully more reliable through automation, smarter monitoring, and architectural improvements</li>\n<li>Deep understanding of networking, distributed systems, containerization (Docker, Kubernetes), and database performance at scale</li>\n<li>Sharp incident management instincts and the debugging skills to navigate complex production failures</li>\n<li>Experience scaling SaaS products to millions of users, or background with Kafka, chaos engineering, or service mesh technologies (Nice to have)</li>\n<li>AWS certifications, or experience with security and compliance frameworks like SOC 2 or ISO 27001 (Nice to have)</li>\n</ul>\n<p><strong>Compensation range:</strong></p>\n<p>The base salary for this full-time position, which spans multiple internal levels depending on qualifications, ranges between $230K - $310K plus benefits &amp; equity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_9d0181ff-f94","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Gamma","sameAs":"https://gamma.com","logo":"https://logos.yubhub.co/gamma.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/gamma/365c8133-e9c1-4bcb-b8f1-975d96115503","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$230K - $310K","x-skills-required":["AWS","Python","Go","TypeScript/Node.js","Terraform","CloudFormation","observability solutions","infrastructure-as-code","DevOps","site reliability engineering","systems engineering","networking","distributed systems","containerization","database performance"],"x-skills-preferred":["Kafka","chaos engineering","service mesh technologies","AWS certifications","security and compliance frameworks"],"datePosted":"2026-04-24T12:15:30.768Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"AWS, Python, Go, TypeScript/Node.js, Terraform, CloudFormation, observability solutions, infrastructure-as-code, DevOps, site reliability engineering, systems engineering, networking, distributed systems, containerization, database performance, Kafka, chaos engineering, service mesh technologies, AWS certifications, security and compliance frameworks","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":230000,"maxValue":310000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a632e52b-c63"},"title":"Site Reliability Engineer","description":"<p>About Mistral AI</p>\n<p>At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.</p>\n<p>We are a dynamic team passionate about AI and its potential to transform society. Our diverse workforce thrives in competitive environments and is committed to driving innovation.</p>\n<p>Role Summary</p>\n<p>We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers&#39; expectations.</p>\n<p>Responsibilities</p>\n<p>As a Site Reliability Engineer, you balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.</p>\n<p>Operations</p>\n<p>• Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads</p>\n<p>• Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters</p>\n<p>• Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)</p>\n<p>• Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime</p>\n<p>• Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs</p>\n<p>• Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences</p>\n<p>Development</p>\n<p>• Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform</p>\n<p>• Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments</p>\n<p>• Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure</p>\n<p>• Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)</p>\n<p>• Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements</p>\n<p>• Document processes and procedures to ensure consistency and knowledge sharing across the team</p>\n<p>• Contribute to open-source projects, research publications, blog articles and conferences</p>\n<p>About You</p>\n<p>• Master’s degree in Computer Science, Engineering or a related field</p>\n<p>• 7+ years of experience in a DevOps/SRE role</p>\n<p>• Strong experience with cloud computing and highly available distributed systems</p>\n<p>• Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)</p>\n<p>• Experience working against reliability KPIs (observability, alerting, SLAs)</p>\n<p>• Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)</p>\n<p>• Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)</p>\n<p>• Familiarity with infrastructure-as-code tools like Terraform or CloudFormation</p>\n<p>• Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices</p>\n<p>• Strong understanding of networking, security, and system administration concepts</p>\n<p>• Excellent problem-solving and communication skills</p>\n<p>• Self-motivated and able to work well in a fast-paced startup environment</p>\n<p>Your Application Will Be All The More Interesting If You Also Have:</p>\n<p>• Experience in an AI/ML environment</p>\n<p>• Experience of high-performance computing (HPC) systems and workload managers (Slurm)</p>\n<p>• Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a632e52b-c63","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/6e16e4fa-a60b-4270-a815-06b0450fb597","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["cloud computing","highly available distributed systems","DevOps","SRE","Kubernetes","Flux","Terraform","CI/CD","containerization","orchestration","monitoring","logging","alerting","observability","infrastructure-as-code","scripting languages","software development best practices","networking","security","system administration"],"x-skills-preferred":["AI/ML environment","high-performance computing (HPC) systems","workload managers","modern AI-oriented solutions"],"datePosted":"2026-04-17T12:47:37.519Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing, highly available distributed systems, DevOps, SRE, Kubernetes, Flux, Terraform, CI/CD, containerization, orchestration, monitoring, logging, alerting, observability, infrastructure-as-code, scripting languages, software development best practices, networking, security, system administration, AI/ML environment, high-performance computing (HPC) systems, workload managers, modern AI-oriented solutions"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_419c1058-a0b"},"title":"Site Reliability Engineer","description":"<p>About Mistral AI</p>\n<p>At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life. We democratize AI through high-performance, optimized, open-source and cutting-edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on-premises or in cloud environments.</p>\n<p>Role Summary</p>\n<p>We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers&#39; expectations.</p>\n<p>Responsibilities</p>\n<p>As a Site Reliability Engineer, you balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.</p>\n<p>Operations (50%)</p>\n<ul>\n<li>Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads</li>\n<li>Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters</li>\n<li>Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)</li>\n<li>Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime</li>\n<li>Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs</li>\n<li>Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences</li>\n</ul>\n<p>Development (50%)</p>\n<ul>\n<li>Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform</li>\n<li>Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments</li>\n<li>Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure</li>\n<li>Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)</li>\n<li>Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements</li>\n<li>Document processes and procedures to ensure consistency and knowledge sharing across the team</li>\n<li>Contribute to open-source projects, research publications, blog articles and conferences</li>\n</ul>\n<p>About You</p>\n<ul>\n<li>Master’s degree in Computer Science, Engineering or a related field</li>\n<li>7+ years of experience in a DevOps/SRE role</li>\n<li>Strong experience with cloud computing and highly available distributed systems</li>\n<li>Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...) </li>\n<li>Experience working against reliability KPIs (observability, alerting, SLAs)</li>\n<li>Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)</li>\n<li>Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)</li>\n<li>Familiarity with infrastructure-as-code tools like Terraform or CloudFormation</li>\n<li>Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices</li>\n<li>Strong understanding of networking, security, and system administration concepts</li>\n<li>Excellent problem-solving and communication skills</li>\n<li>Self-motivated and able to work well in a fast-paced startup environment</li>\n</ul>\n<p>Your Application Will Be All The More Interesting If You Also Have:</p>\n<ul>\n<li>Experience in an AI/ML environment</li>\n<li>Experience of high-performance computing (HPC) systems and workload managers (Slurm)</li>\n<li>Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_419c1058-a0b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai/careers"},"x-apply-url":"https://jobs.lever.co/mistral/6e16e4fa-a60b-4270-a815-06b0450fb597","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["cloud computing","highly available distributed systems","DevOps","SRE","Kubernetes","Flux","Terraform","CI/CD","containerization","orchestration","monitoring","logging","alerting","observability","infrastructure-as-code","scripting languages","software development best practices","networking","security","system administration"],"x-skills-preferred":["AI/ML environment","high-performance computing","workload managers","modern AI-oriented solutions"],"datePosted":"2026-03-10T11:32:04.928Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing, highly available distributed systems, DevOps, SRE, Kubernetes, Flux, Terraform, CI/CD, containerization, orchestration, monitoring, logging, alerting, observability, infrastructure-as-code, scripting languages, software development best practices, networking, security, system administration, AI/ML environment, high-performance computing, workload managers, modern AI-oriented solutions"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b7de618e-5e1"},"title":"Site Reliability Engineer","description":"<p>Join our Site Reliability Engineering team and help ensure the reliability, scalability, and performance of Replit&#39;s infrastructure that serves millions of developers worldwide. As a Site Reliability Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.</p>\n<p>We are seeking SREs who are passionate about building and maintaining resilient systems at scale. Your mission will be to design and implement robust monitoring solutions, automate operational tasks, and continuously improve our infrastructure&#39;s reliability and performance.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Design and Implement Observability Solutions: Develop comprehensive monitoring and alerting systems using modern observability tools. Create dashboards and metrics that provide real-time visibility into system health and performance. Implement logging strategies that enable quick problem identification and resolution.</li>\n</ul>\n<ul>\n<li>Drive Automation and Infrastructure as Code: Architect and implement infrastructure automation solutions using tools like Terraform, Ansible, or Pulumi. Design and maintain CI/CD pipelines that enable reliable and consistent deployments. Create self-healing systems that can automatically respond to common failure scenarios.</li>\n</ul>\n<ul>\n<li>Establish SLOs and SLIs: Work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Build systems to track and report on these metrics, ensuring we maintain high reliability standards while balancing innovation speed.</li>\n</ul>\n<ul>\n<li>Incident Management and Response: Lead incident response efforts, conducting thorough post-mortems, and implementing improvements to prevent future occurrences. Develop and maintain runbooks for critical services. Build tools and processes that reduce Mean Time To Recovery (MTTR).</li>\n</ul>\n<ul>\n<li>Performance Optimization: Identify and resolve performance bottlenecks across our infrastructure. Implement capacity planning strategies and optimize resource utilization. Work on reducing latency and improving system efficiency across global regions.</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>4-8 years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering)</li>\n</ul>\n<ul>\n<li>Strong programming skills in languages commonly used for automation (Python, Go, or similar)</li>\n</ul>\n<ul>\n<li>Deep understanding of distributed systems</li>\n</ul>\n<ul>\n<li>Experience with container orchestration platforms (Kubernetes) and cloud-native technologies</li>\n</ul>\n<ul>\n<li>Proven track record of implementing and maintaining monitoring/observability solutions</li>\n</ul>\n<ul>\n<li>Strong incident management skills with experience leading incident response</li>\n</ul>\n<ul>\n<li>Experience with infrastructure as code and configuration management tools</li>\n</ul>\n<p><strong>Bonus Points</strong></p>\n<ul>\n<li>Experience with Google Cloud Platform (GCP) services and tools</li>\n</ul>\n<ul>\n<li>Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.)</li>\n</ul>\n<p><strong>What We Value</strong></p>\n<ul>\n<li>Problem-solving mindset: Ability to approach complex operational challenges systematically and devise effective solutions</li>\n</ul>\n<ul>\n<li>Self-directed and autonomous: Capable of working independently while collaborating effectively with cross-functional teams</li>\n</ul>\n<ul>\n<li>Strong communication skills: Ability to explain complex technical concepts to both technical and non-technical audiences</li>\n</ul>\n<ul>\n<li>Continuous learning: Passion for staying current with industry best practices and new technologies</li>\n</ul>\n<ul>\n<li>Focus on automation: Strong belief in automating repetitive tasks and building self-healing systems</li>\n</ul>\n<p><strong>Full-Time Employee Benefits Include</strong></p>\n<ul>\n<li>Competitive Salary &amp; Equity</li>\n</ul>\n<ul>\n<li>401(k) Program with a 4% match</li>\n</ul>\n<ul>\n<li>Health, Dental, Vision and Life Insurance</li>\n</ul>\n<ul>\n<li>Short Term and Long Term Disability</li>\n</ul>\n<ul>\n<li>Paid Parental, Medical, Caregiver Leave</li>\n</ul>\n<ul>\n<li>Commuter Benefits</li>\n</ul>\n<ul>\n<li>Monthly Wellness Stipend</li>\n</ul>\n<ul>\n<li>Autonomous Work Environment</li>\n</ul>\n<ul>\n<li>In Office Set-Up Reimbursement</li>\n</ul>\n<ul>\n<li>Flexible Time Off (FTO) + Holidays</li>\n</ul>\n<ul>\n<li>Quarterly Team Gatherings</li>\n</ul>\n<ul>\n<li>In Office Amenities</li>\n</ul>\n<p><strong>Want to Learn More About What We Are Up To?</strong></p>\n<ul>\n<li>Meet the Replit Agent</li>\n</ul>\n<ul>\n<li>Replit: Make an app for that</li>\n</ul>\n<ul>\n<li>Replit Blog</li>\n</ul>\n<ul>\n<li>Amjad TED Talk</li>\n</ul>\n<p><strong>Interviewing + Culture at Replit</strong></p>\n<ul>\n<li>Operating Principles</li>\n</ul>\n<ul>\n<li>Reasons not to work at Replit</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b7de618e-5e1","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Replit","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/replit.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/replit/f6e6158e-eb89-4008-81ea-1b7512bc509d","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$160K - $250K","x-skills-required":["Site Reliability Engineering","DevOps","Systems Engineering","Infrastructure Engineering","Python","Go","Distributed systems","Container orchestration platforms","Cloud-native technologies","Monitoring/observability solutions","Incident management","Infrastructure as code","Configuration management tools"],"x-skills-preferred":["Google Cloud Platform","Prometheus","Grafana","Datadog"],"datePosted":"2026-03-07T15:20:24.140Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Site Reliability Engineering, DevOps, Systems Engineering, Infrastructure Engineering, Python, Go, Distributed systems, Container orchestration platforms, Cloud-native technologies, Monitoring/observability solutions, Incident management, Infrastructure as code, Configuration management tools, Google Cloud Platform, Prometheus, Grafana, Datadog","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":160000,"maxValue":250000,"unitText":"YEAR"}}}]}