{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/monitoring-and-observability"},"x-facet":{"type":"skill","slug":"monitoring-and-observability","display":"Monitoring And Observability","count":13},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_34fa7d64-89a"},"title":"Technical Product Manager - Linux Developer Experience","description":"<p>We&#39;re seeking a Technical Product Manager to join our team responsible for shaping and evolving the developer experience on our firm&#39;s developer platform.</p>\n<p>In this pivotal role, you&#39;ll serve as the primary liaison between the platform engineering team and our developer community , including quantitative analysts, researchers, and front-office trading teams , ensuring the platform meets their complex development needs and continuously improves.</p>\n<p>The Developer Platform team architects, engineers, and enhances the firm&#39;s developer’s toolchain and workflow. We collaborate closely with developers, quants, researchers, and front-office trading teams to ensure our platform provides a best-in-class development experience with the feel of native Mac/UNIX-like development.</p>\n<p>This role sits at the intersection of product management and technical enablement, acting as the voice of the developer within the platform team.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Build and maintain relationships with technologists and developers across the firm to deeply understand their workflows, pain points, and emerging needs</li>\n</ul>\n<ul>\n<li>Discover novel use cases and translate them into actionable product requirements for the platform engineering team</li>\n</ul>\n<ul>\n<li>Serve as the first point of contact for developer questions about the platform&#39;s environment, tooling, and capabilities</li>\n</ul>\n<ul>\n<li>Triage and reproduce issues reported by developers, driving initial diagnosis , including leveraging AI-assisted sessions for problem analysis , and escalating to the deeper technical engineering team when necessary</li>\n</ul>\n<ul>\n<li>Drive the roadmap and prioritization of platform enhancements in collaboration with engineering leadership</li>\n</ul>\n<ul>\n<li>Promote and evangelize the Linux developer platform , driving adoption and ensuring developers are aware of available features and best practices</li>\n</ul>\n<ul>\n<li>Manage project timelines, stakeholder communication, and delivery milestones for platform initiatives</li>\n</ul>\n<p>Qualifications / Skills Required:</p>\n<ul>\n<li>Demonstrated experience in Technical Product Management, Technical Project Management, or Developer Relations/Developer Experience roles</li>\n</ul>\n<ul>\n<li>Strong communication and stakeholder management skills , ability to engage credibly with both highly technical developers and senior leadership</li>\n</ul>\n<ul>\n<li>Working familiarity with Linux desktop environments , comfortable navigating the platform, understanding developer workflows, and answering environment/tooling questions</li>\n</ul>\n<ul>\n<li>Conceptual understanding of containerization and orchestration (Docker, Podman, Kubernetes) and how developers leverage these tools in their workflows</li>\n</ul>\n<ul>\n<li>Familiarity with CI/CD concepts and tools (e.g., Jenkins, Git) , enough to understand developer pipelines and identify friction points</li>\n</ul>\n<ul>\n<li>Problem reproduction and triage skills , ability to recreate reported issues in the environment and clearly document/escalate to engineering with relevant context</li>\n</ul>\n<ul>\n<li>Experience leveraging AI tools (e.g., LLM-based assistants, copilots) to assist in problem diagnosis, research, and knowledge synthesis</li>\n</ul>\n<ul>\n<li>Basic scripting literacy (Bash, Python) , enough to read, understand, and run existing scripts; not necessarily write complex automation from scratch</li>\n</ul>\n<p>Qualifications / Skills Desired:</p>\n<ul>\n<li>Familiarity with serverless compute concepts and cloud-native development paradigms</li>\n</ul>\n<ul>\n<li>Exposure to configuration management tools (e.g., Ansible) and image lifecycle management (e.g., Hashicorp Packer) , understanding what they do and how they fit into the platform, rather than hands-on administration</li>\n</ul>\n<ul>\n<li>Awareness of monitoring and observability tools (Prometheus, Grafana, ELK stack) from a user/consumer perspective</li>\n</ul>\n<ul>\n<li>Understanding of authentication and identity management concepts (e.g., Active Directory integration) as they relate to developer access and workflows</li>\n</ul>\n<ul>\n<li>Experience with agile project management methodologies and tools (Jira, Confluence, or similar)</li>\n</ul>\n<ul>\n<li>Strong communication skills working with engineering leadership, developer community, and stakeholders</li>\n</ul>\n<ul>\n<li>Bachelor’s degree in Computer Science or a related field</li>\n</ul>\n<p>The estimated base salary range for this position is $175,000 to $250,000, which is specific to New York and may change in the future. Millennium pays a total compensation package which includes a base salary, discretionary performance bonus, and a comprehensive benefits package.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_34fa7d64-89a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"IT Infrastructure","sameAs":"https://mlp.eightfold.ai","logo":"https://logos.yubhub.co/mlp.eightfold.ai.png"},"x-apply-url":"https://mlp.eightfold.ai/careers/job/755953932410","x-work-arrangement":null,"x-experience-level":null,"x-job-type":"full-time","x-salary-range":"$175,000 to $250,000","x-skills-required":["Technical Product Management","Technical Project Management","Developer Relations/Developer Experience","Linux desktop environments","Containerization and orchestration","CI/CD concepts and tools","Problem reproduction and triage skills","AI tools","Basic scripting literacy"],"x-skills-preferred":["Serverless compute concepts and cloud-native development paradigms","Configuration management tools","Image lifecycle management","Monitoring and observability tools","Authentication and identity management concepts","Agile project management methodologies and tools"],"datePosted":"2026-04-18T22:13:03.074Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, New York, United States of America"}},"employmentType":"FULL_TIME","occupationalCategory":"IT","industry":"Technology","skills":"Technical Product Management, Technical Project Management, Developer Relations/Developer Experience, Linux desktop environments, Containerization and orchestration, CI/CD concepts and tools, Problem reproduction and triage skills, AI tools, Basic scripting literacy, Serverless compute concepts and cloud-native development paradigms, Configuration management tools, Image lifecycle management, Monitoring and observability tools, Authentication and identity management concepts, Agile project management methodologies and tools","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":175000,"maxValue":250000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0a2ea62c-943"},"title":"Research Engineer, Infrastructure, RL Systems","description":"<p>We&#39;re looking for an infrastructure research engineer to design and build the core systems that enable scalable, efficient training of large models through reinforcement learning.</p>\n<p>This role sits at the intersection of research and large-scale systems engineering: a builder who understands both the algorithms behind RL and the realities of distributed training and inference at scale. You&#39;ll wear many hats, from optimising rollout and reward pipelines to enhancing reliability, observability, and orchestration, collaborating closely with researchers and infra teams to make reinforcement learning stable, fast, and production-ready.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Design, build, and optimise the infrastructure that powers large-scale reinforcement learning and post-training workloads.</li>\n</ul>\n<ul>\n<li>Improve the reliability and scalability of RL training pipeline, distributed RL workloads, and training throughput.</li>\n</ul>\n<ul>\n<li>Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems.</li>\n</ul>\n<ul>\n<li>Collaborate with researchers to translate algorithmic ideas into production-grade training pipelines.</li>\n</ul>\n<ul>\n<li>Build evaluation and benchmarking infrastructure that measures model progress on helpfulness, safety, and factuality.</li>\n</ul>\n<ul>\n<li>Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.</li>\n</ul>\n<p>We&#39;re looking for someone with strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases. You should have a good understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.</p>\n<p>Experience training or supporting large-scale language models with tens of billions of parameters or more is a plus. Familiarity with monitoring and observability tools (Prometheus, Grafana, OpenTelemetry) is also a plus.</p>\n<p>Logistics:</p>\n<ul>\n<li>Location: This role is based in San Francisco, California.</li>\n</ul>\n<ul>\n<li>Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.</li>\n</ul>\n<ul>\n<li>Visa sponsorship: We sponsor visas. While we can&#39;t guarantee success for every candidate or role, if you&#39;re the right fit, we&#39;re committed to working through the visa process together.</li>\n</ul>\n<ul>\n<li>Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0a2ea62c-943","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Thinking Machines Lab","sameAs":"https://thinkingmachineslab.com/","logo":"https://logos.yubhub.co/thinkingmachineslab.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5013930008","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $475,000 USD","x-skills-required":["deep learning frameworks","PyTorch","JAX","complex codebases","scalable AI infrastructure","large-scale language models","monitoring and observability tools"],"x-skills-preferred":["experience training or supporting large-scale language models","familiarity with monitoring and observability tools"],"datePosted":"2026-04-18T15:56:59.642Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"deep learning frameworks, PyTorch, JAX, complex codebases, scalable AI infrastructure, large-scale language models, monitoring and observability tools, experience training or supporting large-scale language models, familiarity with monitoring and observability tools","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":475000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_eff95313-cdc"},"title":"Senior Site Reliability Engineer","description":"<p>The Senior Site Reliability Engineer will play a key role in developing scalable, reliable, and efficient infrastructure that powers the entire company. This includes building and scaling internal platform offerings, designing and implementing monitoring, alerting, and incident response systems, and collaborating with application software engineers to guide their design and ensure it scales for what Carta needs in the long run.</p>\n<p>The ideal candidate will have extensive experience with cloud services such as AWS, Google Cloud Platform, or Azure, including services like EC2, S3, RDS, and Lambda. They will also be proficient in using tools such as Terraform, Ansible, or CloudFormation for managing and provisioning cloud infrastructure.</p>\n<p>The team is responsible for providing secure, reliable, scalable, and performant infrastructure to Carta&#39;s customers and developers. The successful candidate will be a strong communicator who enjoys collaborating to solve complex problems and has familiarity with infrastructure best practices on performance, reliability, and security and their associated tools.</p>\n<p>Our stack is Python, Java, Terraform, gRPC, Docker, Kubernetes, Postgres, running on AWS. Come join us!</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_eff95313-cdc","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Carta","sameAs":"https://carta.com/","logo":"https://logos.yubhub.co/carta.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/carta/jobs/7688689003","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$181,688 - $225,000","x-skills-required":["Cloud Platforms","Infrastructure as Code (IaC)","Networking","Monitoring and Observability","Software Development","API Services","AI Fluency"],"x-skills-preferred":["Experience operating CI/CD and its associated best practices"],"datePosted":"2026-04-18T15:55:48.770Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, California; Santa Clara, California; Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Cloud Platforms, Infrastructure as Code (IaC), Networking, Monitoring and Observability, Software Development, API Services, AI Fluency, Experience operating CI/CD and its associated best practices","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":181688,"maxValue":225000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_709b405a-48b"},"title":"Staff / Senior Software Engineer, AI Reliability","description":"<p>We&#39;re seeking a Staff / Senior Software Engineer, AI Reliability to join our team. As a key member of our AIRE (AI Reliability Engineering) team, you will partner with teams across Anthropic to improve reliability across our most critical serving paths. You will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, assist in the design and implementation of high-availability serving infrastructure, lead incident response for critical AI services, and support the reliability of safeguard model serving.</p>\n<p>You may be a good fit for this role if you have strong distributed systems, infrastructure, or reliability backgrounds, are curious and brave, think holistically about how systems compose and where the seams are, can build lasting relationships across teams, care about users and feel ownership over outcomes, have excellent communication and collaboration skills, and bring diverse experience.</p>\n<p>Strong candidates may also have experience operating large-scale model serving or training infrastructure, experience with one or more ML hardware accelerators, understanding of ML-specific networking optimizations, expertise in AI-specific observability tools and frameworks, experience with chaos engineering and systematic resilience testing, and contributions to open-source infrastructure or ML tooling.</p>\n<p>We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. We value impact and believe that the highest-impact AI research will be big science. We work as a single cohesive team on just a few large-scale research efforts and value communication skills.</p>\n<p>If you&#39;re interested in this role, please submit an application even if you don&#39;t believe you meet every single qualification. We encourage diversity and strive to include a range of diverse perspectives on our team.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_709b405a-48b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5113224008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$325,000-$485,000 USD","x-skills-required":["distributed systems","infrastructure","reliability","Service Level Objectives","monitoring and observability systems","high-availability serving infrastructure","incident response","safeguard model serving"],"x-skills-preferred":["large-scale model serving or training infrastructure","ML hardware accelerators","ML-specific networking optimizations","AI-specific observability tools and frameworks","chaos engineering and systematic resilience testing","open-source infrastructure or ML tooling"],"datePosted":"2026-04-18T15:52:16.313Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, Service Level Objectives, monitoring and observability systems, high-availability serving infrastructure, incident response, safeguard model serving, large-scale model serving or training infrastructure, ML hardware accelerators, ML-specific networking optimizations, AI-specific observability tools and frameworks, chaos engineering and systematic resilience testing, open-source infrastructure or ML tooling","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_782a1c68-325"},"title":"Senior DevOps Engineer","description":"<p>At ZoomInfo, we&#39;re looking for a Senior DevOps Engineer to join our Infrastructure Engineering group. As a Senior DevOps Engineer, you will be responsible for innovation in infrastructure and automation for ZoomInfo Engineering. You will have a strong background in modern infrastructure, with a thorough understanding of industry best practices. You will have a high level of comfort participating in challenging technical discussions and advocating for best practices in a high-paced environment.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Thorough, clear, concise documentation of new and existing standards, procedures, and automated workflows</li>\n<li>Championing of best practices and standards around infrastructure configuration and management</li>\n<li>Experience in creating internal products and managing their software development lifecycle</li>\n<li>Deployment, configuration, and management of infrastructure via infrastructure as code</li>\n<li>Working hands on with cloud infrastructure (AWS, Azure, and GCP)</li>\n<li>Working hands on with container infrastructure (Docker, Kubernetes, ECS, EKS, GKE, GAE, etc.)</li>\n<li>Configuration and management of Linux based tools and third-party cloud services</li>\n<li>Continuous improvement of our infrastructure, ensuring that it is highly available and observable</li>\n</ul>\n<p>Minimum Requirements:</p>\n<ul>\n<li>Solid foundation of experience managing Linux systems in virtual environments (6+ years)</li>\n<li>Deploying and maintaining highly available infrastructure in one or more Cloud providers (5+ years, AWS or GCP preferred)</li>\n<li>Infrastructure as code using Terraform (4+ years)</li>\n<li>Creating, deploying, maintaining, and troubleshooting Docker images (4+ years)</li>\n<li>Scoping, deploying, maintaining and troubleshooting Kubernetes clusters (4+ years)</li>\n<li>Developing and maintaining an active codebase in Go, Python preferably (3+ years)</li>\n<li>Experience with PaaS technologies (5+ years, EKS and GKE preferred)</li>\n<li>Maintaining monitoring and observability tools (Datadog, Prometheus preferred)</li>\n<li>Thorough understanding of network infrastructure and concepts (VPNs, routers and routing protocols, TCP/IP, IPv4 and v6, UDP, OSI layers, etc.)</li>\n<li>Experience with load balancing and proxy technologies (Istio, Nginx, HAProxy, Apache, Cloud load balancers, etc.)</li>\n<li>Debugging and troubleshooting complex problems in cloud-native infrastructure.</li>\n<li>Slack native mentality.</li>\n<li>Bachelor’s Degree in Computer Science or a related technical discipline, or the equivalent combination of education, technical certifications, training, or work experience.</li>\n</ul>\n<p>Abilities Required:</p>\n<ul>\n<li>Demonstrated ability to learn new technologies quickly and independently</li>\n<li>Strong technical, organizational and interpersonal skills</li>\n<li>Strong written and verbal communication skills</li>\n<li>Must be able to read, understand, and communicate complex problems and solutions in English over a textual medium (such as Slack)</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_782a1c68-325","directApply":true,"hiringOrganization":{"@type":"Organization","name":"ZoomInfo","sameAs":"https://www.zoominfo.com/","logo":"https://logos.yubhub.co/zoominfo.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/zoominfo/jobs/8287254002","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Linux","Cloud infrastructure (AWS, Azure, GCP)","Container infrastructure (Docker, Kubernetes, ECS, EKS, GKE, GAE)","Infrastructure as code (Terraform)","Go","Python","PaaS technologies (EKS, GKE)","Monitoring and observability tools (Datadog, Prometheus)"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:47:10.427Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Ra'anana, Israel"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Linux, Cloud infrastructure (AWS, Azure, GCP), Container infrastructure (Docker, Kubernetes, ECS, EKS, GKE, GAE), Infrastructure as code (Terraform), Go, Python, PaaS technologies (EKS, GKE), Monitoring and observability tools (Datadog, Prometheus)"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_da7679a6-e4f"},"title":"Senior Technical Operations Lead","description":"<p>Job Title: Senior Technical Operations Lead</p>\n<p>We are seeking an experienced Senior Technical Operations Lead to drive operational excellence across our Infrastructure Engineering organization.</p>\n<p>As a Senior Technical Operations Lead, you will design and implement world-class operational processes, establish SRE best practices, and mentor technical teams to achieve exceptional reliability and efficiency.</p>\n<p>Key Responsibilities:</p>\n<p>SRE Leadership &amp; Transformation</p>\n<ul>\n<li>Lead the design and implementation of SRE practices and tooling across Infrastructure Engineering</li>\n</ul>\n<ul>\n<li>Establish and cultivate an SRE-focused culture at Zoominfo</li>\n</ul>\n<p>Operational Process Design &amp; Governance</p>\n<ul>\n<li>Establish clear governance frameworks and procedural consistency</li>\n</ul>\n<ul>\n<li>Make decisions about process exceptions and/or changes to accommodate different team contexts</li>\n</ul>\n<ul>\n<li>Design and/or implement process automations using scripts and integrations</li>\n</ul>\n<ul>\n<li>Define functional requirements and goals for process automations</li>\n</ul>\n<ul>\n<li>Conduct hands-on and/or automated audits to ensure process adherence and identify improvement opportunities</li>\n</ul>\n<p>Incident Management &amp; Root Cause Analysis</p>\n<ul>\n<li>Design, implement, and continuously improve Incident Management and Change Management procedures that scale across the organization, using tools such as PagerDuty, Slack, Jira, ServiceNow, and custom integrations</li>\n</ul>\n<ul>\n<li>Lead and participate in root cause analysis sessions, driving teams toward systemic improvements rather than blame</li>\n</ul>\n<ul>\n<li>Design and execute incident dry runs and tabletop exercises to build organizational resilience</li>\n</ul>\n<ul>\n<li>Establish metrics and KPIs that measure incident response effectiveness and drive continuous improvement</li>\n</ul>\n<p>Enable Data-Driven Decision Making</p>\n<ul>\n<li>Identify, define, and automate the tracking of operational KPIs and departmental metrics that matter, enabling senior managers to make informed decisions on the basis of data</li>\n</ul>\n<ul>\n<li>Build and maintain metric dashboards and automated reporting systems that provide real-time visibility into operational health</li>\n</ul>\n<ul>\n<li>Analyze trends and surface opportunities for optimization</li>\n</ul>\n<p>Stakeholder Engagement, Training &amp; Mentorship</p>\n<ul>\n<li>Build and maintain strong relationships with Engineering managers, Product Managers, and cross-functional stakeholders across geographies</li>\n</ul>\n<ul>\n<li>Maintain a feedback loop. Meet with stakeholders to understand process pain points.</li>\n</ul>\n<ul>\n<li>Influence others by fostering trust, leading by example, and inspiring them with your expertise and passion for reliability practices.</li>\n</ul>\n<ul>\n<li>Enhance internal knowledge of third-party tools such as Pagerduty, Datadog, and more, by educating Zoominfo employees on these tools.</li>\n</ul>\n<p>Deliver training sessions that make Operational Excellence engaging and motivating for diverse audiences.</p>\n<p>Required Experience &amp; Qualifications:</p>\n<ul>\n<li>Bachelor’s degree in Software Engineering, Operations Management, or related field</li>\n</ul>\n<ul>\n<li>7+ years of hands-on experience in technical operations, Site Reliability Engineering (SRE), Incident Management, or IT Service Management roles within SaaS or technical organizations</li>\n</ul>\n<ul>\n<li>Fluent English proficiency (written and verbal)</li>\n</ul>\n<ul>\n<li>Proven track record designing and implementing operational processes at scale</li>\n</ul>\n<ul>\n<li>Demonstrated expertise in SRE principles, practices, and tooling</li>\n</ul>\n<ul>\n<li>Strong data analysis skills with ability to define metrics, build or design dashboards, and use data to drive strategic decisions</li>\n</ul>\n<ul>\n<li>Proven ability to work effectively in a matrix organizational structure</li>\n</ul>\n<ul>\n<li>Ability and experience working with senior management at global organizations</li>\n</ul>\n<ul>\n<li>Hands-on experience with monitoring and observability tools such as PagerDuty and/or Datadog</li>\n</ul>\n<ul>\n<li>Familiarity with Jira, Confluence, Google Data Studio, or Tableau</li>\n</ul>\n<ul>\n<li>Experience with scripting and integrations (Python, JavaScript, Google AppScript, or similar)</li>\n</ul>\n<ul>\n<li>Background in SRE transformation or organizational process improvement initiatives</li>\n</ul>\n<p>#LI-SS4 #LI-Hybrid</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_da7679a6-e4f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"ZoomInfo","sameAs":"https://www.zoominfo.com/","logo":"https://logos.yubhub.co/zoominfo.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/zoominfo/jobs/8451386002","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Site Reliability Engineering (SRE)","Technical Operations","Incident Management","IT Service Management","Monitoring and Observability Tools","Jira","Confluence","Google Data Studio","Tableau","Scripting and Integrations","Python","JavaScript","Google AppScript"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:45:47.393Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Ra'anana, Israel"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Site Reliability Engineering (SRE), Technical Operations, Incident Management, IT Service Management, Monitoring and Observability Tools, Jira, Confluence, Google Data Studio, Tableau, Scripting and Integrations, Python, JavaScript, Google AppScript"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_bd4ea9f9-369"},"title":"Staff Software Engineer","description":"<p>Omada Health is on a mission to inspire and engage people in lifelong health, one step at a time.</p>\n<p>We&#39;re seeking a Staff Software Engineer to lead the modernization, optimization, and scalability of Omada&#39;s B2B platform. This role is ideal for someone who combines deep technical expertise with strong leadership,someone eager to design for scale, mentor others, and influence technical direction across teams.</p>\n<p>You&#39;ll play a central role in re-architecting complex legacy systems, designing high-performance data pipelines (batch and real-time), and ensuring our core B2B capabilities,file ingestion, marketing outreach, eligibility, and billing,are robust, performant, and ready for the next wave of growth.</p>\n<p><strong>About You:</strong></p>\n<p>You&#39;re a systems thinker who thrives on solving hard technical challenges at scale. You have a strong foundation in distributed systems, database performance, and architectural design patterns,and you naturally guide teams toward simpler, more scalable solutions.</p>\n<p>You&#39;re both a technical expert and a connector, equally comfortable deep in the code or collaborating across disciplines. You&#39;re passionate about leading by example, mentoring others, and helping engineers across Omada level up their craft. You&#39;re also motivated by impact,building systems that help improve health outcomes for millions.</p>\n<p><strong>What You&#39;ll Be Doing:</strong></p>\n<ul>\n<li>Lead architecture, system design and engineering efforts for high-scale, data-intensive B2B systems supporting eligibility, billing, marketing, and file ingestion.</li>\n<li>Design and implement batch and real-time processing architectures that are reliable, observable, and performant.</li>\n<li>Drive efforts in database performance optimization, schema design, and long-term scalability planning across multi-terabyte PostgreSQL and other persistent stores.</li>\n<li>Partner closely with product, infrastructure, and operations teams to deliver resilient, maintainable systems that balance business needs with technical excellence.</li>\n<li>Identify and lead engineering-wide initiatives that improve scalability, developer efficiency, or data quality.</li>\n<li>Mentor and coach engineers at all levels, and actively contribute to Omada’s engineering community through design reviews, technical talks, and shared best practices.</li>\n<li>Contribute to modern, cloud-forward architecture across multiple product domains, ensuring our systems are designed to evolve gracefully and scale efficiently.</li>\n<li>Use and advocate for AI-assisted development tools (e.g., Cursor, Claude) to enhance individual and team productivity.</li>\n<li>Champion a culture of quality, observability, and reliability through strong DevOps principles and continuous improvement.</li>\n</ul>\n<p>*</p>\n<ul>\n<li><strong>What You Need for This Role:</strong></li>\n</ul>\n<ul>\n<li>10+ years of software engineering experience, with a significant portion spent on scalable systems architecture and performance optimization.</li>\n<li>Proven success in re-architecting complex legacy platforms and implementing modern, maintainable solutions.</li>\n<li>Strong programming experience with Ruby and Python, and comfort working across a modern stack (Rails, GraphQL, Django, Sidekiq).</li>\n<li>Deep understanding of relational databases (PostgreSQL, MySQL), performance tuning, and data modeling.</li>\n<li>Hands-on experience with both batch and streaming data pipelines (e.g., SQS, Kafka, Kinesis, Airflow).</li>\n<li>Demonstrable mastery of API design, distributed systems, and cloud-native architecture (preferably AWS).</li>\n<li>Fluency in CI/CD, containerization, and infrastructure-as-code (Docker, Kubernetes, Terraform).</li>\n<li>Familiarity with monitoring and observability frameworks (Datadog, OpenTelemetry).</li>\n<li>Excellent communication and collaboration skills, with a proven ability to influence and deliver through others.</li>\n<li>Growth mindset and genuine curiosity about new technologies, tools, and team approaches.</li>\n</ul>\n<p>*</p>\n<ul>\n<li><strong>Technologies We Use:</strong></li>\n</ul>\n<ul>\n<li>Ruby on Rails</li>\n<li>Sidekiq</li>\n<li>AWS Managed Datastores (RDS with PostgreSQL, Elasticache, ElasticSearch SNS/SQS)</li>\n<li>GraphQL</li>\n<li>Docker</li>\n<li>Kubernetes</li>\n</ul>\n<p>*</p>\n<ul>\n<li><strong>Benefits:</strong></li>\n</ul>\n<ul>\n<li>Competitive salary with generous annual cash bonus</li>\n<li>Equity grants</li>\n<li>Remote first work from home culture</li>\n<li>Flexible Time Off to help you rest, recharge, and connect with loved ones</li>\n<li>Generous parental leave</li>\n<li>Health, dental, and vision insurance (and above market employer contributions)</li>\n<li>401k retirement savings plan</li>\n<li>Lifestyle Spending Account (LSA)</li>\n<li>Mental Health Support Solutions</li>\n<li>...and more!</li>\n</ul>\n<p>*</p>\n<ul>\n<li><strong>It Takes a Village to Change Healthcare:</strong></li>\n</ul>\n<p>At Omada, we strive to embody the following values in our day-to-day work. We hope these hold meaning for you as well as you consider Omada!</p>\n<ul>\n<li>Cultivate Trust. We listen closely and we operate with kindness. We provide respectful and candid feedback to each other.</li>\n<li>Seek Context. We ask to understand and we build connections. We do our research up front to move faster down the road.</li>\n<li>Act Boldly. We innovate daily to solve problems, improve processes, and find new opportunities for our members and customers.</li>\n<li>Deliver Results. We reward impact above output. We set a high bar, we’re not afraid to fail, and we take pride in our work.</li>\n<li>Succeed Together. We prioritize Omada’s progress above team or individual. We have fun as we get stuff done, and we celebrate together.</li>\n<li>Remember Why We’re Here. We push through the challenges of changing healthcare because we know the destination is worth it.</li>\n</ul>\n<p>*</p>\n<ul>\n<li><strong>About Omada Health:</strong></li>\n</ul>\n<p>Omada Health is a between-visit healthcare provider that addresses lifestyle and behavior change elements for individuals managing chronic conditions. Omada’s multi-condition platform treats diabetes, hypertension, prediabetes, musculoskeletal, and GLP-1 management. With insights from connected devices and AI-supported tools, Omada care teams deliver care that is rooted in evidence and unique to every member, unlocking results at scale. With more than a decade of experience and data, and 29 peer-reviewed publications showcasing clinical and economic proof points, Omada’s approach is designed to improve health outcomes and contain costs. Our customers include health plans, pharmacy benefit managers, health systems, and employers ranging from small businesses to Fortune 500s. At Omada, we aim to inspire and empower people to make lasting health changes on their own terms. For more information, visit: https://www.omadahealth.com/</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_bd4ea9f9-369","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Omada Health","sameAs":"https://www.omadahealth.com/","logo":"https://logos.yubhub.co/omadahealth.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/omadahealth/jobs/7611424","x-work-arrangement":"remote","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Ruby","Python","Ruby on Rails","GraphQL","Django","Sidekiq","PostgreSQL","MySQL","API design","distributed systems","cloud-native architecture","AWS","CI/CD","containerization","infrastructure-as-code","Docker","Kubernetes","monitoring and observability frameworks","Datadog","OpenTelemetry"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:50:06.108Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote, USA"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Healthcare","skills":"Ruby, Python, Ruby on Rails, GraphQL, Django, Sidekiq, PostgreSQL, MySQL, API design, distributed systems, cloud-native architecture, AWS, CI/CD, containerization, infrastructure-as-code, Docker, Kubernetes, monitoring and observability frameworks, Datadog, OpenTelemetry"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2299e559-5df"},"title":"Customer Support Engineer","description":"<p><strong>Job Summary</strong></p>\n<p>As a Customer Support Engineer at Electronic Arts, you will work directly with Game Developer teams to resolve technical challenges and improve product quality. You will have experience of minimum 3 years working with customers and empathy for the customer experience.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Work directly with Game Developer teams to help them solve technical challenges</li>\n<li>Resolve issues involving project implementation, code error diagnosis, debugging, validation, and root cause analysis for the products/platforms assigned</li>\n<li>Build internal relationships with our development and product management teams to help us communicate the priorities of our customers</li>\n<li>Improve product quality by injecting fresh ideas and bringing innovations to existing products or platforms, building automation</li>\n<li>Develop additional software components for tasks associated with the projects/platforms, designing and debugging software applications</li>\n<li>Participate in projects that improve overall product and documentation quality</li>\n<li>Participate in product/platform testing and updates</li>\n<li>Achieve knowledge transfer through the delivery of training, knowledge sessions, mentoring</li>\n<li>Help increase the team efficiency by sharing knowledge, providing feedback about best practices, writing tools/utilities using AI stack</li>\n<li>Develop your technology skills and become a versatile IT professional</li>\n<li>Participate in schedule rotations and working shifts</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>Experience of minimum 3 years working with customers</li>\n<li>Empathy for the customer experience</li>\n<li>Balance varying levels of priority and urgency</li>\n<li>Understand complex technical issues and manage to express them in simple and concise language</li>\n<li>Familiarity with database concepts (e.g. SQL Server, MongoDB)</li>\n<li>Experience around the following technology concepts - virtualization, operating systems and server administration (Linux, Windows), cloud infrastructure and services(AWS, Azure), networking (IP, routing, firewall, ACL&#39;s), coding/scripting (Python, Bash, Powershell), application and API endpoints, monitoring and observability tools (Grafana, Kibana)</li>\n<li>Qualification in Computer Engineering or relevant experience needed</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2299e559-5df","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Electronic Arts","sameAs":"https://jobs.ea.com","logo":"https://logos.yubhub.co/jobs.ea.com.png"},"x-apply-url":"https://jobs.ea.com/en_US/careers/JobDetail/GameKit-Operations-Engineer-12-months-contract/212687","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"temporary","x-salary-range":null,"x-skills-required":["database concepts","virtualization","operating systems and server administration","cloud infrastructure and services","networking","coding/scripting","application and API endpoints","monitoring and observability tools"],"x-skills-preferred":[],"datePosted":"2026-03-10T12:17:50.433Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Hyderabad"}},"employmentType":"TEMPORARY","occupationalCategory":"Engineering","industry":"Technology","skills":"database concepts, virtualization, operating systems and server administration, cloud infrastructure and services, networking, coding/scripting, application and API endpoints, monitoring and observability tools"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f9b846f5-a43"},"title":"FBS Observability Engineer","description":"<p>Our Client is one of the United States&#39; largest insurers, providing a wide range of insurance and financial services products with gross written premiums well over US$25 Billion (P&amp;C). They proudly serve more than 10 million U.S. households with more than 19 million individual policies across all 50 states through the efforts of over 48,000 exclusive and independent agents and nearly 18,500 employees.</p>\n<p>This role will be part of the Enterprise Observability team that provides observability solutions to teams across the company in the form of a shared service.</p>\n<p><strong>Key Responsibilities</strong></p>\n<ul>\n<li>Manage and evolve the enterprise observability platform, with a focus on operational excellence and best practices.</li>\n<li>Provide platform stewardship, optimization, and strategic enablement.</li>\n<li>Work closely with Site Reliability Engineers (SREs) and automation engineers to enhance issue detection, drive intelligent alerting, and enable automated remediation across a large and complex enterprise environment.</li>\n<li>Manage our unified observability platform, and work with individual application and infrastructure teams to ensure their observability needs are met. This includes instrumentation, configuration, alert creation, coaching, and ongoing improvements.</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>3-5 years of experience as an Observability Engineer, Monitoring Engineer, Observability platform administrator, Dynatrace administrator or similar</li>\n<li>Full English Fluency</li>\n<li>BS in Computer Science or similar</li>\n<li>Experience in working in Complex environment in engineering or administration in a multinational environment from any industry.</li>\n</ul>\n<p><strong>Technical &amp; Business Skills</strong></p>\n<ul>\n<li>Platform Administration – Advanced</li>\n<li>Monitoring and Observability – Intermediate</li>\n<li>Communication, stakeholder management, Advance</li>\n<li>Technical Writing</li>\n<li>Implementation Experience Automation / Scripting - Intermediate</li>\n<li>Coding capacity - Desirable</li>\n<li>Process Improvement – Intermediate</li>\n<li>Dynatrace - Entry Level (1-3 Years) Desirable or others similar (AppDynamics, Splunk, New Relic, Solarwinds, Datadog)</li>\n<li>Office Suite - Intermediate (4-6 Years)</li>\n<li>Cloud Platforms (AWS/Azure/GCP) - Intermediate (4-6 Years) Working knowledge</li>\n</ul>\n<p>This position comes with competitive compensation and benefits package:</p>\n<ol>\n<li>Competitive salary and performance-based bonuses</li>\n<li>Comprehensive benefits package</li>\n<li>Career development and training opportunities</li>\n<li>Flexible work arrangements (remote and/or office-based)</li>\n<li>Dynamic and inclusive work culture within a globally renowned group</li>\n<li>Private Health Insurance</li>\n<li>Pension Plan</li>\n<li>Paid Time Off</li>\n<li>Training &amp; Development</li>\n</ol>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f9b846f5-a43","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Capgemini","sameAs":"https://jobs.workable.com","logo":"https://logos.yubhub.co/view.com.png"},"x-apply-url":"https://jobs.workable.com/view/vYDCLc6QSq2nPAc7cw5PGE/remote-fbs-observability-engineer-in-mexico-at-capgemini","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Platform Administration","Monitoring and Observability","Communication, stakeholder management","Technical Writing","Implementation Experience Automation / Scripting","Coding capacity","Process Improvement","Dynatrace","Office Suite","Cloud Platforms (AWS/Azure/GCP)"],"x-skills-preferred":["AppDynamics","Splunk","New Relic","Solarwinds","Datadog"],"datePosted":"2026-03-09T17:00:37.374Z","jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"Platform Administration, Monitoring and Observability, Communication, stakeholder management, Technical Writing, Implementation Experience Automation / Scripting, Coding capacity, Process Improvement, Dynatrace, Office Suite, Cloud Platforms (AWS/Azure/GCP), AppDynamics, Splunk, New Relic, Solarwinds, Datadog"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_eebf21c4-d1f"},"title":"Staff Site Reliability Engineer","description":"<p>Join our Site Reliability Engineering (SRE) team and help ensure the reliability, scalability, and performance of Replit&#39;s infrastructure that serves millions of developers worldwide.</p>\n<p>As a Staff Site Reliability Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.</p>\n<p>We are seeking Staff SREs who are passionate about building and maintaining resilient systems at scale. Your mission will be to proactively find and analyze reliability problems across our stack, then design and implement software and systems to create step-function improvements.</p>\n<p>You will design robust observability solutions, lead incident response, automate operational tasks, and continuously improve our infrastructure&#39;s reliability, all while mentoring and educating the broader engineering team to make reliability a core value at Replit.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Architect and Implement Observability: Design, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions. Create dashboards and metrics that provide real-time visibility into system health and performance, enabling proactive issue detection.</li>\n</ul>\n<ul>\n<li>Define and Drive Reliability Standards: Work with product and engineering teams to define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Build systems to monitor and report on these metrics, holding teams accountable and ensuring we maintain high reliability standards while balancing innovation speed.</li>\n</ul>\n<ul>\n<li>Lead Incident Management and Response: Act as a senior leader during high-impact incidents, guiding the team to rapid resolution. Conduct thorough, blameless post-mortems and drive the implementation of preventative measures. Develop and refine runbooks and build automation to reduce Mean Time To Recovery (MTTR).</li>\n</ul>\n<ul>\n<li>Drive Automation and Infrastructure as Code: Architect, build, and improve automation to eliminate toil and operational work. Design and maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios.</li>\n</ul>\n<ul>\n<li>Optimize Performance on Kubernetes: Collaborate with core infrastructure and product teams to performance-tune and optimize our large-scale cloud deployments, with a deep focus on Kubernetes, Docker, and GCP. Identify and resolve performance bottlenecks, implement capacity planning strategies, and reduce latency across global regions.</li>\n</ul>\n<ul>\n<li>Debug and Harden Distributed Systems: Dive deep into debugging extremely difficult technical problems across the stack. Use your findings to design and implement long-term fixes that make our systems and products more robust, operable, and easier to diagnose.</li>\n</ul>\n<ul>\n<li>Provide Staff-Level Guidance: Review feature and system designs from across the company, acting as a key owner for the reliability, scalability, security, and operational integrity of those designs.</li>\n</ul>\n<ul>\n<li>Educate and Mentor: Educate, mentor, and hold accountable the broader engineering team to improve the reliability of our systems, making reliability a core value of the Replit engineering culture.</li>\n</ul>\n<ul>\n<li>Build and Integrate: Write high-quality, well-tested code in Python or Go to meet the needs of your customers, whether it&#39;s building new internal tools or integrating with third-party vendors.</li>\n</ul>\n<p><strong>Required Skills and Experience</strong></p>\n<ul>\n<li>8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering).</li>\n</ul>\n<ul>\n<li>Strong programming skills in languages like Python or Go. You write high-quality, well-tested code.</li>\n</ul>\n<ul>\n<li>Deep understanding of distributed systems. You’ve designed, built, scaled, and maintained production services and know how to compose a service-oriented architecture.</li>\n</ul>\n<ul>\n<li>Deep experience with container orchestration platforms, specifically Kubernetes, and cloud-native technologies.</li>\n</ul>\n<ul>\n<li>Proven track record of designing, implementing, and maintaining sophisticated monitoring and observability solutions (e.g., metrics, logging, tracing).</li>\n</ul>\n<ul>\n<li>Strong incident management skills with extensive experience leading incident response for complex systems and demonstrated critical thinking under pressure.</li>\n</ul>\n<ul>\n<li>Experience with infrastructure as code (e.g., Terraform, Pulumi) and configuration management tools.</li>\n</ul>\n<ul>\n<li>Excellent written and verbal communication skills, with an ability to explain complex technical concepts clearly and simply and a bias toward open, transparent cultural practices.</li>\n</ul>\n<ul>\n<li>Strong interpersonal skills, with experience working with and mentoring engineers from junior to principal levels.</li>\n</ul>\n<ul>\n<li>A willingness to dive into understanding, debugging, and improving any layer of the stack.</li>\n</ul>\n<ul>\n<li>You&#39;re passionate about making software creation accessible and empowering the next generation of builders.</li>\n</ul>\n<p><strong>Bonus Points</strong></p>\n<ul>\n<li>Deep experience with Google Cloud Platform (GCP) services and tools.</li>\n</ul>\n<ul>\n<li>Expert-level knowledge of modern observability platforms (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).</li>\n</ul>\n<ul>\n<li>Experience designing and building reliable systems capable of handling high throughput and low latency.</li>\n</ul>\n<ul>\n<li>Significant experience with Go and Terraform.</li>\n</ul>\n<ul>\n<li>Familiarity with working in rapid-growth, startup environments.</li>\n</ul>\n<ul>\n<li>Experience writing company-facing blog posts and training materials.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_eebf21c4-d1f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Replit","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/replit.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/replit/d50ad15b-82d4-452f-b4ea-2a7f5e796170","x-work-arrangement":"remote","x-experience-level":"staff","x-job-type":"Full time","x-salary-range":"$220K - $325K","x-skills-required":["Site Reliability Engineering","DevOps","Systems Engineering","Infrastructure Engineering","Python","Go","Distributed Systems","Container Orchestration","Kubernetes","Cloud-Native Technologies","Monitoring and Observability","Incident Management","Infrastructure as Code","Terraform","Pulumi","Configuration Management"],"x-skills-preferred":["Google Cloud Platform","Prometheus","Grafana","Datadog","OpenTelemetry","Go","Terraform"],"datePosted":"2026-03-08T22:20:23.639Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote (United States)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Site Reliability Engineering, DevOps, Systems Engineering, Infrastructure Engineering, Python, Go, Distributed Systems, Container Orchestration, Kubernetes, Cloud-Native Technologies, Monitoring and Observability, Incident Management, Infrastructure as Code, Terraform, Pulumi, Configuration Management, Google Cloud Platform, Prometheus, Grafana, Datadog, OpenTelemetry, Go, Terraform","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":220000,"maxValue":325000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_afeda614-2d1"},"title":"Security Technology Deployment Specialist","description":"<p><strong>About the Role:</strong></p>\n<p>As part of Anthropic&#39;s Global Safety, Intelligence, and Security (GSIS) team, the Security Technology Deployment Specialist will own the validation, standardization, and deployment of physical security technology across Anthropic&#39;s rapidly expanding global office portfolio.</p>\n<p>You&#39;ll define the installation standards, configuration baselines, and deployment processes that the broader team executes against — from access control migrations and intercom replacements to AI analytics onboarding and new application integrations. You&#39;ll work across InfoSec, IT, Networking, and Identity Management to ensure every security application passes review, integrates with SSO, and is supported within Anthropic&#39;s infrastructure before going live.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>Validate and deploy new and replacement security technology platforms including access control systems, intercom solutions, video management, visitor management, and AI/analytics tools across all Anthropic locations</li>\n<li>Build and maintain staging environments for pre-production testing and validation of all security applications, hardware, firmware, and system configurations</li>\n<li>Define installation standards, configuration baselines, licensing structures, update procedures, and maintenance requirements for every deployed security platform</li>\n<li>Deploy integrations between security applications, validating that platforms communicate and share data correctly before transitioning to production</li>\n<li>Support colleagues&#39; security applications through InfoSec review processes, ensuring new tools meet Anthropic&#39;s information security and compliance requirements</li>\n<li>Coordinate SSO integration for newly deployed security applications with Identity Management and IT teams</li>\n<li>Transition applications requiring custom integration or data pipeline development to the IT Engineering team with documented technical requirements for roadmap inclusion</li>\n<li>Initiate onboarding of deployed hardware and systems into Anthropic&#39;s health monitoring platform to ensure operational visibility from day one</li>\n<li>Develop standardized deployment playbooks, checklists, configuration templates, and handoff documentation that enable repeatable installations across all current and future sites</li>\n<li>Evaluate security platforms for scalability, identifying capacity constraints, single points of failure, and architectural limitations before they impact operations at scale</li>\n<li>Coordinate with Networking, IT Infrastructure, and Facilities teams to ensure all infrastructure prerequisites (network, power, rack space, cloud resources) are met prior to deployment</li>\n<li>Execute structured handoffs to Project Management (for site programming), Break-Fix Support (for maintenance), and Access Control Administration (for ongoing system management), ensuring each team has the standards and documentation to execute independently</li>\n</ul>\n<p><strong>You may be a good fit if you:</strong></p>\n<ul>\n<li>Have 5+ years of hands-on experience deploying, validating, and managing enterprise physical security technology across a large or rapidly growing organisation</li>\n<li>Have deployed security technology across 50 or more sites, or have demonstrated experience in a high-growth environment where deployment velocity and repeatability were essential</li>\n<li>Have built standardized deployment processes, playbooks, and configuration templates that enabled others to execute installations independently and consistently</li>\n<li>Have experience working across InfoSec, IT, Networking, and Identity Management teams to onboard and integrate security applications into enterprise environments</li>\n<li>Have supported SSO integration, InfoSec reviews, and enterprise application onboarding workflows for security tools</li>\n<li>Possess broad technology experience across access control, video management, intercoms, visitor management, AI/analytics, and alarm monitoring platforms</li>\n<li>Are a strong technical communicator who can define standards clearly enough that PMs, integrators, and service teams execute against them without ambiguity</li>\n<li>Have experience with IP networking, VLANs, PoE, and infrastructure requirements for security devices</li>\n<li>Are comfortable with 25% travel for site deployments, commissioning, and validation</li>\n</ul>\n<p><strong>Strong candidates may have:</strong></p>\n<ul>\n<li>Previous experience at a hyper-growth technology company or managing security technology programs for high-profile corporate environments</li>\n<li>Experience with Anthropic&#39;s specific technology stack: Genetec Security Center, Axis cameras, Wavelynx, Commend Symphony Cloud, Alcatraz.ai, Ambient.ai, SureView, Envoy</li>\n<li>Industry certifications: Genetec, Axis, CCNA, PSP, CPP, or PMP</li>\n<li>Experience with OSDP, modern credential technologies, and encryption protocols for physical security systems</li>\n<li>Familiarity with scripting or automation (Python, PowerShell) for configuration management and deployment automation</li>\n<li>Experience with health monitoring and observability platforms</li>\n<li>Experience with change management, configuration control, and version-controlled infrastructure documentation</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_afeda614-2d1","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5123587008","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["physical security technology","access control systems","intercom solutions","video management","visitor management","AI/analytics tools","security applications","InfoSec","IT","Networking","Identity Management","SSO integration","IP networking","VLANs","PoE","infrastructure requirements for security devices"],"x-skills-preferred":["Genetec Security Center","Axis cameras","Wavelynx","Commend Symphony Cloud","Alcatraz.ai","Ambient.ai","SureView","Envoy","OSDP","modern credential technologies","encryption protocols for physical security systems","scripting or automation (Python, PowerShell)","health monitoring and observability platforms","change management","configuration control","version-controlled infrastructure documentation"],"datePosted":"2026-03-08T13:56:18.481Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA; Seattle, WA; New York City, NY"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"physical security technology, access control systems, intercom solutions, video management, visitor management, AI/analytics tools, security applications, InfoSec, IT, Networking, Identity Management, SSO integration, IP networking, VLANs, PoE, infrastructure requirements for security devices, Genetec Security Center, Axis cameras, Wavelynx, Commend Symphony Cloud, Alcatraz.ai, Ambient.ai, SureView, Envoy, OSDP, modern credential technologies, encryption protocols for physical security systems, scripting or automation (Python, PowerShell), health monitoring and observability platforms, change management, configuration control, version-controlled infrastructure documentation"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c930b80e-7a6"},"title":"Staff / Senior Software Engineer, AI Reliability","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.</p>\n<p>Reliability here is an emergent phenomenon that transcends any single team&#39;s boundaries, so someone has to zoom out and look at the whole picture. That&#39;s us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.</p>\n<p>Claude has your back. AIRE has Claude&#39;s. Help us keep Claude reliable for everyone who depends on it.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.</li>\n</ul>\n<ul>\n<li>Design and implement monitoring and observability systems across the token path.</li>\n</ul>\n<ul>\n<li>Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers</li>\n</ul>\n<ul>\n<li>Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.</li>\n</ul>\n<ul>\n<li>Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic&#39;s safety commitments.</li>\n</ul>\n<p><strong>You may be a good fit if you:</strong></p>\n<ul>\n<li>Have strong distributed systems, infrastructure, or reliability backgrounds -- we&#39;re looking for reliability-minded software engineers and SREs.</li>\n</ul>\n<ul>\n<li>Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</li>\n</ul>\n<ul>\n<li>Think holistically about how systems compose and where the seams are.</li>\n</ul>\n<ul>\n<li>Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.</li>\n</ul>\n<ul>\n<li>Care about users and feel ownership over outcomes, even for systems you don&#39;t own.</li>\n</ul>\n<ul>\n<li>Have excellent communication and collaboration skills -- you&#39;ll be partnering across the entire company.</li>\n</ul>\n<ul>\n<li>Bring diverse experience -- the team&#39;s strength comes from people who&#39;ve built product stacks, scaled databases, run massive distributed systems, and everything in between.</li>\n</ul>\n<p><strong>Strong candidates may also:</strong></p>\n<ul>\n<li>Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems</li>\n</ul>\n<ul>\n<li>Have experience operating large-scale model serving or training infrastructure (&gt;1000 GPUs).</li>\n</ul>\n<ul>\n<li>Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).</li>\n</ul>\n<ul>\n<li>Understand ML-specific networking optimizations like RDMA and InfiniBand.</li>\n</ul>\n<ul>\n<li>Have expertise in AI-specific observability tools and frameworks.</li>\n</ul>\n<ul>\n<li>Have experience with chaos engineering and systematic resilience testing.</li>\n</ul>\n<ul>\n<li>Have contributed to open-source infrastructure or ML tooling.</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</strong></p>\n<p><strong>Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links—visit anthropic.com/careers directly for confirmed position openings.</strong></p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as a team sport, where everyone contributes to the overall success of the team.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c930b80e-7a6","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5113224008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$325,000 - $485,000 USD","x-skills-required":["distributed systems","infrastructure","reliability","large language model serving systems","monitoring and observability systems","high-availability serving infrastructure","incident response","safeguard model serving"],"x-skills-preferred":["SRE","Production Engineer","ML hardware accelerators","ML-specific networking optimizations","AI-specific observability tools and frameworks","chaos engineering","systematic resilience testing","open-source infrastructure or ML tooling"],"datePosted":"2026-03-08T13:50:54.182Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, large language model serving systems, monitoring and observability systems, high-availability serving infrastructure, incident response, safeguard model serving, SRE, Production Engineer, ML hardware accelerators, ML-specific networking optimizations, AI-specific observability tools and frameworks, chaos engineering, systematic resilience testing, open-source infrastructure or ML tooling","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b050a65c-f0a"},"title":"Senior SRE 1","description":"<p>We are seeking an accomplished Senior Site Reliability Engineer (SRE) with 12–15 years of experience to lead the reliability, scalability, and performance engineering of our critical infrastructure and production systems. As a Senior SRE, you will play a strategic and technical leadership role — driving reliability practices, mentoring SRE teams, and influencing the adoption of automation, observability, and resilience engineering across the organization.</p>\n<p><strong>What you&#39;ll do</strong></p>\n<ul>\n<li>Architect, implement, and manage resilient, scalable, and highly available infrastructure systems.</li>\n<li>Lead initiatives to automate manual operations, deployment, and monitoring processes to improve reliability and reduce toil.</li>\n</ul>\n<p><strong>What you need</strong></p>\n<ul>\n<li>Strong proficiency in Linux/Unix system administration and internals.</li>\n<li>Proven experience in cloud platforms — AWS, Azure, or GCP.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b050a65c-f0a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Electronic Arts","sameAs":"https://jobs.ea.com","logo":"https://logos.yubhub.co/jobs.ea.com.png"},"x-apply-url":"https://jobs.ea.com/en_US/careers/JobDetail/Senior-SRE-I/211515","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Linux/Unix system administration","Cloud platforms","Automation"],"x-skills-preferred":["Containerization and orchestration","Monitoring and observability stacks","Configuration management and IaC tools"],"datePosted":"2026-01-05T21:08:11.258Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Hyderabad"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Linux/Unix system administration, Cloud platforms, Automation, Containerization and orchestration, Monitoring and observability stacks, Configuration management and IaC tools"}]}