{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/gpu-and-ai-ml-workloads"},"x-facet":{"type":"skill","slug":"gpu-and-ai-ml-workloads","display":"GPU and AI/ML workloads","count":1},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_198d64d4-207"},"title":"Senior/Staff Site Reliability Engineer","description":"<p>You are a seasoned SRE who keeps production infrastructure running at scale. You own the reliability and availability of customer-facing systems , from Kubernetes clusters to deployment pipelines to the networking layer that connects it all. You think in SLOs, automate ruthlessly, and treat every incident as a chance to make the system better.</p>\n<p><strong>Key Responsibilities</strong></p>\n<ul>\n<li>Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads</li>\n</ul>\n<ul>\n<li>Build and maintain CI/CD pipelines and deployment infrastructure</li>\n</ul>\n<ul>\n<li>Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability</li>\n</ul>\n<ul>\n<li>Build dashboards, alerting, and anomaly detection across our systems</li>\n</ul>\n<ul>\n<li>Define and enforce SLOs and build out incident response processes</li>\n</ul>\n<ul>\n<li>Manage and improve our networking, load balancing, and service mesh configurations</li>\n</ul>\n<ul>\n<li>Drive reliability improvements across the stack through automation, runbooks, and chaos engineering</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>5+ years experience in managing critical production systems and software development workflows</li>\n</ul>\n<ul>\n<li>Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)</li>\n</ul>\n<ul>\n<li>Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS</li>\n</ul>\n<ul>\n<li>Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)</li>\n</ul>\n<ul>\n<li>Proficiency in Python and either Go or Bash for tooling and automation</li>\n</ul>\n<ul>\n<li>Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)</li>\n</ul>\n<ul>\n<li>Excellent communication and ability to drive technical decisions across teams</li>\n</ul>\n<ul>\n<li>Self-starter who executes quickly, takes ownership, and constantly seeks improvement</li>\n</ul>\n<p><strong>Nice to have</strong></p>\n<ul>\n<li>Experience with managing GPU and AI/ML workloads</li>\n</ul>\n<ul>\n<li>Experience with kernel-based monitoring and routing (eBPF, XDP)</li>\n</ul>\n<ul>\n<li>Experience with security tooling (Falco, Coroot, SIEM)</li>\n</ul>\n<ul>\n<li>Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)</li>\n</ul>\n<ul>\n<li>Experience with distributed storage systems (Ceph, Longhorn, etc.)</li>\n</ul>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$180,000-250,000 plus equity + benefits</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Interesting and challenging work</li>\n</ul>\n<ul>\n<li>A lot of learning and growth opportunities</li>\n</ul>\n<ul>\n<li>Regular team events and offsites</li>\n</ul>\n<ul>\n<li>Health, dental, and vision insurance (US)</li>\n</ul>\n<ul>\n<li>Visa sponsorship and relocation assistance</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_198d64d4-207","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Fal","sameAs":"https://fal.com","logo":"https://logos.yubhub.co/fal.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/fal/jobs/4146019009","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$180,000-250,000","x-skills-required":["Kubernetes","Infrastructure-as-code","Linux networking","Container networking","CI/CD systems","GitOps workflows","Python","Go","Bash","Logging","Monitoring","Alerting"],"x-skills-preferred":["GPU and AI/ML workloads","Kernel-based monitoring and routing","Security tooling","Bare metal Kubernetes networking","Distributed storage systems"],"datePosted":"2026-04-24T15:18:14.287Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Infrastructure-as-code, Linux networking, Container networking, CI/CD systems, GitOps workflows, Python, Go, Bash, Logging, Monitoring, Alerting, GPU and AI/ML workloads, Kernel-based monitoring and routing, Security tooling, Bare metal Kubernetes networking, Distributed storage systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":250000,"unitText":"YEAR"}}}]}