{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/accelerator-infrastructure"},"x-facet":{"type":"skill","slug":"accelerator-infrastructure","display":"Accelerator Infrastructure","count":3},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_baad2598-8bc"},"title":"Staff / Senior Software Engineer, Compute Capacity","description":"<p><strong>About the Role</strong></p>\n<p>Anthropic&#39;s Accelerator Capacity Engineering (ACE) team manages one of the largest and fastest-growing accelerator fleets in the industry. As an engineer on ACE, you will build the production systems that power this work: data pipelines that ingest and normalize telemetry from heterogeneous cloud environments, observability tooling that gives the org real-time visibility into fleet health, and performance instrumentation that measures how efficiently every major workload uses the hardware it’s running on.</p>\n<p><strong>What This Team Owns</strong></p>\n<p>The team’s work spans three functional areas: data infrastructure, fleet observability, and compute efficiency. Depending on your background and interests, you’ll focus primarily in one, but the boundaries are fluid and the problems overlap:</p>\n<p><strong>Data Infrastructure</strong></p>\n<p>Collecting, normalizing, and serving the fleet-wide data that powers everything else. This means building pipelines that ingest occupancy and utilization telemetry from Kubernetes clusters, normalizing billing and usage data across cloud providers, and maintaining the BigQuery layer that the rest of the org queries against.</p>\n<p><strong>Fleet Observability</strong></p>\n<p>Making the state of the accelerator fleet legible and actionable in real time. This means building cluster health tooling, capacity planning platforms, alerting on occupancy drops and allocation problems, and driving systemic improvements to scheduling and fragmentation.</p>\n<p><strong>Compute Efficiency</strong></p>\n<p>Measuring and improving how effectively every major workload uses the hardware it’s running on. This means instrumenting utilization metrics across training, inference, and eval systems, building benchmarking infrastructure, establishing per-config baselines, and collaborating directly with system-owning teams to close efficiency gaps.</p>\n<p><strong>What You’ll Do</strong></p>\n<ul>\n<li>Build and operate data pipelines that ingest accelerator occupancy, utilization, and cost data from multiple cloud providers into BigQuery.</li>\n<li>Develop and maintain observability infrastructure , Prometheus recording rules, Grafana dashboards, and alerting systems , that surface actionable signals about fleet health, occupancy, and efficiency.</li>\n<li>Instrument and analyze compute efficiency metrics across training, inference, and eval workloads.</li>\n<li>Build internal tooling and platforms that enable capacity planning, workload attribution, and cluster debugging.</li>\n<li>Operate Kubernetes-native systems at scale , deploying data collection agents, managing workload labeling infrastructure, and understanding how taints, reservations, and scheduling affect capacity.</li>\n<li>Normalize and reconcile data across heterogeneous sources , including AWS, GCP, and Azure billing exports, vendor-specific telemetry formats, and internal systems with different schemas and billing arrangements.</li>\n</ul>\n<p><strong>You May Be a Good Fit If You Have</strong></p>\n<ul>\n<li>5+ years of software engineering experience with a strong track record building and operating production systems.</li>\n<li>Kubernetes fluency at operational depth , you’ve operated production K8s at meaningful scale, not just written manifests.</li>\n<li>Data pipeline engineering experience , designing, building, and owning the full lifecycle of production data pipelines.</li>\n<li>Observability tooling experience , Prometheus, PromQL, and Grafana are in the critical path for this team.</li>\n<li>Python and SQL at production quality.</li>\n<li>Familiarity with at least one major cloud provider (AWS, GCP, or Azure) at the infrastructure level , compute, billing, usage APIs, cost management tooling.</li>\n</ul>\n<p><strong>Strong Candidates May Also Have</strong></p>\n<ul>\n<li>Multi-cloud data ingestion experience , especially working with AWS and GCP APIs, billing exports, or vendor-specific telemetry formats.</li>\n<li>Accelerator infrastructure familiarity , GPU metrics (DCGM), TPU utilization, Trainium power and utilization metrics, or experience working with ML training/inference systems at the hardware level.</li>\n<li>Performance engineering and benchmarking experience , building benchmark harnesses, establishing baselines, reasoning about compute efficiency (FLOPs utilization, memory bandwidth, interconnect throughput), and working with system teams to diagnose and improve performance.</li>\n<li>Data-as-product thinking , experience building internal data products with self-service access, schema contracts, API serving, documentation,</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_baad2598-8bc","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.co/","logo":"https://logos.yubhub.co/anthropic.co.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5126702008","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Kubernetes","Python","SQL","Prometheus","Grafana","BigQuery","Cloud computing","Data pipeline engineering","Observability tooling"],"x-skills-preferred":["Multi-cloud data ingestion","Accelerator infrastructure","Performance engineering","Data-as-product thinking"],"datePosted":"2026-04-18T15:56:02.706Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Python, SQL, Prometheus, Grafana, BigQuery, Cloud computing, Data pipeline engineering, Observability tooling, Multi-cloud data ingestion, Accelerator infrastructure, Performance engineering, Data-as-product thinking"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_8f6ef3b1-c9b"},"title":"Technical Program Manager, Compute","description":"<p>As a Technical Program Manager on the Compute team, you will help drive the planning, coordination, and execution of programs that keep Anthropic&#39;s compute infrastructure running efficiently at scale.</p>\n<p>Our compute fleet is the foundation on which every model training run, evaluation, and inference workload depends. You&#39;ll join a small, high-impact TPM team and take ownership of critical workstreams across the compute lifecycle, from how supply is procured and brought online, to how capacity is allocated and utilized across teams.</p>\n<p>You&#39;ll partner with Infrastructure, Systems, Research, Finance, and Capacity Engineering to shape the processes, tooling, and coordination mechanisms that allow Anthropic to move fast while managing an increasingly complex compute environment.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Own and drive critical programs across the compute lifecycle, coordinating execution across multiple engineering, research, and operations teams</li>\n<li>Build and maintain operational visibility into the compute fleet, ensuring the organization has a clear picture of supply, demand, utilization, and health</li>\n<li>Lead cross-functional coordination for compute transitions: bringing new capacity online, migrating workloads, and managing decommissions across cloud providers and hardware platforms</li>\n<li>Partner with engineering and research leadership to navigate competing priorities and drive alignment on how compute resources are planned, allocated, and used</li>\n<li>Identify and close operational gaps across the compute pipeline, whether through new tooling, improved processes, or better cross-team communication</li>\n<li>Own trade-off discussions between utilization, cost, latency, and reliability, synthesizing inputs from technical and business stakeholders and communicating decisions to leadership</li>\n<li>Develop and improve the processes and frameworks the team uses to plan, track, and execute compute programs at increasing scale and complexity</li>\n</ul>\n<p>You may be a good fit if you:</p>\n<ul>\n<li>Have 7+ years of technical program management experience in infrastructure, platform engineering, or compute-intensive environments</li>\n<li>Have led complex, cross-functional programs involving multiple engineering teams with competing priorities and ambiguous requirements</li>\n<li>Have experience working with research or ML teams and translating their needs into operational plans and technical requirements</li>\n<li>Are comfortable diving deep into technical details (cloud infrastructure, cluster management, job scheduling, resource orchestration) while maintaining program-level visibility</li>\n<li>Thrive in ambiguous, fast-moving environments where you need to define scope and build processes from the ground up</li>\n<li>Have strong communication skills and can engage credibly with engineers, researchers, finance, and executive leadership</li>\n<li>Have a track record of building trust with engineering teams and driving changes through influence rather than authority</li>\n</ul>\n<p>Strong candidates may also have:</p>\n<ul>\n<li>Experience managing compute capacity across multiple cloud providers (AWS, GCP, Azure) or hybrid cloud/on-premises environments</li>\n<li>Familiarity with job scheduling, resource orchestration, or workload management systems (Kubernetes, Slurm, Borg, YARN, or custom schedulers)</li>\n<li>Experience with GPU or accelerator infrastructure, including the unique challenges of large-scale ML training and inference workloads</li>\n<li>Built or improved observability for infrastructure systems: dashboards, alerting, efficiency metrics, or cost attribution</li>\n<li>Capacity planning experience including demand forecasting, cost modeling, or hardware lifecycle management</li>\n<li>Scaled through hypergrowth in AI/ML, HPC, or large-scale cloud environments</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_8f6ef3b1-c9b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5138044008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$290,000-$365,000 USD","x-skills-required":["Technical Program Management","Cloud Infrastructure","Cluster Management","Job Scheduling","Resource Orchestration","Compute Capacity Management","GPU or Accelerator Infrastructure","Observability for Infrastructure Systems","Capacity Planning"],"x-skills-preferred":["Kubernetes","Slurm","Borg","YARN","Custom Schedulers","Demand Forecasting","Cost Modeling","Hardware Lifecycle Management"],"datePosted":"2026-04-18T15:53:42.458Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Technical Program Management, Cloud Infrastructure, Cluster Management, Job Scheduling, Resource Orchestration, Compute Capacity Management, GPU or Accelerator Infrastructure, Observability for Infrastructure Systems, Capacity Planning, Kubernetes, Slurm, Borg, YARN, Custom Schedulers, Demand Forecasting, Cost Modeling, Hardware Lifecycle Management","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":290000,"maxValue":365000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_e8e9acc0-a63"},"title":"Technical Program Manager, Compute","description":"<p>As a Technical Program Manager on the Compute team, you will help drive the planning, coordination, and execution of programs that keep Anthropic&#39;s compute infrastructure running efficiently at scale.</p>\n<p>Our compute fleet is the foundation on which every model training run, evaluation, and inference workload depends. You&#39;ll join a small, high-impact TPM team and take ownership of critical workstreams across the compute lifecycle, from how supply is procured and brought online, to how capacity is allocated and utilized across teams.</p>\n<p>You&#39;ll partner with Infrastructure, Systems, Research, Finance, and Capacity Engineering to shape the processes, tooling, and coordination mechanisms that allow Anthropic to move fast while managing an increasingly complex compute environment.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Own and drive critical programs across the compute lifecycle, coordinating execution across multiple engineering, research, and operations teams</li>\n<li>Build and maintain operational visibility into the compute fleet, ensuring the organization has a clear picture of supply, demand, utilization, and health</li>\n<li>Lead cross-functional coordination for compute transitions: bringing new capacity online, migrating workloads, and managing decommissions across cloud providers and hardware platforms</li>\n<li>Partner with engineering and research leadership to navigate competing priorities and drive alignment on how compute resources are planned, allocated, and used</li>\n<li>Identify and close operational gaps across the compute pipeline, whether through new tooling, improved processes, or better cross-team communication</li>\n<li>Own trade-off discussions between utilization, cost, latency, and reliability, synthesizing inputs from technical and business stakeholders and communicating decisions to leadership</li>\n<li>Develop and improve the processes and frameworks the team uses to plan, track, and execute compute programs at increasing scale and complexity</li>\n</ul>\n<p>You may be a good fit if you:</p>\n<ul>\n<li>Have 7+ years of technical program management experience in infrastructure, platform engineering, or compute-intensive environments</li>\n<li>Have led complex, cross-functional programs involving multiple engineering teams with competing priorities and ambiguous requirements</li>\n<li>Have experience working with research or ML teams and translating their needs into operational plans and technical requirements</li>\n<li>Are comfortable diving deep into technical details (cloud infrastructure, cluster management, job scheduling, resource orchestration) while maintaining program-level visibility</li>\n<li>Thrive in ambiguous, fast-moving environments where you need to define scope and build processes from the ground up</li>\n<li>Have strong communication skills and can engage credibly with engineers, researchers, finance, and executive leadership</li>\n<li>Have a track record of building trust with engineering teams and driving changes through influence rather than authority</li>\n</ul>\n<p>Strong candidates may also have:</p>\n<ul>\n<li>Experience managing compute capacity across multiple cloud providers (AWS, GCP, Azure) or hybrid cloud/on-premises environments</li>\n<li>Familiarity with job scheduling, resource orchestration, or workload management systems (Kubernetes, Slurm, Borg, YARN, or custom schedulers)</li>\n<li>Experience with GPU or accelerator infrastructure, including the unique challenges of large-scale ML training and inference workloads</li>\n<li>Built or improved observability for infrastructure systems: dashboards, alerting, efficiency metrics, or cost attribution</li>\n<li>Capacity planning experience including demand forecasting, cost modeling, or hardware lifecycle management</li>\n<li>Scaled through hypergrowth in AI/ML, HPC, or large-scale cloud environments</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_e8e9acc0-a63","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5138044008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$290,000-$365,000 USD","x-skills-required":["Technical Program Management","Compute Infrastructure","Cloud Providers","Job Scheduling","Resource Orchestration","Workload Management","GPU or Accelerator Infrastructure","Observability","Capacity Planning"],"x-skills-preferred":["Kubernetes","Slurm","Borg","YARN","Custom Schedulers","Demand Forecasting","Cost Modeling","Hardware Lifecycle Management"],"datePosted":"2026-04-18T15:52:47.770Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Technical Program Management, Compute Infrastructure, Cloud Providers, Job Scheduling, Resource Orchestration, Workload Management, GPU or Accelerator Infrastructure, Observability, Capacity Planning, Kubernetes, Slurm, Borg, YARN, Custom Schedulers, Demand Forecasting, Cost Modeling, Hardware Lifecycle Management","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":290000,"maxValue":365000,"unitText":"YEAR"}}}]}