{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/slurm"},"x-facet":{"type":"skill","slug":"slurm","display":"Slurm","count":20},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_588dfb0e-611"},"title":"Solutions Architect - Kubernetes","description":"<p>As a Solutions Architect at CoreWeave, you will play a vital role in helping customers succeed with our cloud infrastructure offerings, focusing on Kubernetes solutions within high-performance compute (HPC) environments.</p>\n<p>Your responsibilities will include serving as the primary technical point of contact for customers, establishing strong technical relationships and ensuring their success with CoreWeave&#39;s cloud infrastructure offerings.</p>\n<p>You will collaborate closely with customers to understand their unique business needs and create, prototype, and deploy tailored solutions that align with their requirements.</p>\n<p>You will lead proof of concept initiatives to showcase the value and viability of CoreWeave&#39;s solutions within specific environments.</p>\n<p>You will drive technical leadership and direction during customer meetings, presentations, and workshops, addressing any technical queries or concerns that arise.</p>\n<p>You will act as a virtual member of CoreWeave&#39;s Kubernetes product and engineering teams, identifying opportunities for product enhancement and collaborating with engineers to implement your suggestions.</p>\n<p>You will offer valuable insights on product features, functionality, and performance, contributing regularly to discussions about product strategy and architecture.</p>\n<p>You will conduct periodic technical reviews and assessments of customer workloads, pinpointing opportunities for workload optimization and suggesting suitable solutions.</p>\n<p>You will stay informed of the latest developments and trends in Kubernetes, cloud computing and infrastructure, sharing your thought leadership with customers and internal stakeholders.</p>\n<p>You will lead the prototyping and initiation of research and development efforts for emerging products and solutions, delivering prototypes and key insights for internal consumption.</p>\n<p>You will represent CoreWeave at conferences and industry events, with occasional travel as required.</p>\n<p>To be successful in this role, you will need to have a B.S. in Computer Science or a related technical discipline, or equivalent experience.</p>\n<p>You will also need to have 7+ years of proven experience as a Solutions Architect, engineer, researcher, or technical account manager in cloud infrastructure, focusing on building distributed systems or HPC/cloud services, with an expertise focused on scalable Kubernetes solutions.</p>\n<p>You will need to be fluent in cloud computing concepts, architecture, and technologies with hands-on experience in designing and implementing cloud solutions.</p>\n<p>You will need to have a proven track record with building customer relationships, communicating clearly and the ability to break down complex technical concepts to both technical and non-technical audiences.</p>\n<p>You will need to be familiar with NVIDIA GPUs typically used in AI/ML applications and associated technologies such as Infiniband and NVIDIA Collective Communications Library (NCCL).</p>\n<p>You will need to have experience with running large-scale Artificial Intelligence/Machine Learning (AI/ML) training and inference workloads on technologies such as Slurm and Kubernetes.</p>\n<p>Preferred qualifications include code contributions to open-source inference frameworks, experience with scripting and automation related to Kubernetes clusters and workloads, experience with building solutions across multi-cloud environments, and client or customer-facing publications/talks on latency, optimization, or advanced model-server architectures.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_588dfb0e-611","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4557835006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $220,000","x-skills-required":["Kubernetes","Cloud Computing","High-Performance Compute (HPC)","Distributed Systems","Cloud Infrastructure","Scalable Solutions","NVIDIA GPUs","Infiniband","NVIDIA Collective Communications Library (NCCL)","Slurm","Kubernetes Clusters"],"x-skills-preferred":["Code Contributions to Open-Source Inference Frameworks","Scripting and Automation Related to Kubernetes Clusters and Workloads","Building Solutions Across Multi-Cloud Environments","Client or Customer-Facing Publications/Talks on Latency, Optimization, or Advanced Model-Server Architectures"],"datePosted":"2026-04-18T15:57:29.779Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Cloud Computing, High-Performance Compute (HPC), Distributed Systems, Cloud Infrastructure, Scalable Solutions, NVIDIA GPUs, Infiniband, NVIDIA Collective Communications Library (NCCL), Slurm, Kubernetes Clusters, Code Contributions to Open-Source Inference Frameworks, Scripting and Automation Related to Kubernetes Clusters and Workloads, Building Solutions Across Multi-Cloud Environments, Client or Customer-Facing Publications/Talks on Latency, Optimization, or Advanced Model-Server Architectures","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":220000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_8f6ef3b1-c9b"},"title":"Technical Program Manager, Compute","description":"<p>As a Technical Program Manager on the Compute team, you will help drive the planning, coordination, and execution of programs that keep Anthropic&#39;s compute infrastructure running efficiently at scale.</p>\n<p>Our compute fleet is the foundation on which every model training run, evaluation, and inference workload depends. You&#39;ll join a small, high-impact TPM team and take ownership of critical workstreams across the compute lifecycle, from how supply is procured and brought online, to how capacity is allocated and utilized across teams.</p>\n<p>You&#39;ll partner with Infrastructure, Systems, Research, Finance, and Capacity Engineering to shape the processes, tooling, and coordination mechanisms that allow Anthropic to move fast while managing an increasingly complex compute environment.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Own and drive critical programs across the compute lifecycle, coordinating execution across multiple engineering, research, and operations teams</li>\n<li>Build and maintain operational visibility into the compute fleet, ensuring the organization has a clear picture of supply, demand, utilization, and health</li>\n<li>Lead cross-functional coordination for compute transitions: bringing new capacity online, migrating workloads, and managing decommissions across cloud providers and hardware platforms</li>\n<li>Partner with engineering and research leadership to navigate competing priorities and drive alignment on how compute resources are planned, allocated, and used</li>\n<li>Identify and close operational gaps across the compute pipeline, whether through new tooling, improved processes, or better cross-team communication</li>\n<li>Own trade-off discussions between utilization, cost, latency, and reliability, synthesizing inputs from technical and business stakeholders and communicating decisions to leadership</li>\n<li>Develop and improve the processes and frameworks the team uses to plan, track, and execute compute programs at increasing scale and complexity</li>\n</ul>\n<p>You may be a good fit if you:</p>\n<ul>\n<li>Have 7+ years of technical program management experience in infrastructure, platform engineering, or compute-intensive environments</li>\n<li>Have led complex, cross-functional programs involving multiple engineering teams with competing priorities and ambiguous requirements</li>\n<li>Have experience working with research or ML teams and translating their needs into operational plans and technical requirements</li>\n<li>Are comfortable diving deep into technical details (cloud infrastructure, cluster management, job scheduling, resource orchestration) while maintaining program-level visibility</li>\n<li>Thrive in ambiguous, fast-moving environments where you need to define scope and build processes from the ground up</li>\n<li>Have strong communication skills and can engage credibly with engineers, researchers, finance, and executive leadership</li>\n<li>Have a track record of building trust with engineering teams and driving changes through influence rather than authority</li>\n</ul>\n<p>Strong candidates may also have:</p>\n<ul>\n<li>Experience managing compute capacity across multiple cloud providers (AWS, GCP, Azure) or hybrid cloud/on-premises environments</li>\n<li>Familiarity with job scheduling, resource orchestration, or workload management systems (Kubernetes, Slurm, Borg, YARN, or custom schedulers)</li>\n<li>Experience with GPU or accelerator infrastructure, including the unique challenges of large-scale ML training and inference workloads</li>\n<li>Built or improved observability for infrastructure systems: dashboards, alerting, efficiency metrics, or cost attribution</li>\n<li>Capacity planning experience including demand forecasting, cost modeling, or hardware lifecycle management</li>\n<li>Scaled through hypergrowth in AI/ML, HPC, or large-scale cloud environments</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_8f6ef3b1-c9b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5138044008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$290,000-$365,000 USD","x-skills-required":["Technical Program Management","Cloud Infrastructure","Cluster Management","Job Scheduling","Resource Orchestration","Compute Capacity Management","GPU or Accelerator Infrastructure","Observability for Infrastructure Systems","Capacity Planning"],"x-skills-preferred":["Kubernetes","Slurm","Borg","YARN","Custom Schedulers","Demand Forecasting","Cost Modeling","Hardware Lifecycle Management"],"datePosted":"2026-04-18T15:53:42.458Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Technical Program Management, Cloud Infrastructure, Cluster Management, Job Scheduling, Resource Orchestration, Compute Capacity Management, GPU or Accelerator Infrastructure, Observability for Infrastructure Systems, Capacity Planning, Kubernetes, Slurm, Borg, YARN, Custom Schedulers, Demand Forecasting, Cost Modeling, Hardware Lifecycle Management","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":290000,"maxValue":365000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_96d05ee1-799"},"title":"Staff Software Engineer, Cluster Orchestration","description":"<p><strong>Job Description</strong></p>\n<p>CoreWeave is The Essential Cloud for AI. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence.</p>\n<p>Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability.</p>\n<p>Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025.</p>\n<p><strong>About the Role</strong></p>\n<p>As part of the Cluster Orchestration team, you will play a key role in advancing CoreWeave&#39;s orchestration platform including SUNK (Slurm on Kubernetes) and beyond, our Kubernetes-native foundation that powers AI training and inference at scale.</p>\n<p>This is an opportunity to help shape one of the most critical layers of the AI cloud: ensuring workloads run seamlessly, reliably, and efficiently across massive GPU clusters.</p>\n<p>By building the systems that eliminate infrastructure bottlenecks and create new orchestration capabilities, you will directly empower customers to innovate faster and push the boundaries of what&#39;s possible with AI.</p>\n<p><strong>What You&#39;ll Do</strong></p>\n<p>As a Staff Engineer, you will be a technical leader shaping the long-term strategy for CoreWeave&#39;s orchestration platform.</p>\n<p>You&#39;ll define architectural direction, own critical parts of the orchestration platform and other managed services, and drive cross-org initiatives in scheduling, quota enforcement, and scaling at hyperscale.</p>\n<p>You&#39;ll mentor senior engineers, establish org-wide best practices in reliability and observability, and ensure CoreWeave&#39;s orchestration layer evolves to meet the demands of next-generation AI workloads.</p>\n<p><strong>Who You Are</strong></p>\n<ul>\n<li>8+ years of software engineering experience.</li>\n</ul>\n<ul>\n<li>Proven track record designing and operating large-scale distributed systems in production.</li>\n</ul>\n<ul>\n<li>Deep expertise in Slurm/Kubernetes internals and cloud-native development.</li>\n</ul>\n<ul>\n<li>Advanced proficiency in Go and distributed systems design and cloud-native development.</li>\n</ul>\n<ul>\n<li>Experience setting technical direction and influencing cross-team architecture.</li>\n</ul>\n<ul>\n<li>Bachelor&#39;s or Master&#39;s degree in CS, EE, or related field.</li>\n</ul>\n<p><strong>Preferred</strong></p>\n<ul>\n<li>Familiarity with orchestration and workflow technologies such as Ray, Kubeflow, Kueue, Istio, Knative, or Argo Workflows</li>\n</ul>\n<ul>\n<li>Deep expertise in Slurm/Kubernetes internals.</li>\n</ul>\n<ul>\n<li>Experience with distributed workloads, GPU-based applications, or ML pipelines.</li>\n</ul>\n<ul>\n<li>Knowledge of scheduling concepts like quota enforcement, pre-emption, and scaling strategies.</li>\n</ul>\n<ul>\n<li>Exposure to reliability practices including SLOs, alarms, and post-incident reviews.</li>\n</ul>\n<ul>\n<li>Experience with AI infrastructure and workloads (ML training, inference, or HPC).</li>\n</ul>\n<ul>\n<li>Ability to mentor senior engineers and elevate organizational standards.</li>\n</ul>\n<p><strong>Why CoreWeave?</strong></p>\n<p>At CoreWeave, we work hard, have fun, and move fast! We&#39;re in an exciting stage of hyper-growth that you will not want to miss out on.</p>\n<p>We&#39;re not afraid of a little chaos, and we&#39;re constantly learning.</p>\n<p>Our team cares deeply about how we build our product and how we work together, which is represented through our core values:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n</ul>\n<ul>\n<li>Act Like an Owner</li>\n</ul>\n<ul>\n<li>Empower Employees</li>\n</ul>\n<ul>\n<li>Deliver Best-in-Class Client Experiences</li>\n</ul>\n<ul>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking.</p>\n<p>We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems.</p>\n<p>As we get set for take off, the growth opportunities within the organization are constantly expanding.</p>\n<p>You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.</p>\n<p>Come join us!</p>\n<p><strong>Salary and Benefits</strong></p>\n<p>The base salary range for this role is $185,000 to $275,000.</p>\n<p>The starting salary will be determined based on job-related knowledge, skills, experience, and market location.</p>\n<p>We strive for both market alignment and internal equity when determining compensation.</p>\n<p>In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>\n<p><strong>What We Offer</strong></p>\n<p>The range we&#39;ve posted represents the typical compensation range for this role.</p>\n<p>To determine actual compensation, we review the market rate for each candidate which can include a variety of factors.</p>\n<p>These include qualifications, experience, interview performance, and location.</p>\n<p>In addition to a competitive salary, we offer a variety of benefits to support your needs, including:</p>\n<ul>\n<li>Medical, dental, and vision insurance - 100% paid for by CoreWeave</li>\n</ul>\n<ul>\n<li>Company-paid Life Insurance</li>\n</ul>\n<ul>\n<li>Voluntary supplemental life insurance</li>\n</ul>\n<ul>\n<li>Short and long-term disability insurance</li>\n</ul>\n<ul>\n<li>Flexible Spending Account</li>\n</ul>\n<ul>\n<li>Health Savings Account</li>\n</ul>\n<ul>\n<li>Tuition Reimbursement</li>\n</ul>\n<ul>\n<li>Ability to Participate in Employee Stock Purchase Program (ESPP)</li>\n</ul>\n<ul>\n<li>Mental Wellness Benefits through Spring Health</li>\n</ul>\n<ul>\n<li>Family-Forming support provided by Carrot</li>\n</ul>\n<ul>\n<li>Paid Parental Leave</li>\n</ul>\n<ul>\n<li>Flexible, full-service childcare support with Kinside</li>\n</ul>\n<ul>\n<li>401(k) with a generous employer match</li>\n</ul>\n<ul>\n<li>Flexible PTO</li>\n</ul>\n<ul>\n<li>Catered lunch each day in our office and data center locations</li>\n</ul>\n<ul>\n<li>A casual work environment</li>\n</ul>\n<ul>\n<li>A work culture focused on innovative disruption</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_96d05ee1-799","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4658801006","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$185,000 to $275,000","x-skills-required":["software engineering","distributed systems","Slurm","Kubernetes","cloud-native development","Go","scheduling","quota enforcement","scaling strategies","reliability practices","SLOs","alarms","post-incident reviews","AI infrastructure","workloads","ML training","inference","HPC"],"x-skills-preferred":["orchestration and workflow technologies","Ray","Kubeflow","Kueue","Istio","Knative","Argo Workflows"],"datePosted":"2026-04-18T15:53:28.322Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bellevue, WA / Sunnyvale, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, distributed systems, Slurm, Kubernetes, cloud-native development, Go, scheduling, quota enforcement, scaling strategies, reliability practices, SLOs, alarms, post-incident reviews, AI infrastructure, workloads, ML training, inference, HPC, orchestration and workflow technologies, Ray, Kubeflow, Kueue, Istio, Knative, Argo Workflows","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":185000,"maxValue":275000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_e8e9acc0-a63"},"title":"Technical Program Manager, Compute","description":"<p>As a Technical Program Manager on the Compute team, you will help drive the planning, coordination, and execution of programs that keep Anthropic&#39;s compute infrastructure running efficiently at scale.</p>\n<p>Our compute fleet is the foundation on which every model training run, evaluation, and inference workload depends. You&#39;ll join a small, high-impact TPM team and take ownership of critical workstreams across the compute lifecycle, from how supply is procured and brought online, to how capacity is allocated and utilized across teams.</p>\n<p>You&#39;ll partner with Infrastructure, Systems, Research, Finance, and Capacity Engineering to shape the processes, tooling, and coordination mechanisms that allow Anthropic to move fast while managing an increasingly complex compute environment.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Own and drive critical programs across the compute lifecycle, coordinating execution across multiple engineering, research, and operations teams</li>\n<li>Build and maintain operational visibility into the compute fleet, ensuring the organization has a clear picture of supply, demand, utilization, and health</li>\n<li>Lead cross-functional coordination for compute transitions: bringing new capacity online, migrating workloads, and managing decommissions across cloud providers and hardware platforms</li>\n<li>Partner with engineering and research leadership to navigate competing priorities and drive alignment on how compute resources are planned, allocated, and used</li>\n<li>Identify and close operational gaps across the compute pipeline, whether through new tooling, improved processes, or better cross-team communication</li>\n<li>Own trade-off discussions between utilization, cost, latency, and reliability, synthesizing inputs from technical and business stakeholders and communicating decisions to leadership</li>\n<li>Develop and improve the processes and frameworks the team uses to plan, track, and execute compute programs at increasing scale and complexity</li>\n</ul>\n<p>You may be a good fit if you:</p>\n<ul>\n<li>Have 7+ years of technical program management experience in infrastructure, platform engineering, or compute-intensive environments</li>\n<li>Have led complex, cross-functional programs involving multiple engineering teams with competing priorities and ambiguous requirements</li>\n<li>Have experience working with research or ML teams and translating their needs into operational plans and technical requirements</li>\n<li>Are comfortable diving deep into technical details (cloud infrastructure, cluster management, job scheduling, resource orchestration) while maintaining program-level visibility</li>\n<li>Thrive in ambiguous, fast-moving environments where you need to define scope and build processes from the ground up</li>\n<li>Have strong communication skills and can engage credibly with engineers, researchers, finance, and executive leadership</li>\n<li>Have a track record of building trust with engineering teams and driving changes through influence rather than authority</li>\n</ul>\n<p>Strong candidates may also have:</p>\n<ul>\n<li>Experience managing compute capacity across multiple cloud providers (AWS, GCP, Azure) or hybrid cloud/on-premises environments</li>\n<li>Familiarity with job scheduling, resource orchestration, or workload management systems (Kubernetes, Slurm, Borg, YARN, or custom schedulers)</li>\n<li>Experience with GPU or accelerator infrastructure, including the unique challenges of large-scale ML training and inference workloads</li>\n<li>Built or improved observability for infrastructure systems: dashboards, alerting, efficiency metrics, or cost attribution</li>\n<li>Capacity planning experience including demand forecasting, cost modeling, or hardware lifecycle management</li>\n<li>Scaled through hypergrowth in AI/ML, HPC, or large-scale cloud environments</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_e8e9acc0-a63","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5138044008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$290,000-$365,000 USD","x-skills-required":["Technical Program Management","Compute Infrastructure","Cloud Providers","Job Scheduling","Resource Orchestration","Workload Management","GPU or Accelerator Infrastructure","Observability","Capacity Planning"],"x-skills-preferred":["Kubernetes","Slurm","Borg","YARN","Custom Schedulers","Demand Forecasting","Cost Modeling","Hardware Lifecycle Management"],"datePosted":"2026-04-18T15:52:47.770Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Technical Program Management, Compute Infrastructure, Cloud Providers, Job Scheduling, Resource Orchestration, Workload Management, GPU or Accelerator Infrastructure, Observability, Capacity Planning, Kubernetes, Slurm, Borg, YARN, Custom Schedulers, Demand Forecasting, Cost Modeling, Hardware Lifecycle Management","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":290000,"maxValue":365000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_9166d234-4c5"},"title":"Solutions Architect - HPC/AI/ML","description":"<p>As a Solutions Architect at CoreWeave, you will play a vital and dynamic role in helping customers establish their Kubernetes environment, develop proofs of concept, onboard, and optimise workloads. You will serve as the primary technical point of contact for customers, establishing strong technical relationships and ensuring their success with CoreWeave&#39;s cloud infrastructure offerings, focusing on AI/ML workloads within high-performance compute (HPC) environments.</p>\n<p>Collaborate closely with customers to understand their unique business needs and create, prototype, and deploy tailored solutions that align with their requirements. Lead proof of concept initiatives to showcase the value and viability of CoreWeave&#39;s solutions within specific environments.</p>\n<p>Drive technical leadership and direction during customer meetings, presentations, and workshops, addressing any technical queries or concerns that arise. Act as a virtual member of CoreWeave&#39;s Kubernetes product and engineering teams, identifying opportunities for product enhancement and collaborating with engineers to implement your suggestions.</p>\n<p>Offer valuable insights on product features, functionality, and performance, contributing regularly to discussions about product strategy and architecture. Conduct periodic technical reviews and assessments of customer workloads, pinpointing opportunities for workload optimisation and suggesting suitable solutions.</p>\n<p>Stay informed of the latest developments and trends in Kubernetes, cloud computing and infrastructure, sharing your thought leadership with customers and internal stakeholders. Lead the prototyping and initiation of research and development efforts for emerging products and solutions, delivering prototypes and key insights for internal consumption.</p>\n<p>Represent CoreWeave at conferences and industry events, with occasional travel as required.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_9166d234-4c5","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4649044006","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $225,000 SGD","x-skills-required":["cloud computing concepts","architecture","technologies","NVIDIA GPUs","Infiniband","NVIDIA Collective Communications Library (NCCL)","Slurm","Kubernetes"],"x-skills-preferred":["code contributions to open-source inference frameworks","scripting and automation related to AI/ML workloads","building solutions across multi-cloud environments","client or customer-facing publications/talks on latency, optimisation, or advanced model-server architectures"],"datePosted":"2026-04-18T15:51:30.371Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Singapore"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing concepts, architecture, technologies, NVIDIA GPUs, Infiniband, NVIDIA Collective Communications Library (NCCL), Slurm, Kubernetes, code contributions to open-source inference frameworks, scripting and automation related to AI/ML workloads, building solutions across multi-cloud environments, client or customer-facing publications/talks on latency, optimisation, or advanced model-server architectures","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":225000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_1868194d-726"},"title":"Operations Engineer, HPC Networking","description":"<p>In this role, you will support the deployment, monitoring, troubleshooting, and maintenance of large-scale InfiniBand fabrics, ensuring their stability and performance.</p>\n<p>The ideal candidate will have a strong operations mindset, effective collaboration skills, and the ability to solve complex issues in a dynamic environment.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Regularly monitoring the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.</li>\n<li>Investigating and resolving operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks.</li>\n<li>Assisting with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams.</li>\n<li>Performing routine maintenance and upgrades on InfiniBand switches and control plane components.</li>\n<li>Collaborating with HPC cluster operations teams to provide troubleshooting and operational expertise.</li>\n</ul>\n<p>Investing in our people is one of our top priorities, and we value candidates who can bring their diversified experiences to our teams.</p>\n<p>Minimum Qualifications:</p>\n<ul>\n<li>At least 1 year of experience with InfiniBand or similar networking technologies.</li>\n<li>Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting.</li>\n<li>Experience with Linux system administration and maintenance.</li>\n<li>Proficiency in at least one scripting language.</li>\n</ul>\n<p>Preferred Qualifications:</p>\n<ul>\n<li>Hands-on experience with Nvidia UFM or similar fabric management tools.</li>\n<li>Familiarity with SLURM job scheduler and its role in HPC environments.</li>\n<li>Experience with monitoring and visualization platforms such as Grafana or Prometheus.</li>\n<li>Experience with operational tooling and automation frameworks like Ansible.</li>\n<li>Knowledge of data center operations, including server racks, and cabling.</li>\n<li>Python or Bash scripting.</li>\n</ul>\n<p>Why CoreWeave? At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n<li>Act Like an Owner</li>\n<li>Empower Employees</li>\n<li>Deliver Best-in-Class Client Experiences</li>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization&#39;s growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.</p>\n<p>Come join us!</p>\n<p>The base salary range for this role is $110,000 to $179,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_1868194d-726","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4673462006","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$110,000 to $179,000","x-skills-required":["InfiniBand","Linux system administration","Scripting language","Networking concepts","Architectures","Topologies","Operational best practices","Troubleshooting"],"x-skills-preferred":["Nvidia UFM","SLURM job scheduler","Grafana","Prometheus","Ansible","Data center operations","Server racks","Cabling","Python","Bash scripting"],"datePosted":"2026-04-18T15:50:12.336Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"InfiniBand, Linux system administration, Scripting language, Networking concepts, Architectures, Topologies, Operational best practices, Troubleshooting, Nvidia UFM, SLURM job scheduler, Grafana, Prometheus, Ansible, Data center operations, Server racks, Cabling, Python, Bash scripting","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":110000,"maxValue":179000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_854e95b5-76b"},"title":"Sr. Director of Product, Research and Training Infrastructure","description":"<p>CoreWeave is seeking a visionary Sr. Director of Product, Research Training Infrastructure to lead the product strategy and engineering execution for the services that power the most ambitious AI research labs in the world.</p>\n<p>This executive leader will own the product strategy and engineering execution for the Research Training Stack, focusing on the specialized orchestration, evaluation, and iteration tools required for massive-scale pre-training and post-training.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Frontier Orchestration: Oversee the evolution of SUNK (Slurm on Kubernetes) to provide researchers with deterministic, bare-metal performance through a cloud-native interface.</li>\n</ul>\n<ul>\n<li>Holistic Training Services: Drive the development of next-generation orchestrators and automated training-based evaluation frameworks that ensure model quality throughout the lifecycle.</li>\n</ul>\n<ul>\n<li>Post-Training Excellence: Build the infrastructure required for sophisticated Reinforcement Learning (RL) and RLHF pipelines, enabling labs to refine foundation models with maximum efficiency.</li>\n</ul>\n<ul>\n<li>Customer Advocacy: Act as the primary technical partner for lead researchers at global AI labs, translating their &#39;future-state&#39; requirements into actionable product roadmaps.</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>Proven leadership experience in engineering leadership, with at least 5+ years managing large-scale infrastructure at a top-tier research lab or an AI-native cloud provider.</li>\n</ul>\n<ul>\n<li>Deep, hands-on knowledge of Slurm, Kubernetes, and the specific networking requirements (InfiniBand/RDMA) for distributed training clusters.</li>\n</ul>\n<ul>\n<li>Research mindset and understanding of the &#39;pain points&#39; of a research scientist.</li>\n</ul>\n<ul>\n<li>Scaling experience delivering mission-critical services on multi-thousand GPU clusters (H100/Blackwell/Rubin architectures).</li>\n</ul>\n<ul>\n<li>Strategic vision to define &#39;what&#39;s next&#39; in the AI stack, from automated RL loops to specialized sandbox environments.</li>\n</ul>\n<p>Why CoreWeave?</p>\n<p>In 2026, CoreWeave is the foundation of the largest infrastructure buildout in human history. We are building AI Factories, not just data centers.</p>\n<ul>\n<li>Silicon-Up Innovation: Work directly with the latest NVIDIA architectures.</li>\n</ul>\n<ul>\n<li>Impact: You will be the architect of the environment that enables the next new discovery.</li>\n</ul>\n<p>Velocity: We move at the speed of the researchers we support, bypassing legacy cloud bottlenecks to deliver raw power.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_854e95b5-76b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4665964006","x-work-arrangement":"hybrid","x-experience-level":"executive","x-job-type":"full-time","x-salary-range":"$233,000 to $341,000","x-skills-required":["Slurm","Kubernetes","InfiniBand/RDMA","Distributed training clusters","GPU clusters","H100/Blackwell/Rubin architectures","Reinforcement Learning (RL)","RLHF pipelines"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:50:11.130Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Slurm, Kubernetes, InfiniBand/RDMA, Distributed training clusters, GPU clusters, H100/Blackwell/Rubin architectures, Reinforcement Learning (RL), RLHF pipelines","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":233000,"maxValue":341000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0f249232-d14"},"title":"Principal Engineer, Cluster Orchestration","description":"<p>As a Principal Engineer in AI Infrastructure, you will lead the design and evolution of the cluster orchestration systems that make this possible. This includes Slurm, Kubernetes, SUNK, and the control planes that support AI training, inference, and model onboarding at scale.</p>\n<p>You will define long-term architecture, solve hard scaling problems, and set technical direction across teams. Your work will directly affect how quickly customers can run models, how efficiently we use GPUs, and how reliably the platform behaves at scale.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Defining the long-term architecture for CoreWeave&#39;s orchestration platforms across Kubernetes, Slurm, SUNK, Kueue, and related systems.</li>\n<li>Acting as a technical authority on scheduling, quota enforcement, fairness, pre-emption, and multi-tenant GPU isolation.</li>\n<li>Making design decisions that balance performance, reliability, cost, and operational complexity.</li>\n</ul>\n<p>In addition to these responsibilities, you will also lead the evolution of Kubernetes-native control planes, including SUNK and custom operators, and design systems that support workload admission, validation, and rollout, including model onboarding flows.</p>\n<p>You will work closely with cross-functional teams to ensure that the systems you design and implement meet the needs of our customers and are scalable, reliable, and efficient.</p>\n<p>If you have a passion for building large-scale distributed systems and are looking for a challenging and rewarding role, we encourage you to apply.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0f249232-d14","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4658799006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$206,000 to $303,000","x-skills-required":["Kubernetes","Slurm","SUNK","Go","Cloud-native systems development","GPU-heavy platforms for AI training, inference, or HPC workloads"],"x-skills-preferred":["Kueue","Kubeflow","Argo Workflows","Ray","Istio","Knative"],"datePosted":"2026-04-18T15:48:07.140Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bellevue, WA / Sunnyvale, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Slurm, SUNK, Go, Cloud-native systems development, GPU-heavy platforms for AI training, inference, or HPC workloads, Kueue, Kubeflow, Argo Workflows, Ray, Istio, Knative","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":206000,"maxValue":303000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b3289639-f91"},"title":"Machine Learning Engineer, Open-Source Software","description":"<p>About Mistral AI</p>\n<p>We believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.</p>\n<p>Role Summary</p>\n<p>You will be in charge of open-sourcing state-of-the-art models, whilst maintaining and improving Mistral’s publicly available libraries. Your work is critical in helping turn research breakthroughs into tangible solutions and improve Mistral&#39;s open-source ecosystem.</p>\n<p>Responsibilities</p>\n<p>• Releasing our models to open-source platforms and libraries, e.g., vLLM, GitHub, Hugging Face\n• Maintaining Mistral’s open-source libraries (mistral-common, mistral-finetune, mistral-inference)\n• Create and maintain tooling and services: both internal facing (internal research) and external facing (open-source libraries)\n• Implement and optimize open-source and internal libraries for performance and accuracy, ensuring production readiness and employing cutting-edge technology and innovative approaches\n• Collaborate with the open-source community (PyTorch, vLLM, Hugging Face)</p>\n<p>About You</p>\n<p>• Master’s degree in Computer Science, Machine Learning, Data Science, or a related field\n• Experience contributing to popular open-source libraries such as PyTorch, Tensorflow, JAX, vLLM, Transformers, Llama.cpp, ...\n• Passion for contributing to the open-source software ecosystem\n• Expert programming skills in Python, PyTorch, MLOps\n• Adaptable, proactive, and autonomous\n• Attention to detail and a drive to go the last mile to build almost perfect tools\n• Deep understanding of machine learning approaches, especially LLMs and algorithms\n• Low-ego, collaborative and have a real team player mindset</p>\n<p>Now, it would be ideal if you have:</p>\n<p>• Experience with training and fine-tuning large language models (e.g., distillation, supervised fine-tuning, policy optimization)\n• Experience working with Slurm\n• Worked with research teams before\n• Experience as a core-maintainer of a popular ML open-source library</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b3289639-f91","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/ef4c26fc-3fdb-4dd2-a64e-95264ee769dd","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Python","PyTorch","MLOps","Open-source software development","Machine learning","Large language models","Slurm"],"x-skills-preferred":["Experience with training and fine-tuning large language models","Experience working with Slurm","Research team experience","Core-maintainer of a popular ML open-source library"],"datePosted":"2026-04-17T12:48:06.893Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, PyTorch, MLOps, Open-source software development, Machine learning, Large language models, Slurm, Experience with training and fine-tuning large language models, Experience working with Slurm, Research team experience, Core-maintainer of a popular ML open-source library"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_5f40194b-3c0"},"title":"Product Manager, Forge","description":"<p>We are seeking a talented and experienced product manager to define and execute the strategy for Forge, our product that enables customers to build, fine-tune and deploy custom AI models at scale.</p>\n<p>Forge turns cutting-edge research into enterprise-ready capabilities by powering model fine-tuning, reinforcement learning and post-training workflows. By working at the intersection of research and product it provides customers with the tools to train specialized models that deliver real-world business value.</p>\n<p>As the PM leading Forge you will shape a 0-1 product with significant business impact and the potential to grow offering while defining how organizations train and deploy the next generation of AI models.</p>\n<p>Key Responsibilities:</p>\n<p>Define the Future • Set the vision: Shape and evangelize a compelling product strategy for Forge ensuring alignment with company goals and market opportunities.</p>\n<p>Spot the gaps: Lead market and UX research to uncover unmet needs, competitive whitespaces, and emerging trends in SOTA AI post-training capabilities.</p>\n<p>Build &amp; Ship • Own the lifecycle: Drive end-to-end product development, from ideation to launch and iteration,balancing speed, quality, and user delight.</p>\n<p>Champion the user: Partner with design and research to craft intuitive, high-impact experiences, using data and feedback to refine continuously.</p>\n<p>Scale, Execute, &amp; Enable • Go-to-market: Collaborate with marketing and sales to launch products successfully, including pricing, positioning, and adoption strategies.</p>\n<p>Align stakeholders: Rally engineering, design, and business teams around priorities, trade-offs, and timelines.</p>\n<p>Prioritize ruthlessly: Maintain a dynamic roadmap that delivers quick wins while advancing long-term bets.</p>\n<p>Requirements:</p>\n<p>Product Management Experience 5+ years of relevant experience in new, competitive, fast-paced and ambiguous environments with a track record of building and scaling complex AI/ML or infrastructure solutions.</p>\n<p>Technical skills - Very good understanding of training pipelines, RL loops, and model deployment architectures,</p>\n<p>Expertise in AI model lifecycle management, including fine-tuning, evaluation, and serving.</p>\n<p>Experience with Infrastructure as Code (IaC), containerization, and scalable deployment modes (e.g., on-prem, VPC, cloud).</p>\n<p>Familiarity with Kubernetes/Slurm is a strong plus.</p>\n<p>User obsession Relentless focus on solving real user problems, backed by data and qualitative insights.</p>\n<p>Cross-functional influence Proven ability to align and inspire engineering, design, and go-to-market teams without direct authority.</p>\n<p>Problem-solving Balance big-picture thinking with hands-on problem-solving , you’re equally comfortable crafting a roadmap, diving into metrics and running technical tests.</p>\n<p>Communication: Crisp, persuasive storytelling for executives, teams, and users , ability to distill complex technical concepts (e.g., RL, LoRA, SFT) into clear narratives for docs, decks, and workshops.</p>\n<p>Adaptability: Thrive in high-velocity, dynamic settings where priorities shift quickly.</p>\n<p>Collaboration: Low ego + high EQ , you build trust and drive decisions through clarity, not hierarchy.</p>\n<p>Autonomy: Self-directed with a bias for action, you own outcomes end-to-end.</p>\n<p>Preferred Qualifications:</p>\n<p>Infrastructure knowledge - Strong knowledge of model training, model architectures, etc.</p>\n<p>Strong understanding how complex architectures are designed and impact of deployment modes</p>\n<p>Proficient coding skills are strongly recommended</p>\n<p>Kubernetes know-how strongly recommended</p>\n<p>Growth mindset: Deep familiarity with product-led growth strategies (e.g., viral loops, onboarding optimization, monetization, etc.).</p>\n<p>Builder’s mindset: Founder or early-stage PM experience , you’ve turned 0 → 1 ideas into products users love.</p>\n<p>Technical depth: Ability to prototype, hack, or dive into code when needed.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_5f40194b-3c0","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai/careers","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/11087966-f183-44b1-adc9-3a400c1f52ad","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["training pipelines","RL loops","model deployment architectures","AI model lifecycle management","fine-tuning","evaluation","serving","Infrastructure as Code (IaC)","containerization","scalable deployment modes","Kubernetes/Slurm"],"x-skills-preferred":["model training","model architectures","complex architectures","deployment modes","proficient coding skills","Kubernetes know-how","product-led growth strategies","viral loops","onboarding optimization","monetization"],"datePosted":"2026-04-17T12:47:21.747Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"training pipelines, RL loops, model deployment architectures, AI model lifecycle management, fine-tuning, evaluation, serving, Infrastructure as Code (IaC), containerization, scalable deployment modes, Kubernetes/Slurm, model training, model architectures, complex architectures, deployment modes, proficient coding skills, Kubernetes know-how, product-led growth strategies, viral loops, onboarding optimization, monetization"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_50cacac8-b47"},"title":"Research Engineer, Machine Learning","description":"<p><strong>About the Role</strong></p>\n<p>We are seeking a Research Engineer to join our Machine Learning team. As a Research Engineer, you will work on building and optimizing large-scale learning systems that power our open-weight models.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Accelerate researchers by taking on the heavy parts of large-scale ML pipelines and building robust tools.</li>\n<li>Interface cutting-edge research with production: integrate checkpoints, streamline evaluation, and expose APIs.</li>\n<li>Conduct experiments on the latest deep-learning techniques.</li>\n<li>Design, implement and benchmark ML algorithms; write clear, efficient code in Python.</li>\n<li>Deliver prototypes that become production-grade components for Le Chat and our enterprise API.</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>Master&#39;s or PhD in Computer Science (or equivalent proven track record).</li>\n<li>4 + years working on large-scale ML codebases.</li>\n<li>Hands-on with PyTorch, JAX or TensorFlow; comfortable with distributed training (DeepSpeed / FSDP / SLURM / K8s).</li>\n<li>Experience in deep learning, NLP or LLMs; bonus for CUDA or data-pipeline chops.</li>\n<li>Strong software-design instincts: testing, code review, CI/CD.</li>\n<li>Self-starter, low-ego, collaborative.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Competitive cash salary and equity.</li>\n<li>Food: Daily lunch vouchers.</li>\n<li>Sport: Monthly contribution to a Gympass subscription.</li>\n<li>Transportation: Monthly contribution to a mobility pass.</li>\n<li>Health: Full health insurance for you and your family.</li>\n<li>Parental: Generous parental leave policy.</li>\n</ul>\n<p>Note: Benefits may vary depending on location.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_50cacac8-b47","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai/careers","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/07447e1d-7900-46d4-b61b-186f2f76847f","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["PyTorch","JAX","TensorFlow","DeepSpeed","FSDP","SLURM","K8s","Python","CUDA","data-pipeline"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:47:05.094Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"PyTorch, JAX, TensorFlow, DeepSpeed, FSDP, SLURM, K8s, Python, CUDA, data-pipeline"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a3d60aab-0bb"},"title":"Research Platform Engineer","description":"<p><strong>About Mistral AI</strong></p>\n<p>Mistral AI is an AI technology company that develops high-performance, optimized, open-source and cutting-edge models, products and solutions.</p>\n<p><strong>Role Summary – Software Engineering track</strong></p>\n<p>As a Research Engineer on the software side, you will design and harden the codebase, tools and distributed services that let our scientists train and ship frontier-scale models. You do not need prior ML experience; what matters is writing clean, reliable code that scales. You will join our Platform REs team to build and maintain shared dev-tools, evaluation &amp; data pipelines, training framework, cluster tooling and CI/CD.</p>\n<p><strong>Responsibilities</strong></p>\n<p>• Accelerate researchers by owning the complex parts of large-scale pipelines and delivering robust internal tooling.\n• Interface research with product: expose clean APIs, automate model pushes, surface live metrics.\n• Write efficient, well-tested Python and systems code; enforce code review, CI, and observability.\n• Design and optimise distributed services (Kubernetes / SLURM, thousands-of-GPU jobs).\n• Prototype utilities (CLI, dashboards) and carry them through to stable, shared libraries.</p>\n<p><strong>About the Research Engineering team</strong></p>\n<p>Based in Paris and London, our REs move fluidly along the research ↔ production spectrum. Engineers can rotate between Platform and Embedded tracks as their interests evolve.</p>\n<p><strong>About you</strong></p>\n<p>• Master’s in Computer Science (or equivalent experience).\n• 4 + years building and operating large-scale or distributed systems.\n• Strong software-design instincts: modular code, tests, CI/CD, observability.\n• Fluency in Python plus one systems language (C++, Rust, Go or Java).\n• Hands-on with container orchestration and schedulers (Kubernetes / K8s, SLURM, or similar).\n• Comfortable profiling performance, optimising I/O, and automating workflows.\n• Self-starter, low-ego, collaborative, high-energy.</p>\n<p><strong>Benefits</strong></p>\n<p>France:\n• Competitive cash salary and equity\n• Food: Daily lunch vouchers\n• Sport: Monthly contribution to a Gympass subscription\n• Transportation: Monthly contribution to a mobility pass\n• Health: Full health insurance for you and your family\n• Parental: Generous parental leave policy</p>\n<p>UK:\n• Competitive cash salary and equity\n• Insurance\n• Transportation: Reimburse office parking charges, or £90 per month for public transport\n• Sport: £90 per month reimbursement for gym membership\n• Meal voucher: £200 monthly allowance for meals\n• Pension plan: SmartPension (percentages are 5% Employee &amp; 3% Employer)</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a3d60aab-0bb","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai/careers","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/df0d75c1-97ef-4e50-85e6-0ffd8f5b7d7c","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Python","C++","Rust","Go","Java","Kubernetes","SLURM","container orchestration","schedulers"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:46:02.806Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, C++, Rust, Go, Java, Kubernetes, SLURM, container orchestration, schedulers"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4075c787-328"},"title":"Member of Technical Staff - Large Scale Data Infrastructure","description":"<p>We&#39;re looking for infrastructure engineers to work at peta-to-exabyte scale. You&#39;ll build data systems behind the largest training runs on thousands of GPUs, where fixing one bottleneck lets researchers train the next breakthrough model.</p>\n<p><strong>What You&#39;ll Work On:</strong></p>\n<ul>\n<li>Scalable data loaders for training runs across thousands of GPUs</li>\n<li>Efficient storage and retrieval systems for petabyte-scale datasets</li>\n<li>Multi-cloud object storage abstraction</li>\n<li>Execute large-scale data migrations across storage systems and providers</li>\n<li>Debug and resolve performance bottlenecks in distributed data loading</li>\n</ul>\n<p><strong>Technical Focus:</strong></p>\n<ul>\n<li>Python, PyTorch DataLoader internals</li>\n<li>Object storage (e.g. S3, Azure Blob, GCS)</li>\n<li>Parquet for metadata</li>\n<li>Video: ffmpeg, PyAV, codec fundamentals</li>\n</ul>\n<p><strong>What We&#39;re Looking For:</strong></p>\n<ul>\n<li>Built and operated data pipelines at petabyte scale</li>\n<li>Optimized data loading</li>\n<li>Worked with petabyte-scale video and image datasets</li>\n<li>Written processing jobs operating on millions of files</li>\n<li>Debugged distributed system bottlenecks across large fleets of machines</li>\n</ul>\n<p><strong>Nice to Have:</strong></p>\n<ul>\n<li>Experience streaming dataset formats (e.g. WebDataset)</li>\n<li>Video codec internals and frame-accurate seeking</li>\n<li>Distributed systems experience</li>\n<li>Slurm and Kubernetes for job orchestration</li>\n<li>Experience with object storage performance tuning across providers</li>\n</ul>\n<p><strong>How We Work Together:</strong></p>\n<ul>\n<li>We&#39;re a distributed team with real offices that people actually use. Depending on your role, you&#39;ll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We&#39;ll cover reasonable travel costs to make this possible. We think in-person time matters, and we&#39;ve structured things to make it accessible to all. We&#39;ll discuss what this will look like for the role during our interview process.</li>\n</ul>\n<p><strong>Everything we do is grounded in four values:</strong></p>\n<ul>\n<li>Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.</li>\n<li>Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.</li>\n<li>Bold. We take the ambitious bet. We ship, we do not wait for conditions to be perfect.</li>\n<li>Kind. People over politics. We treat each other with genuine warmth. Agency without empathy creates chaos.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4075c787-328","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Black Forest Labs","sameAs":"https://www.blackforestlabs.com/","logo":"https://logos.yubhub.co/blackforestlabs.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/blackforestlabs/jobs/5019171008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$180,000–$300,000 USD + Equity","x-skills-required":["Python","PyTorch","Data Loader Internals","Object Storage","Parquet","Video","ffmpeg","PyAV","Codec Fundamentals"],"x-skills-preferred":["WebDataset","Distributed Systems","Slurm","Kubernetes","Object Storage Performance Tuning"],"datePosted":"2026-04-17T12:26:28.781Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Freiburg (Germany), San Francisco (USA)"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, PyTorch, Data Loader Internals, Object Storage, Parquet, Video, ffmpeg, PyAV, Codec Fundamentals, WebDataset, Distributed Systems, Slurm, Kubernetes, Object Storage Performance Tuning","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":300000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c1dcea75-d5a"},"title":"Member of Technical Staff - Infrastructure Engineer","description":"<p>We&#39;re looking for an experienced engineer to join our team in Freiburg, Germany or San Francisco, USA. As a Member of Technical Staff - Infrastructure Engineer, you will be responsible for maintaining and scaling our research infrastructure, ensuring health and optimizing components to extract peak performance from the system. You will also collaborate with research teams to deeply understand their infrastructure needs and design solutions that balance performance with cost efficiency.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Maintaining research infrastructure, ensuring health, and optimizing components to extract peak performance from the system (both on application and infrastructure side)</li>\n<li>Scaling infrastructure to meet growing research demands while maintaining reliability and performance</li>\n<li>Collaborating with research teams to deeply understand their infrastructure needs, and design solutions that balance performance with cost efficiency</li>\n<li>Identifying and resolving performance bottlenecks and capacity hotspots through deep analysis of distributed systems at scale</li>\n<li>Building and evolving telemetry and monitoring systems to provide deep visibility into infrastructure performance, utilization, and costs across our cloud and datacenter fleets</li>\n<li>Participating in on-call rotations and incident response to maintain system reliability</li>\n</ul>\n<p>Technical focus includes:</p>\n<ul>\n<li>Python, Bash, Go</li>\n<li>Kubernetes</li>\n<li>Nvidia GPU drivers and operators</li>\n<li>OTel, Prometheus</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>Experience building or operating large-scale training platforms</li>\n<li>Worked with large-scale compute clusters (GPUs)</li>\n<li>Proven ability to debug performance and reliability issues across large distributed fleets</li>\n<li>Strong problem-solving skills and ability to work independently</li>\n<li>Strong communication skills and the ability to work effectively with both internal and external partners</li>\n<li>Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP</li>\n<li>Experience with SLURM</li>\n</ul>\n<p>We offer a competitive base annual salary of $180,000-$300,000 USD and a hybrid work model with a meaningful in-person presence.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c1dcea75-d5a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Black Forest Labs","sameAs":"https://www.blackforestlabs.com/","logo":"https://logos.yubhub.co/blackforestlabs.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/blackforestlabs/jobs/4925659008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$180,000-$300,000 USD","x-skills-required":["Python","Bash","Go","Kubernetes","Nvidia GPU drivers","Nvidia GPU operators","OTel","Prometheus","Experience building or operating large-scale training platforms","Worked with large-scale compute clusters (GPUs)","Proven ability to debug performance and reliability issues across large distributed fleets","Strong problem-solving skills and ability to work independently","Strong communication skills and the ability to work effectively with both internal and external partners","Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP","Experience with SLURM"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:25:55.745Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Freiburg (Germany), San Francisco (USA)"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Bash, Go, Kubernetes, Nvidia GPU drivers, Nvidia GPU operators, OTel, Prometheus, Experience building or operating large-scale training platforms, Worked with large-scale compute clusters (GPUs), Proven ability to debug performance and reliability issues across large distributed fleets, Strong problem-solving skills and ability to work independently, Strong communication skills and the ability to work effectively with both internal and external partners, Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP, Experience with SLURM","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":300000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_89406e8e-f38"},"title":"Machine Learning Engineer, Open-Source Software","description":"<p>You will be in charge of open-sourcing state-of-the-art models, whilst maintaining and improving Mistral’s publicly available libraries. Your work is critical in helping turn research breakthroughs into tangible solutions and improve Mistral&#39;s open-source ecosystem.</p>\n<p>About the Open Source Software team\nOur OSS team is embedded in our Science team and works very closely with various engineering and marketing teams. All OSS team members can fluidly move on the production / research spectrum depending on where the needs are or where their interests lie</p>\n<p>Responsibilities\n• Releasing our models to open-source platforms and libraries, e.g., vLLM, GitHub, Hugging Face\n• Maintaining Mistral’s open-source libraries (mistral-common, mistral-finetune, mistral-inference)\n• Create and maintain tooling and services: both internal facing (internal research) and external facing (open-source libraries)\n• Implement and optimize open-source and internal libraries for performance and accuracy, ensuring production readiness and employing cutting-edge technology and innovative approaches\n• Collaborate with the open-source community (PyTorch, vLLM, Hugging Face)</p>\n<p>About you\n• Master’s degree in Computer Science, Machine Learning, Data Science, or a related field\n• Experience contributing to popular open-source libraries such as PyTorch, Tensorflow, JAX, vLLM, Transformers, Llama.cpp, ...\n• Passion for contributing to the open-source software ecosystem\n• Expert programming skills in Python, PyTorch, MLOps\n• Adaptable, proactive, and autonomous\n• Attention to detail and a drive to go the last mile to build almost perfect tools\n• Deep understanding of machine learning approaches, especially LLMs and algorithms\n• Low-ego, collaborative and have a real team player mindset</p>\n<p>Now, it would be ideal if you have:\n• Experience with training and fine-tuning large language models (e.g., distillation, supervised fine-tuning, policy optimization)\n• Experience working with Slurm\n• Worked with research teams before\n• Experience as a core-maintainer of a popular ML open-source library</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_89406e8e-f38","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai"},"x-apply-url":"https://jobs.lever.co/mistral/ef4c26fc-3fdb-4dd2-a64e-95264ee769dd","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Python","PyTorch","MLOps","Machine Learning","Large Language Models","Slurm","Open-source libraries"],"x-skills-preferred":["vLLM","GitHub","Hugging Face","PyTorch","Tensorflow","JAX","Transformers","Llama.cpp"],"datePosted":"2026-03-10T11:30:04.700Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, PyTorch, MLOps, Machine Learning, Large Language Models, Slurm, Open-source libraries, vLLM, GitHub, Hugging Face, PyTorch, Tensorflow, JAX, Transformers, Llama.cpp"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b151fcc2-2fb"},"title":"Member of Technical Staff, High Performance Computing Engineer","description":"<p>We are looking for experienced Member of Technical Staff, High Performance Computing Engineers to help build and scale the infrastructure that trains our frontier models and powers the next evolution of our personal AI, Copilot.</p>\n<p>This role offers the unique opportunity to work on some of the largest scale supercomputers in the world – a rare chance to operate at such a significant scale.</p>\n<p><strong>Responsibilities</strong></p>\n<p>Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.</p>\n<p>Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.</p>\n<p>Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.</p>\n<p>Develop and maintain automation and tooling using Bash and/or Python to improve cluster reliability, observability, and operational efficiency.</p>\n<p>Partner closely with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.</p>\n<p>Drive work forward independently by navigating ambiguity and technical roadblocks, delivering incremental improvements that get capabilities into users’ hands quickly.</p>\n<p><strong>Qualifications</strong></p>\n<p>Do you have a Bachelor’s degree in computer science, or related technical field AND 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters, AND 4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.), AND 4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP, OR equivalent experience?</p>\n<p><strong>Preferred Qualifications</strong></p>\n<p>Master’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters, AND 6+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.), AND 6+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP, OR equivalent experience.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b151fcc2-2fb","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-high-performance-computing-engineer-mai-superintelligence-team-3/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["HPC","SLURM","Kubernetes","GPU compute","high-performance storage","networking","Bash","Python","nvidia InfiniBand clusters","Ray"],"x-skills-preferred":["LLM training clusters","AI platforms","Machine Learning frameworks","large-scale HPC or GPU systems"],"datePosted":"2026-03-08T22:15:08.170Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Zürich"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"HPC, SLURM, Kubernetes, GPU compute, high-performance storage, networking, Bash, Python, nvidia InfiniBand clusters, Ray, LLM training clusters, AI platforms, Machine Learning frameworks, large-scale HPC or GPU systems"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_5d37a7c7-d2a"},"title":"ML Infrastructure Engineer","description":"<p><strong>About the role</strong></p>\n<p>The ML Infrastructure team at Cursor builds large-scale compute, storage, and software infrastructure to support the company&#39;s work building the world&#39;s best agentic coding model. We&#39;re looking for strong engineers who are interested in building high-performance infrastructure and the software to support it. This role works closely with ML researchers and engineers to enable their work through improvements to our training framework, systems reliability/performance, and developer experience.</p>\n<p><strong>What you&#39;ll do</strong></p>\n<ul>\n<li>Collaborate with ML researchers to improve the throughput and reliability of training</li>\n<li>Work with OEMs, cloud service providers, and others to plan and build cutting-edge GPU infrastructure</li>\n<li>Improve the density and scalability of compute environments to enable increasingly large RL workloads</li>\n<li>Create software and systems to automate building, monitoring, and running GPU clusters</li>\n<li>Build workload scheduling and data movement systems to support Cursor&#39;s growing training footprint</li>\n</ul>\n<p><strong>You may be a fit if</strong></p>\n<ul>\n<li>A strong background in systems and infrastructure-focused software engineering, particularly in Python, Typescript, Rust, and Golang</li>\n<li>Experience with distributed storage and networking infrastructure, particularly on Linux systems across cloud and bare metal environments</li>\n<li>Exposure to large-scale systems and their unique challenges, ideally across thousands of nodes with significant resource footprints</li>\n</ul>\n<p><strong>Nice to have</strong></p>\n<ul>\n<li>Operational exposure to Nvidia GPUs with Infiniband or RoCE, particularly with Blackwell and Hopper-class hardware</li>\n<li>Exposure to Ray, Slurm, or other common compute and runtime schedulers</li>\n</ul>\n<p>Name<em> Email</em> ↥ Upload file LinkedIn URL GitHub Profile</p>\n<p>Please write a short note on a project you&#39;re proud of:</p>\n<p>Will you now or in the future require visa sponsorship to work in the country where this position is located?</p>\n<p>Has someone at Cursor referred you for this role? If so, please include their email here</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_5d37a7c7-d2a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Cursor","sameAs":"https://cursor.com","logo":"https://logos.yubhub.co/cursor.com.png"},"x-apply-url":"https://cursor.com/careers/software-engineer-ml-infrastructure","x-work-arrangement":"remote","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Python","Typescript","Rust","Golang","Distributed storage","Networking infrastructure","Linux systems","Kubernetes"],"x-skills-preferred":["Nvidia GPUs","Infiniband","RoCE","Blackwell","Hopper-class hardware","Ray","Slurm"],"datePosted":"2026-03-08T00:17:18.553Z","jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Typescript, Rust, Golang, Distributed storage, Networking infrastructure, Linux systems, Kubernetes, Nvidia GPUs, Infiniband, RoCE, Blackwell, Hopper-class hardware, Ray, Slurm"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0e50f5ba-8b9"},"title":"Hardware Development Infrastructure Engineer","description":"<p><strong>Hardware Development Infrastructure Engineer</strong></p>\n<p><strong>About the Team:</strong></p>\n<p>OpenAI&#39;s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI&#39;s supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI.</p>\n<p><strong>About the Role</strong></p>\n<p>We&#39;re looking for a Hardware Development Infrastructure Engineer to build and run the infrastructure that powers OpenAI&#39;s hardware development lifecycle. You&#39;ll work closely with hardware teams to translate their workflows into scalable, observable, and automated systems, and then own the platforms that support them over time.</p>\n<p>This role sits at the intersection of hardware, cloud, HPC, DevOps, and data. You&#39;ll design regression systems, CI/CD pipelines, cloud and cluster platforms, and the data foundations that make development efficiency visible and measurable.</p>\n<p><strong>In this role, you will:</strong></p>\n<ul>\n<li>Partner with hardware teams on workflows and tooling: Embed with teams across DV, PD, emulation, formal, and software to understand development flows, identify failure modes, and deliver tooling (CLIs, services, APIs) that reduces manual work and accelerates iteration.</li>\n</ul>\n<ul>\n<li>Build and operate regression systems at scale: Own regressions end-to-end—from definition and scheduling to execution, results ingestion, triage, and reporting—while improving throughput, reproducibility, and flake reduction.</li>\n</ul>\n<ul>\n<li>Own CI/CD for infrastructure and tooling: Design and operate pipelines for infrastructure-as-code, services, images, and cluster configuration changes, including testing, gated deploys, staged rollouts, and safe rollback.</li>\n</ul>\n<ul>\n<li>Run cloud and HPC platforms: Design, provision, and operate cloud infrastructure (Azure preferred) and HPC/HTC clusters (e.g., Slurm), tuning scheduling policies, autoscaling, node lifecycles, and cost-performance tradeoffs.</li>\n</ul>\n<ul>\n<li>Build data foundations and visibility: Develop ETL pipelines to ingest metrics, logs, and results; operate databases for workflow metadata and outcomes; and build dashboards that surface efficiency, utilization, and reliability trends.</li>\n</ul>\n<ul>\n<li>Drive operational excellence: Establish monitoring and alerting, lead incident response and postmortems, maintain runbooks, and produce clear, durable documentation.</li>\n</ul>\n<p><strong>You might thrive in this role if you have:</strong></p>\n<ul>\n<li>Familiarity with chip development workflows and at least one deep EDA domain (e.g., DV, PD, emulation, or formal verification).</li>\n</ul>\n<p>Strong infrastructure fundamentals, including cloud platforms, networking, security, performance, and automation.</p>\n<ul>\n<li>Experience operating cloud environments (Azure preferred; AWS, GCP, or OCI acceptable) with strong infrastructure-as-code practices (e.g., Terraform, Bicep; configuration management tools a plus).</li>\n</ul>\n<p>Strong programming skills (Python preferred) and solid software engineering and scripting practices.</p>\n<ul>\n<li>Experience building and operating CI/CD systems (e.g., Jenkins, Buildkite, GitHub Actions), including testing and release workflows.</li>\n</ul>\n<ul>\n<li>Database experience (e.g., Postgres or MySQL), including schema design, migrations, indexing, and operational safety.</li>\n</ul>\n<ul>\n<li>Clear communicator with strong judgment—able to explain tradeoffs, propose pragmatic solutions, and articulate a realistic vision for scalable infrastructure</li>\n</ul>\n<p><strong>Preferred Qualifications</strong></p>\n<ul>\n<li>Experience operating Slurm or other large-scale cluster schedulers.</li>\n</ul>\n<ul>\n<li>Experience with enterprise authentication and directory services (e.g., Entra ID, LDAP, FreeIPA, SSSD).</li>\n</ul>\n<ul>\n<li>Experience building or operating backend and middleware systems</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$260K – $335K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0e50f5ba-8b9","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/f2908f94-93a9-476b-ac83-b03392ae827d","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$260K – $335K • Offers Equity","x-skills-required":["chip development workflows","EDA domain","cloud platforms","networking","security","performance","automation","cloud environments","infrastructure-as-code","configuration management tools","programming skills","software engineering","scripting practices","CI/CD systems","testing","release workflows","database experience","schema design","migrations","indexing","operational safety"],"x-skills-preferred":["Slurm","enterprise authentication","directory services","backend and middleware systems"],"datePosted":"2026-03-06T18:28:58.829Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"chip development workflows, EDA domain, cloud platforms, networking, security, performance, automation, cloud environments, infrastructure-as-code, configuration management tools, programming skills, software engineering, scripting practices, CI/CD systems, testing, release workflows, database experience, schema design, migrations, indexing, operational safety, Slurm, enterprise authentication, directory services, backend and middleware systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":260000,"maxValue":335000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_148ddf8d-fe9"},"title":"IT Director - Infrastructure Engineering","description":"<p>We are seeking an experienced IT Director to lead our Infrastructure Engineering team. As a seasoned IT leader, you will be responsible for developing and executing strategic IT infrastructure plans, policies, and procedures that align with Synopsys&#39; global and regional objectives.</p>\n<p><strong>What you&#39;ll do</strong></p>\n<ul>\n<li>Developing and executing strategic IT infrastructure plans, policies, and procedures that align with Synopsys&#39; global and regional objectives.</li>\n<li>Overseeing the design, implementation, and maintenance of HPC Engineering infrastructure, including Compute, Citrix, Storage, Networks, and Data Centers to ensure seamless operations.</li>\n</ul>\n<p><strong>What you need</strong></p>\n<ul>\n<li>Extensive experience (20+ years) in managing large-scale, 24x7 IT infrastructure delivery programs for global organizations.</li>\n<li>Deep technical expertise in HPC Engineering infrastructure—Compute, Citrix, Storage, Networks, and Data Centers.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_148ddf8d-fe9","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Synopsys","sameAs":"https://careers.synopsys.com","logo":"https://logos.yubhub.co/careers.synopsys.com.png"},"x-apply-url":"https://careers.synopsys.com/job/bengaluru/it-director-infrastructure-engineering/44408/92296852016","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"employee","x-salary-range":null,"x-skills-required":["IT infrastructure management","HPC Engineering infrastructure","IT operations","governance","compliance","service management"],"x-skills-preferred":["job scheduling and queuing systems","LSF","SLURM"],"datePosted":"2026-03-06T07:20:45.459Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bengaluru"}},"occupationalCategory":"Information Technology","industry":"Technology","skills":"IT infrastructure management, HPC Engineering infrastructure, IT operations, governance, compliance, service management, job scheduling and queuing systems, LSF, SLURM"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_e8e799bf-c32"},"title":"AI Infra Engineer","description":"<p>We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters</p>\n<p><strong>What you&#39;ll do</strong></p>\n<ul>\n<li>Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads</li>\n<li>Manage and optimize Slurm-based HPC environments for distributed training of large language models</li>\n</ul>\n<p><strong>What you need</strong></p>\n<ul>\n<li>Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management</li>\n<li>Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_e8e799bf-c32","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Perplexity","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/perplexity.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/perplexity/598e1f7d-b802-4de2-99ac-90eb2bc33315","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$220K – $405K","x-skills-required":["Kubernetes administration","Slurm workload management"],"x-skills-preferred":["Kubernetes operators","Slurm administration"],"datePosted":"2026-03-04T12:24:29.750Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, Palo Alto"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes administration, Slurm workload management, Kubernetes operators, Slurm administration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":220000,"maxValue":405000,"unitText":"YEAR"}}}]}