{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/clusters"},"x-facet":{"type":"skill","slug":"clusters","display":"Clusters","count":45},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b960b2cd-63f"},"title":"Senior Datacenter Technical Program Manager, At-Scale AI Clusters","description":"<p>We are looking for a highly-motivated Technical Program Manager (TPM) to join our Applied Systems Engineering Team to drive datacenter integration for the next generation of NVIDIA AI supercomputing systems.</p>\n<p>This TPM will play a crucial role throughout the lifecycle of the latest AI systems at scale, from datacenter design and requirements definition, through systems integration of AI clusters into the datacenter environment, and support for these systems as they enter production.</p>\n<p>The successful candidate will collaborate with outstanding engineers and architects to build and deploy large-scale GPU computing systems based on NVIDIA&#39;s reference supercomputing architectures.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Collaborating with engineering leaders across multiple hardware and software teams to build AI supercomputers for NVIDIA engineers and develop reference architectures to advise customers and partners.</li>\n</ul>\n<ul>\n<li>Leading the integration of new AI clusters with datacenter facilities with demanding requirements on power, cooling, and instrumentation.</li>\n</ul>\n<ul>\n<li>Coordinating design and fit-out of new datacenter builds, working with both internal engineering teams and external contractors.</li>\n</ul>\n<ul>\n<li>Owning and producing detailed documentation for the end-to-end process for datacenter fit-out and integration.</li>\n</ul>\n<ul>\n<li>Communicating internally with engineering leadership to prioritize and address key issues essential to the success of our largest customers.</li>\n</ul>\n<p>We are looking for a TPM with a strong background in high-performance computing systems and GPU clusters deployed in on-premises datacenters.</p>\n<ul>\n<li>BS in Applied Science or Engineering (or equivalent experience)</li>\n</ul>\n<ul>\n<li>8+ years of overall experience</li>\n</ul>\n<ul>\n<li>Experience with high-performance computing systems and GPU clusters deployed in on-premises datacenters</li>\n</ul>\n<ul>\n<li>A passion for understanding challenging technical problems and driving the process of finding a solution</li>\n</ul>\n<ul>\n<li>Strong teamwork and interpersonal skills, to facilitate building a collaborative workflow for coordination between many teams</li>\n</ul>\n<ul>\n<li>Understanding of datacenter design, including familiarity with power and cooling technologies</li>\n</ul>\n<ul>\n<li>Expertise in system monitoring and instrumentation of large clusters, using technologies such as Prometheus, Grafana, Splunk, Modbus, and BACNet</li>\n</ul>\n<ul>\n<li>Experience working with the engineering or academic research community supporting high-performance computing or deep learning</li>\n</ul>\n<p>You will also be eligible for equity and benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b960b2cd-63f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"NVIDIA","sameAs":"https://www.nvidia.com/","logo":"https://logos.yubhub.co/nvidia.com.png"},"x-apply-url":"https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Datacenter-Technical-Program-Manager_JR2011480","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["high-performance computing systems","GPU clusters","datacenter design","power and cooling technologies","system monitoring and instrumentation","Prometheus","Grafana","Splunk","Modbus","BACNet"],"x-skills-preferred":[],"datePosted":"2026-04-25T12:09:09.325Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Santa Clara"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"high-performance computing systems, GPU clusters, datacenter design, power and cooling technologies, system monitoring and instrumentation, Prometheus, Grafana, Splunk, Modbus, BACNet"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_9c3667a3-140"},"title":"Token-as-a-Service Technical Program Manager","description":"<p><strong>Compensation</strong></p>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>OpenAI&#39;s Stargate and 3P Engineering teams are responsible for building and scaling the external infrastructure ecosystem that powers advanced AI systems. We work across hyperscalers, colocation providers, cloud partners, and strategic third-party operators to turn contracted capacity into production-ready compute.</p>\n<p><strong>About the Role</strong></p>\n<p>We are seeking a Technical Program Manager, Token-as-a-Service (TaaS) to lead delivery of external compute capacity that directly serves OpenAI model workloads.</p>\n<p>In this role, you will own complex cross-functional programs that transform third-party infrastructure into usable tokens at scale. You will partner across engineering, capacity planning, networking, hardware, finance, product, and external providers to ensure that deployed capacity translates into real production throughput.</p>\n<p><strong>Key Responsibilities</strong></p>\n<ul>\n<li>Lead end-to-end delivery programs that convert external infrastructure capacity into production-ready token supply.</li>\n</ul>\n<ul>\n<li>Own readiness across compute, storage, networking, security, and operational dependencies for third-party environments.</li>\n</ul>\n<ul>\n<li>Build integrated plans across internal engineering teams and external partners with clear milestones, owners, risks, and critical paths.</li>\n</ul>\n<ul>\n<li>Drive launch execution for new partner regions, clusters, and capacity expansions.</li>\n</ul>\n<ul>\n<li>Create operating mechanisms that measure deployed capacity versus usable token output.</li>\n</ul>\n<ul>\n<li>Identify bottlenecks preventing token generation (network constraints, hardware readiness, software enablement, partner delays, etc.) and drive resolution.</li>\n</ul>\n<ul>\n<li>Coordinate with capacity planning and finance teams to prioritize the highest ROI capacity opportunities.</li>\n</ul>\n<ul>\n<li>Establish executive-level reporting on delivery status, risks, and token ramp forecasts.</li>\n</ul>\n<ul>\n<li>Improve repeatability of partner onboarding, technical integration, and scaling motions.</li>\n</ul>\n<ul>\n<li>Manage escalations across internal and external stakeholders during high-severity delivery issues.</li>\n</ul>\n<ul>\n<li>Translate ambiguous infrastructure constraints into clear execution plans.</li>\n</ul>\n<ul>\n<li>Help define the long-term operating model for Token-as-a-Service across Stargate and 3P ecosystems.</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>8+ years of Technical Program Management, Engineering Program Management, or Infrastructure Delivery experience.</li>\n</ul>\n<ul>\n<li>Experience leading large-scale technical programs involving cloud, data center, networking, hardware, or distributed systems.</li>\n</ul>\n<ul>\n<li>Strong understanding of compute infrastructure, clusters, networking, storage, and production systems.</li>\n</ul>\n<ul>\n<li>Proven ability to drive cross-functional execution across engineering, operations, finance, and external vendors.</li>\n</ul>\n<ul>\n<li>Experience managing executive stakeholders and communicating complex tradeoffs clearly.</li>\n</ul>\n<ul>\n<li>Strong analytical skills with ability to reason about utilization, throughput, capacity, and operational metrics.</li>\n</ul>\n<ul>\n<li>Comfortable operating in ambiguous, fast-scaling environments.</li>\n</ul>\n<ul>\n<li>Strong written and verbal communication skills.</li>\n</ul>\n<ul>\n<li>High ownership mentality with bias toward action.</li>\n</ul>\n<ul>\n<li>Experience working with external providers, strategic partners, or hyperscalers is highly preferred.</li>\n</ul>\n<p><strong>Preferred Skills</strong></p>\n<ul>\n<li>Experience with GPU clusters, AI infrastructure, or large-scale model serving environments.</li>\n</ul>\n<ul>\n<li>Familiarity with token economics, inference capacity planning, or workload scheduling.</li>\n</ul>\n<ul>\n<li>Experience scaling global infrastructure through third-party providers.</li>\n</ul>\n<ul>\n<li>Background in systems engineering, networking, or hardware deployment programs.</li>\n</ul>\n<ul>\n<li>Experience building new operational models in high-growth environments.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_9c3667a3-140","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/e8558280-69dc-438a-b905-623f75ae6d62","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$342K – $555K","x-skills-required":["Technical Program Management","Engineering Program Management","Infrastructure Delivery","Cloud","Data Center","Networking","Hardware","Distributed Systems","Compute Infrastructure","Clusters","Storage","Production Systems"],"x-skills-preferred":["GPU Clusters","AI Infrastructure","Large-Scale Model Serving Environments","Token Economics","Inference Capacity Planning","Workload Scheduling","Global Infrastructure","Systems Engineering","Hardware Deployment Programs"],"datePosted":"2026-04-24T12:23:54.161Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco; Seattle"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Technical Program Management, Engineering Program Management, Infrastructure Delivery, Cloud, Data Center, Networking, Hardware, Distributed Systems, Compute Infrastructure, Clusters, Storage, Production Systems, GPU Clusters, AI Infrastructure, Large-Scale Model Serving Environments, Token Economics, Inference Capacity Planning, Workload Scheduling, Global Infrastructure, Systems Engineering, Hardware Deployment Programs","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":342000,"maxValue":555000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_38c10a5f-35e"},"title":"CPU/Storage/PoP-WAN Program Manager","description":"<p>We are seeking a highly technical Program Manager to lead execution across CPU, Storage, PoP, and WAN infrastructure programs that directly unlock OpenAI&#39;s next generation compute capacity.</p>\n<p>In this role, you will own complex cross-functional programs spanning compute cluster activation, storage deployment, PoP bring-up, and backbone expansion. You will coordinate hardware readiness, site readiness, network pathing, storage availability, vendor execution, and engineering dependencies required to turn contracted infrastructure into live training and inference capacity.</p>\n<p>This role requires strong technical fluency across hardware systems, network infrastructure, storage architecture, and deployment execution. You should be comfortable operating from rack-level implementation details through executive-level capacity planning discussions.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Lead end-to-end execution of CPU / GPU cluster activation programs across OpenAI&#39;s global infrastructure footprint</li>\n<li>Drive readiness to convert contracted compute capacity into schedulable production clusters</li>\n<li>Own deployment programs for new PoPs, backbone nodes, WAN expansion, and interconnection initiatives</li>\n<li>Build integrated schedules spanning procurement, logistics, installation, storage readiness, network turn-up, testing, and production handoff</li>\n<li>Coordinate BOM readiness, server delivery, racks, optics, cabling, storage hardware, and vendor milestones</li>\n<li>Partner with engineering teams to align compute, storage, and networking dependencies before cluster activation</li>\n<li>Manage deployment of storage systems supporting training and inference workloads, including readiness, validation, performance checks, and scaling plans</li>\n<li>Coordinate backbone capacity expansion, cross-connects, inter-region pathing, and cloud interconnect readiness with Azure and third-party providers</li>\n<li>Lead physical deployment execution including rack-and-stack, hardware bring-up, L1 validation, and site acceptance criteria</li>\n<li>Build repeatable deployment playbooks, dashboards, governance cadences, and operating mechanisms for scale</li>\n<li>Identify risks early across supply chain, site readiness, technical constraints, and vendor execution, then drive mitigation plans</li>\n<li>Communicate milestones, escalations, and capacity forecasts to senior leadership</li>\n</ul>\n<p>Qualifications:</p>\n<ul>\n<li>8+ years of experience in technical program management, infrastructure deployment, network deployment, or data center operations</li>\n<li>Strong experience delivering programs involving compute, storage, networking, or large-scale infrastructure systems</li>\n<li>Working knowledge of servers, clusters, storage arrays, routers, switches, optics, and structured cabling</li>\n<li>Experience owning cross-functional programs across engineering, operations, supply chain, and external vendors</li>\n<li>Strong understanding of deployment lifecycles from planning and procurement through production handoff</li>\n<li>Ability to reason across physical infrastructure execution and logical systems architecture dependencies</li>\n<li>Proven ability to build integrated schedules and drive accountability across multiple stakeholders</li>\n<li>Strong executive communication skills with experience managing critical escalations and leadership updates</li>\n<li>Comfortable operating in fast-moving environments with aggressive timelines and evolving priorities</li>\n<li>Highly analytical with strong problem-solving and execution instincts</li>\n</ul>\n<p>Preferred Skills:</p>\n<ul>\n<li>Experience at a hyperscaler, cloud provider, AI infrastructure company, or global network operator</li>\n<li>Experience deploying GPU clusters, HPC systems, or large training environments</li>\n<li>Familiarity with distributed storage systems and high-performance data infrastructure</li>\n<li>Experience with PoP deployments, WAN backbone expansion, or global network buildouts</li>\n<li>Experience working across first-party, colo, and cloud environments</li>\n<li>Experience building repeatable infrastructure deployment systems in high-growth environments</li>\n</ul>\n<p>About OpenAI:</p>\n<p>OpenAI is an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_38c10a5f-35e","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/667c09e2-6efc-45dc-9714-078bedf17343","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$342K – $555K","x-skills-required":["technical program management","infrastructure deployment","network deployment","data center operations","compute, storage, networking, or large-scale infrastructure systems","servers, clusters, storage arrays, routers, switches, optics, and structured cabling","cross-functional programs across engineering, operations, supply chain, and external vendors","deployment lifecycles from planning and procurement through production handoff","physical infrastructure execution and logical systems architecture dependencies","integrated schedules and drive accountability across multiple stakeholders","executive communication skills with experience managing critical escalations and leadership updates"],"x-skills-preferred":["hyperscaler, cloud provider, AI infrastructure company, or global network operator","deploying GPU clusters, HPC systems, or large training environments","distributed storage systems and high-performance data infrastructure","PoP deployments, WAN backbone expansion, or global network buildouts","first-party, colo, and cloud environments","repeatable infrastructure deployment systems in high-growth environments"],"datePosted":"2026-04-24T12:23:53.931Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco; Seattle"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"technical program management, infrastructure deployment, network deployment, data center operations, compute, storage, networking, or large-scale infrastructure systems, servers, clusters, storage arrays, routers, switches, optics, and structured cabling, cross-functional programs across engineering, operations, supply chain, and external vendors, deployment lifecycles from planning and procurement through production handoff, physical infrastructure execution and logical systems architecture dependencies, integrated schedules and drive accountability across multiple stakeholders, executive communication skills with experience managing critical escalations and leadership updates, hyperscaler, cloud provider, AI infrastructure company, or global network operator, deploying GPU clusters, HPC systems, or large training environments, distributed storage systems and high-performance data infrastructure, PoP deployments, WAN backbone expansion, or global network buildouts, first-party, colo, and cloud environments, repeatable infrastructure deployment systems in high-growth environments","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":342000,"maxValue":555000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_e179812d-e4c"},"title":"Technical Program Manager, Compute Infrastructure","description":"<p>We&#39;re seeking a Technical Program Manager for Compute Infrastructure to join our engineer-first TPM team. As a Technical Program Manager, you will own the end-to-end delivery of large-scale GPU clusters, partnering with engineers to bring clusters online across external providers and partners. You&#39;ll run a broad, parallel portfolio spanning hardware, networking, power, and cooling,driving execution, risk management, and crisp alignment from working teams through leadership to deliver production-ready capacity at scale.</p>\n<p>In this role, you will:</p>\n<ul>\n<li>Lead end-to-end delivery of both New Compute SKUs and large-scale GPU clusters across an external partner ecosystem while supporting capacity planning for training and inference.</li>\n<li>Ability to contextually drive multi-threaded bring-up programs spanning hardware, networking, power, and cooling,owning plans, dependencies, and critical paths.</li>\n<li>Interface with chip providers to derisk long-term onboarding to new hardware platforms by working across kernels, comms, hardware, and scheduling engineering teams.</li>\n<li>Build and operationalize program mechanisms (roadmaps, milestones, risk registers, runbooks) that make delivery predictable at massive scale.</li>\n<li>Partner with engineering to improve cluster turn-up reliability, repeatability, and automation, reducing time-to-serve for new capacity.</li>\n<li>Support network operations and end-to-end physical and logical bring-up of OpenAI network Points-of-Presence (PoPs), including on-site deployment, rack cabling, and close collaboration with engineering teams.</li>\n<li>Coordinate cross-functional readiness (security, finance, operations, product/research stakeholders) to ship production-ready compute.</li>\n<li>Manage integration and handoffs across teams and partners,ensuring consistent execution, clear communication, and fast issue resolution.</li>\n<li>Identify bottlenecks and systemic gaps, then drive durable fixes across tooling, process, and partner interfaces.</li>\n<li>Provide crisp executive visibility on progress, tradeoffs, and risks across a large portfolio of concurrent programs.</li>\n</ul>\n<p>You might thrive in this role if you:</p>\n<ul>\n<li>Possess a degree in a hard science, or have a demonstrated track record of engineering expertise.</li>\n<li>Have 5+ years of experience in program management for major projects including capital projects or hyperscaler infrastructure deployment.</li>\n<li>Demonstrate the ability to serve as the go-to person solely responsible for driving and delivering complex projects.</li>\n<li>Are comfortable managing cross-functional and cross-company teams; experience driving information and decision hygiene.</li>\n<li>Have an extensive track record of successfully delivering high-profile, technical projects against tight deadlines.</li>\n<li>Are technically adept and have effectively partnered with engineering or fundamental research teams of the highest caliber.</li>\n<li>Have experience interfacing with and leading external vendors including engineering firms, equipment suppliers, and/or construction firms.</li>\n<li>Have expertise in designing and implementing simple, scalable processes that solve complex problems.</li>\n<li>Have experience managing complicated dependencies such as logistics and/or supply chains.</li>\n<li>Are relentlessly resourceful and thrive in ambiguous, fast-paced environments.</li>\n<li>Are interested in and thoughtful about the impacts of AGI.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_e179812d-e4c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/8fb1615c-34bf-47c4-a1d1-b7b2f836bbd3","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$257K – $335K","x-skills-required":["Program Management","Compute Infrastructure","GPU Clusters","Hardware","Networking","Power","Cooling","Risk Management","Cross-Functional Teams","Engineering","External Vendors","Supply Chain Management"],"x-skills-preferred":[],"datePosted":"2026-04-24T12:23:45.036Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Program Management, Compute Infrastructure, GPU Clusters, Hardware, Networking, Power, Cooling, Risk Management, Cross-Functional Teams, Engineering, External Vendors, Supply Chain Management","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":257000,"maxValue":335000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_736969e6-3f9"},"title":"CPU Storage Tech Lead","description":"<p><strong>Compensation</strong></p>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>The Stargate team is responsible for building the physical infrastructure that powers large-scale AI systems. We design and deliver next-generation data centers optimized for dense compute clusters, advanced networking, and rapidly evolving hardware platforms.</p>\n<p><strong>About the Role</strong></p>\n<p>We are seeking a CPU &amp; Storage Technical Lead to define and drive the server compute and storage architecture strategy for Stargate infrastructure.</p>\n<p>In this role, you will own technical direction across CPU platforms, memory configurations, local and disaggregated storage systems, and their integration into large-scale AI clusters. You will evaluate vendor roadmaps, lead platform tradeoff decisions, and ensure compute and storage systems are optimized for training, inference, and supporting services.</p>\n<p><strong>Key Responsibilities</strong></p>\n<ul>\n<li>Own CPU and storage technical strategy for Stargate compute infrastructure across current and future generations.</li>\n</ul>\n<ul>\n<li>Evaluate CPU platforms across performance, efficiency, memory bandwidth, PCIe topology, cost, and roadmap alignment.</li>\n</ul>\n<ul>\n<li>Define storage architectures for AI environments, including boot media, local NVMe, shared storage, caching tiers, metadata services, and high-performance data pipelines.</li>\n</ul>\n<ul>\n<li>Drive server platform decisions involving CPU, memory, NIC, GPU, and storage subsystem integration.</li>\n</ul>\n<ul>\n<li>Partner with performance modeling teams to quantify tradeoffs across compute, memory, I/O, and storage bottlenecks.</li>\n</ul>\n<ul>\n<li>Work with silicon and hardware vendors on roadmap influence, feature requests, qualification plans, and technical escalations.</li>\n</ul>\n<ul>\n<li>Lead bring-up and validation efforts for new CPU and storage platforms in lab and production environments.</li>\n</ul>\n<ul>\n<li>Partner with networking and cluster architecture teams to optimize end-to-end node design and data movement.</li>\n</ul>\n<ul>\n<li>Support supply chain and sourcing teams with technical vendor assessments and second-source strategies.</li>\n</ul>\n<ul>\n<li>Drive reliability, serviceability, and fleet lifecycle planning for compute and storage platforms.</li>\n</ul>\n<ul>\n<li>Translate future AI workload requirements into infrastructure platform specifications.</li>\n</ul>\n<ul>\n<li>Provide technical leadership across cross-functional stakeholders and executive reviews.</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>Bachelor’s degree in Computer Engineering, Electrical Engineering, Computer Science, or related technical field; advanced degree preferred.</li>\n</ul>\n<ul>\n<li>10+ years of experience in server hardware, systems architecture, data center infrastructure, or hyperscale compute platforms.</li>\n</ul>\n<ul>\n<li>Deep expertise in modern CPU architectures (x86, ARM, accelerator host systems) and server platform design.</li>\n</ul>\n<ul>\n<li>Strong understanding of memory systems, PCIe/CXL fabrics, NUMA behavior, and platform-level performance constraints.</li>\n</ul>\n<ul>\n<li>Experience with storage systems including NVMe, SSD qualification, RAID, distributed storage, object/file systems, or high-performance data pipelines.</li>\n</ul>\n<ul>\n<li>Experience evaluating hardware tradeoffs across performance, cost, power, thermals, and supply availability.</li>\n</ul>\n<ul>\n<li>Familiarity with GPU clusters and AI training/inference infrastructure strongly preferred.</li>\n</ul>\n<ul>\n<li>Experience working directly with OEMs, ODMs, silicon vendors, or storage vendors.</li>\n</ul>\n<ul>\n<li>Strong systems thinking with ability to connect component decisions to fleet-level outcomes.</li>\n</ul>\n<ul>\n<li>Excellent communication skills with the ability to influence engineering and executive stakeholders.</li>\n</ul>\n<ul>\n<li>Proven ability to operate in fast-moving, ambiguous environments with high ownership.</li>\n</ul>\n<p><strong>Preferred Skills</strong></p>\n<ul>\n<li>Experience designing infrastructure for large-scale AI or HPC environments.</li>\n</ul>\n<ul>\n<li>Familiarity with CPU vendor roadmaps across AMD, Intel, and ARM ecosystems.</li>\n</ul>\n<ul>\n<li>Experience with distributed storage architectures supporting GPU clusters.</li>\n</ul>\n<ul>\n<li>Knowledge of fleet operations, hardware lifecycle management, and production deployments at scale.</li>\n</ul>\n<ul>\n<li>Prior experience in hyperscale cloud, AI infrastructure, or advanced compute environments.</li>\n</ul>\n<p><strong>About OpenAI</strong></p>\n<p>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.</p>\n<p>We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.</p>\n<p>For additional information, please see [OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement](https://cdn.openai.com/policies/eeo-policy-statement.pdf).</p>\n<p>Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_736969e6-3f9","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/18a60850-cf8b-4374-a214-ef78b9712deb","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$342K – $555K","x-skills-required":["server hardware","systems architecture","data center infrastructure","hyperscale compute platforms","modern CPU architectures","server platform design","memory systems","PCIe/CXL fabrics","NUMA behavior","platform-level performance constraints","storage systems","NVMe","SSD qualification","RAID","distributed storage","object/file systems","high-performance data pipelines","hardware tradeoffs","performance","cost","power","thermals","supply availability","GPU clusters","AI training/inference infrastructure","OEMs","ODMs","silicon vendors","storage vendors","strong systems thinking","component decisions","fleet-level outcomes","excellent communication skills","influence engineering and executive stakeholders","fast-moving","ambiguous environments","high ownership"],"x-skills-preferred":["infrastructure for large-scale AI or HPC environments","CPU vendor roadmaps across AMD, Intel, and ARM ecosystems","distributed storage architectures supporting GPU clusters","fleet operations","hardware lifecycle management","production deployments at scale","hyperscale cloud","AI infrastructure","advanced compute environments"],"datePosted":"2026-04-24T12:21:17.145Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco; Seattle"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"server hardware, systems architecture, data center infrastructure, hyperscale compute platforms, modern CPU architectures, server platform design, memory systems, PCIe/CXL fabrics, NUMA behavior, platform-level performance constraints, storage systems, NVMe, SSD qualification, RAID, distributed storage, object/file systems, high-performance data pipelines, hardware tradeoffs, performance, cost, power, thermals, supply availability, GPU clusters, AI training/inference infrastructure, OEMs, ODMs, silicon vendors, storage vendors, strong systems thinking, component decisions, fleet-level outcomes, excellent communication skills, influence engineering and executive stakeholders, fast-moving, ambiguous environments, high ownership, infrastructure for large-scale AI or HPC environments, CPU vendor roadmaps across AMD, Intel, and ARM ecosystems, distributed storage architectures supporting GPU clusters, fleet operations, hardware lifecycle management, production deployments at scale, hyperscale cloud, AI infrastructure, advanced compute environments","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":342000,"maxValue":555000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a88f73e0-fbc"},"title":"Systems Engineer (Network / Storage / Systems)","description":"<p>We are seeking a System Engineer (Network / Storage / Systems) to help architect, validate, and operationalize the core infrastructure systems that enable Stargate deployments.</p>\n<p>In this role, you will work across networking, storage, system bring-up, hardware debugging, and cluster readiness. You will partner closely with hardware engineering, cluster software, infrastructure operations, and external vendors to ensure new systems are deployed efficiently and run reliably at scale.</p>\n<p>This role is ideal for engineers who can operate across hardware and software boundaries, solve ambiguous technical problems, and drive complex systems into production.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Own system engineering workstreams across one or more critical domains including networking, storage, system validation, or bring-up.</li>\n<li>Design and improve top-of-network architectures spanning frontend, WAN, OOB, firewall, and adjacent infrastructure layers.</li>\n<li>Drive logical network readiness including routing, configuration management, provisioning, and issue resolution.</li>\n<li>Define storage architectures across in-rack, in-pod, cluster, and cloud tiers with focus on performance, lifecycle, and cost efficiency.</li>\n<li>Evaluate vendor hardware and infrastructure proposals, providing technical feedback on architecture, reliability, and operational fit.</li>\n<li>Lead system bring-up for new hardware platforms including imaging, provisioning, validation, and readiness for production deployment.</li>\n<li>Debug complex system faults across firmware, NIC, GPU, server, and platform layers; drive root cause analysis with internal teams and external vendors.</li>\n<li>Build tools and automation that improve lab operations, SKU onboarding, fleet readiness, and deployment velocity.</li>\n<li>Partner with hardware, clusters, and operations teams to translate new compute platforms into stable production environments.</li>\n</ul>\n<p>Qualifications:</p>\n<ul>\n<li>7+ years of experience in systems engineering, infrastructure engineering, hardware platforms, or large-scale compute environments.</li>\n<li>Strong technical depth in one or more areas: networking, storage systems, server platforms, firmware, Linux systems, or distributed infrastructure.</li>\n<li>Experience bringing up new hardware systems or clusters in lab or production environments.</li>\n<li>Experience debugging low-level hardware/software issues and driving cross-functional RCA efforts.</li>\n<li>Familiarity with hyperscale infrastructure, AI clusters, HPC environments, or data center systems.</li>\n<li>Experience working with OEM, ODM, JDM, or hardware vendors.</li>\n<li>Strong scripting or software skills in Python, Go, Bash, or similar.</li>\n<li>Ability to operate effectively in fast-moving environments with high ownership and evolving technical requirements.</li>\n</ul>\n<p>Preferred Skills:</p>\n<ul>\n<li>Experience supporting GPU clusters or accelerator-based infrastructure at scale.</li>\n<li>Familiarity with cluster management, provisioning, or fleet lifecycle tooling.</li>\n<li>Experience with network automation, storage optimization, or systems observability.</li>\n<li>Background working across both hardware and software engineering organizations.</li>\n<li>Experience scaling greenfield infrastructure deployments or rapid expansion programs.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a88f73e0-fbc","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/695300fd-1332-4eff-ba1d-d87bb1691f73","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$335K – $455K","x-skills-required":["networking","storage systems","server platforms","firmware","Linux systems","distributed infrastructure","OEM","ODM","JDM","hardware vendors","Python","Go","Bash"],"x-skills-preferred":["GPU clusters","cluster management","provisioning","fleet lifecycle tooling","network automation","storage optimization","systems observability"],"datePosted":"2026-04-24T12:20:32.700Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"networking, storage systems, server platforms, firmware, Linux systems, distributed infrastructure, OEM, ODM, JDM, hardware vendors, Python, Go, Bash, GPU clusters, cluster management, provisioning, fleet lifecycle tooling, network automation, storage optimization, systems observability","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":335000,"maxValue":455000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6ca1bab3-645"},"title":"Research Engineer — Reinforcement Learning","description":"<p>You&#39;ll bring reinforcement learning to Firecrawl&#39;s core product , building the training infrastructure, reward pipelines, and fine-tuning systems that make our models meaningfully better at extracting, understanding, and structuring web data.</p>\n<p>This isn&#39;t theoretical RL research. You&#39;ll build your own training infra, run fast experiments, ship models to production, and bridge the gap between classical RL approaches and modern LLM agent systems. If you care as much about training throughput as you do about reward design, this is the role.</p>\n<p><strong>Salary Range:</strong> $180,000–$290,000/year (Range shown is for U.S.-based employees. Compensation outside the U.S. is adjusted fairly based on your country&#39;s cost of living.)</p>\n<p><strong>Equity Range:</strong> Up to 0.15%</p>\n<p><strong>Location:</strong> San Francisco, CA or Remote (Americas, UTC-3 to UTC-10)</p>\n<p><strong>Job Type:</strong> Full-Time</p>\n<p><strong>Experience:</strong> 3+ years in applied RL, ML engineering, or model training , with production systems</p>\n<p><strong>Visa:</strong> US Citizenship/Visa required for SF; N/A for Remote</p>\n<p><strong>Build training infrastructure and reward pipelines from scratch.</strong> Design and operate the systems that train and evaluate Firecrawl&#39;s models. You&#39;ll own the full loop , data collection, reward modeling, training runs, evaluation, and deployment. You build the infra yourself because you&#39;re the one who needs it to work.</p>\n<p><strong>Fine-tune models to achieve state-of-the-art results.</strong> Take foundation models and make them dramatically better at web data extraction, content understanding, and structured output generation. You know how to get from &#39;decent fine-tune&#39; to &#39;best-in-class&#39; and you have the patience and rigor to close that gap.</p>\n<p><strong>Bridge LLM agents and classical RL.</strong> The most interesting problems at Firecrawl sit at the intersection of modern LLM-based agents and classical RL techniques. You&#39;ll design reward signals for agent behaviors, apply RL methods to improve multi-step agent workflows, and figure out where traditional RL approaches outperform prompting , and vice versa.</p>\n<p><strong>Run fast experiments and iterate.</strong> You design experiments that test meaningful hypotheses, run them quickly, and make decisions based on results. You don&#39;t spend weeks on experiment infrastructure before getting a single result. Speed of iteration is a core part of how you work.</p>\n<p><strong>Communicate clearly to non-RL people.</strong> RL can be opaque. You translate your work into language that engineers, product people, and leadership can understand and act on. You know how to explain why a reward function matters without requiring everyone to read the paper.</p>\n<p><strong>Collaborate closely with the team.</strong> Work directly with the Search/IR-focused Research Engineer and the engineering team to connect RL improvements with search, ranking, and the broader product roadmap.</p>\n<p><strong>Builds their own training infra and reward pipelines.</strong> You don&#39;t wait for an ML platform team to set things up. You build the training loops, reward models, data pipelines, and evaluation frameworks yourself , because you understand that infra choices directly affect the quality of results. You&#39;ve operated GPU clusters, managed training runs, and debugged convergence issues in production.</p>\n<p><strong>Can fine-tune models to SOTA.</strong> You&#39;ve taken models from baseline to best-in-class on tasks that matter. You understand the full fine-tuning lifecycle , data curation, training dynamics, hyperparameter sensitivity, evaluation methodology , and you have the taste to know when a model is actually good versus when the eval is flattering.</p>\n<p><strong>Bridges LLM agents and classical RL.</strong> You&#39;re fluent in both worlds. You understand PPO, RLHF, reward modeling, and policy optimization , and you understand how modern LLM agents work, where they fail, and how RL techniques make them better. You see connections between these domains that most people miss.</p>\n<p><strong>Production-minded.</strong> You care about whether your models work in production, not just on benchmarks. You&#39;ve deployed models that serve real traffic and made hard tradeoffs between model quality, latency, and cost. Research that doesn&#39;t ship isn&#39;t research that matters here.</p>\n<p><strong>Runs fast experiments and communicates clearly.</strong> You&#39;d rather run three rough experiments this week than one polished one next month. When you have results, anyone on the team can understand what they mean , no decoder ring required.</p>\n<p><strong>Backgrounds that tend to do well:</strong> RL engineers at AI labs or applied ML teams who&#39;ve shipped models to production. Researchers who&#39;ve done RLHF or reward modeling for LLM systems. ML engineers who&#39;ve built training infrastructure at startups and cared as much about the pipeline as the model. People who&#39;ve worked at the intersection of RL and language models , whether in academic labs with a production bent or at companies building agent systems.</p>\n<p><strong>What We&#39;re NOT Looking For:</strong></p>\n<p><strong>Pure theorists.</strong> If your best RL work lives in a paper and you&#39;ve never trained a model on real data at real scale, this isn&#39;t the role. We need someone who builds and ships.</p>\n<p><strong>Researchers who need a platform team.</strong> If you expect training infrastructure, data pipelines, and evaluation frameworks to be set up before you can be productive, you&#39;ll be frustrated here. You build the tools you need.</p>\n<p><strong>People who only know one paradigm.</strong> Deep in classical RL but never worked with LLMs? LLM fine-tuner who&#39;s never touched RL? You&#39;ll be missing half the picture. This role requires fluency in both.</p>\n<p><strong>Slow iterators.</strong> If your standard experiment cycle is measured in weeks, not days, you&#39;ll struggle with the pace. We need someone who can run a meaningful experiment, interpret results, and decide next steps within a day or two.</p>\n<p><strong>Black-box communicators.</strong> If your typical update is a wall of metrics only another RL researcher can parse, this isn&#39;t the right fit. We need someone who can explain what&#39;s working, what&#39;s not, and why it matters , to people without RL PhDs.</p>\n<p><strong>A Note On Pace:</strong> We operate at an absurd level of urgency because the window for what we&#39;re building won&#39;t stay open forever. If that excites you, keep reading. If it doesn&#39;t, no hard feelings , but this role probably isn&#39;t for you.</p>\n<p><strong>Benefits &amp; Perks:</strong></p>\n<p><strong>Available to all employees</strong></p>\n<ul>\n<li><strong>Salary that makes sense</strong> , $180,000–$290,000/year, based on impact, not tenure</li>\n</ul>\n<ul>\n<li><strong>Own a piece</strong> , Up to 0.15% equity in what you&#39;re helping build</li>\n</ul>\n<ul>\n<li><strong>Generous PTO</strong> , 15 days mandatory, anything after 24 days, just ask (holidays excluded); take the time you need to recharge</li>\n</ul>\n<ul>\n<li><strong>Parental leave</strong> , 12 weeks fully paid, for moms and dads</li>\n</ul>\n<ul>\n<li><strong>Wellness stipend</strong> , $100/month for the gym, therapy, massages, or whatever keeps you human</li>\n</ul>\n<ul>\n<li><strong>Learning &amp; Development</strong> , Expense up to $1,000/year toward anything that helps you grow professionally</li>\n</ul>\n<ul>\n<li><strong>Team offsites</strong> , A change of scenery, minus the trust falls</li>\n</ul>\n<ul>\n<li><strong>Sabbatical</strong> , 3 paid months off after 4 years, do something fun and new</li>\n</ul>\n<p><strong>Available to US-based full-time employees</strong></p>\n<ul>\n<li><strong>Full coverage, no red tape</strong> , Medical, dental, and vision (100% for</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6ca1bab3-645","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Firecrawl","sameAs":"https://www.firecrawl.dev","logo":"https://logos.yubhub.co/firecrawl.dev.png"},"x-apply-url":"https://jobs.ashbyhq.com/firecrawl/26abaf11-ff85-4f8d-ba44-2b6d32aae2a1","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$180,000–$290,000/year","x-skills-required":["Reinforcement Learning","Machine Learning","Deep Learning","Python","GPU Clusters","Training Runs","Evaluation Frameworks","Data Pipelines","Reward Modeling","Policy Optimization","LLM Agents","Classical RL Techniques"],"x-skills-preferred":[],"datePosted":"2026-04-24T12:17:17.208Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA (Hybrid) OR Remote (Americas, UTC-3 to UTC-10)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Reinforcement Learning, Machine Learning, Deep Learning, Python, GPU Clusters, Training Runs, Evaluation Frameworks, Data Pipelines, Reward Modeling, Policy Optimization, LLM Agents, Classical RL Techniques","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":290000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_bdf4e05a-b8c"},"title":"MTS - Site Reliability Engineer","description":"<p>As Microsoft continues to push the boundaries of AI, we are on the lookout for individuals to work with us on the most interesting and challenging AI questions of our time. Our vision is bold and broad , to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. It’s also inclusive: we aim to make AI accessible to all , consumers, businesses, developers , so that everyone can realize its benefits.</p>\n<p>We’re looking for an experienced Site Reliability Engineer (SRE) to join our infrastructure team. In this role, you’ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You’ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.</p>\n<p>Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.</p>\n<p>Responsibilities:</p>\n<p>Reliability &amp; Availability: Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.</p>\n<p>Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra.</p>\n<p>Performance Optimization: Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking).</p>\n<p>Automation &amp; Tooling: Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments.</p>\n<p>Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.</p>\n<p>Security &amp; Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments.</p>\n<p>Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.</p>\n<p>Qualifications:</p>\n<p>Required Qualifications: 4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.</p>\n<p>Preferred Qualifications: Strong proficiency in Kubernetes, Docker, and container orchestration. Knowledge of CI/CD pipelines for Inference and ML model deployment. Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code. Expertise in monitoring &amp; observability tools (Grafana, Datadog, OpenTelemetry, etc.). Strong programming/scripting skills in Python, Go, or Bash. Solid knowledge of distributed systems, networking, and storage. Experience running large-scale GPU clusters for ML/AI workloads (preferred). Familiarity with ML training/inference pipelines. Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators). Background in capacity planning &amp; cost optimization for GPU-heavy environments.</p>\n<p>Work on cutting-edge infrastructure that powers the future of Generative AI. Collaborate with world-class researchers and engineers. Impact millions of users through reliable and responsible AI deployments. Competitive compensation, equity options, and comprehensive benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_bdf4e05a-b8c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/mts-site-reliability-engineer/","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$119,800 - $234,700 per year","x-skills-required":["Site Reliability Engineering","DevOps","Infrastructure Engineering","Kubernetes","Docker","container orchestration","CI/CD pipelines","ML model deployment","public cloud platforms","Azure","AWS","GCP","infrastructure-as-code","monitoring & observability tools","Grafana","Datadog","OpenTelemetry","Python","Go","Bash","distributed systems","networking","storage","GPU clusters","ML training/inference pipelines","high-performance computing","workload schedulers","capacity planning","cost optimization"],"x-skills-preferred":["cloud architecture","containerization","microservices","API design","security","compliance","agile development","scrum","kanban"],"datePosted":"2026-04-24T12:12:26.597Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Redmond"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Site Reliability Engineering, DevOps, Infrastructure Engineering, Kubernetes, Docker, container orchestration, CI/CD pipelines, ML model deployment, public cloud platforms, Azure, AWS, GCP, infrastructure-as-code, monitoring & observability tools, Grafana, Datadog, OpenTelemetry, Python, Go, Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, capacity planning, cost optimization, cloud architecture, containerization, microservices, API design, security, compliance, agile development, scrum, kanban","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":119800,"maxValue":234700,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2291f859-746"},"title":"MTS - Site Reliability Engineer","description":"<p>As Microsoft continues to push the boundaries of AI, we are on the lookout for experienced Site Reliability Engineers to work with us on the most interesting and challenging AI questions of our time.</p>\n<p>Our vision is bold and broad , to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. It’s also inclusive: we aim to make AI accessible to all , consumers, businesses, developers , so that everyone can realize its benefits.</p>\n<p>We’re looking for an experienced Site Reliability Engineer (SRE) to join our infrastructure team. In this role, you’ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You’ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.</p>\n<p>Responsibilities:</p>\n<p>Reliability &amp; Availability: Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.</p>\n<p>Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra.</p>\n<p>Performance Optimization: Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking).</p>\n<p>Automation &amp; Tooling: Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments.</p>\n<p>Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.</p>\n<p>Security &amp; Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments.</p>\n<p>Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.</p>\n<p>Qualifications:</p>\n<p>Required Qualifications:</p>\n<p>4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.</p>\n<p>Strong proficiency in Kubernetes, Docker, and container orchestration.</p>\n<p>Knowledge of CI/CD pipelines for Inference and ML model deployment.</p>\n<p>Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code.</p>\n<p>Expertise in monitoring &amp; observability tools (Grafana, Datadog, OpenTelemetry, etc.).</p>\n<p>Strong programming/scripting skills in Python, Go, or Bash.</p>\n<p>Solid knowledge of distributed systems, networking, and storage.</p>\n<p>Experience running large-scale GPU clusters for ML/AI workloads (preferred).</p>\n<p>Familiarity with ML training/inference pipelines.</p>\n<p>Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators).</p>\n<p>Background in capacity planning &amp; cost optimization for GPU-heavy environments.</p>\n<p>Work on cutting-edge infrastructure that powers the future of Generative AI.</p>\n<p>Collaborate with world-class researchers and engineers.</p>\n<p>Impact millions of users through reliable and responsible AI deployments.</p>\n<p>Competitive compensation, equity options, and comprehensive benefits.</p>\n<p>Software Engineering IC4 – The typical base pay range for this role across the U.S. is USD $119,800 – $234,700 per year.</p>\n<p>Software Engineering IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2291f859-746","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/mts-site-reliability-engineer-3/","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$119,800 - $234,700 per year","x-skills-required":["Kubernetes","Docker","container orchestration","CI/CD pipelines","public cloud platforms","infrastructure-as-code","monitoring & observability tools","Python","Go","Bash","distributed systems","networking","storage","GPU clusters","ML training/inference pipelines","high-performance computing","workload schedulers","capacity planning & cost optimization"],"x-skills-preferred":[],"datePosted":"2026-04-24T12:12:10.488Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Docker, container orchestration, CI/CD pipelines, public cloud platforms, infrastructure-as-code, monitoring & observability tools, Python, Go, Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, capacity planning & cost optimization","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":119800,"maxValue":234700,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_70e2591f-d7d"},"title":"Technical Program Manager, Infrastructure","description":"<p>As a Technical Program Manager for Infrastructure, you&#39;ll work across multiple infrastructure domains to coordinate complex programs that have broad organisational impact. You&#39;ll be solving novel scaling challenges at the frontier of what&#39;s possible, all while maintaining the security and reliability our mission demands.</p>\n<p>Developer Productivity &amp; Tooling</p>\n<ul>\n<li>Drive cross-functional programs to improve developer environments, CI/CD infrastructure, and release processes that enable rapid innovation while maintaining high security standards</li>\n</ul>\n<ul>\n<li>Coordinate large-scale migrations and platform modernization efforts across engineering teams</li>\n</ul>\n<ul>\n<li>Partner with teams to measure and improve developer productivity metrics, identifying bottlenecks and driving systematic improvements</li>\n</ul>\n<ul>\n<li>Lead initiatives to integrate AI tools into development workflows, helping Anthropic be at the forefront of AI-assisted research and engineering</li>\n</ul>\n<p>Infrastructure Reliability &amp; Operations</p>\n<ul>\n<li>Drive programs to establish and achieve reliability targets across training infrastructure and production services</li>\n</ul>\n<ul>\n<li>Coordinate incident response improvements, post-mortem processes, and on-call rotations that help teams operate effectively</li>\n</ul>\n<ul>\n<li>Establish metrics and dashboards to track infrastructure health, capacity utilisation, and operational excellence</li>\n</ul>\n<p>Cross-functional Coordination</p>\n<ul>\n<li>Serve as the critical bridge between infrastructure teams, research, and product, translating technical complexities into clear updates for a variety of audiences</li>\n</ul>\n<ul>\n<li>Consult with stakeholders to deeply understand infrastructure, data, and compute needs, identifying solutions to support frontier research and product development</li>\n</ul>\n<ul>\n<li>Drive alignment on priorities and timelines across teams with competing constraints</li>\n</ul>\n<p>You&#39;ll be a good fit if you have 5+ years of technical program management experience, with a track record of successfully delivering complex infrastructure programs in ML/AI systems or large-scale distributed systems. You&#39;ll also need a deep technical understanding of infrastructure systems, strong stakeholder management skills, and the ability to navigate competing priorities-confirming data-driven technical decisions.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_70e2591f-d7d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5111783008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$290,000-$365,000 USD","x-skills-required":["Kubernetes","Cloud platforms (AWS, GCP, Azure)","ML infrastructure (GPU/TPU/Trainium clusters)","Developer productivity initiatives","CI/CD systems","Infrastructure scaling"],"x-skills-preferred":["Observability tooling and practices","AI tools to improve engineering productivity","Research teams and translating their needs into concrete technical requirements"],"datePosted":"2026-04-18T15:57:52.097Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Cloud platforms (AWS, GCP, Azure), ML infrastructure (GPU/TPU/Trainium clusters), Developer productivity initiatives, CI/CD systems, Infrastructure scaling, Observability tooling and practices, AI tools to improve engineering productivity, Research teams and translating their needs into concrete technical requirements","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":290000,"maxValue":365000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_588dfb0e-611"},"title":"Solutions Architect - Kubernetes","description":"<p>As a Solutions Architect at CoreWeave, you will play a vital role in helping customers succeed with our cloud infrastructure offerings, focusing on Kubernetes solutions within high-performance compute (HPC) environments.</p>\n<p>Your responsibilities will include serving as the primary technical point of contact for customers, establishing strong technical relationships and ensuring their success with CoreWeave&#39;s cloud infrastructure offerings.</p>\n<p>You will collaborate closely with customers to understand their unique business needs and create, prototype, and deploy tailored solutions that align with their requirements.</p>\n<p>You will lead proof of concept initiatives to showcase the value and viability of CoreWeave&#39;s solutions within specific environments.</p>\n<p>You will drive technical leadership and direction during customer meetings, presentations, and workshops, addressing any technical queries or concerns that arise.</p>\n<p>You will act as a virtual member of CoreWeave&#39;s Kubernetes product and engineering teams, identifying opportunities for product enhancement and collaborating with engineers to implement your suggestions.</p>\n<p>You will offer valuable insights on product features, functionality, and performance, contributing regularly to discussions about product strategy and architecture.</p>\n<p>You will conduct periodic technical reviews and assessments of customer workloads, pinpointing opportunities for workload optimization and suggesting suitable solutions.</p>\n<p>You will stay informed of the latest developments and trends in Kubernetes, cloud computing and infrastructure, sharing your thought leadership with customers and internal stakeholders.</p>\n<p>You will lead the prototyping and initiation of research and development efforts for emerging products and solutions, delivering prototypes and key insights for internal consumption.</p>\n<p>You will represent CoreWeave at conferences and industry events, with occasional travel as required.</p>\n<p>To be successful in this role, you will need to have a B.S. in Computer Science or a related technical discipline, or equivalent experience.</p>\n<p>You will also need to have 7+ years of proven experience as a Solutions Architect, engineer, researcher, or technical account manager in cloud infrastructure, focusing on building distributed systems or HPC/cloud services, with an expertise focused on scalable Kubernetes solutions.</p>\n<p>You will need to be fluent in cloud computing concepts, architecture, and technologies with hands-on experience in designing and implementing cloud solutions.</p>\n<p>You will need to have a proven track record with building customer relationships, communicating clearly and the ability to break down complex technical concepts to both technical and non-technical audiences.</p>\n<p>You will need to be familiar with NVIDIA GPUs typically used in AI/ML applications and associated technologies such as Infiniband and NVIDIA Collective Communications Library (NCCL).</p>\n<p>You will need to have experience with running large-scale Artificial Intelligence/Machine Learning (AI/ML) training and inference workloads on technologies such as Slurm and Kubernetes.</p>\n<p>Preferred qualifications include code contributions to open-source inference frameworks, experience with scripting and automation related to Kubernetes clusters and workloads, experience with building solutions across multi-cloud environments, and client or customer-facing publications/talks on latency, optimization, or advanced model-server architectures.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_588dfb0e-611","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4557835006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $220,000","x-skills-required":["Kubernetes","Cloud Computing","High-Performance Compute (HPC)","Distributed Systems","Cloud Infrastructure","Scalable Solutions","NVIDIA GPUs","Infiniband","NVIDIA Collective Communications Library (NCCL)","Slurm","Kubernetes Clusters"],"x-skills-preferred":["Code Contributions to Open-Source Inference Frameworks","Scripting and Automation Related to Kubernetes Clusters and Workloads","Building Solutions Across Multi-Cloud Environments","Client or Customer-Facing Publications/Talks on Latency, Optimization, or Advanced Model-Server Architectures"],"datePosted":"2026-04-18T15:57:29.779Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Cloud Computing, High-Performance Compute (HPC), Distributed Systems, Cloud Infrastructure, Scalable Solutions, NVIDIA GPUs, Infiniband, NVIDIA Collective Communications Library (NCCL), Slurm, Kubernetes Clusters, Code Contributions to Open-Source Inference Frameworks, Scripting and Automation Related to Kubernetes Clusters and Workloads, Building Solutions Across Multi-Cloud Environments, Client or Customer-Facing Publications/Talks on Latency, Optimization, or Advanced Model-Server Architectures","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":220000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c7de81b4-bec"},"title":"Security Engineer, Infrastructure","description":"<p>We are seeking a highly skilled Infrastructure Security Engineer to join our team. This role is integral to ensuring the security and integrity of our platform.</p>\n<p>You will be responsible for securing large cloud environments, orchestrating and securing various compute clusters, and reviewing infrastructure as code. Your expertise in cloud security, infrastructure automation, and advanced security practices will be essential in maintaining and enhancing our security posture.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Securing infrastructure across large cloud hosting providers (e.g., AWS, Azure, GCP).</li>\n<li>Implementing and maintaining robust security configurations and policies for cloud environments.</li>\n<li>Conducting regular security assessments and audits of infrastructure to identify vulnerabilities and areas for improvement.</li>\n<li>Developing and enforcing security best practices for infrastructure automation and orchestration.</li>\n<li>Collaborating with DeveloperExperience, IT, and product teams to integrate security into all stages of the infrastructure lifecycle.</li>\n<li>Reviewing and securing infrastructure as code (e.g., Terraform, CloudFormation).</li>\n<li>Educating and mentoring team members on infrastructure security best practices and emerging threats.</li>\n</ul>\n<p>Ideally, you&#39;d have:</p>\n<ul>\n<li>Proven experience as a Security Engineer with a focus on product security.</li>\n<li>Proficiency in NodeJS, TypeScript, and Kubernetes.</li>\n<li>Experience with orchestrating and securing GPU clusters.</li>\n<li>Proficiency in infrastructure as code tools such as Terraform and CloudFormation.</li>\n<li>Excellent communication skills, with the ability to clearly explain technical concepts and their implications to both technical and non-technical stakeholders.</li>\n<li>Demonstrated ability to influence security strategies and drive improvements within an organisation.</li>\n<li>Relevant security certifications (e.g., AWS Certified Security Specialty, Certified Cloud Security Professional) are a plus.</li>\n<li>Experience in a senior or lead security role is preferred.</li>\n</ul>\n<p>Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c7de81b4-bec","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Scale","sameAs":"https://www.scale.com/","logo":"https://logos.yubhub.co/scale.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/scaleai/jobs/4646888005","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$237,600-$297,000 USD","x-skills-required":["cloud security","infrastructure automation","advanced security practices","NodeJS","TypeScript","Kubernetes","Terraform","CloudFormation"],"x-skills-preferred":["orchestrating and securing GPU clusters","relevant security certifications"],"datePosted":"2026-04-18T15:56:27.426Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY; San Francisco, CA; Seattle, WA; Washington, DC"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud security, infrastructure automation, advanced security practices, NodeJS, TypeScript, Kubernetes, Terraform, CloudFormation, orchestrating and securing GPU clusters, relevant security certifications","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":237600,"maxValue":297000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_16599c27-a87"},"title":"Senior Infrastructure Engineer/SRE","description":"<p>We&#39;re on a mission to revolutionize the workforce with AI. As a member of the infrastructure team, you&#39;ll design, build, and advance our core infrastructure that allows the engineering team to execute quickly, productively, and securely.</p>\n<p>You&#39;ll partner with engineers to build dev tools that empower developer workflows and deployment infrastructure. Ensure reliability of multi-cloud Kubernetes clusters and pipelines. Implement metrics, logging, analytics, and alerting for performance and security across all endpoints and applications. Automate operations and engineering, focusing on automation so we can spend energy where it matters.</p>\n<p>You&#39;ll also build machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.</p>\n<p>We&#39;re looking for someone with 5+ years of experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field. You should have deep proficiency with coding languages such as Golang or Python, and deep familiarity with container-related security best practices. Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns. Experience with GPU-enabled clusters is a bonus.</p>\n<p>Perks &amp; Benefits:</p>\n<ul>\n<li>Comprehensive medical, dental, and vision coverage with plans to fit you and your family</li>\n<li>Flexible PTO to take the time you need, when you need it</li>\n<li>Paid parental leave for all new parents welcoming a new child</li>\n<li>Retirement savings plan to help you plan for the future</li>\n<li>Remote work setup budget to help you create a productive home office</li>\n<li>Monthly wellness and communication stipend to keep you connected and balanced</li>\n<li>In-office meal program and commuter benefits provided for onsite employees</li>\n</ul>\n<p>Compensation at Cresta:</p>\n<p>Cresta&#39;s approach to compensation is simple: recognize impact, reward excellence, and invest in our people. We offer competitive, location-based pay that reflects the market and what each individual brings to the table. The posted base salary range represents what we expect to pay for this role in a given location. Final offers are shaped by factors like experience, skills, education, and geography. In addition to base pay, total compensation includes equity and a comprehensive benefits package for you and your family.</p>\n<p>OTE Range: $205,000–$270,000 + Offers Equity</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_16599c27-a87","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Cresta","sameAs":"https://www.cresta.ai/","logo":"https://logos.yubhub.co/cresta.ai.png"},"x-apply-url":"https://job-boards.greenhouse.io/cresta/jobs/5137153008","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$205,000–$270,000","x-skills-required":["Golang","Python","Kubernetes","cert-manager","external-dns","GPU-enabled clusters","Terraform","CloudFormation","AWS","IAM","S3","EC2","EKS","PostgreSQL","GitOps","Flux","Argo","CI/CD","GitHub Actions"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:55:52.459Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"United States (Remote)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Golang, Python, Kubernetes, cert-manager, external-dns, GPU-enabled clusters, Terraform, CloudFormation, AWS, IAM, S3, EC2, EKS, PostgreSQL, GitOps, Flux, Argo, CI/CD, GitHub Actions","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":205000,"maxValue":270000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_24176cb8-311"},"title":"Member of Technical Staff - Compute Infrastructure","description":"<p>We&#39;re seeking a highly skilled Member of Technical Staff to join our Compute Infrastructure team. As a key member of this team, you will design, build, and operate massive-scale clusters and orchestration platforms that power frontier AI training, inference, and agent workloads at unprecedented scale.</p>\n<p>In this role, you will push the boundaries of container orchestration far beyond existing systems like Kubernetes, manage exascale compute resources, optimize for high-performance training runs and production serving, and collaborate closely with research and systems teams to deliver reliable, ultra-scalable infrastructure that enables xAI&#39;s next-generation models and applications.</p>\n<p>Responsibilities include building and managing massive-scale clusters, designing, developing, and extending an in-house container orchestration platform, collaborating with research teams to architect and optimize compute clusters, profiling, debugging, and resolving complex system-level performance bottlenecks, and owning end-to-end infrastructure initiatives.</p>\n<p>To succeed in this role, you will need deep expertise in virtualization technologies and advanced containerization/sandboxing, strong proficiency in systems programming languages such as C/C++ and Rust, and proven track record profiling, debugging, and optimizing complex system-level performance issues.</p>\n<p>Preferred skills and experience include experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads, operating or designing large-scale AI training/inference clusters, and familiarity with performance tools, tracing, and debugging in production distributed environments.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_24176cb8-311","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://www.xai.com/","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/5052040007","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$180,000 - $440,000 USD","x-skills-required":["Deep expertise in virtualization technologies (KVM, Xen, QEMU) and advanced containerization/sandboxing (Kata, Firecracker, gVisor, Sysbox, or equivalent)","Strong proficiency in systems programming languages such as C/C++ and Rust","Proven track record profiling, debugging, and optimizing complex system-level performance issues, with deep knowledge of Linux kernel internals, resource management, scheduling, memory management, and low-level engineering","Hands-on experience building or significantly enhancing distributed compute platforms, orchestration systems, or high-performance infrastructure at scale"],"x-skills-preferred":["Experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads","Proven track record operating or designing large-scale AI training/inference clusters (GPU/TPU scale)","Experience with custom runtimes, isolation techniques, or bespoke platforms for specialized AI compute","Familiarity with performance tools, tracing, and debugging in production distributed environments"],"datePosted":"2026-04-18T15:55:50.213Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Palo Alto, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Deep expertise in virtualization technologies (KVM, Xen, QEMU) and advanced containerization/sandboxing (Kata, Firecracker, gVisor, Sysbox, or equivalent), Strong proficiency in systems programming languages such as C/C++ and Rust, Proven track record profiling, debugging, and optimizing complex system-level performance issues, with deep knowledge of Linux kernel internals, resource management, scheduling, memory management, and low-level engineering, Hands-on experience building or significantly enhancing distributed compute platforms, orchestration systems, or high-performance infrastructure at scale, Experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads, Proven track record operating or designing large-scale AI training/inference clusters (GPU/TPU scale), Experience with custom runtimes, isolation techniques, or bespoke platforms for specialized AI compute, Familiarity with performance tools, tracing, and debugging in production distributed environments","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":440000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6b0282a9-9ee"},"title":"Staff Software Engineer, Observability","description":"<p>We are seeking a highly experienced Staff Software Engineer to lead our efforts in building, maintaining, and optimizing highly scalable, reliable, and secure systems. The Observability team is responsible for deploying and maintaining critical infrastructure at CoreWeave including our logging, tracing, and metrics platforms as well as the pipelines that feed them.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Lead and mentor engineers, fostering a culture of collaboration and continuous improvement.</li>\n<li>Scale logging, tracing, and metrics platforms to support a global datacenter footprint.</li>\n<li>Develop and refine monitoring and alerting to enhance system reliability.</li>\n<li>Advise engineers across CoreWeave on optimal usage of Observability systems.</li>\n<li>Automate interactions with CoreWeave&#39;s Compute Infrastructure layer.</li>\n<li>Manage production clusters and ensure development teams follow best practices for deployments.</li>\n</ul>\n<p>Required Qualifications:</p>\n<ul>\n<li>7+ years of experience in Software Engineering, Site Reliability Engineering, DevOps, or a related field.</li>\n<li>Deep expertise across all observability pillars using tools like ClickHouse, Elastic, Loki, Victoria Metrics, Prometheus, Thanos and/or Grafana.</li>\n<li>Expertise in Kubernetes, containerization, and microservices architectures.</li>\n<li>Proven track record of leading incident management and post-mortem analysis.</li>\n<li>Excellent problem-solving, analytical, and communication skills.</li>\n</ul>\n<p>Preferred Qualifications:</p>\n<ul>\n<li>Experience running and scaling observability tools as a cloud provider.</li>\n<li>Experience administering large-scale kubernetes clusters.</li>\n<li>Deep understanding of data-streaming systems.</li>\n</ul>\n<p>The base salary range for this role is $188,000 to $250,000.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6b0282a9-9ee","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4577361006","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$188,000 to $250,000","x-skills-required":["ClickHouse","Elastic","Loki","Victoria Metrics","Prometheus","Thanos","Grafana","Kubernetes","containerization","microservices architectures"],"x-skills-preferred":["Experience running and scaling observability tools as a cloud provider","Experience administering large-scale kubernetes clusters","Deep understanding of data-streaming systems"],"datePosted":"2026-04-18T15:54:03.521Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"ClickHouse, Elastic, Loki, Victoria Metrics, Prometheus, Thanos, Grafana, Kubernetes, containerization, microservices architectures, Experience running and scaling observability tools as a cloud provider, Experience administering large-scale kubernetes clusters, Deep understanding of data-streaming systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":188000,"maxValue":250000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c1903386-87b"},"title":"Staff Infrastructure Software Engineer (Kubernetes)","description":"<p>As a member of the infrastructure team, you will design, build, and advance our core infrastructure that allows the engineering team to execute quickly, productively, and securely.</p>\n<p>You will partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.</p>\n<p>Ensure reliability of multi-cloud Kubernetes clusters and pipelines.</p>\n<p>Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications.</p>\n<p>Infrastructure-as-code deployment tooling and supporting services on multiple cloud providers.</p>\n<p>Automate operations and engineering.</p>\n<p>Focus on automation so we can spend energy where it matters.</p>\n<p>Building machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.</p>\n<p>We are looking for a highly skilled engineer with 5+ years of experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.</p>\n<p>Deep proficiency with coding languages such as Golang or Python.</p>\n<p>Deep familiarity with container-related security best practices.</p>\n<p>Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.</p>\n<p>Experience with GPU-enabled clusters is a bonus.</p>\n<p>Production experience with Kubernetes templating tools such as Helm or Kustomize.</p>\n<p>Production experience with IAC tools such as Terraform or CloudFormation.</p>\n<p>Production experience working with AWS and services such as IAM, S3, EC2, and EKS.</p>\n<p>Production experience with other cloud providers such as Google Cloud and Azure is a bonus.</p>\n<p>Production experience with database software such as PostgreSQL.</p>\n<p>Experience with GitOps tooling such as Flux or Argo.</p>\n<p>Experience with CI/CD such as GitHub Actions.</p>\n<p>Perks and benefits include paid parental leave, monthly health and wellness allowance, and PTO.</p>\n<p>Compensation includes a base salary, equity, and a variety of benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c1903386-87b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Cresta","sameAs":"https://www.cresta.ai/","logo":"https://logos.yubhub.co/cresta.ai.png"},"x-apply-url":"https://job-boards.greenhouse.io/cresta/jobs/4535898008","x-work-arrangement":"remote","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Golang","Python","Kubernetes","cert-manager","external-dns","GPU-enabled clusters","Helm","Kustomize","Terraform","CloudFormation","AWS","IAM","S3","EC2","EKS","Google Cloud","Azure","PostgreSQL","GitOps","Flux","Argo","CI/CD","GitHub Actions"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:53:57.717Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Germany (Remote)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Golang, Python, Kubernetes, cert-manager, external-dns, GPU-enabled clusters, Helm, Kustomize, Terraform, CloudFormation, AWS, IAM, S3, EC2, EKS, Google Cloud, Azure, PostgreSQL, GitOps, Flux, Argo, CI/CD, GitHub Actions"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_26212e9e-5a8"},"title":"Infrastructure Engineer/SRE","description":"<p>We&#39;re seeking an experienced Infrastructure Engineer/SRE to join our engineering team. As a key member of our infrastructure team, you will be responsible for designing, building, and advancing our core infrastructure that allows the engineering team to execute quickly, productively, and securely.</p>\n<p>As a collaborative but highly autonomous working environment, each member has a defined role with clear expectations, as well as the freedom to pursue projects they find interesting.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.</li>\n<li>Ensure reliability of multi-cloud Kubernetes clusters and pipelines.</li>\n<li>Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications.</li>\n<li>Infrastructure-as-code deployment tooling and supporting services on multiple cloud providers.</li>\n<li>Automate operations and engineering. Focus on automation so we can spend energy where it matters.</li>\n<li>Building machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.</li>\n</ul>\n<p>What we are looking for:</p>\n<ul>\n<li>5+ years experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.</li>\n<li>Deep proficiency with coding languages such as Golang or Python.</li>\n<li>Deep familiarity with container-related security best practices.</li>\n<li>Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.</li>\n<li>Experience with GPU-enabled clusters is a bonus.</li>\n<li>Production experience with Kubernetes templating tools such as Helm or Kustomize.</li>\n<li>Production experience with IAC tools such as Terraform or CloudFormation.</li>\n<li>Production experience working with AWS and services such as IAM, S3, EC2, and EKS.</li>\n<li>Production experience with other cloud providers such as Google Cloud and Azure is a bonus.</li>\n<li>Production experience with database software such as PostgreSQL.</li>\n<li>Experience with GitOps tooling such as Flux or Argo.</li>\n<li>Experience with CI/CD such as GitHub Actions.</li>\n</ul>\n<p>Perks &amp; Benefits:</p>\n<ul>\n<li>We offer Cresta employees a variety of medical benefits designed to fit your stage of life.</li>\n<li>Flexible vacation time to promote a healthy work-life blend.</li>\n<li>Paid parental leave to support you and your family.</li>\n</ul>\n<p>Compensation for this position includes a base salary, equity, and a variety of benefits. Actual base salaries will be based on candidate-specific factors, including experience, skillset, and location, and local minimum pay requirements as applicable.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_26212e9e-5a8","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Cresta","sameAs":"https://www.cresta.ai/","logo":"https://logos.yubhub.co/cresta.ai.png"},"x-apply-url":"https://job-boards.greenhouse.io/cresta/jobs/5113847008","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Golang","Python","Kubernetes","cert-manager","external-dns","GPU-enabled clusters","Helm","Kustomize","Terraform","CloudFormation","AWS","IAM","S3","EC2","EKS","Google Cloud","Azure","PostgreSQL","GitOps","Flux","Argo","CI/CD","GitHub Actions"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:53:55.875Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Australia (Remote)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Golang, Python, Kubernetes, cert-manager, external-dns, GPU-enabled clusters, Helm, Kustomize, Terraform, CloudFormation, AWS, IAM, S3, EC2, EKS, Google Cloud, Azure, PostgreSQL, GitOps, Flux, Argo, CI/CD, GitHub Actions"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6a7d182d-c49"},"title":"Solutions Architect - Kubernetes","description":"<p>As a Solutions Architect at CoreWeave, you will play a vital role in helping customers succeed with our cloud infrastructure offerings, focusing on Kubernetes solutions within high-performance compute (HPC) environments.</p>\n<p>Your primary responsibility will be to serve as the primary technical point of contact for customers, establishing strong technical relationships and ensuring their success with CoreWeave&#39;s cloud infrastructure offerings.</p>\n<p>You will collaborate closely with customers to understand their unique business needs and create, prototype, and deploy tailored solutions that align with their requirements.</p>\n<p>You will lead proof of concept initiatives to showcase the value and viability of CoreWeave&#39;s solutions within specific environments.</p>\n<p>You will drive technical leadership and direction during customer meetings, presentations, and workshops, addressing any technical queries or concerns that arise.</p>\n<p>You will act as a virtual member of CoreWeave&#39;s Kubernetes product and engineering teams, identifying opportunities for product enhancement and collaborating with engineers to implement your suggestions.</p>\n<p>You will offer valuable insights on product features, functionality, and performance, contributing regularly to discussions about product strategy and architecture.</p>\n<p>You will conduct periodic technical reviews and assessments of customer workloads, pinpointing opportunities for workload optimization and suggesting suitable solutions.</p>\n<p>You will stay informed of the latest developments and trends in Kubernetes, cloud computing and infrastructure, sharing your thought leadership with customers and internal stakeholders.</p>\n<p>You will lead the prototyping and initiation of research and development efforts for emerging products and solutions, delivering prototypes and key insights for internal consumption.</p>\n<p>You will represent CoreWeave at conferences and industry events, with occasional travel as required.</p>\n<p>To be successful in this role, you will need to have a proven track record of working as a Solutions Architect, engineer, researcher, or technical account manager in cloud infrastructure, focusing on building distributed systems or HPC/cloud services, with an expertise focused on scalable Kubernetes solutions.</p>\n<p>You will also need to have fluency in cloud computing concepts, architecture, and technologies with hands-on experience in designing and implementing cloud solutions.</p>\n<p>In addition, you will need to have a proven track record with building customer relationships, communicating clearly and the ability to break down complex technical concepts to both technical and non-technical audiences.</p>\n<p>Preferred qualifications include code contributions to open-source inference frameworks, experience with scripting and automation related to Kubernetes clusters and workloads, experience with building solutions across multi-cloud environments, and client or customer-facing publications/talks on latency, optimization, or advanced model-server architectures.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6a7d182d-c49","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4649036006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $225,000 SGD","x-skills-required":["Cloud computing concepts","Kubernetes solutions","High-performance compute (HPC) environments","Distributed systems","Cloud infrastructure"],"x-skills-preferred":["Code contributions to open-source inference frameworks","Scripting and automation related to Kubernetes clusters and workloads","Building solutions across multi-cloud environments","Client or customer-facing publications/talks on latency, optimization, or advanced model-server architectures"],"datePosted":"2026-04-18T15:52:11.835Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Singapore"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Cloud computing concepts, Kubernetes solutions, High-performance compute (HPC) environments, Distributed systems, Cloud infrastructure, Code contributions to open-source inference frameworks, Scripting and automation related to Kubernetes clusters and workloads, Building solutions across multi-cloud environments, Client or customer-facing publications/talks on latency, optimization, or advanced model-server architectures","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":225000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_18013f3c-904"},"title":"Cluster Deployment Engineer","description":"<p>As a Cluster Deployment Engineer at Anthropic, you will own how large-scale AI compute clusters physically come together inside our datacenter fleet.</p>\n<p>You will set the deployment-engineering strategy for cluster build-out , how racks are organized into pods, halls, and sites; how compute, network, power, and cooling systems interface at the rack boundary; and how deployment scope flows cleanly from hardware specification to facility delivery to a running cluster.</p>\n<p>This role is focused on deployment engineering, not on datacenter network or systems design , your scope is making sure clusters land cleanly and predictably, not designing the fabrics or facilities themselves.</p>\n<p>You will work across hardware, networking, facilities, supply chain, and construction to ensure that every generation of accelerator we deploy lands in a datacenter that is ready for it , on schedule, at full density, and with every piece of required infrastructure accounted for.</p>\n<p>You will be the person who sees around corners: anticipating how next-generation rack designs will stress our facilities, where our deployment model will break at scale, and what needs to change now so that the next cluster turn-up is faster and more predictable than the last.</p>\n<p>You will operate at the intersection of engineering strategy and execution discipline, partnering with internal research and systems teams, external developers, engineering firms, and OEM partners to deliver cluster capacity at the speed the frontier demands.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Own cluster-level deployment strategy , define how AI compute clusters are organized across the floor, how racks interconnect, and how cluster topology requirements translate into facility and deployment scope across a portfolio of sites.</li>\n</ul>\n<ul>\n<li>Set rack interface standards spanning power, network, mechanical, thermal, and spatial domains, and ensure that every deployment includes the complete set of infrastructure required to bring a cluster online.</li>\n</ul>\n<ul>\n<li>Drive multi-threaded cluster bring-up programs across hardware, networking, power, and cooling , owning plans, dependencies, and critical paths from hardware specification through energization and turn-up.</li>\n</ul>\n<ul>\n<li>Partner with internal engineering teams , research, systems, networking, and hardware , to translate cluster requirements into deployable facility scope, and to derisk onboarding of new hardware platforms well ahead of delivery.</li>\n</ul>\n<ul>\n<li>Lead external partner execution with developers, engineering firms, OEMs, and construction teams, driving technical reviews, deviation management, and handoffs that keep deployments on schedule and within specification.</li>\n</ul>\n<ul>\n<li>Improve cluster turn-up reliability and repeatability , identify systemic gaps in deployment scope, tooling, and partner interfaces, and drive durable fixes that reduce time-to-serve for new capacity.</li>\n</ul>\n<ul>\n<li>Define and track deployment KPIs , cluster readiness, schedule adherence, scope completeness, time-to-first-packet , and use historical trends to forecast risk and inform capacity planning.</li>\n</ul>\n<ul>\n<li>Coordinate cross-functional readiness across supply chain, security, operations, and construction to ship production-ready compute capacity.</li>\n</ul>\n<ul>\n<li>Provide crisp executive visibility on deployment progress, tradeoffs, and risks across a portfolio of concurrent cluster programs.</li>\n</ul>\n<ul>\n<li>Design cluster interfaces for durability , define rack and cluster-level interfaces that remain robust across hardware generations, so that facility scope and deployment models do not need to be reinvented every time the underlying hardware changes.</li>\n</ul>\n<ul>\n<li>Build cluster layout and BOM tooling , create and maintain the tools, templates, and data models that turn cluster topology and rack specifications into accurate floor layouts, deployment sequences, and complete bills of materials, replacing one-off spreadsheets with repeatable, auditable workflows.</li>\n</ul>\n<p>You may be a good fit if you:</p>\n<ul>\n<li>Have 10+ years of experience in hyperscale datacenter environments, with senior-level responsibility for cluster deployment, large-scale IT integration, or equivalent infrastructure programs.</li>\n</ul>\n<ul>\n<li>Have delivered AI, HPC, or high-density compute clusters at scale and developed a strong intuition for the constraints that govern cluster deployment , interconnect reach, adjacency, power density, and thermal limits.</li>\n</ul>\n<ul>\n<li>Can operate fluently across the boundary between IT hardware and facility infrastructure, and have set interface standards that held up across multiple hardware generations and sites.</li>\n</ul>\n<ul>\n<li>Have led cross-functional programs with both internal engineering teams and external developers, engineering firms, and OEM partners, and are effective at driving alignment across organizational levels.</li>\n</ul>\n<ul>\n<li>Combine strong systems thinking with execution discipline , comfortable zooming from cluster topology and portfolio strategy down to the specific interface detail that will otherwise become a field issue.</li>\n</ul>\n<ul>\n<li>Communicate clearly with technical and executive audiences, and can distill complex, multi-disciplinary programs into decisions and tradeoffs leadership can act on.</li>\n</ul>\n<ul>\n<li>Thrive in ambiguous, fast-moving environments where the hardware, the scale, and the requirements are all changing simultaneously.</li>\n</ul>\n<ul>\n<li>Hold a Bachelor&#39;s degree in Electrical Engineering, Mechanical Engineering, Computer Engineering, or equivalent practical experience.</li>\n</ul>\n<p>Strong candidates may also:</p>\n<ul>\n<li>Have direct experience deploying leading-edge AI accelerator clusters at hyperscale.</li>\n</ul>\n<ul>\n<li>Have shaped reference designs, deployment standards, or cluster-level playbooks that were adopted across a fleet.</li>\n</ul>\n<ul>\n<li>Have experience working across multiple geographies and understand how regional codes, climate, utility constraints, and supply chains shape cluster-level decisions.</li>\n</ul>\n<ul>\n<li>Have partnered closely with hardware and system providers on long-term platform onboarding and bring-up.</li>\n</ul>\n<ul>\n<li>Have experience building the program mechanisms , roadmaps, milestones, risk registers, runbooks , that make delivery predictable at massive scale.</li>\n</ul>\n<p>The annual compensation range for this role is listed below. For sales roles, the range provided is the role’s On Target Earnings (“OTE”) range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.</p>\n<p>Annual Salary: $320,000-$405,000 USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_18013f3c-904","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5191638008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$320,000-$405,000 USD","x-skills-required":["Hyperscale datacenter environments","Cluster deployment","Large-scale IT integration","Infrastructure programs","AI","HPC","High-density compute clusters","Interconnect reach","Adjacency","Power density","Thermal limits","IT hardware","Facility infrastructure","Interface standards","Cluster topology","Portfolio strategy","Execution discipline","Systems thinking","Communication","Technical audiences","Executive audiences","Complex programs","Decisions","Tradeoffs","Leadership","Bachelor's degree","Electrical Engineering","Mechanical Engineering","Computer Engineering","Practical experience"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:51:42.505Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote-Friendly, United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Hyperscale datacenter environments, Cluster deployment, Large-scale IT integration, Infrastructure programs, AI, HPC, High-density compute clusters, Interconnect reach, Adjacency, Power density, Thermal limits, IT hardware, Facility infrastructure, Interface standards, Cluster topology, Portfolio strategy, Execution discipline, Systems thinking, Communication, Technical audiences, Executive audiences, Complex programs, Decisions, Tradeoffs, Leadership, Bachelor's degree, Electrical Engineering, Mechanical Engineering, Computer Engineering, Practical experience","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":320000,"maxValue":405000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_ff4d3a91-b20"},"title":"Principal Engineer - Perf and Benchmarking","description":"<p>We&#39;re looking for a Principal Engineer to be the technical lead of CoreWeave&#39;s Benchmarking &amp; Performance team. You will be responsible for our planet-scale performance data warehouse: Ingesting, storing, transforming and analyzing performance events in all the data centers across our global infrastructure.</p>\n<p>You will also be an integral part of achieving industry-leading end-to-end performance benchmarking publications: If MLPerf (Training &amp; Inference), Working closely with NVIDIA (Megatron-LM, TensorRT-LLM &amp; DGX cloud) and the open-source community (llm-d, vLLM and all popular ML frameworks) speak to you, come help us demonstrate CoreWeave&#39;s performance reliability leadership in the field.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Strategy &amp; Leadership - Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers. Build, lead, and mentor a high-performing team of performance engineers and data analysts. Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails.</li>\n</ul>\n<ul>\n<li>Perf Ownership - Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication. Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, TensorRT/TensorRT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed.</li>\n</ul>\n<ul>\n<li>Internal Latency &amp; Throughput Benchmarks - Design a Kubernetes-native, repeatable benchmarking service that exercises CoreWeave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines. Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precisions (BF16/FP8/FP4), batch sizes, and GPU types. Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations.</li>\n</ul>\n<ul>\n<li>Tooling &amp; Automation - Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses. Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures).</li>\n</ul>\n<ul>\n<li>Cross-functional &amp; Community - Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/DeepSpeed, ONNX Runtime) to co-develop optimizations and upstream improvements. Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data.</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads.</li>\n</ul>\n<ul>\n<li>Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines).</li>\n</ul>\n<ul>\n<li>Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM).</li>\n</ul>\n<ul>\n<li>Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments.</li>\n</ul>\n<ul>\n<li>Excellent communicator able to interface with executives, customers, auditors, and OSS communities.</li>\n</ul>\n<p><strong>Nice to have</strong></p>\n<ul>\n<li>Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development.</li>\n</ul>\n<ul>\n<li>Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale.</li>\n</ul>\n<ul>\n<li>Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects.</li>\n</ul>\n<ul>\n<li>Experience benchmarking multi-region fleets and large clusters (thousands of GPUs).</li>\n</ul>\n<ul>\n<li>Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_ff4d3a91-b20","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4627302006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$206,000 to $333,000","x-skills-required":["Distributed systems","HPC/cloud services","Large-scale ML training","GPU performance","Model-server stacks","Distributed training frameworks","Kubernetes","ML control planes","Time-series databases","Log-structured merge trees","Custom storage engine development"],"x-skills-preferred":["MLPerf submissions","Audited benchmarks","Contributions to OSS projects","Benchmarking multi-region fleets","Large clusters","Publications/talks on ML performance"],"datePosted":"2026-04-18T15:51:17.448Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Distributed systems, HPC/cloud services, Large-scale ML training, GPU performance, Model-server stacks, Distributed training frameworks, Kubernetes, ML control planes, Time-series databases, Log-structured merge trees, Custom storage engine development, MLPerf submissions, Audited benchmarks, Contributions to OSS projects, Benchmarking multi-region fleets, Large clusters, Publications/talks on ML performance","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":206000,"maxValue":333000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_594b20c4-c28"},"title":"Infrastructure Engineer, Security","description":"<p>We&#39;re looking for an infrastructure engineer to own and evolve the security infrastructure that underpins our foundation models. In this role, you&#39;ll work across compute, storage, networking, and data platforms, making sure our systems are secure, reliable, and built to scale.</p>\n<p>You&#39;ll shape controls, architecture, and tooling so that security is part of how the platform works by default. You&#39;ll partner closely with research and product teams, enabling them to move quickly while keeping our models, data, and environments protected.</p>\n<p>Key responsibilities include:</p>\n<p>Architecting security patterns for platforms and services, including network segmentation, service-to-service authentication, RBAC, and policy enforcement in Kubernetes and cloud environments.</p>\n<p>Managing identity, access, and secrets for humans and services: workload and cross-cloud identity, least-privilege IAM, and secrets management.</p>\n<p>Building secure platforms for data ingestion, processing, and curation: classification, encryption, access controls, and safe sharing patterns across teams.</p>\n<p>Writing threat models and reviewing designs with researchers and engineers to help them ship features and experiments in a safe, scalable way.</p>\n<p>Automating security checks and building guardrails: policy-as-code, secure infrastructure baselines, validation in CI/CD, and tools that make the secure path the easiest one.</p>\n<p>Requirements include:</p>\n<p>Bachelor&#39;s degree or equivalent experience in engineering, or similar.</p>\n<p>Strong background with containers and orchestration (e.g., Kubernetes) and how to secure them (namespaces, network policies, pod security, admission controls, etc.).</p>\n<p>Practical experience with Infrastructure as Code (Terraform or similar), including secure patterns for provisioning networks, IAM, and shared services.</p>\n<p>Solid understanding of cloud networking and security: VPCs, load balancers, service discovery, mTLS, firewalls, and zero-trust-style architectures.</p>\n<p>Proficiency with a systems language such as Rust and scripting in Python for building platform components and internal tools.</p>\n<p>Evidence of owning complex, production-critical systems, including debugging issues that span infra, security, and application layers.</p>\n<p>Preferred qualifications include experience with ML infrastructure, GPU clusters, or large-scale training environments, as well as background in AI labs, HPC environments, or ML-heavy organizations.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_594b20c4-c28","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Thinking Machines Lab","sameAs":"https://thinkingmachineslab.com/","logo":"https://logos.yubhub.co/thinkingmachineslab.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5015964008","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$200,000 - $475,000 USD","x-skills-required":["Kubernetes","Infrastructure as Code","Cloud Networking and Security","Systems Language (Rust)","Scripting (Python)"],"x-skills-preferred":["ML Infrastructure","GPU Clusters","Large-Scale Training Environments","AI Labs","HPC Environments"],"datePosted":"2026-04-18T15:50:20.174Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Infrastructure as Code, Cloud Networking and Security, Systems Language (Rust), Scripting (Python), ML Infrastructure, GPU Clusters, Large-Scale Training Environments, AI Labs, HPC Environments","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":200000,"maxValue":475000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_854e95b5-76b"},"title":"Sr. Director of Product, Research and Training Infrastructure","description":"<p>CoreWeave is seeking a visionary Sr. Director of Product, Research Training Infrastructure to lead the product strategy and engineering execution for the services that power the most ambitious AI research labs in the world.</p>\n<p>This executive leader will own the product strategy and engineering execution for the Research Training Stack, focusing on the specialized orchestration, evaluation, and iteration tools required for massive-scale pre-training and post-training.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Frontier Orchestration: Oversee the evolution of SUNK (Slurm on Kubernetes) to provide researchers with deterministic, bare-metal performance through a cloud-native interface.</li>\n</ul>\n<ul>\n<li>Holistic Training Services: Drive the development of next-generation orchestrators and automated training-based evaluation frameworks that ensure model quality throughout the lifecycle.</li>\n</ul>\n<ul>\n<li>Post-Training Excellence: Build the infrastructure required for sophisticated Reinforcement Learning (RL) and RLHF pipelines, enabling labs to refine foundation models with maximum efficiency.</li>\n</ul>\n<ul>\n<li>Customer Advocacy: Act as the primary technical partner for lead researchers at global AI labs, translating their &#39;future-state&#39; requirements into actionable product roadmaps.</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>Proven leadership experience in engineering leadership, with at least 5+ years managing large-scale infrastructure at a top-tier research lab or an AI-native cloud provider.</li>\n</ul>\n<ul>\n<li>Deep, hands-on knowledge of Slurm, Kubernetes, and the specific networking requirements (InfiniBand/RDMA) for distributed training clusters.</li>\n</ul>\n<ul>\n<li>Research mindset and understanding of the &#39;pain points&#39; of a research scientist.</li>\n</ul>\n<ul>\n<li>Scaling experience delivering mission-critical services on multi-thousand GPU clusters (H100/Blackwell/Rubin architectures).</li>\n</ul>\n<ul>\n<li>Strategic vision to define &#39;what&#39;s next&#39; in the AI stack, from automated RL loops to specialized sandbox environments.</li>\n</ul>\n<p>Why CoreWeave?</p>\n<p>In 2026, CoreWeave is the foundation of the largest infrastructure buildout in human history. We are building AI Factories, not just data centers.</p>\n<ul>\n<li>Silicon-Up Innovation: Work directly with the latest NVIDIA architectures.</li>\n</ul>\n<ul>\n<li>Impact: You will be the architect of the environment that enables the next new discovery.</li>\n</ul>\n<p>Velocity: We move at the speed of the researchers we support, bypassing legacy cloud bottlenecks to deliver raw power.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_854e95b5-76b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4665964006","x-work-arrangement":"hybrid","x-experience-level":"executive","x-job-type":"full-time","x-salary-range":"$233,000 to $341,000","x-skills-required":["Slurm","Kubernetes","InfiniBand/RDMA","Distributed training clusters","GPU clusters","H100/Blackwell/Rubin architectures","Reinforcement Learning (RL)","RLHF pipelines"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:50:11.130Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Slurm, Kubernetes, InfiniBand/RDMA, Distributed training clusters, GPU clusters, H100/Blackwell/Rubin architectures, Reinforcement Learning (RL), RLHF pipelines","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":233000,"maxValue":341000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a8092b6e-7f5"},"title":"Bare Metal Support Engineer","description":"<p>As a Bare Metal Support Engineer at CoreWeave, you will be responsible for supporting, operating, and maintaining CoreWeave&#39;s extensive GPU fleet across our growing data centers in the U.S., Europe, and beyond.</p>\n<p>You will work closely with customers, data center technicians, and engineering teams to ensure the reliability, performance, and scalability of our infrastructure.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Providing high-level support for customers utilizing bare-metal GPU fleets on CoreWeave Cloud.</li>\n<li>Diagnosing, triaging, and investigating reported customer issues and high-priority incidents, identifying root causes and escalating when necessary.</li>\n<li>Developing a deep understanding of customer workloads and use cases to provide tailored technical support.</li>\n<li>Coordinating remote troubleshooting and hardware interventions with Data Center Technicians.</li>\n<li>Creating and maintaining internal documentation, including troubleshooting guides, best practices, and knowledge base articles.</li>\n<li>Participating in an on-call rotation to support production clusters and ensure operational reliability.</li>\n<li>Collaborating with engineering teams to improve hardware reliability, software stability, and system performance.</li>\n<li>Implementing automation and scripting to streamline support workflows and reduce manual interventions.</li>\n<li>Performing in-depth log analysis and debugging across multiple layers of the stack (firmware, drivers, hardware).</li>\n<li>Providing feedback to internal teams on common support issues to drive continuous improvements.</li>\n<li>Working with networking teams to troubleshoot connectivity issues affecting customer workloads.</li>\n<li>Supporting supercomputing infrastructure running GPU workloads at scale.</li>\n<li>Driving operational excellence by refining internal processes and support methodologies.</li>\n</ul>\n<p>To succeed in this role, you will need:</p>\n<ul>\n<li>Experience in data centers, GPU clusters, server deployments, system administration, or hardware troubleshooting.</li>\n<li>Demonstrated experience driving resolutions and continuous improvements across cross-functional environments and teams within a data center environment.</li>\n<li>Intermediate knowledge of Linux (Ubuntu, CentOS, or similar), including command-line proficiency.</li>\n<li>Experience with NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing (HPC), and large-scale data center environments.</li>\n<li>Experience in networking fundamentals (TCP/IP, VLANs, DNS, DHCP) and troubleshooting tools.</li>\n<li>Hands-on experience with firmware updates, BIOS configurations, and driver management.</li>\n<li>Experience analyzing system logs and debugging issues across firmware, drivers, and hardware layers.</li>\n<li>Experience working with Jira, Confluence, Notion, or other issue-tracking and documentation platforms.</li>\n<li>Experience in scripting and automation (Python, Bash, Ansible, or similar).</li>\n</ul>\n<p>If you&#39;re a curious and analytical individual with a passion for problem-solving and a desire to work in a fast-paced environment, we&#39;d love to hear from you!</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a8092b6e-7f5","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4560350006","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$83,000 to $132,000","x-skills-required":["Linux","GPU clusters","server deployments","system administration","hardware troubleshooting","NVIDIA GPUs","SuperMicro systems","Dell systems","high-performance computing","large-scale data center environments","networking fundamentals","troubleshooting tools","firmware updates","BIOS configurations","driver management","system logs","debugging issues","Jira","Confluence","Notion","issue-tracking","documentation platforms","scripting","automation"],"x-skills-preferred":["Kubernetes","Docker","containerized infrastructure"],"datePosted":"2026-04-18T15:49:58.535Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Linux, GPU clusters, server deployments, system administration, hardware troubleshooting, NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing, large-scale data center environments, networking fundamentals, troubleshooting tools, firmware updates, BIOS configurations, driver management, system logs, debugging issues, Jira, Confluence, Notion, issue-tracking, documentation platforms, scripting, automation, Kubernetes, Docker, containerized infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":83000,"maxValue":132000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_67b4ccd7-51d"},"title":"Senior Software Engineer, Observability Insights","description":"<p>Join CoreWeave&#39;s Observability team, where we are building the next-generation insights layer for AI systems.</p>\n<p>Our team empowers internal and external users to understand, troubleshoot, and optimize complex AI workloads by transforming telemetry into actionable insights.</p>\n<p>As a Senior Software Engineer on the Observability Insights team, you will lead the development of agentic interfaces and product experiences that sit atop CoreWeave&#39;s telemetry layer.</p>\n<p>You&#39;ll design multi-tenant APIs, managed Grafana experiences, and MCP-based tool servers to help customers and internal teams interact with data in innovative ways.</p>\n<p>Collaborating closely with PMs and engineering leadership, your work will shape the end-to-end observability experience and influence how people engage with cutting-edge AI infrastructure.</p>\n<p><strong>About the role</strong></p>\n<ul>\n<li>6+ years of experience in software or infrastructure engineering building production-grade backend systems and distributed APIs.</li>\n</ul>\n<ul>\n<li>Strong focus on developer-facing infrastructure, with a customer-obsessed approach to SDKs, CLIs, and APIs.</li>\n</ul>\n<ul>\n<li>Proficient in reliability engineering, including fault-tolerant design, SLOs, error budgets, and multi-tenant system resilience.</li>\n</ul>\n<ul>\n<li>Familiar with observability systems such as ClickHouse, Loki, VictoriaMetrics, Prometheus, and Grafana.</li>\n</ul>\n<ul>\n<li>Experienced in agentic applications or LLM-based features, including grounding, tool calling, and operational safety.</li>\n</ul>\n<ul>\n<li>Comfortable writing production code primarily in Go, with the ability to integrate Python components when needed.</li>\n</ul>\n<ul>\n<li>Collaborative experience in agile teams delivering end-to-end telemetry-to-insights pipelines.</li>\n</ul>\n<p><strong>Preferred</strong></p>\n<ul>\n<li>Experience operating Kubernetes clusters at scale, especially for AI workloads.</li>\n</ul>\n<ul>\n<li>Hands-on experience with logging, tracing, and metrics platforms in production, with deep knowledge of cardinality, indexing, and query optimization.</li>\n</ul>\n<ul>\n<li>Experienced in running distributed systems or API services at cloud scale, including event streaming and data pipeline management.</li>\n</ul>\n<ul>\n<li>Familiarity with LLM frameworks, MCP, and agentic tooling (e.g., Langchain, AgentCore).</li>\n</ul>\n<p><strong>Why CoreWeave?</strong></p>\n<p>At CoreWeave, we work hard, have fun, and move fast!</p>\n<p>We&#39;re in an exciting stage of hyper-growth that you will not want to miss out on.</p>\n<p>We&#39;re not afraid of a little chaos, and we&#39;re constantly learning.</p>\n<p>Our team cares deeply about how we build our product and how we work together, which is represented through our core values:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n</ul>\n<ul>\n<li>Act Like an Owner</li>\n</ul>\n<ul>\n<li>Empower Employees</li>\n</ul>\n<ul>\n<li>Deliver Best-in-Class Client Experiences</li>\n</ul>\n<ul>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking.</p>\n<p>We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems.</p>\n<p>As we get set for takeoff, the organization&#39;s growth opportunities are constantly expanding.</p>\n<p>You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.</p>\n<p>Come join us!</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_67b4ccd7-51d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4650163006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $242,000","x-skills-required":["software engineering","infrastructure engineering","backend systems","distributed APIs","reliability engineering","fault-tolerant design","SLOs","error budgets","multi-tenant system resilience","observability systems","ClickHouse","Loki","VictoriaMetrics","Prometheus","Grafana","agentic applications","LLM-based features","grounding","tool calling","operational safety","Go","Python","Kubernetes","logging","tracing","metrics platforms","cardinality","indexing","query optimization","event streaming","data pipeline management","LLM frameworks","MCP","agent tooling"],"x-skills-preferred":["operating Kubernetes clusters"],"datePosted":"2026-04-18T15:48:46.219Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY / Sunnyvale, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, infrastructure engineering, backend systems, distributed APIs, reliability engineering, fault-tolerant design, SLOs, error budgets, multi-tenant system resilience, observability systems, ClickHouse, Loki, VictoriaMetrics, Prometheus, Grafana, agentic applications, LLM-based features, grounding, tool calling, operational safety, Go, Python, Kubernetes, logging, tracing, metrics platforms, cardinality, indexing, query optimization, event streaming, data pipeline management, LLM frameworks, MCP, agent tooling, operating Kubernetes clusters","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":242000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_60082588-bf0"},"title":"Cluster Deployment Engineer","description":"<p>As a Cluster Deployment Engineer at Anthropic, you will own how large-scale AI compute clusters physically come together inside our datacenter fleet.</p>\n<p>You will set the deployment-engineering strategy for cluster build-out , how racks are organized into pods, halls, and sites; how compute, network, power, and cooling systems interface at the rack boundary; and how deployment scope flows cleanly from hardware specification to facility delivery to a running cluster.</p>\n<p>This role is focused on deployment engineering, not on datacenter network or systems design , your scope is making sure clusters land cleanly and predictably, not designing the fabrics or facilities themselves.</p>\n<p>You will work across hardware, networking, facilities, supply chain, and construction to ensure that every generation of accelerator we deploy lands in a datacenter that is ready for it , on schedule, at full density, and with every piece of required infrastructure accounted for.</p>\n<p>You will be the person who sees around corners: anticipating how next-generation rack designs will stress our facilities, where our deployment model will break at scale, and what needs to change now so that the next cluster turn-up is faster and more predictable than the last.</p>\n<p>You will operate at the intersection of engineering strategy and execution discipline, partnering with internal research and systems teams, external developers, engineering firms, and OEM partners to deliver cluster capacity at the speed the frontier demands.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Own cluster-level deployment strategy , define how AI compute clusters are organized across the floor, how racks interconnect, and how cluster topology requirements translate into facility and deployment scope across a portfolio of sites.</li>\n</ul>\n<ul>\n<li>Set rack interface standards spanning power, network, mechanical, thermal, and spatial domains, and ensure that every deployment includes the complete set of infrastructure required to bring a cluster online.</li>\n</ul>\n<ul>\n<li>Drive multi-threaded cluster bring-up programs across hardware, networking, power, and cooling , owning plans, dependencies, and critical paths from hardware specification through energization and turn-up.</li>\n</ul>\n<ul>\n<li>Partner with internal engineering teams , research, systems, networking, and hardware , to translate cluster requirements into deployable facility scope, and to derisk onboarding of new hardware platforms well ahead of delivery.</li>\n</ul>\n<ul>\n<li>Lead external partner execution with developers, engineering firms, OEMs, and construction teams, driving technical reviews, deviation management, and handoffs that keep deployments on schedule and within specification.</li>\n</ul>\n<ul>\n<li>Improve cluster turn-up reliability and repeatability , identify systemic gaps in deployment scope, tooling, and partner interfaces, and drive durable fixes that reduce time-to-serve for new capacity.</li>\n</ul>\n<ul>\n<li>Define and track deployment KPIs , cluster readiness, schedule adherence, scope completeness, time-to-first-packet , and use historical trends to forecast risk and inform capacity planning.</li>\n</ul>\n<ul>\n<li>Coordinate cross-functional readiness across supply chain, security, operations, and construction to ship production-ready compute capacity.</li>\n</ul>\n<ul>\n<li>Provide crisp executive visibility on deployment progress, tradeoffs, and risks across a portfolio of concurrent cluster programs.</li>\n</ul>\n<ul>\n<li>Design cluster interfaces for durability , define rack and cluster-level interfaces that remain robust across hardware generations, so that facility scope and deployment models do not need to be reinvented every time the underlying hardware changes.</li>\n</ul>\n<ul>\n<li>Build cluster layout and BOM tooling , create and maintain the tools, templates, and data models that turn cluster topology and rack specifications into accurate floor layouts, deployment sequences, and complete bills of materials, replacing one-off spreadsheets with repeatable, auditable workflows.</li>\n</ul>\n<p>You may be a good fit if you:</p>\n<ul>\n<li>Have 10+ years of experience in hyperscale datacenter environments, with senior-level responsibility for cluster deployment, large-scale IT integration, or equivalent infrastructure programs.</li>\n</ul>\n<ul>\n<li>Have delivered AI, HPC, or high-density compute clusters at scale and developed a strong intuition for the constraints that govern cluster deployment , interconnect reach, adjacency, power density, and thermal limits.</li>\n</ul>\n<ul>\n<li>Can operate fluently across the boundary between IT hardware and facility infrastructure, and have set interface standards that held up across multiple hardware generations and sites.</li>\n</ul>\n<ul>\n<li>Have led cross-functional programs with both internal engineering teams and external developers, engineering firms, and OEM partners, and are effective at driving alignment across organizational levels.</li>\n</ul>\n<ul>\n<li>Combine strong systems thinking with execution discipline , comfortable zooming from cluster topology and portfolio strategy down to the specific interface detail that will otherwise become a field issue.</li>\n</ul>\n<ul>\n<li>Communicate clearly with technical and executive audiences, and can distill complex, multi-disciplinary programs into decisions and tradeoffs leadership can act on.</li>\n</ul>\n<ul>\n<li>Thrive in ambiguous, fast-moving environments where the hardware, the scale, and the requirements are all changing simultaneously.</li>\n</ul>\n<ul>\n<li>Hold a Bachelor&#39;s degree in Electrical Engineering, Mechanical Engineering, Computer Engineering, or equivalent practical experience.</li>\n</ul>\n<p>Strong candidates may also:</p>\n<ul>\n<li>Have direct experience deploying leading-edge AI accelerator clusters at hyperscale.</li>\n</ul>\n<ul>\n<li>Have shaped reference designs, deployment standards, or cluster-level playbooks that were adopted across a fleet.</li>\n</ul>\n<ul>\n<li>Have experience working across multiple geographies and understand how regional codes, climate, utility constraints, and supply chains shape cluster-level decisions.</li>\n</ul>\n<ul>\n<li>Have partnered closely with hardware and system providers on long-term platform onboarding and bring-up.</li>\n</ul>\n<ul>\n<li>Have experience building the program mechanisms , roadmaps, milestones, risk registers, runbooks , that make delivery predictable at massive scale.</li>\n</ul>\n<p>The annual compensation range for this role is listed below. For sales roles, the range provided is the role’s On Target Earnings (“OTE”) range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.</p>\n<p>Annual Salary: $320,000-$405,000 USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_60082588-bf0","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5191638008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$320,000-$405,000 USD","x-skills-required":["Hyperscale datacenter environments","Cluster deployment","Large-scale IT integration","Infrastructure programs","AI","HPC","High-density compute clusters","Interconnect reach","Adjacency","Power density","Thermal limits","IT hardware","Facility infrastructure","Interface standards","Cluster topology","Portfolio strategy","Execution discipline","Systems thinking","Communication","Technical audiences","Executive audiences","Decision-making","Trade-offs","Leadership","Bachelor's degree","Electrical Engineering","Mechanical Engineering","Computer Engineering","Practical experience"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:36:06.517Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote-Friendly, United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Hyperscale datacenter environments, Cluster deployment, Large-scale IT integration, Infrastructure programs, AI, HPC, High-density compute clusters, Interconnect reach, Adjacency, Power density, Thermal limits, IT hardware, Facility infrastructure, Interface standards, Cluster topology, Portfolio strategy, Execution discipline, Systems thinking, Communication, Technical audiences, Executive audiences, Decision-making, Trade-offs, Leadership, Bachelor's degree, Electrical Engineering, Mechanical Engineering, Computer Engineering, Practical experience","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":320000,"maxValue":405000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a0051ff6-ddf"},"title":"Facilities Operations Manager","description":"<p>We&#39;re seeking a driven Facilities Operations Manager to join our team and ensure the relentless performance of our data center infrastructure. This role is critical to maintaining the uptime and efficiency of the systems powering our AI breakthroughs.</p>\n<p>As a Facilities Operations Manager, you&#39;ll lead teams, oversee cutting-edge facilities, and solve complex problems in real time to keep our mission on track. You&#39;ll own the operation of power, cooling, and monitoring systems at scale, bringing technical depth and a no-excuses mindset to our facility.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Manage all aspects of data center critical infrastructure,switchgear, generators, UPS systems, chillers, liquid cooling, and building monitoring,ensuring 99.999%+ uptime.</li>\n<li>Lead 24x7 teams of facility technicians and vendors, driving safety, execution, and a culture of accountability.</li>\n<li>Troubleshoot and resolve facility emergencies using root cause analysis, acting as the go-to escalation point.</li>\n<li>Spearhead optimization projects, collaborating with engineers to integrate next-gen tech and cut operational costs.</li>\n<li>Own the operations budget, balancing efficiency with performance under tight deadlines.</li>\n<li>Enforce compliance with safety and operational protocols, anticipating regulatory shifts.</li>\n<li>Coordinate with cross-functional teams to deliver high-quality outcomes and boost team morale.</li>\n<li>Support multi-site operations and new facility build-outs as xAI scales.</li>\n</ul>\n<p>Basic Qualifications:</p>\n<ul>\n<li>Minimum of 5 years in data center operations or facility management, ideally with hyperscaler or industrial systems.</li>\n<li>Strong grasp of critical infrastructure,power, cooling, and monitoring systems.</li>\n<li>Proven ability to lead teams and manage projects under pressure.</li>\n<li>Sharp analytical and communication skills.</li>\n</ul>\n<p>Preferred Skills and Experience:</p>\n<ul>\n<li>B.S. in Engineering, Facilities Management, or related field; advanced degree a plus.</li>\n<li>Experience with GPU clusters or AI-driven data center environments.</li>\n<li>Methodical troubleshooting and technical leadership chops.</li>\n<li>Familiarity with Southaven, MS area regulations and practices is a bonus.</li>\n<li>Comfort with Excel, Word, and operational tools; CAD or monitoring software knowledge is a plus.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a0051ff6-ddf","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://www.xai.com/","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/4685202007","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["data center operations","facility management","critical infrastructure","team leadership","project management","analytical skills","communication skills"],"x-skills-preferred":["GPU clusters","AI-driven data center environments","methodical troubleshooting","technical leadership","CAD or monitoring software"],"datePosted":"2026-04-18T15:35:02.637Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Southaven, MS"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"data center operations, facility management, critical infrastructure, team leadership, project management, analytical skills, communication skills, GPU clusters, AI-driven data center environments, methodical troubleshooting, technical leadership, CAD or monitoring software"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6d03dbf4-1e2"},"title":"Staff Engineer, DevOps","description":"<p>Shield AI is building autonomous aircraft that redefine aviation. The Software Integration &amp; Operations (SIO) team builds the pipelines, tools, and systems that make fast, reliable delivery of complex aircraft software possible.</p>\n<p>As a DevOps Build Engineer, you will own and maintain the build system for our autonomous aircraft (XBAT/VBAT). This role is primarily focused on Developer Experience (DevEx) and bringing DevOps best practices to our software engineers to accelerate their development cycles.</p>\n<p>Your responsibilities will include:</p>\n<ul>\n<li>Being the expert on our native (C++) software build process</li>\n<li>Improving developer velocity by reducing build/test cycle time and eliminating pipeline bottlenecks</li>\n<li>Collaborating with Software Engineers to understand their requirements, propose designs, and implement identified solutions</li>\n<li>Partnering with developers to resolve build/test failures quickly and set best practices for integration</li>\n<li>Integrating CI pipelines with HIL and simulation environments for automated validation</li>\n<li>Implementing monitoring, dashboards, and alerts related to build system performance for both local development and CI workflows</li>\n<li>Using agentic solutions to aid in quickly identifying root cause around build failures and getting the appropriate feedback to a Software Engineer</li>\n</ul>\n<p>The ideal candidate will have:</p>\n<ul>\n<li>A BS in CS, CE, EE, or related field with 7+ years industry experience (or MS/PhD with 5+ years)</li>\n<li>Deep CMake/C++ expertise or equivalent</li>\n<li>Strong understanding of dependency management and proficiency with either Conan or CPM</li>\n<li>Proficiency with containers and orchestration (Docker, Kubernetes, cloud build clusters)</li>\n<li>Strong scripting skills (Python, Bash)</li>\n<li>Hands-on experience integrating automated testing into CI pipelines</li>\n<li>Demonstrated success improving developer productivity at scale</li>\n<li>Demonstrated track record of excellence as a primary contributor on projects - showcasing the ability to navigate through development cycles, overcome obstacles, and deliver high-quality solutions in a fast-paced environment</li>\n</ul>\n<p>Additional desired skills include:</p>\n<ul>\n<li>Proficiency with CICD tools (Azure DevOps, Jenkins, Gitlab)</li>\n<li>Cloud Operations experience (Azure, AWS, GCP)</li>\n<li>Familiarity with infrastructure-as-code (Terraform, Ansible) and monitoring (Grafana, Prometheus)</li>\n<li>Background in robotics, autonomy, or aerospace systems</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6d03dbf4-1e2","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Shield AI","sameAs":"https://www.shield.ai","logo":"https://logos.yubhub.co/shield.ai.png"},"x-apply-url":"https://jobs.lever.co/shieldai/1d77e789-a2be-414c-9a1f-f4bc284343ea","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$150,000 - $220,000 a year","x-skills-required":["CMake","C++","Conan","CPM","Docker","Kubernetes","cloud build clusters","Python","Bash","automated testing","CI pipelines","HIL","simulation environments","monitoring","dashboards","alerts"],"x-skills-preferred":["CICD tools","Cloud Operations","infrastructure-as-code","robotics","autonomy","aerospace systems"],"datePosted":"2026-04-17T13:03:25.275Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Diego, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"CMake, C++, Conan, CPM, Docker, Kubernetes, cloud build clusters, Python, Bash, automated testing, CI pipelines, HIL, simulation environments, monitoring, dashboards, alerts, CICD tools, Cloud Operations, infrastructure-as-code, robotics, autonomy, aerospace systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":150000,"maxValue":220000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_d5b743bb-d8f"},"title":"Product Manager, AI Platforms","description":"<p>The AI Platform Product Manager will drive the strategy and execution of Shield AI&#39;s next-generation autonomy intelligence stack. This PM owns the product vision and roadmap for the Hivemind AI Platform, ensuring we can manufacture, govern, and field advanced world models, robotics foundation models, and vision-language-action systems safely and at scale.</p>\n<p>This role sits at the intersection of AI/ML, autonomy, model lifecycle, infrastructure, and product strategy. The PM partners closely with engineering, AI research, Hivemind Solutions, and field teams to deliver the tooling that enables sovereign autonomy, AI Factories at the edge, and continuous learning,capabilities that are central to Shield AI&#39;s strategic direction.</p>\n<p>This is a high-impact role for an experienced product leader excited to define how foundation models are trained, validated, governed, and deployed across thousands of autonomous systems in highly contested environments.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>AI Model Development &amp; Training Platform</li>\n</ul>\n<p>Own the roadmap for foundation model training workflows, including dataset ingestion, curation, labeling, synthetic data generation, domain model training, and distillation pipelines. Define requirements for world models, robotics models, and VLA-based training, evaluation, and specialization. Lead the evolution of MLOps capabilities in Forge, including data lineage, experiment tracking, model versioning, and scalable evaluation suites.</p>\n<ul>\n<li>Data, Simulation &amp; Synthetic Data Factory</li>\n</ul>\n<p>Define product requirements for synthetic data generation, simulation-integrated data flywheels, and automated scenario generation. Partner with Digital Twin, Simulation, and autonomy teams to convert natural-language mission inputs into data needs, training procedures, and model variants.</p>\n<ul>\n<li>Safe Deployment &amp; Model Governance</li>\n</ul>\n<p>Lead the development of model governance and auditability tooling, including model cards, dataset rights, lineage tracking, safety gates, and compliance evidence. Build guardrails and workflows to safely deploy models onto edge hardware in disconnected, GPS- or comms-denied environments. Partner with Safety, Certification, Cyber, and Engineering teams to ensure traceability and evaluation pipelines meet operational and accreditation requirements.</p>\n<ul>\n<li>Edge Deployment &amp; AI Factory Integration</li>\n</ul>\n<p>Partner with Pilot, EdgeOS, and hardware teams to integrate foundation-model-based perception and reasoning into autonomy behaviors. Define requirements for distillation, quantization, and inference tooling as part of the “three-computer” development and deployment model. Ensure closed-loop workflows between cloud model training and edge-native execution.</p>\n<ul>\n<li>Cross-Functional Leadership</li>\n</ul>\n<p>Collaborate with Engineering, Research, Product, Customer Engagement, and Solutions teams to ensure model outputs meet mission and platform constraints. Translate advanced AI capabilities into intuitive workflows that platform OEMs and partner nations can use to build sovereign AI factories. Sequence foundational capabilities that unblock autonomy, simulation, and customer-facing product teams.</p>\n<ul>\n<li>User &amp; Customer Impact</li>\n</ul>\n<p>Develop deep empathy for ML engineers, autonomy developers, and Solutions engineers who rely on the platform. Capture operational data gaps, mission-driven model needs, and domain-specific specialization requirements. Lead demos and onboarding for model-development capabilities across internal and external teams.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_d5b743bb-d8f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Shield AI","sameAs":"https://www.shield.ai","logo":"https://logos.yubhub.co/shield.ai.png"},"x-apply-url":"https://jobs.lever.co/shieldai/7886f437-2d5e-4616-8dcb-3dc488f1f585","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$190,000 - $290,000 a year","x-skills-required":["AI Model Development & Training Platform","Data, Simulation & Synthetic Data Factory","Safe Deployment & Model Governance","Edge Deployment & AI Factory Integration","Cross-Functional Leadership","User & Customer Impact","Strong engineering background","Deep understanding of foundation models, robotics models, multimodal models, MLOps, and training infrastructure","Experience managing complex products spanning data pipelines, cloud training clusters, model governance, and edge deployments","Proven success partnering with research teams to transition ML innovations into stable, production-grade workflows"],"x-skills-preferred":["Experience working on autonomy, robotics, embedded AI, or mission-critical systems","Hands-on familiarity with GPU infrastructure, distributed training, or data lakehouse architectures","Experience supporting defense, dual-use, or safety-critical AI systems","Background designing or operating AI Factory–style pipelines (data → training → evaluation → distillation → edge deployment)","Advanced degree in engineering, ML/AI, robotics, or a related field"],"datePosted":"2026-04-17T13:02:54.419Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Diego"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"AI Model Development & Training Platform, Data, Simulation & Synthetic Data Factory, Safe Deployment & Model Governance, Edge Deployment & AI Factory Integration, Cross-Functional Leadership, User & Customer Impact, Strong engineering background, Deep understanding of foundation models, robotics models, multimodal models, MLOps, and training infrastructure, Experience managing complex products spanning data pipelines, cloud training clusters, model governance, and edge deployments, Proven success partnering with research teams to transition ML innovations into stable, production-grade workflows, Experience working on autonomy, robotics, embedded AI, or mission-critical systems, Hands-on familiarity with GPU infrastructure, distributed training, or data lakehouse architectures, Experience supporting defense, dual-use, or safety-critical AI systems, Background designing or operating AI Factory–style pipelines (data → training → evaluation → distillation → edge deployment), Advanced degree in engineering, ML/AI, robotics, or a related field","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":190000,"maxValue":290000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b7bde4cf-9c8"},"title":"Datacenter Hardware Engineer, HPC","description":"<p>About Mistral</p>\n<p>At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.</p>\n<p>Our compute footprint is growing fast to support our science and engineering teams. We’re hiring a Datacenter HW Engineer to maintain, troubleshoot, and scale our GPU/CPU clusters safely and reliably.</p>\n<p>You’ll execute hands-on hardware work in our Paris-area datacenter and partner with hardware owners, DC operations, and vendors to keep one of France’s largest GPU clusters healthy.</p>\n<p>Location: Bruyères-le-Châtel , on-site, field role</p>\n<p>Reporting line: Hardware Ops</p>\n<p>Impact</p>\n<p>• Compute is a key lever for Mistral’s success and our largest spend item.\n• Direct impact on scale: your work keeps one of France’s largest AI clusters healthy as we grow to unprecedented scale.\n• Enable breakthrough AI: you unlock our science &amp; engineering teams to deliver groundbreaking AI solutions.</p>\n<p>Responsibilities</p>\n<p>• Diagnose &amp; operate core server/cluster components - Investigate and handle compute/storage hardware issues (CPU, memory, drives, NICs, GPUs, PSUs) and interconnect problems (switches, cables, transceivers; Ethernet/InfiniBand).\n• Safety &amp; procedures - Apply lockout/tagout (LOTO) and ESD discipline; follow pre/post-work checklists; maintain tidy, safe work areas.\n• First-line diagnostics - Triage using LEDs, POST, beep codes and basic tests; capture evidence (photos, serials, results); open/update/close tickets with clear notes.\n• Preventive maintenance - Provide feedback and ideas to improve proactive activities, monitoring, and targeted follow-ups on recurring or specific anomalies; help turn ad-hoc checks into SOPs, alerts, and dashboards.\n• Parts &amp; logistics - Receive and track parts, keep labeled inventory accurate, manage simple RMAs, and coordinate with vendors.\n• Collaboration &amp; escalation - Partner with senior hardware/firmware owners on complex or multi-node issues; communicate status and next steps crisply.\n• Documentation &amp; quality - Keep SOPs/checklists current; ensure zero undocumented changes and consistent, audit-ready records.</p>\n<p>About you</p>\n<p>• Hands-on mindset in datacenters/server hardware: you can install/re-seat/swap GPU/PCIe cards, NICs, PSUs, drives, and work cleanly in racks (rails, cabling, labeling).\n• Disciplined and meticulous: follows checklists, ESD/LOTO; no rough handling; careful with all high-value server components.\n• Practical electrical basics: power-off, PPE, short-circuit risk awareness.\n• Comfortable in racks: cooling, network, storage, PDU, cable management; can lift/mount safely (within HSE limits).\n• Clear communicator: short factual updates; reliable teammate; punctual and process-minded.\n• Hardware-passionate, professionally grounded: strong curiosity and craft mindset.</p>\n<p>Nice to have</p>\n<p>• HPC/AI/Cloud at scale experience (production environments), large-fleet/server install &amp; maintenance in datacenters.\n• Basic networking (Ethernet/InfiniBand) and basic Linux (boot/check; no coding needed).\n• Coding/automation skills (Python/Bash): small tools/scripts to improve checklists, photo/serial capture, inventory sync, or simple monitoring/reporting.\n• Experience with inventory/RMA tools and vendor coordination.\n• Exposure to HPC/research/industrial environments.</p>\n<p>What we offer</p>\n<p>💰 Competitive salary and equity package</p>\n<p>🧑‍⚕️ Health insurance</p>\n<p>🚴 Transportation allowance</p>\n<p>🥎 Sport allowance</p>\n<p>🥕 Meal vouchers</p>\n<p>💰 Private pension plan</p>\n<p>🍼 Generous parental leave policy</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b7bde4cf-9c8","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai/careers","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/ddf7bcbb-e223-4768-a553-6e95df472cf7","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["GPU/CPU clusters","server hardware","Linux fundamentals","scripting","electrical basics","networking","inventory management"],"x-skills-preferred":["HPC/AI/Cloud at scale experience","basic Linux","coding/automation skills","inventory/RMA tools","vendor coordination"],"datePosted":"2026-04-17T12:47:08.660Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GPU/CPU clusters, server hardware, Linux fundamentals, scripting, electrical basics, networking, inventory management, HPC/AI/Cloud at scale experience, basic Linux, coding/automation skills, inventory/RMA tools, vendor coordination"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c1dcea75-d5a"},"title":"Member of Technical Staff - Infrastructure Engineer","description":"<p>We&#39;re looking for an experienced engineer to join our team in Freiburg, Germany or San Francisco, USA. As a Member of Technical Staff - Infrastructure Engineer, you will be responsible for maintaining and scaling our research infrastructure, ensuring health and optimizing components to extract peak performance from the system. You will also collaborate with research teams to deeply understand their infrastructure needs and design solutions that balance performance with cost efficiency.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Maintaining research infrastructure, ensuring health, and optimizing components to extract peak performance from the system (both on application and infrastructure side)</li>\n<li>Scaling infrastructure to meet growing research demands while maintaining reliability and performance</li>\n<li>Collaborating with research teams to deeply understand their infrastructure needs, and design solutions that balance performance with cost efficiency</li>\n<li>Identifying and resolving performance bottlenecks and capacity hotspots through deep analysis of distributed systems at scale</li>\n<li>Building and evolving telemetry and monitoring systems to provide deep visibility into infrastructure performance, utilization, and costs across our cloud and datacenter fleets</li>\n<li>Participating in on-call rotations and incident response to maintain system reliability</li>\n</ul>\n<p>Technical focus includes:</p>\n<ul>\n<li>Python, Bash, Go</li>\n<li>Kubernetes</li>\n<li>Nvidia GPU drivers and operators</li>\n<li>OTel, Prometheus</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>Experience building or operating large-scale training platforms</li>\n<li>Worked with large-scale compute clusters (GPUs)</li>\n<li>Proven ability to debug performance and reliability issues across large distributed fleets</li>\n<li>Strong problem-solving skills and ability to work independently</li>\n<li>Strong communication skills and the ability to work effectively with both internal and external partners</li>\n<li>Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP</li>\n<li>Experience with SLURM</li>\n</ul>\n<p>We offer a competitive base annual salary of $180,000-$300,000 USD and a hybrid work model with a meaningful in-person presence.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c1dcea75-d5a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Black Forest Labs","sameAs":"https://www.blackforestlabs.com/","logo":"https://logos.yubhub.co/blackforestlabs.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/blackforestlabs/jobs/4925659008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$180,000-$300,000 USD","x-skills-required":["Python","Bash","Go","Kubernetes","Nvidia GPU drivers","Nvidia GPU operators","OTel","Prometheus","Experience building or operating large-scale training platforms","Worked with large-scale compute clusters (GPUs)","Proven ability to debug performance and reliability issues across large distributed fleets","Strong problem-solving skills and ability to work independently","Strong communication skills and the ability to work effectively with both internal and external partners","Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP","Experience with SLURM"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:25:55.745Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Freiburg (Germany), San Francisco (USA)"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Bash, Go, Kubernetes, Nvidia GPU drivers, Nvidia GPU operators, OTel, Prometheus, Experience building or operating large-scale training platforms, Worked with large-scale compute clusters (GPUs), Proven ability to debug performance and reliability issues across large distributed fleets, Strong problem-solving skills and ability to work independently, Strong communication skills and the ability to work effectively with both internal and external partners, Deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP, Experience with SLURM","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":300000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c8c20fa9-7f3"},"title":"Datacenter Hardware Engineer, HPC","description":"<p>About Mistral AI</p>\n<p>At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.</p>\n<p>We are a company that democratizes AI through high-performance, optimized, open-source and cutting-edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on-premises or in cloud environments.</p>\n<p>Our offerings include le Chat, the AI assistant for life and work. We are a team passionate about AI and its potential to transform society.</p>\n<p>Role Summary</p>\n<p>Our compute footprint is growing fast to support our science and engineering teams. We’re hiring a Datacenter HW Engineer to maintain, troubleshoot, and scale our GPU/CPU clusters safely and reliably.</p>\n<p>What you will do</p>\n<ul>\n<li>Diagnose &amp; operate core server/cluster components - Investigate and handle compute/storage hardware issues (CPU, memory, drives, NICs, GPUs, PSUs) and interconnect problems (switches, cables, transceivers; Ethernet/InfiniBand).</li>\n<li>Safety &amp; procedures - Apply lockout/tagout (LOTO) and ESD discipline; follow pre/post-work checklists; maintain tidy, safe work areas.</li>\n<li>First-line diagnostics - Triage using LEDs, POST, beep codes and basic tests; capture evidence (photos, serials, results); open/update/close tickets with clear notes.</li>\n<li>Preventive maintenance - Provide feedback and ideas to improve proactive activities, monitoring, and targeted follow-ups on recurring or specific anomalies; help turn ad-hoc checks into SOPs, alerts, and dashboards.</li>\n<li>Parts &amp; logistics - Receive and track parts, keep labeled inventory accurate, manage simple RMAs, and coordinate with vendors.</li>\n<li>Collaboration &amp; escalation - Partner with senior hardware/firmware owners on complex or multi-node issues; communicate status and next steps crisply.</li>\n<li>Documentation &amp; quality - Keep SOPs/checklists current; ensure zero undocumented changes and consistent, audit-ready records.</li>\n</ul>\n<p>About you</p>\n<ul>\n<li>Hands-on mindset in datacenters/server hardware: you can install/re-seat/swap GPU/PCIe cards, NICs, PSUs, drives, and work cleanly in racks (rails, cabling, labeling).</li>\n<li>Disciplined and meticulous: follows checklists, ESD/LOTO; no rough handling; careful with all high-value server components.</li>\n<li>Practical electrical basics: power-off, PPE, short-circuit risk awareness.</li>\n<li>Comfortable in racks: cooling, network, storage, PDU, cable management; can lift/mount safely (within HSE limits).</li>\n<li>Clear communicator: short factual updates; reliable teammate; punctual and process-minded.</li>\n<li>Hardware-passionate, professionally grounded: strong curiosity and craft mindset.</li>\n</ul>\n<p>Nice to have</p>\n<ul>\n<li>HPC/AI/Cloud at scale experience (production environments), large-fleet/server install &amp; maintenance in datacenters.</li>\n<li>Basic networking (Ethernet/InfiniBand) and basic Linux (boot/check; no coding needed).</li>\n<li>Coding/automation skills (Python/Bash): small tools/scripts to improve checklists, photo/serial capture, inventory sync, or simple monitoring/reporting.</li>\n<li>Experience with inventory/RMA tools and vendor coordination.</li>\n<li>Exposure to HPC/research/industrial environments.</li>\n</ul>\n<p>What we offer</p>\n<ul>\n<li>Competitive salary and equity package</li>\n<li>Health insurance</li>\n<li>Transportation allowance</li>\n<li>Sport allowance</li>\n<li>Meal vouchers</li>\n<li>Private pension plan</li>\n<li>Generous parental leave policy</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c8c20fa9-7f3","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai"},"x-apply-url":"https://jobs.lever.co/mistral/ddf7bcbb-e223-4768-a553-6e95df472cf7","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Datacenter hardware","Server hardware","GPU/CPU clusters","Networking","Linux","Scripting (Python/Bash)","Inventory/RMA tools","Vendor coordination"],"x-skills-preferred":["HPC/AI/Cloud at scale experience","Basic networking (Ethernet/InfiniBand)","Basic Linux (boot/check; no coding needed)","Coding/automation skills (Python/Bash)"],"datePosted":"2026-03-10T11:25:48.956Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Datacenter hardware, Server hardware, GPU/CPU clusters, Networking, Linux, Scripting (Python/Bash), Inventory/RMA tools, Vendor coordination, HPC/AI/Cloud at scale experience, Basic networking (Ethernet/InfiniBand), Basic Linux (boot/check; no coding needed), Coding/automation skills (Python/Bash)"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a51375e8-30e"},"title":"Member of Technical Staff, Software Co-Design AI HPC Systems","description":"<p>Our team&#39;s mission is to architect, co-design, and productionize next-generation AI systems at datacenter scale. We operate at the intersection of models, systems software, networking, storage, and AI hardware, optimizing end-to-end performance, efficiency, reliability, and cost. Our work spans today&#39;s frontier AI workloads and directly shapes the next generation of accelerators, system architectures, and large-scale AI platforms. We pursue this mission through deep hardware–software co-design, combining rigorous systems thinking with hands-on engineering. The team invests heavily in understanding real production workloads large-scale training, inference, and emerging multimodal models and translating those insights into concrete improvements across the stack: from kernels, runtimes, and distributed systems, all the way down to silicon-level trade-offs and datacenter-scale architectures. This role sits at the boundary between exploration and production. You will work closely with internal infrastructure, hardware, compiler, and product teams, as well as external partners across the hardware and systems ecosystem. Our operating model emphasizes rapid ideation and prototyping, followed by disciplined execution to drive high-leverage ideas into production systems that operate at massive scale. In addition to delivering real-world impact on large-scale AI platforms, the team actively contributes to the broader research and engineering community. Our work aligns closely with leading communities in ML systems, distributed systems, computer architecture, and high-performance computing, and we regularly publish, prototype, and open-source impactful technologies where appropriate.</p>\n<p>About the Team</p>\n<p>We build foundational AI infrastructure that enables large-scale training and inference across diverse workloads and rapidly evolving hardware generations. Our work directly shapes how AI systems are designed, deployed, and scaled today and into the future. Engineers on this team operate with end-to-end ownership, deep technical rigor, and a strong bias toward real-world impact.</p>\n<p>Microsoft Superintelligence Team</p>\n<p>Microsoft Superintelligence team’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.</p>\n<p>This role is part of Microsoft AI’s Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!</p>\n<p>Responsibilities</p>\n<p>Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks. Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements. Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems. Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps. Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations. Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs. Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams. Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a51375e8-30e","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-software-co-design-ai-hpc-systems-mai-superintelligence-team-3/","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["AI accelerator or GPU architectures","Distributed systems and large-scale AI training/inference","High-performance computing (HPC) and collective communications","ML systems, runtimes, or compilers","Performance modeling, benchmarking, and systems analysis","Hardware–software co-design for AI workloads","Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development"],"x-skills-preferred":["Experience designing or operating large-scale AI clusters for training or inference","Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications","Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand)","Background in performance modeling and capacity planning for future hardware generations","Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews","Publications, patents, or open-source contributions in systems, architecture, or ML systems"],"datePosted":"2026-03-08T22:18:41.443Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"AI accelerator or GPU architectures, Distributed systems and large-scale AI training/inference, High-performance computing (HPC) and collective communications, ML systems, runtimes, or compilers, Performance modeling, benchmarking, and systems analysis, Hardware–software co-design for AI workloads, Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development, Experience designing or operating large-scale AI clusters for training or inference, Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications, Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand), Background in performance modeling and capacity planning for future hardware generations, Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews, Publications, patents, or open-source contributions in systems, architecture, or ML systems"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b151fcc2-2fb"},"title":"Member of Technical Staff, High Performance Computing Engineer","description":"<p>We are looking for experienced Member of Technical Staff, High Performance Computing Engineers to help build and scale the infrastructure that trains our frontier models and powers the next evolution of our personal AI, Copilot.</p>\n<p>This role offers the unique opportunity to work on some of the largest scale supercomputers in the world – a rare chance to operate at such a significant scale.</p>\n<p><strong>Responsibilities</strong></p>\n<p>Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.</p>\n<p>Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.</p>\n<p>Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.</p>\n<p>Develop and maintain automation and tooling using Bash and/or Python to improve cluster reliability, observability, and operational efficiency.</p>\n<p>Partner closely with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.</p>\n<p>Drive work forward independently by navigating ambiguity and technical roadblocks, delivering incremental improvements that get capabilities into users’ hands quickly.</p>\n<p><strong>Qualifications</strong></p>\n<p>Do you have a Bachelor’s degree in computer science, or related technical field AND 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters, AND 4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.), AND 4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP, OR equivalent experience?</p>\n<p><strong>Preferred Qualifications</strong></p>\n<p>Master’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters, AND 6+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.), AND 6+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP, OR equivalent experience.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b151fcc2-2fb","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-high-performance-computing-engineer-mai-superintelligence-team-3/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["HPC","SLURM","Kubernetes","GPU compute","high-performance storage","networking","Bash","Python","nvidia InfiniBand clusters","Ray"],"x-skills-preferred":["LLM training clusters","AI platforms","Machine Learning frameworks","large-scale HPC or GPU systems"],"datePosted":"2026-03-08T22:15:08.170Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Zürich"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"HPC, SLURM, Kubernetes, GPU compute, high-performance storage, networking, Bash, Python, nvidia InfiniBand clusters, Ray, LLM training clusters, AI platforms, Machine Learning frameworks, large-scale HPC or GPU systems"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_93a4ece6-182"},"title":"Member of Technical Staff, Site Reliability Engineer (HPC)","description":"<p>As Microsoft continues to push the boundaries of AI, we are on the lookout for experienced individuals to work with us on the most interesting and challenging AI questions of our time. Our vision is to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. We&#39;re looking for an experienced HPC Site Reliability Engineer (SRE) to join our High Performance Computing (HPC) infrastructure team. In this role, you&#39;ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You&#39;ll ensure that AI systems stay efficient and reliable with very high uptimes.</p>\n<p>Microsoft&#39;s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.</p>\n<p>This role is part of Microsoft AI&#39;s Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being.</p>\n<p>Responsibilities\nReliability &amp; Availability : Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference.\nObservability : Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking.\nAutomation &amp; Tooling : Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments.\nIncident Management : Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.\nSecurity &amp; Compliance : Ensure data privacy, compliance, and secure operations across model training and serving environments.\nCollaboration : Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.</p>\n<p>Qualifications\nRequired Qualifications\nMaster’s Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR Bachelor’s Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR equivalent experience</p>\n<p>Preferred Qualifications\nStrong proficiency in Kubernetes, Docker, and container orchestration.\nKnowledge of CI/CD pipelines for Inference and ML model deployment.\nHands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code.\nExpertise in monitoring &amp; observability tools (Grafana, Datadog, OpenTelemetry, etc.).\nStrong programming/scripting skills in Python, Go, or Bash.\nSolid knowledge of distributed systems, networking, and storage.\nExperience running large-scale GPU clusters for ML/AI workloads (preferred).\nFamiliarity with ML training/inference pipelines.\nExperience with high-performance computing (HPC) and workload schedulers (Kubernetes operators).\nBackground in capacity planning &amp; cost optimization for GPU-heavy environments.</p>\n<p>Work on cutting-edge infrastructure that powers the future of Generative AI. Collaborate with world-class researchers and engineers. Impact millions of users through reliable and responsible AI deployments. Competitive compensation, equity options, and comprehensive benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_93a4ece6-182","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-site-reliability-engineer-hpc-mai-superintelligence-team/","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$139,900 – $274,800 per year","x-skills-required":["Kubernetes","Docker","container orchestration","CI/CD pipelines","public cloud platforms","infrastructure-as-code","monitoring & observability tools","programming/scripting skills in Python, Go, or Bash","distributed systems","networking","storage","GPU clusters","ML training/inference pipelines","high-performance computing","workload schedulers"],"x-skills-preferred":["strong proficiency in Kubernetes","knowledge of CI/CD pipelines","hands-on experience with public cloud platforms","expertise in monitoring & observability tools","strong programming/scripting skills in Python, Go, or Bash","solid knowledge of distributed systems","experience running large-scale GPU clusters","familiarity with ML training/inference pipelines","experience with high-performance computing"],"datePosted":"2026-03-08T22:09:23.399Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Docker, container orchestration, CI/CD pipelines, public cloud platforms, infrastructure-as-code, monitoring & observability tools, programming/scripting skills in Python, Go, or Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, strong proficiency in Kubernetes, knowledge of CI/CD pipelines, hands-on experience with public cloud platforms, expertise in monitoring & observability tools, strong programming/scripting skills in Python, Go, or Bash, solid knowledge of distributed systems, experience running large-scale GPU clusters, familiarity with ML training/inference pipelines, experience with high-performance computing","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4e0b9271-cdd"},"title":"Research Engineer / Scientist, Alignment Science","description":"<p><strong>About the role:</strong></p>\n<p>You want to build and run elegant and thorough machine learning experiments to help us understand and steer the behavior of powerful AI systems. You care about making AI helpful, honest, and harmless, and are interested in the ways that this could be challenging in the context of human-level capabilities. You could describe yourself as both a scientist and an engineer. As a Research Engineer on Alignment Science, you&#39;ll contribute to exploratory experimental research on AI safety, with a focus on risks from powerful future systems (like those we would designate as ASL-3 or ASL-4 under our Responsible Scaling Policy), often in collaboration with other teams including Interpretability, Fine-Tuning, and the Frontier Red Team.</p>\n<p>Our blog provides an overview of topics that the Alignment Science team is either currently exploring or has previously explored. Our current topics of focus include...</p>\n<ul>\n<li><strong>Scalable Oversight:</strong> Developing techniques to keep highly capable models helpful and honest, even as they surpass human-level intelligence in various domains.</li>\n</ul>\n<ul>\n<li><strong>AI Control:</strong> Creating methods to ensure advanced AI systems remain safe and harmless in unfamiliar or adversarial scenarios.</li>\n</ul>\n<ul>\n<li><strong>Alignment Stress-testing</strong> <strong>:</strong> Creating model organisms of misalignment to improve our empirical understanding of how alignment failures might arise.</li>\n</ul>\n<ul>\n<li><strong>Automated Alignment Research:</strong> Building and aligning a system that can speed up &amp; improve alignment research.</li>\n</ul>\n<ul>\n<li><strong>Alignment Assessments</strong>: Understanding and documenting the highest-stakes and most concerning emerging properties of models through pre-deployment alignment and welfare assessments (see our Claude 4 System Card), misalignment-risk safety cases, and coordination with third-party evaluators.</li>\n</ul>\n<ul>\n<li><strong>Safeguards Research</strong>: Developing robust defenses against adversarial attacks, comprehensive evaluation frameworks for model safety, and automated systems to detect and mitigate potential risks before deployment.</li>\n</ul>\n<ul>\n<li><strong>Model Welfare:</strong> Investigating and addressing potential model welfare, moral status, and related questions. See our program announcement and welfare assessment in the Claude 4 system card for more.</li>\n</ul>\n<p>_Note: For this role, we conduct all interviews in Python and prefer candidates to be based in the Bay Area._</p>\n<p><strong>Representative projects:</strong></p>\n<ul>\n<li>Testing the robustness of our safety techniques by training language models to subvert our safety techniques, and seeing how effective they are at subvertinng our interventions.</li>\n</ul>\n<ul>\n<li>Run multi-agent reinforcement learning experiments to test out techniques like AI Debate.</li>\n</ul>\n<ul>\n<li>Build tooling to efficiently evaluate the effectiveness of novel LLM-generated jailbreaks.</li>\n</ul>\n<ul>\n<li>Write scripts and prompts to efficiently produce evaluation questions to test models’ reasoning abilities in safety-relevant contexts.</li>\n</ul>\n<ul>\n<li>Contribute ideas, figures, and writing to research papers, blog posts, and talks.</li>\n</ul>\n<ul>\n<li>Run experiments that feed into key AI safety efforts at Anthropic, like the design and implementation of our Responsible Scaling Policy.</li>\n</ul>\n<p><strong>You may be a good fit if you:</strong></p>\n<ul>\n<li>Have significant software, ML, or research engineering experience</li>\n</ul>\n<ul>\n<li>Have some experience contributing to empirical AI research projects</li>\n</ul>\n<ul>\n<li>Have some familiarity with technical AI safety research</li>\n</ul>\n<ul>\n<li>Prefer fast-moving collaborative projects to extensive solo efforts</li>\n</ul>\n<ul>\n<li>Pick up slack, even if it goes outside your job description</li>\n</ul>\n<ul>\n<li>Care about the impacts of AI</li>\n</ul>\n<p><strong>Strong candidates may also:</strong></p>\n<ul>\n<li>Have experience authoring research papers in machine learning, NLP, or AI safety</li>\n</ul>\n<ul>\n<li>Have experience with LLMs</li>\n</ul>\n<ul>\n<li>Have experience with reinforcement learning</li>\n</ul>\n<ul>\n<li>Have experience with Kubernetes clusters and complex shared codebases</li>\n</ul>\n<p><strong>Candidates need not have:</strong></p>\n<ul>\n<li>100% of the skills needed to perform the job</li>\n</ul>\n<ul>\n<li>Formal certifications or education credentials</li>\n</ul>\n<p>The annual compensation range for this role is listed below.</p>\n<p>For sales roles, the range provided is the role’s On Target Earnings (&quot;OTE&quot;) range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.</p>\n<p>Annual Salary:</p>\n<p>$350,000 \\- $500,000USD</p>\n<p><strong><strong>Logistics</strong></strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience.</p>\n<p><strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong> Not all strong candidates will meet every single qualification as listed.  Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work. We think AI systems like the ones we&#39;re building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team.</p>\n<p><strong>Your safety matters to us.</strong> To protect yourself from potential scams, remember that Anthropic recruits through our website and other job boards, and we will never ask you to pay for any part of the recruitment process.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4e0b9271-cdd","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4631822008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $500,000USD","x-skills-required":["Python","Machine Learning","Research Engineering","AI Safety","Scalable Oversight","AI Control","Alignment Stress-testing","Automated Alignment Research","Alignment Assessments","Safeguards Research","Model Welfare"],"x-skills-preferred":["Experience authoring research papers in machine learning, NLP, or AI safety","Experience with LLMs","Experience with reinforcement learning","Experience with Kubernetes clusters and complex shared codebases"],"datePosted":"2026-03-08T13:51:34.613Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Machine Learning, Research Engineering, AI Safety, Scalable Oversight, AI Control, Alignment Stress-testing, Automated Alignment Research, Alignment Assessments, Safeguards Research, Model Welfare, Experience authoring research papers in machine learning, NLP, or AI safety, Experience with LLMs, Experience with reinforcement learning, Experience with Kubernetes clusters and complex shared codebases","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":500000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f5e7e195-679"},"title":"Datacenter Hardware Operations Technician, AI Compute Infrastructure - Stargate","description":"<p><strong>Job Posting</strong></p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$86.4K – $228K</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>OpenAI, in close collaboration with our capital partners, is embarking on a journey to build the world’s most advanced AI infrastructure ecosystem. Our Stargate program develops and deploys massive, state-of-the-art data center campuses in partnership with industry leaders such as Oracle today—and through future OpenAI infrastructure projects tomorrow. We design for scale, speed, and reliability, and we need experienced hardware professionals who can help ensure our high-density compute environment operates at peak performance.</p>\n<p><strong>About the Role</strong></p>\n<p>We are seeking a senior datacenter hardware operations technician to coordinate physical hardware activities at a large partner-operated campus. In this role you will work side-by-side with Oracle and their delivery teams, helping align OpenAI’s compute requirements with day-to-day hardware work on the ground. Rather than directing partner personnel, you will focus on collaboration, technical alignment, and shared problem solving, ensuring that maintenance, repairs, and lifecycle activities support the performance and reliability goals of both organizations. As the campus matures, you will help capture lessons learned and develop standards and playbooks to guide hardware operations at future OpenAI infrastructure projects.</p>\n<p>_Candidates must be able to sit onsite in Abilene, Texas 5 days per week_</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Serve as OpenAI’s primary on-site hardware contact, collaborating with Oracle teams and vendors to plan and coordinate maintenance, repairs, and lifecycle activities.</li>\n</ul>\n<ul>\n<li>Share technical requirements and verify that work performed supports OpenAI’s compute needs and agreed quality targets.</li>\n</ul>\n<ul>\n<li>Coordinate schedules, spare-parts planning, and issue escalation with partner teams to minimize downtime and keep operations running smoothly.</li>\n</ul>\n<ul>\n<li>Work with OpenAI fleet-health engineers to translate software-detected issues into on-site hardware actions in partnership with Oracle.</li>\n</ul>\n<ul>\n<li>Track hardware trends and provide joint recommendations with partner teams for design or operational improvements.</li>\n</ul>\n<ul>\n<li>Prepare documentation and runbooks that capture joint best practices and can be applied at additional campuses.</li>\n</ul>\n<ul>\n<li>Offer technical guidance and context to partner personnel while respecting their operational ownership.</li>\n</ul>\n<ul>\n<li>Collaborate with supply-chain teams to plan spares and manage hardware lifecycle activities.</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>Have 7+ years of experience in datacenter hardware operations, hardware engineering, or large-scale server maintenance, with at least 2 years in a senior or lead technician capacity.</li>\n</ul>\n<ul>\n<li>Bring deep knowledge of high-density server hardware, including x86 platforms, GPUs, storage devices, and power/cooling systems.</li>\n</ul>\n<ul>\n<li>Excel at diagnosing hardware issues, coordinating complex repairs, and maintaining strong working relationships across organizations.</li>\n</ul>\n<ul>\n<li>Are comfortable setting technical expectations and validating outcomes through collaboration, not direct management.</li>\n</ul>\n<ul>\n<li>Adapt quickly to changing operational conditions and enjoy solving problems at both the strategic and on-site levels.</li>\n</ul>\n<ul>\n<li>Communicate clearly and build trust across partner teams, vendors, and internal engineering stakeholders.</li>\n</ul>\n<ul>\n<li>Are willing to be based full-time at a partner-operated campus</li>\n</ul>\n<p><strong>Preferred Skills</strong></p>\n<ul>\n<li>Familiarity with large-scale cluster management or monitoring tools (IPMI, BMC, Prometheus, Nagios) to interpret alerts and coordinate partner responses.</li>\n</ul>\n<ul>\n<li>Experience with GPU-accelerated compute clusters or other high-performance computing hardware.</li>\n</ul>\n<ul>\n<li>Knowledge of Linux/Unix system administration and command-line diagnostic tools for hardware validation.</li>\n</ul>\n<ul>\n<li>Industry certifications such as CompTIA Server+, OEM hardware certifications, or equivalent.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f5e7e195-679","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/b9a4a809-a965-4dbe-aeef-6ce1593903dd","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$86.4K – $228K","x-skills-required":["datacenter hardware operations","hardware engineering","large-scale server maintenance","high-density server hardware","x86 platforms","GPUs","storage devices","power/cooling systems"],"x-skills-preferred":["large-scale cluster management","monitoring tools","IPMI","BMC","Prometheus","Nagios","GPU-accelerated compute clusters","Linux/Unix system administration","command-line diagnostic tools","industry certifications"],"datePosted":"2026-03-06T18:43:34.654Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote - US"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"datacenter hardware operations, hardware engineering, large-scale server maintenance, high-density server hardware, x86 platforms, GPUs, storage devices, power/cooling systems, large-scale cluster management, monitoring tools, IPMI, BMC, Prometheus, Nagios, GPU-accelerated compute clusters, Linux/Unix system administration, command-line diagnostic tools, industry certifications","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":86400,"maxValue":228000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_520ca95e-75f"},"title":"Software Engineer, Agent Infrastructure","description":"<p><strong>Software Engineer, Agent Infrastructure</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco; New York City</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Scaling</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$230K – $385K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>The Agent Infrastructure team at OpenAI is responsible for building systems that enable training and deployment of highly useful AI agents, both internally and for the world.</p>\n<p>We work hand-in-hand with researchers to design and scale the environment in which agentic models are trained – providing a workspace for AI models to execute code, debug issues, and develop software just as human SWEs do. Our training environment for agentic models operates at an extremely high scale and has the flexibility to emulate any environment in which an agent might work.</p>\n<p>At the same time, our team builds and maintains OpenAI’s core platform for the deployment and execution of agents in production. Our systems power products such as Codex, Operator, tool use in ChatGPT, and future agentic products.</p>\n<p><strong>About the Role</strong></p>\n<p>As a Software Engineer on the Agent Infrastructure team, you will have the opportunity to work closely with both research and product at OpenAI - building and scaling systems to train highly capable agentic models, and building the platform and integrations to launch new agents to hundreds of millions of users worldwide.</p>\n<p>Your work will consist of both building new capabilities - standing up the infrastructure and integrations needed to train more complex agentic models - and rapidly scaling these new capabilities to some of the largest compute clusters in the world. At the same time, you’ll be instrumental to the launch of agentic products at OpenAI - building, maintaining, and scaling the production platform on which all agents run.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Push massive compute clusters to their limits. You will be a core contributor to a novel container orchestration platform built in-house by our team to scale far beyond what’s possible with systems like Kubernetes.</li>\n</ul>\n<ul>\n<li>Develop and maintain FastAPI and gRPC APIs that serve as the interface for our agentic infrastructure used both in training and production.</li>\n</ul>\n<ul>\n<li>Use Terraform to stand up and evolve complex infrastructure for both research and production.</li>\n</ul>\n<ul>\n<li>Collaborate with research teams to stand up and optimize systems for novel AI training runs and experimental applications.</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>Have deep experience working on large-scale machine learning infrastructure. You know how to reason about training at scale, identifying bottlenecks and engineering solutions to optimize system performance in training environments.</li>\n</ul>\n<ul>\n<li>Know how to build new things from 0-1 quickly, and then scale them 1,000,000x.</li>\n</ul>\n<ul>\n<li>Have a keen eye for performance and optimization. You know how to squeeze the most performance out of complex, globally-distributed systems.</li>\n</ul>\n<ul>\n<li>Know your way around cloud platforms and work with infrastructure-as-code tech like Terraform.</li>\n</ul>\n<ul>\n<li>Are driven by solving complex, ambiguous problems at the intersection of infrastructure scalability, virtualization efficiency, and agentic capabilities.</li>\n</ul>\n<ul>\n<li>Have deep technical expertise in virtualization and containerization technologies (e.g. Kata, Firecracker, gVisor, Sysbox) and are passionate about optimizing runtime performance.</li>\n</ul>\n<p><strong>What We Offer</strong></p>\n<ul>\n<li>Competitive salary and equity package</li>\n</ul>\n<ul>\n<li>Opportunity to work on cutting-edge AI infrastructure</li>\n</ul>\n<ul>\n<li>Collaborative and dynamic team environment</li>\n</ul>\n<ul>\n<li>Flexible work arrangements</li>\n</ul>\n<ul>\n<li>Professional development opportunities</li>\n</ul>\n<ul>\n<li>Access to the latest technology and tools</li>\n</ul>\n<p><strong>How to Apply</strong></p>\n<p>If you are a motivated and experienced software engineer looking to join a dynamic team and work on cutting-edge AI infrastructure, please submit your application. We look forward to hearing from you!</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_520ca95e-75f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/c1316397-25bb-4add-9e9d-0e3ea8ba929a","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$230K – $385K","x-skills-required":["large-scale machine learning infrastructure","container orchestration","FastAPI","gRPC","Terraform","cloud platforms","infrastructure-as-code","virtualization","containerization","Kata","Firecracker","gVisor","Sysbox"],"x-skills-preferred":["AI infrastructure","agentic models","training environments","compute clusters","performance optimization","runtime performance"],"datePosted":"2026-03-06T18:41:05.385Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco; New York City"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"large-scale machine learning infrastructure, container orchestration, FastAPI, gRPC, Terraform, cloud platforms, infrastructure-as-code, virtualization, containerization, Kata, Firecracker, gVisor, Sysbox, AI infrastructure, agentic models, training environments, compute clusters, performance optimization, runtime performance","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":230000,"maxValue":385000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_e31a2c4e-190"},"title":"ASIC Firmware Engineer, Modeling","description":"<p><strong>Job Posting</strong></p>\n<p><strong>ASIC Firmware Engineer, Modeling</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Scaling</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$226K – $445K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p>More details about our benefits are available to candidates during the hiring process.</p>\n<p>This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.</p>\n<p><strong>About the Team</strong></p>\n<p>OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI’s supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI.</p>\n<p><strong>About the Role</strong></p>\n<p>We are looking for an embedded engineer to help build firmware and associated modeling software for OpenAI’s in house AI accelerator. This role involves designing and developing drivers and functional models for a large array of HW components, writing high throughput and low latency firmware code, investigating bring-up and production issues.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Design and implement drivers for hardware peripherals, including those related to AI chips.</li>\n</ul>\n<ul>\n<li>Design and implement functional software models to simulate SoC uncore logic and enable FW testing against the model</li>\n</ul>\n<ul>\n<li>Design and implement low-latency and high throughput embedded SW to manage HW resources.</li>\n</ul>\n<ul>\n<li>Work with adjacent software and hardware teams to implement requirements, debug issues and shape future generations of the hardware.</li>\n</ul>\n<ul>\n<li>Collaborate with vendors to integrate their technologies within our systems.</li>\n</ul>\n<ul>\n<li>Bring up and debug firmware/driver on new platforms.</li>\n</ul>\n<ul>\n<li>Come up with processes and debug issues raised in the field.</li>\n</ul>\n<ul>\n<li>Set up monitoring, integration testing and diagnostics tools.</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>5+ years of experience working in embedded SW space.</li>\n</ul>\n<ul>\n<li>Ability to thrive in ambiguity and learn new technologies.</li>\n</ul>\n<ul>\n<li>Strong programming skills in C/C++ and/or Rust.</li>\n</ul>\n<ul>\n<li>Experience developing high throughput, low latency and multi-threaded code.</li>\n</ul>\n<ul>\n<li>Experience working with real time operating systems (RTOS).</li>\n</ul>\n<ul>\n<li>Experience developing hardware drivers and working with hardware</li>\n</ul>\n<ul>\n<li>Experience with HW/SW co-design</li>\n</ul>\n<ul>\n<li>Knowledge of common embedded protocols, e.g. UART, I2C, SPI, etc.</li>\n</ul>\n<ul>\n<li>Knowledge of microprocessor and common ARM architectures (e.g. AMBA) is a plus.</li>\n</ul>\n<ul>\n<li>Knowledge of PCIe, ethernet and other high BW communication protocols is a plus.</li>\n</ul>\n<ul>\n<li>Experience with GPUs or other compute hardware is a plus.</li>\n</ul>\n<ul>\n<li>Experience deploying large compute clusters is a plus.</li>\n</ul>\n<p>_To comply with U.S. export control laws and regulations, candidates for this role may need to meet certain legal status requirements as provided in those laws and regulations._</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_e31a2c4e-190","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/e4ef18a1-f2f7-4920-a53c-aeadd184d124","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$226K – $445K • Offers Equity","x-skills-required":["C/C++","Rust","Embedded SW","Real time operating systems (RTOS)","Hardware drivers","HW/SW co-design","Common embedded protocols (UART, I2C, SPI, etc.)","Microprocessor and common ARM architectures (e.g. AMBA)","PCIe, ethernet and other high BW communication protocols"],"x-skills-preferred":["GPU","Compute hardware","Large compute clusters"],"datePosted":"2026-03-06T18:40:36.430Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"C/C++, Rust, Embedded SW, Real time operating systems (RTOS), Hardware drivers, HW/SW co-design, Common embedded protocols (UART, I2C, SPI, etc.), Microprocessor and common ARM architectures (e.g. AMBA), PCIe, ethernet and other high BW communication protocols, GPU, Compute hardware, Large compute clusters","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":226000,"maxValue":445000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3469e687-cba"},"title":"Offensive Security Engineer, Agent Security","description":"<p><strong>Job Posting</strong></p>\n<p><strong>Offensive Security Engineer, Agent Security</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco; New York City; Remote - US; Seattle; Washington, DC</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Security</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>San Francisco, Seattle, New York$347K – $490K • Offers Equity</li>\n<li>Zone A$312.3K – $490K • Offers Equity</li>\n<li>Zone B$277.6K – $490K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p>More details about our benefits are available to candidates during the hiring process.</p>\n<p>This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.</p>\n<p><strong>About the Team</strong></p>\n<p>Security is at the foundation of OpenAI’s mission to ensure that artificial general intelligence benefits all of humanity. The Security team protects OpenAI’s technology, people, and products. We are technical in what we build but are operational in how we do our work, and are committed to supporting all products and research at OpenAI. Our Security team tenets include: prioritizing for impact, enabling researchers, preparing for future transformative technologies, and engaging a robust security culture.</p>\n<p><strong>About the Role</strong></p>\n<p>We&#39;re seeking an exceptional Principal-level Offensive Security Engineer to challenge and strengthen OpenAI&#39;s security posture. This role isn&#39;t your typical red team job - it&#39;s an opportunity to engage broadly and deeply, craft innovative attack simulations, collaborate closely with defensive teams, and influence strategic security improvements across the organization.</p>\n<p>You&#39;ll have the chance to not only find vulnerabilities but actively drive their resolution, automate offensive techniques with cutting-edge technologies, and use your unique attacker perspective to shape our security strategy.</p>\n<p>This role will be primarily focused on continuously testing our agent powered products like codex and operator. These systems are uniquely valuable targets because they’re rapidly evolving, have access to perform sensitive actions on behalf of users, and have large, diverse attack surfaces. You will play a crucial role in securing our agents by hunting for realistic vulnerabilities that emerge from the interactions between the applications, infrastructure, and models that power them.</p>\n<p><strong>In this role you will:</strong></p>\n<ul>\n<li>Continuously hunt for vulnerabilities in the interactions between the applications, infrastructure, and models that power our agentic products.</li>\n</ul>\n<ul>\n<li>Conduct open-scope red and purple team operations, simulating realistic attack scenarios.</li>\n</ul>\n<ul>\n<li>Collaborate proactively with defensive security teams to enhance detection, response, and mitigation capabilities.</li>\n</ul>\n<ul>\n<li>Perform comprehensive penetration testing on our diverse suite of products.</li>\n</ul>\n<ul>\n<li>Leverage advanced automation and OpenAI technologies to optimize your offensive security work.</li>\n</ul>\n<ul>\n<li>Present insightful, actionable findings clearly and compellingly to inspire impactful change.</li>\n</ul>\n<ul>\n<li>Influence security strategy by providing attacker-driven insights into risk and threat modeling.</li>\n</ul>\n<p><strong>You might thrive in this role if you have:</strong></p>\n<ul>\n<li>7+ years of hands-on red team experience or exceptional accomplishments demonstrating equivalent expertise.</li>\n</ul>\n<ul>\n<li>Deep expertise conducting offensive security operations within modern technology companies.</li>\n</ul>\n<ul>\n<li>Experience designing, developing, or testing assessing the security of AI-powered systems.</li>\n</ul>\n<ul>\n<li>Experience working finding, exploiting and mitigating common vulnerabilities in AI systems like prompt injection, leaking sensitive data, confused deputies, and dynamically generated UI components.</li>\n</ul>\n<ul>\n<li>Exceptional skill in code review, identifying novel and subtle vulnerabilities.</li>\n</ul>\n<ul>\n<li>Proven experience performing offensive security assessments in at least one hyperscaler cloud environment (Azure preferred).</li>\n</ul>\n<ul>\n<li>Demonstrated mastery assessing complex technology stacks, including:</li>\n</ul>\n<ul>\n<li>Highly customized Kubernetes clusters</li>\n</ul>\n<ul>\n<li>Container environments</li>\n</ul>\n<ul>\n<li>CI/CD pipelines</li>\n</ul>\n<ul>\n<li>GitHub security</li>\n</ul>\n<ul>\n<li>macOS and Linux operating systems</li>\n</ul>\n<ul>\n<li>Data science tooling and environments</li>\n</ul>\n<ul>\n<li>Python-based web services</li>\n</ul>\n<ul>\n<li>React-based frontend applications</li>\n<li>Strong intuitive understanding of trust boundaries and risk assessment in dynamic contexts.</li>\n</ul>\n<ul>\n<li>Excellent coding skills, capable of writing robust tools and automation for offensive operations.</li>\n</ul>\n<ul>\n<li>Ability to communicate complex technical concepts to both technical and non-technical stakeholders.</li>\n</ul>\n<p><strong>Experience Level</strong></p>\n<p>Senior</p>\n<p><strong>Employment Type</strong></p>\n<p>Full-time</p>\n<p><strong>Workplace Type</strong></p>\n<p>Remote</p>\n<p><strong>Category</strong></p>\n<p>Engineering</p>\n<p><strong>Industry</strong></p>\n<p>Technology</p>\n<p><strong>Salary Range</strong></p>\n<p>$347K – $490K • Offers Equity</p>\n<p><strong>Required Skills</strong></p>\n<ul>\n<li>Red team experience</li>\n<li>Offensive security operations</li>\n<li>AI-powered systems security</li>\n<li>Vulnerability assessment</li>\n<li>Penetration testing</li>\n<li>Automation</li>\n<li>Code review</li>\n<li>Cloud security</li>\n<li>Kubernetes</li>\n<li>Container security</li>\n<li>CI/CD pipelines</li>\n<li>GitHub security</li>\n<li>macOS and Linux operating systems</li>\n<li>Data science tooling and environments</li>\n<li>Python-based web services</li>\n<li>React-based frontend applications</li>\n</ul>\n<p><strong>Preferred Skills</strong></p>\n<ul>\n<li>Azure cloud security</li>\n<li>Highly customized Kubernetes clusters</li>\n<li>Container environments</li>\n<li>CI/CD pipelines</li>\n<li>GitHub security</li>\n<li>macOS and Linux operating systems</li>\n<li>Data science tooling and environments</li>\n<li>Python-based web services</li>\n<li>React-based frontend applications</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3469e687-cba","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/bb97fffc-cdda-43a3-a6bc-234f9c031720","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$347K – $490K • Offers Equity","x-skills-required":["red team experience","offensive security operations","AI-powered systems security","vulnerability assessment","penetration testing","automation","code review","cloud security","kubernetes","container security","ci/cd pipelines","github security","macos and linux operating systems","data science tooling and environments","python-based web services","react-based frontend applications"],"x-skills-preferred":["azure cloud security","highly customized kubernetes clusters","container environments","ci/cd pipelines","github security","macos and linux operating systems","data science tooling and environments","python-based web services","react-based frontend applications"],"datePosted":"2026-03-06T18:27:44.474Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco; New York City; Remote - US; Seattle; Washington, DC"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"engineering","industry":"technology","skills":"red team experience, offensive security operations, AI-powered systems security, vulnerability assessment, penetration testing, automation, code review, cloud security, kubernetes, container security, ci/cd pipelines, github security, macos and linux operating systems, data science tooling and environments, python-based web services, react-based frontend applications, azure cloud security, highly customized kubernetes clusters, container environments, ci/cd pipelines, github security, macos and linux operating systems, data science tooling and environments, python-based web services, react-based frontend applications","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":347000,"maxValue":490000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f8953efe-b98"},"title":"Member of Technical Staff, Evaluations Engineering","description":"<p><strong>Summary</strong></p>\n<p>Microsoft AI are looking for a talented Member of Technical Staff, Evaluations Engineer to help build the next wave of capabilities of our personalized AI assistant, Copilot. We’re looking for someone who will bring an abundance of positive energy, empathy, and kindness to the team every day, in addition to being highly effective.</p>\n<p><strong>About the Role</strong></p>\n<p>We are looking for a highly skilled and experienced engineer to join our Evaluations Engineering team. As a Member of Technical Staff, Evaluations Engineer, you will be responsible for developing and tuning the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures. You will also be responsible for benchmarking GB200 and AMD MIxxx GPU clusters, gathering data and insights to develop the pretraining compute roadmap, and caring deeply about conversational AI and its deployment.</p>\n<p><strong>Accountabilities</strong></p>\n<ul>\n<li>Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures.</li>\n<li>Benchmark GB200 and AMD MIxxx GPU clusters.</li>\n<li>Gather data and insights to develop the pretraining compute roadmap.</li>\n</ul>\n<p><strong>The Candidate we&#39;re looking for</strong></p>\n<p><strong>Experience:</strong></p>\n<ul>\n<li>Bachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.</li>\n</ul>\n<p><strong>Technical skills:</strong></p>\n<ul>\n<li>Experience with generative AI.</li>\n<li>Experience with distributed computing.</li>\n</ul>\n<p><strong>Personal attributes:</strong></p>\n<ul>\n<li>Enjoy working in a fast-paced, design-driven, product development cycle.</li>\n<li>Embody our Culture and Values.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Software Engineering IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.</li>\n<li>Software Engineering IC6 – The typical base pay range for this role across the U.S. is USD $163,000 – $296,400 per year.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f8953efe-b98","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-evaluations-engineering-mai-superintelligence-team-2/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"USD $139,900 – $274,800 per year","x-skills-required":["C","C++","C#","Java","JavaScript","Python","Generative AI","Distributed Computing"],"x-skills-preferred":["Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures","Experience with benchmarking GPU clusters"],"datePosted":"2026-03-06T07:32:38.526Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Redmond"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"C, C++, C#, Java, JavaScript, Python, Generative AI, Distributed Computing, Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures, Experience with benchmarking GPU clusters","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0bf1fd91-846"},"title":"Member of Technical Staff, Hardware Health","description":"<p><strong>Summary</strong></p>\n<p>Microsoft AI are looking for a talented Member of Technical Staff, Hardware Health, to ensure these systems deliver sustained reliability, performance, and availability across exascale-class deployments.</p>\n<p><strong>About the Role</strong></p>\n<p>We work closely with research, hardware, datacenter, and platform engineering teams to develop predictive health models, failure detection frameworks, and autonomous remediation systems that keep our AI clusters operating at frontier scale. Our team is responsible for Copilot, Bing, Edge, and generative AI research. Join us and help shape the future of personal computing.</p>\n<p><strong>Accountabilities</strong></p>\n<ul>\n<li>Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).</li>\n<li>Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.</li>\n</ul>\n<p><strong>The Candidate we&#39;re looking for</strong></p>\n<p><strong>Experience:</strong></p>\n<ul>\n<li>Bachelor&#39;s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.</li>\n</ul>\n<p><strong>Technical skills:</strong></p>\n<ul>\n<li>Proficiency in hardware telemetry, diagnostics, or failure analysis tools.</li>\n<li>Experience with exascale-class systems or cloud-scale AI clusters.</li>\n</ul>\n<p><strong>Personal attributes:</strong></p>\n<ul>\n<li>Strong analytical and problem-solving skills.</li>\n<li>Excellent communication and collaboration skills.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Competitive salary range: $139,900 - $274,800 per year.</li>\n<li>Comprehensive benefits package, including health insurance, retirement plan, and paid time off.</li>\n<li>Opportunities for professional growth and development.</li>\n<li>Collaborative and dynamic work environment.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0bf1fd91-846","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-hardware-health-mai-superintelligence-team-3/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$139,900 - $274,800 per year","x-skills-required":["C","C++","C#","Java","JavaScript","Python","hardware telemetry","diagnostics","failure analysis tools","exascale-class systems","cloud-scale AI clusters"],"x-skills-preferred":["machine learning","data analysis","problem-solving"],"datePosted":"2026-03-06T07:32:20.374Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"C, C++, C#, Java, JavaScript, Python, hardware telemetry, diagnostics, failure analysis tools, exascale-class systems, cloud-scale AI clusters, machine learning, data analysis, problem-solving","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c1d20281-7ee"},"title":"Member of Technical Staff, High Performance Computing Engineer","description":"<p><strong>Summary</strong></p>\n<p>Microsoft AI are looking for a talented Member of Technical Staff, High Performance Computing Engineer at their London office. This role sits at the heart of building and scaling the infrastructure that trains their frontier models and powers the next evolution of their personal AI, Copilot. You&#39;ll work directly with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.</p>\n<p><strong>About the Role</strong></p>\n<p>As a Member of Technical Staff, High Performance Computing Engineer, you will design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings. You will own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale. You will serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.</p>\n<p><strong>Accountabilities</strong></p>\n<ul>\n<li>Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.</li>\n<li>Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.</li>\n<li>Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.</li>\n</ul>\n<p><strong>The Candidate we&#39;re looking for</strong></p>\n<p><strong>Experience:</strong></p>\n<ul>\n<li>4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters.</li>\n<li>4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.).</li>\n<li>4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP.</li>\n</ul>\n<p><strong>Technical skills:</strong></p>\n<ul>\n<li>Experience with LLM training clusters.</li>\n<li>Experience working with AI platforms, frameworks, and APIs.</li>\n<li>Experience using Machine Learning frameworks, including experience using, deploying, and scaling language learning models, either personally or professionally.</li>\n</ul>\n<p><strong>Personal attributes:</strong></p>\n<ul>\n<li>Ability to identify, analyze, and resolve complex technical issues, ensuring optimal performance, scalability, and user experience.</li>\n<li>Dedication to writing clean, maintainable, and well-documented code with a focus on application quality, performance, and security.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Competitive salary and benefits package.</li>\n<li>Opportunity to work with a leading technology company and contribute to HERE&#39;s mission.</li>\n<li>Collaborative and dynamic work environment.</li>\n<li>Professional development opportunities.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c1d20281-7ee","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-high-performance-computing-engineer-mai-superintelligence-team-2/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"Competitive salary and benefits package","x-skills-required":["High Performance Computing","Cloud Infrastructure","Machine Learning","AI Platforms","Frameworks and APIs"],"x-skills-preferred":["LLM Training Clusters","AI Platforms","Frameworks and APIs"],"datePosted":"2026-03-06T07:31:41.290Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"High Performance Computing, Cloud Infrastructure, Machine Learning, AI Platforms, Frameworks and APIs, LLM Training Clusters, AI Platforms, Frameworks and APIs"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_675d41e9-5f9"},"title":"Member of Technical Staff, Reinforcement Learning Systems","description":"<p><strong>Summary</strong></p>\n<p>Microsoft AI are looking for a talented Member of Technical Staff, Reinforcement Learning Systems to help build the world&#39;s most advanced reinforcement learning systems. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that&#39;s revolutionising AI technology.</p>\n<p><strong>About the Role</strong></p>\n<p>We are responsible for designing, developing, and operating the large-scale reinforcement learning systems that power several use cases across the Superintelligence team. We are looking for individuals who can contribute to cutting-edge research and help bridge the gap between cutting-edge research and robust, production-grade distributed systems. The ideal candidate has both distributed systems expertise and a scientific mindset and will be able to build complex and scalable systems from the ground up, identify and resolve performance bottlenecks, debug complex, cross-system issues with extremely high attention to detail, and contribute to solving scientific and research challenges.</p>\n<p><strong>Accountabilities</strong></p>\n<ul>\n<li>Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures.</li>\n<li>Benchmark GB200 and AMD MIxxx GPU clusters.</li>\n<li>Gather data and insights to develop the pretraining compute roadmap.</li>\n</ul>\n<p><strong>The Candidate we&#39;re looking for</strong></p>\n<p><strong>Experience:</strong></p>\n<ul>\n<li>Bachelor&#39;s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.</li>\n</ul>\n<p><strong>Technical skills:</strong></p>\n<ul>\n<li>Experience with generative AI.</li>\n<li>Experience with distributed computing.</li>\n</ul>\n<p><strong>Personal attributes:</strong></p>\n<ul>\n<li>A high degree of craftsmanship and pay close attention to details.</li>\n<li>Enjoy working in a fast-paced, design-driven, product development cycle.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Software Engineering IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.</li>\n<li>There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_675d41e9-5f9","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-reinforcement-learning-systems-mai-superintelligence-team-3/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"USD $139,900 – $274,800 per year","x-skills-required":["C","C++","C#","Java","JavaScript","Python","Generative AI","Distributed Computing"],"x-skills-preferred":["Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures","Experience with GPU clusters"],"datePosted":"2026-03-06T07:29:05.671Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"C, C++, C#, Java, JavaScript, Python, Generative AI, Distributed Computing, Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures, Experience with GPU clusters","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b0dff67a-5b5"},"title":"Member of Technical Staff, Reinforcement Learning Systems","description":"<p><strong>Summary</strong></p>\n<p>Microsoft AI are looking for a talented Member of Technical Staff, Reinforcement Learning Systems to help build the world&#39;s most advanced reinforcement learning systems. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that&#39;s revolutionising AI technology.</p>\n<p><strong>About the Role</strong></p>\n<p>We are responsible for designing, developing, and operating the large-scale reinforcement learning systems that power several use cases across the Superintelligence team. We are looking for individuals who can contribute to cutting-edge research and help bridge the gap between cutting-edge research and robust, production-grade distributed systems. The ideal candidate has both distributed systems expertise and a scientific mindset and will be able to build complex and scalable systems from the ground up, identify and resolve performance bottlenecks, debug complex, cross-system issues with extremely high attention to detail, and contribute to solving scientific and research challenges.</p>\n<p><strong>Accountabilities</strong></p>\n<ul>\n<li>Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures.</li>\n<li>Benchmark GB200 and AMD MIxxx GPU clusters.</li>\n<li>Gather data and insights to develop the pretraining compute roadmap.</li>\n</ul>\n<p><strong>The Candidate we&#39;re looking for</strong></p>\n<p><strong>Experience:</strong></p>\n<ul>\n<li>Bachelor&#39;s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.</li>\n</ul>\n<p><strong>Technical skills:</strong></p>\n<ul>\n<li>Experience with generative AI.</li>\n<li>Experience with distributed computing.</li>\n</ul>\n<p><strong>Personal attributes:</strong></p>\n<ul>\n<li>A high degree of craftsmanship and pay close attention to details.</li>\n<li>Enjoy working in a fast-paced, design-driven, product development cycle.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Software Engineering IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.</li>\n<li>There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b0dff67a-5b5","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-reinforcement-learning-systems-mai-superintelligence-team/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"USD $139,900 – $274,800 per year","x-skills-required":["C","C++","C#","Java","JavaScript","Python","Generative AI","Distributed computing"],"x-skills-preferred":["Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures","Experience with GPU clusters"],"datePosted":"2026-03-06T07:28:16.942Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"C, C++, C#, Java, JavaScript, Python, Generative AI, Distributed computing, Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures, Experience with GPU clusters","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_7abfb827-590"},"title":"Member of Technical Staff, High Performance Computing Engineer","description":"<p><strong>Summary</strong></p>\n<p>Microsoft AI are looking for experienced Member of Technical Staff, High Performance Computing Engineers to help build and scale the infrastructure that trains their frontier models and powers the next evolution of their personal AI, Copilot.</p>\n<p><strong>About the Role</strong></p>\n<p>This role offers the unique opportunity to work on some of the largest scale supercomputers in the world – a rare chance to operate at such a significant scale. As a Member of Technical Staff, High Performance Computing Engineer, you will design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings. You will own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.</p>\n<p><strong>Accountabilities</strong></p>\n<ul>\n<li>Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.</li>\n<li>Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.</li>\n</ul>\n<p><strong>The Candidate we&#39;re looking for</strong></p>\n<p><strong>Experience:</strong></p>\n<ul>\n<li>4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters.</li>\n<li>4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.).</li>\n<li>4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP.</li>\n</ul>\n<p><strong>Technical skills:</strong></p>\n<ul>\n<li>Experience with LLM training clusters.</li>\n<li>Experience working with AI platforms, frameworks, and APIs.</li>\n<li>Experience using Machine Learning frameworks, including experience using, deploying, and scaling language learning models, either personally or professionally.</li>\n</ul>\n<p><strong>Personal attributes:</strong></p>\n<ul>\n<li>Ability to identify, analyze, and resolve complex technical issues, ensuring optimal performance, scalability, and user experience.</li>\n<li>Dedication to writing clean, maintainable, and well-documented code with a focus on application quality, performance, and security.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Competitive salary.</li>\n<li>Comprehensive benefits package.</li>\n<li>Opportunities for professional growth and development.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_7abfb827-590","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-high-performance-computing-engineer-mai-superintelligence-team/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"Competitive salary","x-skills-required":["High Performance Computing","Cloud Infrastructure","Machine Learning","AI Platforms","Frameworks and APIs"],"x-skills-preferred":["LLM Training Clusters","AI Platforms","Frameworks and APIs"],"datePosted":"2026-03-06T07:26:54.143Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Multiple Locations, United States"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"High Performance Computing, Cloud Infrastructure, Machine Learning, AI Platforms, Frameworks and APIs, LLM Training Clusters, AI Platforms, Frameworks and APIs"}]}