{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/gpu-clusters"},"x-facet":{"type":"skill","slug":"gpu-clusters","display":"GPU Clusters","count":12},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_9c3667a3-140"},"title":"Token-as-a-Service Technical Program Manager","description":"<p><strong>Compensation</strong></p>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>OpenAI&#39;s Stargate and 3P Engineering teams are responsible for building and scaling the external infrastructure ecosystem that powers advanced AI systems. We work across hyperscalers, colocation providers, cloud partners, and strategic third-party operators to turn contracted capacity into production-ready compute.</p>\n<p><strong>About the Role</strong></p>\n<p>We are seeking a Technical Program Manager, Token-as-a-Service (TaaS) to lead delivery of external compute capacity that directly serves OpenAI model workloads.</p>\n<p>In this role, you will own complex cross-functional programs that transform third-party infrastructure into usable tokens at scale. You will partner across engineering, capacity planning, networking, hardware, finance, product, and external providers to ensure that deployed capacity translates into real production throughput.</p>\n<p><strong>Key Responsibilities</strong></p>\n<ul>\n<li>Lead end-to-end delivery programs that convert external infrastructure capacity into production-ready token supply.</li>\n</ul>\n<ul>\n<li>Own readiness across compute, storage, networking, security, and operational dependencies for third-party environments.</li>\n</ul>\n<ul>\n<li>Build integrated plans across internal engineering teams and external partners with clear milestones, owners, risks, and critical paths.</li>\n</ul>\n<ul>\n<li>Drive launch execution for new partner regions, clusters, and capacity expansions.</li>\n</ul>\n<ul>\n<li>Create operating mechanisms that measure deployed capacity versus usable token output.</li>\n</ul>\n<ul>\n<li>Identify bottlenecks preventing token generation (network constraints, hardware readiness, software enablement, partner delays, etc.) and drive resolution.</li>\n</ul>\n<ul>\n<li>Coordinate with capacity planning and finance teams to prioritize the highest ROI capacity opportunities.</li>\n</ul>\n<ul>\n<li>Establish executive-level reporting on delivery status, risks, and token ramp forecasts.</li>\n</ul>\n<ul>\n<li>Improve repeatability of partner onboarding, technical integration, and scaling motions.</li>\n</ul>\n<ul>\n<li>Manage escalations across internal and external stakeholders during high-severity delivery issues.</li>\n</ul>\n<ul>\n<li>Translate ambiguous infrastructure constraints into clear execution plans.</li>\n</ul>\n<ul>\n<li>Help define the long-term operating model for Token-as-a-Service across Stargate and 3P ecosystems.</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>8+ years of Technical Program Management, Engineering Program Management, or Infrastructure Delivery experience.</li>\n</ul>\n<ul>\n<li>Experience leading large-scale technical programs involving cloud, data center, networking, hardware, or distributed systems.</li>\n</ul>\n<ul>\n<li>Strong understanding of compute infrastructure, clusters, networking, storage, and production systems.</li>\n</ul>\n<ul>\n<li>Proven ability to drive cross-functional execution across engineering, operations, finance, and external vendors.</li>\n</ul>\n<ul>\n<li>Experience managing executive stakeholders and communicating complex tradeoffs clearly.</li>\n</ul>\n<ul>\n<li>Strong analytical skills with ability to reason about utilization, throughput, capacity, and operational metrics.</li>\n</ul>\n<ul>\n<li>Comfortable operating in ambiguous, fast-scaling environments.</li>\n</ul>\n<ul>\n<li>Strong written and verbal communication skills.</li>\n</ul>\n<ul>\n<li>High ownership mentality with bias toward action.</li>\n</ul>\n<ul>\n<li>Experience working with external providers, strategic partners, or hyperscalers is highly preferred.</li>\n</ul>\n<p><strong>Preferred Skills</strong></p>\n<ul>\n<li>Experience with GPU clusters, AI infrastructure, or large-scale model serving environments.</li>\n</ul>\n<ul>\n<li>Familiarity with token economics, inference capacity planning, or workload scheduling.</li>\n</ul>\n<ul>\n<li>Experience scaling global infrastructure through third-party providers.</li>\n</ul>\n<ul>\n<li>Background in systems engineering, networking, or hardware deployment programs.</li>\n</ul>\n<ul>\n<li>Experience building new operational models in high-growth environments.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_9c3667a3-140","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/e8558280-69dc-438a-b905-623f75ae6d62?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$342K – $555K","x-skills-required":["Technical Program Management","Engineering Program Management","Infrastructure Delivery","Cloud","Data Center","Networking","Hardware","Distributed Systems","Compute Infrastructure","Clusters","Storage","Production Systems"],"x-skills-preferred":["GPU Clusters","AI Infrastructure","Large-Scale Model Serving Environments","Token Economics","Inference Capacity Planning","Workload Scheduling","Global Infrastructure","Systems Engineering","Hardware Deployment Programs"],"datePosted":"2026-04-24T12:23:54.161Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco; Seattle"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Technical Program Management, Engineering Program Management, Infrastructure Delivery, Cloud, Data Center, Networking, Hardware, Distributed Systems, Compute Infrastructure, Clusters, Storage, Production Systems, GPU Clusters, AI Infrastructure, Large-Scale Model Serving Environments, Token Economics, Inference Capacity Planning, Workload Scheduling, Global Infrastructure, Systems Engineering, Hardware Deployment Programs","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":342000,"maxValue":555000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_e179812d-e4c"},"title":"Technical Program Manager, Compute Infrastructure","description":"<p>We&#39;re seeking a Technical Program Manager for Compute Infrastructure to join our engineer-first TPM team. As a Technical Program Manager, you will own the end-to-end delivery of large-scale GPU clusters, partnering with engineers to bring clusters online across external providers and partners. You&#39;ll run a broad, parallel portfolio spanning hardware, networking, power, and cooling,driving execution, risk management, and crisp alignment from working teams through leadership to deliver production-ready capacity at scale.</p>\n<p>In this role, you will:</p>\n<ul>\n<li>Lead end-to-end delivery of both New Compute SKUs and large-scale GPU clusters across an external partner ecosystem while supporting capacity planning for training and inference.</li>\n<li>Ability to contextually drive multi-threaded bring-up programs spanning hardware, networking, power, and cooling,owning plans, dependencies, and critical paths.</li>\n<li>Interface with chip providers to derisk long-term onboarding to new hardware platforms by working across kernels, comms, hardware, and scheduling engineering teams.</li>\n<li>Build and operationalize program mechanisms (roadmaps, milestones, risk registers, runbooks) that make delivery predictable at massive scale.</li>\n<li>Partner with engineering to improve cluster turn-up reliability, repeatability, and automation, reducing time-to-serve for new capacity.</li>\n<li>Support network operations and end-to-end physical and logical bring-up of OpenAI network Points-of-Presence (PoPs), including on-site deployment, rack cabling, and close collaboration with engineering teams.</li>\n<li>Coordinate cross-functional readiness (security, finance, operations, product/research stakeholders) to ship production-ready compute.</li>\n<li>Manage integration and handoffs across teams and partners,ensuring consistent execution, clear communication, and fast issue resolution.</li>\n<li>Identify bottlenecks and systemic gaps, then drive durable fixes across tooling, process, and partner interfaces.</li>\n<li>Provide crisp executive visibility on progress, tradeoffs, and risks across a large portfolio of concurrent programs.</li>\n</ul>\n<p>You might thrive in this role if you:</p>\n<ul>\n<li>Possess a degree in a hard science, or have a demonstrated track record of engineering expertise.</li>\n<li>Have 5+ years of experience in program management for major projects including capital projects or hyperscaler infrastructure deployment.</li>\n<li>Demonstrate the ability to serve as the go-to person solely responsible for driving and delivering complex projects.</li>\n<li>Are comfortable managing cross-functional and cross-company teams; experience driving information and decision hygiene.</li>\n<li>Have an extensive track record of successfully delivering high-profile, technical projects against tight deadlines.</li>\n<li>Are technically adept and have effectively partnered with engineering or fundamental research teams of the highest caliber.</li>\n<li>Have experience interfacing with and leading external vendors including engineering firms, equipment suppliers, and/or construction firms.</li>\n<li>Have expertise in designing and implementing simple, scalable processes that solve complex problems.</li>\n<li>Have experience managing complicated dependencies such as logistics and/or supply chains.</li>\n<li>Are relentlessly resourceful and thrive in ambiguous, fast-paced environments.</li>\n<li>Are interested in and thoughtful about the impacts of AGI.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_e179812d-e4c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/8fb1615c-34bf-47c4-a1d1-b7b2f836bbd3?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$257K – $335K","x-skills-required":["Program Management","Compute Infrastructure","GPU Clusters","Hardware","Networking","Power","Cooling","Risk Management","Cross-Functional Teams","Engineering","External Vendors","Supply Chain Management"],"x-skills-preferred":[],"datePosted":"2026-04-24T12:23:45.036Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Program Management, Compute Infrastructure, GPU Clusters, Hardware, Networking, Power, Cooling, Risk Management, Cross-Functional Teams, Engineering, External Vendors, Supply Chain Management","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":257000,"maxValue":335000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_736969e6-3f9"},"title":"CPU Storage Tech Lead","description":"<p><strong>Compensation</strong></p>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>The Stargate team is responsible for building the physical infrastructure that powers large-scale AI systems. We design and deliver next-generation data centers optimized for dense compute clusters, advanced networking, and rapidly evolving hardware platforms.</p>\n<p><strong>About the Role</strong></p>\n<p>We are seeking a CPU &amp; Storage Technical Lead to define and drive the server compute and storage architecture strategy for Stargate infrastructure.</p>\n<p>In this role, you will own technical direction across CPU platforms, memory configurations, local and disaggregated storage systems, and their integration into large-scale AI clusters. You will evaluate vendor roadmaps, lead platform tradeoff decisions, and ensure compute and storage systems are optimized for training, inference, and supporting services.</p>\n<p><strong>Key Responsibilities</strong></p>\n<ul>\n<li>Own CPU and storage technical strategy for Stargate compute infrastructure across current and future generations.</li>\n</ul>\n<ul>\n<li>Evaluate CPU platforms across performance, efficiency, memory bandwidth, PCIe topology, cost, and roadmap alignment.</li>\n</ul>\n<ul>\n<li>Define storage architectures for AI environments, including boot media, local NVMe, shared storage, caching tiers, metadata services, and high-performance data pipelines.</li>\n</ul>\n<ul>\n<li>Drive server platform decisions involving CPU, memory, NIC, GPU, and storage subsystem integration.</li>\n</ul>\n<ul>\n<li>Partner with performance modeling teams to quantify tradeoffs across compute, memory, I/O, and storage bottlenecks.</li>\n</ul>\n<ul>\n<li>Work with silicon and hardware vendors on roadmap influence, feature requests, qualification plans, and technical escalations.</li>\n</ul>\n<ul>\n<li>Lead bring-up and validation efforts for new CPU and storage platforms in lab and production environments.</li>\n</ul>\n<ul>\n<li>Partner with networking and cluster architecture teams to optimize end-to-end node design and data movement.</li>\n</ul>\n<ul>\n<li>Support supply chain and sourcing teams with technical vendor assessments and second-source strategies.</li>\n</ul>\n<ul>\n<li>Drive reliability, serviceability, and fleet lifecycle planning for compute and storage platforms.</li>\n</ul>\n<ul>\n<li>Translate future AI workload requirements into infrastructure platform specifications.</li>\n</ul>\n<ul>\n<li>Provide technical leadership across cross-functional stakeholders and executive reviews.</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>Bachelor’s degree in Computer Engineering, Electrical Engineering, Computer Science, or related technical field; advanced degree preferred.</li>\n</ul>\n<ul>\n<li>10+ years of experience in server hardware, systems architecture, data center infrastructure, or hyperscale compute platforms.</li>\n</ul>\n<ul>\n<li>Deep expertise in modern CPU architectures (x86, ARM, accelerator host systems) and server platform design.</li>\n</ul>\n<ul>\n<li>Strong understanding of memory systems, PCIe/CXL fabrics, NUMA behavior, and platform-level performance constraints.</li>\n</ul>\n<ul>\n<li>Experience with storage systems including NVMe, SSD qualification, RAID, distributed storage, object/file systems, or high-performance data pipelines.</li>\n</ul>\n<ul>\n<li>Experience evaluating hardware tradeoffs across performance, cost, power, thermals, and supply availability.</li>\n</ul>\n<ul>\n<li>Familiarity with GPU clusters and AI training/inference infrastructure strongly preferred.</li>\n</ul>\n<ul>\n<li>Experience working directly with OEMs, ODMs, silicon vendors, or storage vendors.</li>\n</ul>\n<ul>\n<li>Strong systems thinking with ability to connect component decisions to fleet-level outcomes.</li>\n</ul>\n<ul>\n<li>Excellent communication skills with the ability to influence engineering and executive stakeholders.</li>\n</ul>\n<ul>\n<li>Proven ability to operate in fast-moving, ambiguous environments with high ownership.</li>\n</ul>\n<p><strong>Preferred Skills</strong></p>\n<ul>\n<li>Experience designing infrastructure for large-scale AI or HPC environments.</li>\n</ul>\n<ul>\n<li>Familiarity with CPU vendor roadmaps across AMD, Intel, and ARM ecosystems.</li>\n</ul>\n<ul>\n<li>Experience with distributed storage architectures supporting GPU clusters.</li>\n</ul>\n<ul>\n<li>Knowledge of fleet operations, hardware lifecycle management, and production deployments at scale.</li>\n</ul>\n<ul>\n<li>Prior experience in hyperscale cloud, AI infrastructure, or advanced compute environments.</li>\n</ul>\n<p><strong>About OpenAI</strong></p>\n<p>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.</p>\n<p>We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.</p>\n<p>For additional information, please see [OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement](https://cdn.openai.com/policies/eeo-policy-statement.pdf).</p>\n<p>Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_736969e6-3f9","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/18a60850-cf8b-4374-a214-ef78b9712deb?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$342K – $555K","x-skills-required":["server hardware","systems architecture","data center infrastructure","hyperscale compute platforms","modern CPU architectures","server platform design","memory systems","PCIe/CXL fabrics","NUMA behavior","platform-level performance constraints","storage systems","NVMe","SSD qualification","RAID","distributed storage","object/file systems","high-performance data pipelines","hardware tradeoffs","performance","cost","power","thermals","supply availability","GPU clusters","AI training/inference infrastructure","OEMs","ODMs","silicon vendors","storage vendors","strong systems thinking","component decisions","fleet-level outcomes","excellent communication skills","influence engineering and executive stakeholders","fast-moving","ambiguous environments","high ownership"],"x-skills-preferred":["infrastructure for large-scale AI or HPC environments","CPU vendor roadmaps across AMD, Intel, and ARM ecosystems","distributed storage architectures supporting GPU clusters","fleet operations","hardware lifecycle management","production deployments at scale","hyperscale cloud","AI infrastructure","advanced compute environments"],"datePosted":"2026-04-24T12:21:17.145Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco; Seattle"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"server hardware, systems architecture, data center infrastructure, hyperscale compute platforms, modern CPU architectures, server platform design, memory systems, PCIe/CXL fabrics, NUMA behavior, platform-level performance constraints, storage systems, NVMe, SSD qualification, RAID, distributed storage, object/file systems, high-performance data pipelines, hardware tradeoffs, performance, cost, power, thermals, supply availability, GPU clusters, AI training/inference infrastructure, OEMs, ODMs, silicon vendors, storage vendors, strong systems thinking, component decisions, fleet-level outcomes, excellent communication skills, influence engineering and executive stakeholders, fast-moving, ambiguous environments, high ownership, infrastructure for large-scale AI or HPC environments, CPU vendor roadmaps across AMD, Intel, and ARM ecosystems, distributed storage architectures supporting GPU clusters, fleet operations, hardware lifecycle management, production deployments at scale, hyperscale cloud, AI infrastructure, advanced compute environments","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":342000,"maxValue":555000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a88f73e0-fbc"},"title":"Systems Engineer (Network / Storage / Systems)","description":"<p>We are seeking a System Engineer (Network / Storage / Systems) to help architect, validate, and operationalize the core infrastructure systems that enable Stargate deployments.</p>\n<p>In this role, you will work across networking, storage, system bring-up, hardware debugging, and cluster readiness. You will partner closely with hardware engineering, cluster software, infrastructure operations, and external vendors to ensure new systems are deployed efficiently and run reliably at scale.</p>\n<p>This role is ideal for engineers who can operate across hardware and software boundaries, solve ambiguous technical problems, and drive complex systems into production.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Own system engineering workstreams across one or more critical domains including networking, storage, system validation, or bring-up.</li>\n<li>Design and improve top-of-network architectures spanning frontend, WAN, OOB, firewall, and adjacent infrastructure layers.</li>\n<li>Drive logical network readiness including routing, configuration management, provisioning, and issue resolution.</li>\n<li>Define storage architectures across in-rack, in-pod, cluster, and cloud tiers with focus on performance, lifecycle, and cost efficiency.</li>\n<li>Evaluate vendor hardware and infrastructure proposals, providing technical feedback on architecture, reliability, and operational fit.</li>\n<li>Lead system bring-up for new hardware platforms including imaging, provisioning, validation, and readiness for production deployment.</li>\n<li>Debug complex system faults across firmware, NIC, GPU, server, and platform layers; drive root cause analysis with internal teams and external vendors.</li>\n<li>Build tools and automation that improve lab operations, SKU onboarding, fleet readiness, and deployment velocity.</li>\n<li>Partner with hardware, clusters, and operations teams to translate new compute platforms into stable production environments.</li>\n</ul>\n<p>Qualifications:</p>\n<ul>\n<li>7+ years of experience in systems engineering, infrastructure engineering, hardware platforms, or large-scale compute environments.</li>\n<li>Strong technical depth in one or more areas: networking, storage systems, server platforms, firmware, Linux systems, or distributed infrastructure.</li>\n<li>Experience bringing up new hardware systems or clusters in lab or production environments.</li>\n<li>Experience debugging low-level hardware/software issues and driving cross-functional RCA efforts.</li>\n<li>Familiarity with hyperscale infrastructure, AI clusters, HPC environments, or data center systems.</li>\n<li>Experience working with OEM, ODM, JDM, or hardware vendors.</li>\n<li>Strong scripting or software skills in Python, Go, Bash, or similar.</li>\n<li>Ability to operate effectively in fast-moving environments with high ownership and evolving technical requirements.</li>\n</ul>\n<p>Preferred Skills:</p>\n<ul>\n<li>Experience supporting GPU clusters or accelerator-based infrastructure at scale.</li>\n<li>Familiarity with cluster management, provisioning, or fleet lifecycle tooling.</li>\n<li>Experience with network automation, storage optimization, or systems observability.</li>\n<li>Background working across both hardware and software engineering organizations.</li>\n<li>Experience scaling greenfield infrastructure deployments or rapid expansion programs.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a88f73e0-fbc","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://openai.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/695300fd-1332-4eff-ba1d-d87bb1691f73?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$335K – $455K","x-skills-required":["networking","storage systems","server platforms","firmware","Linux systems","distributed infrastructure","OEM","ODM","JDM","hardware vendors","Python","Go","Bash"],"x-skills-preferred":["GPU clusters","cluster management","provisioning","fleet lifecycle tooling","network automation","storage optimization","systems observability"],"datePosted":"2026-04-24T12:20:32.700Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"networking, storage systems, server platforms, firmware, Linux systems, distributed infrastructure, OEM, ODM, JDM, hardware vendors, Python, Go, Bash, GPU clusters, cluster management, provisioning, fleet lifecycle tooling, network automation, storage optimization, systems observability","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":335000,"maxValue":455000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6ca1bab3-645"},"title":"Research Engineer — Reinforcement Learning","description":"<p>You&#39;ll bring reinforcement learning to Firecrawl&#39;s core product , building the training infrastructure, reward pipelines, and fine-tuning systems that make our models meaningfully better at extracting, understanding, and structuring web data.</p>\n<p>This isn&#39;t theoretical RL research. You&#39;ll build your own training infra, run fast experiments, ship models to production, and bridge the gap between classical RL approaches and modern LLM agent systems. If you care as much about training throughput as you do about reward design, this is the role.</p>\n<p><strong>Salary Range:</strong> $180,000–$290,000/year (Range shown is for U.S.-based employees. Compensation outside the U.S. is adjusted fairly based on your country&#39;s cost of living.)</p>\n<p><strong>Equity Range:</strong> Up to 0.15%</p>\n<p><strong>Location:</strong> San Francisco, CA or Remote (Americas, UTC-3 to UTC-10)</p>\n<p><strong>Job Type:</strong> Full-Time</p>\n<p><strong>Experience:</strong> 3+ years in applied RL, ML engineering, or model training , with production systems</p>\n<p><strong>Visa:</strong> US Citizenship/Visa required for SF; N/A for Remote</p>\n<p><strong>Build training infrastructure and reward pipelines from scratch.</strong> Design and operate the systems that train and evaluate Firecrawl&#39;s models. You&#39;ll own the full loop , data collection, reward modeling, training runs, evaluation, and deployment. You build the infra yourself because you&#39;re the one who needs it to work.</p>\n<p><strong>Fine-tune models to achieve state-of-the-art results.</strong> Take foundation models and make them dramatically better at web data extraction, content understanding, and structured output generation. You know how to get from &#39;decent fine-tune&#39; to &#39;best-in-class&#39; and you have the patience and rigor to close that gap.</p>\n<p><strong>Bridge LLM agents and classical RL.</strong> The most interesting problems at Firecrawl sit at the intersection of modern LLM-based agents and classical RL techniques. You&#39;ll design reward signals for agent behaviors, apply RL methods to improve multi-step agent workflows, and figure out where traditional RL approaches outperform prompting , and vice versa.</p>\n<p><strong>Run fast experiments and iterate.</strong> You design experiments that test meaningful hypotheses, run them quickly, and make decisions based on results. You don&#39;t spend weeks on experiment infrastructure before getting a single result. Speed of iteration is a core part of how you work.</p>\n<p><strong>Communicate clearly to non-RL people.</strong> RL can be opaque. You translate your work into language that engineers, product people, and leadership can understand and act on. You know how to explain why a reward function matters without requiring everyone to read the paper.</p>\n<p><strong>Collaborate closely with the team.</strong> Work directly with the Search/IR-focused Research Engineer and the engineering team to connect RL improvements with search, ranking, and the broader product roadmap.</p>\n<p><strong>Builds their own training infra and reward pipelines.</strong> You don&#39;t wait for an ML platform team to set things up. You build the training loops, reward models, data pipelines, and evaluation frameworks yourself , because you understand that infra choices directly affect the quality of results. You&#39;ve operated GPU clusters, managed training runs, and debugged convergence issues in production.</p>\n<p><strong>Can fine-tune models to SOTA.</strong> You&#39;ve taken models from baseline to best-in-class on tasks that matter. You understand the full fine-tuning lifecycle , data curation, training dynamics, hyperparameter sensitivity, evaluation methodology , and you have the taste to know when a model is actually good versus when the eval is flattering.</p>\n<p><strong>Bridges LLM agents and classical RL.</strong> You&#39;re fluent in both worlds. You understand PPO, RLHF, reward modeling, and policy optimization , and you understand how modern LLM agents work, where they fail, and how RL techniques make them better. You see connections between these domains that most people miss.</p>\n<p><strong>Production-minded.</strong> You care about whether your models work in production, not just on benchmarks. You&#39;ve deployed models that serve real traffic and made hard tradeoffs between model quality, latency, and cost. Research that doesn&#39;t ship isn&#39;t research that matters here.</p>\n<p><strong>Runs fast experiments and communicates clearly.</strong> You&#39;d rather run three rough experiments this week than one polished one next month. When you have results, anyone on the team can understand what they mean , no decoder ring required.</p>\n<p><strong>Backgrounds that tend to do well:</strong> RL engineers at AI labs or applied ML teams who&#39;ve shipped models to production. Researchers who&#39;ve done RLHF or reward modeling for LLM systems. ML engineers who&#39;ve built training infrastructure at startups and cared as much about the pipeline as the model. People who&#39;ve worked at the intersection of RL and language models , whether in academic labs with a production bent or at companies building agent systems.</p>\n<p><strong>What We&#39;re NOT Looking For:</strong></p>\n<p><strong>Pure theorists.</strong> If your best RL work lives in a paper and you&#39;ve never trained a model on real data at real scale, this isn&#39;t the role. We need someone who builds and ships.</p>\n<p><strong>Researchers who need a platform team.</strong> If you expect training infrastructure, data pipelines, and evaluation frameworks to be set up before you can be productive, you&#39;ll be frustrated here. You build the tools you need.</p>\n<p><strong>People who only know one paradigm.</strong> Deep in classical RL but never worked with LLMs? LLM fine-tuner who&#39;s never touched RL? You&#39;ll be missing half the picture. This role requires fluency in both.</p>\n<p><strong>Slow iterators.</strong> If your standard experiment cycle is measured in weeks, not days, you&#39;ll struggle with the pace. We need someone who can run a meaningful experiment, interpret results, and decide next steps within a day or two.</p>\n<p><strong>Black-box communicators.</strong> If your typical update is a wall of metrics only another RL researcher can parse, this isn&#39;t the right fit. We need someone who can explain what&#39;s working, what&#39;s not, and why it matters , to people without RL PhDs.</p>\n<p><strong>A Note On Pace:</strong> We operate at an absurd level of urgency because the window for what we&#39;re building won&#39;t stay open forever. If that excites you, keep reading. If it doesn&#39;t, no hard feelings , but this role probably isn&#39;t for you.</p>\n<p><strong>Benefits &amp; Perks:</strong></p>\n<p><strong>Available to all employees</strong></p>\n<ul>\n<li><strong>Salary that makes sense</strong> , $180,000–$290,000/year, based on impact, not tenure</li>\n</ul>\n<ul>\n<li><strong>Own a piece</strong> , Up to 0.15% equity in what you&#39;re helping build</li>\n</ul>\n<ul>\n<li><strong>Generous PTO</strong> , 15 days mandatory, anything after 24 days, just ask (holidays excluded); take the time you need to recharge</li>\n</ul>\n<ul>\n<li><strong>Parental leave</strong> , 12 weeks fully paid, for moms and dads</li>\n</ul>\n<ul>\n<li><strong>Wellness stipend</strong> , $100/month for the gym, therapy, massages, or whatever keeps you human</li>\n</ul>\n<ul>\n<li><strong>Learning &amp; Development</strong> , Expense up to $1,000/year toward anything that helps you grow professionally</li>\n</ul>\n<ul>\n<li><strong>Team offsites</strong> , A change of scenery, minus the trust falls</li>\n</ul>\n<ul>\n<li><strong>Sabbatical</strong> , 3 paid months off after 4 years, do something fun and new</li>\n</ul>\n<p><strong>Available to US-based full-time employees</strong></p>\n<ul>\n<li><strong>Full coverage, no red tape</strong> , Medical, dental, and vision (100% for</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6ca1bab3-645","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Firecrawl","sameAs":"https://www.firecrawl.dev","logo":"https://logos.yubhub.co/firecrawl.dev.png"},"x-apply-url":"https://jobs.ashbyhq.com/firecrawl/26abaf11-ff85-4f8d-ba44-2b6d32aae2a1?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":"$180,000–$290,000/year","x-skills-required":["Reinforcement Learning","Machine Learning","Deep Learning","Python","GPU Clusters","Training Runs","Evaluation Frameworks","Data Pipelines","Reward Modeling","Policy Optimization","LLM Agents","Classical RL Techniques"],"x-skills-preferred":[],"datePosted":"2026-04-24T12:17:17.208Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA (Hybrid) OR Remote (Americas, UTC-3 to UTC-10)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Reinforcement Learning, Machine Learning, Deep Learning, Python, GPU Clusters, Training Runs, Evaluation Frameworks, Data Pipelines, Reward Modeling, Policy Optimization, LLM Agents, Classical RL Techniques","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":290000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_bdf4e05a-b8c"},"title":"MTS - Site Reliability Engineer","description":"<p>As Microsoft continues to push the boundaries of AI, we are on the lookout for individuals to work with us on the most interesting and challenging AI questions of our time. Our vision is bold and broad , to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. It’s also inclusive: we aim to make AI accessible to all , consumers, businesses, developers , so that everyone can realize its benefits.</p>\n<p>We’re looking for an experienced Site Reliability Engineer (SRE) to join our infrastructure team. In this role, you’ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You’ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.</p>\n<p>Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.</p>\n<p>Responsibilities:</p>\n<p>Reliability &amp; Availability: Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.</p>\n<p>Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra.</p>\n<p>Performance Optimization: Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking).</p>\n<p>Automation &amp; Tooling: Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments.</p>\n<p>Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.</p>\n<p>Security &amp; Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments.</p>\n<p>Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.</p>\n<p>Qualifications:</p>\n<p>Required Qualifications: 4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.</p>\n<p>Preferred Qualifications: Strong proficiency in Kubernetes, Docker, and container orchestration. Knowledge of CI/CD pipelines for Inference and ML model deployment. Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code. Expertise in monitoring &amp; observability tools (Grafana, Datadog, OpenTelemetry, etc.). Strong programming/scripting skills in Python, Go, or Bash. Solid knowledge of distributed systems, networking, and storage. Experience running large-scale GPU clusters for ML/AI workloads (preferred). Familiarity with ML training/inference pipelines. Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators). Background in capacity planning &amp; cost optimization for GPU-heavy environments.</p>\n<p>Work on cutting-edge infrastructure that powers the future of Generative AI. Collaborate with world-class researchers and engineers. Impact millions of users through reliable and responsible AI deployments. Competitive compensation, equity options, and comprehensive benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_bdf4e05a-b8c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/mts-site-reliability-engineer/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$119,800 - $234,700 per year","x-skills-required":["Site Reliability Engineering","DevOps","Infrastructure Engineering","Kubernetes","Docker","container orchestration","CI/CD pipelines","ML model deployment","public cloud platforms","Azure","AWS","GCP","infrastructure-as-code","monitoring & observability tools","Grafana","Datadog","OpenTelemetry","Python","Go","Bash","distributed systems","networking","storage","GPU clusters","ML training/inference pipelines","high-performance computing","workload schedulers","capacity planning","cost optimization"],"x-skills-preferred":["cloud architecture","containerization","microservices","API design","security","compliance","agile development","scrum","kanban"],"datePosted":"2026-04-24T12:12:26.597Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Redmond"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Site Reliability Engineering, DevOps, Infrastructure Engineering, Kubernetes, Docker, container orchestration, CI/CD pipelines, ML model deployment, public cloud platforms, Azure, AWS, GCP, infrastructure-as-code, monitoring & observability tools, Grafana, Datadog, OpenTelemetry, Python, Go, Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, capacity planning, cost optimization, cloud architecture, containerization, microservices, API design, security, compliance, agile development, scrum, kanban","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":119800,"maxValue":234700,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2291f859-746"},"title":"MTS - Site Reliability Engineer","description":"<p>As Microsoft continues to push the boundaries of AI, we are on the lookout for experienced Site Reliability Engineers to work with us on the most interesting and challenging AI questions of our time.</p>\n<p>Our vision is bold and broad , to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. It’s also inclusive: we aim to make AI accessible to all , consumers, businesses, developers , so that everyone can realize its benefits.</p>\n<p>We’re looking for an experienced Site Reliability Engineer (SRE) to join our infrastructure team. In this role, you’ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You’ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.</p>\n<p>Responsibilities:</p>\n<p>Reliability &amp; Availability: Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.</p>\n<p>Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra.</p>\n<p>Performance Optimization: Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking).</p>\n<p>Automation &amp; Tooling: Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments.</p>\n<p>Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.</p>\n<p>Security &amp; Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments.</p>\n<p>Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.</p>\n<p>Qualifications:</p>\n<p>Required Qualifications:</p>\n<p>4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.</p>\n<p>Strong proficiency in Kubernetes, Docker, and container orchestration.</p>\n<p>Knowledge of CI/CD pipelines for Inference and ML model deployment.</p>\n<p>Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code.</p>\n<p>Expertise in monitoring &amp; observability tools (Grafana, Datadog, OpenTelemetry, etc.).</p>\n<p>Strong programming/scripting skills in Python, Go, or Bash.</p>\n<p>Solid knowledge of distributed systems, networking, and storage.</p>\n<p>Experience running large-scale GPU clusters for ML/AI workloads (preferred).</p>\n<p>Familiarity with ML training/inference pipelines.</p>\n<p>Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators).</p>\n<p>Background in capacity planning &amp; cost optimization for GPU-heavy environments.</p>\n<p>Work on cutting-edge infrastructure that powers the future of Generative AI.</p>\n<p>Collaborate with world-class researchers and engineers.</p>\n<p>Impact millions of users through reliable and responsible AI deployments.</p>\n<p>Competitive compensation, equity options, and comprehensive benefits.</p>\n<p>Software Engineering IC4 – The typical base pay range for this role across the U.S. is USD $119,800 – $234,700 per year.</p>\n<p>Software Engineering IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2291f859-746","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/mts-site-reliability-engineer-3/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$119,800 - $234,700 per year","x-skills-required":["Kubernetes","Docker","container orchestration","CI/CD pipelines","public cloud platforms","infrastructure-as-code","monitoring & observability tools","Python","Go","Bash","distributed systems","networking","storage","GPU clusters","ML training/inference pipelines","high-performance computing","workload schedulers","capacity planning & cost optimization"],"x-skills-preferred":[],"datePosted":"2026-04-24T12:12:10.488Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Docker, container orchestration, CI/CD pipelines, public cloud platforms, infrastructure-as-code, monitoring & observability tools, Python, Go, Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, capacity planning & cost optimization","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":119800,"maxValue":234700,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_594b20c4-c28"},"title":"Infrastructure Engineer, Security","description":"<p>We&#39;re looking for an infrastructure engineer to own and evolve the security infrastructure that underpins our foundation models. In this role, you&#39;ll work across compute, storage, networking, and data platforms, making sure our systems are secure, reliable, and built to scale.</p>\n<p>You&#39;ll shape controls, architecture, and tooling so that security is part of how the platform works by default. You&#39;ll partner closely with research and product teams, enabling them to move quickly while keeping our models, data, and environments protected.</p>\n<p>Key responsibilities include:</p>\n<p>Architecting security patterns for platforms and services, including network segmentation, service-to-service authentication, RBAC, and policy enforcement in Kubernetes and cloud environments.</p>\n<p>Managing identity, access, and secrets for humans and services: workload and cross-cloud identity, least-privilege IAM, and secrets management.</p>\n<p>Building secure platforms for data ingestion, processing, and curation: classification, encryption, access controls, and safe sharing patterns across teams.</p>\n<p>Writing threat models and reviewing designs with researchers and engineers to help them ship features and experiments in a safe, scalable way.</p>\n<p>Automating security checks and building guardrails: policy-as-code, secure infrastructure baselines, validation in CI/CD, and tools that make the secure path the easiest one.</p>\n<p>Requirements include:</p>\n<p>Bachelor&#39;s degree or equivalent experience in engineering, or similar.</p>\n<p>Strong background with containers and orchestration (e.g., Kubernetes) and how to secure them (namespaces, network policies, pod security, admission controls, etc.).</p>\n<p>Practical experience with Infrastructure as Code (Terraform or similar), including secure patterns for provisioning networks, IAM, and shared services.</p>\n<p>Solid understanding of cloud networking and security: VPCs, load balancers, service discovery, mTLS, firewalls, and zero-trust-style architectures.</p>\n<p>Proficiency with a systems language such as Rust and scripting in Python for building platform components and internal tools.</p>\n<p>Evidence of owning complex, production-critical systems, including debugging issues that span infra, security, and application layers.</p>\n<p>Preferred qualifications include experience with ML infrastructure, GPU clusters, or large-scale training environments, as well as background in AI labs, HPC environments, or ML-heavy organizations.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_594b20c4-c28","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Thinking Machines Lab","sameAs":"https://thinkingmachineslab.com/","logo":"https://logos.yubhub.co/thinkingmachineslab.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5015964008?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$200,000 - $475,000 USD","x-skills-required":["Kubernetes","Infrastructure as Code","Cloud Networking and Security","Systems Language (Rust)","Scripting (Python)"],"x-skills-preferred":["ML Infrastructure","GPU Clusters","Large-Scale Training Environments","AI Labs","HPC Environments"],"datePosted":"2026-04-18T15:50:20.174Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Infrastructure as Code, Cloud Networking and Security, Systems Language (Rust), Scripting (Python), ML Infrastructure, GPU Clusters, Large-Scale Training Environments, AI Labs, HPC Environments","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":200000,"maxValue":475000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_854e95b5-76b"},"title":"Sr. Director of Product, Research and Training Infrastructure","description":"<p>CoreWeave is seeking a visionary Sr. Director of Product, Research Training Infrastructure to lead the product strategy and engineering execution for the services that power the most ambitious AI research labs in the world.</p>\n<p>This executive leader will own the product strategy and engineering execution for the Research Training Stack, focusing on the specialized orchestration, evaluation, and iteration tools required for massive-scale pre-training and post-training.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Frontier Orchestration: Oversee the evolution of SUNK (Slurm on Kubernetes) to provide researchers with deterministic, bare-metal performance through a cloud-native interface.</li>\n</ul>\n<ul>\n<li>Holistic Training Services: Drive the development of next-generation orchestrators and automated training-based evaluation frameworks that ensure model quality throughout the lifecycle.</li>\n</ul>\n<ul>\n<li>Post-Training Excellence: Build the infrastructure required for sophisticated Reinforcement Learning (RL) and RLHF pipelines, enabling labs to refine foundation models with maximum efficiency.</li>\n</ul>\n<ul>\n<li>Customer Advocacy: Act as the primary technical partner for lead researchers at global AI labs, translating their &#39;future-state&#39; requirements into actionable product roadmaps.</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>Proven leadership experience in engineering leadership, with at least 5+ years managing large-scale infrastructure at a top-tier research lab or an AI-native cloud provider.</li>\n</ul>\n<ul>\n<li>Deep, hands-on knowledge of Slurm, Kubernetes, and the specific networking requirements (InfiniBand/RDMA) for distributed training clusters.</li>\n</ul>\n<ul>\n<li>Research mindset and understanding of the &#39;pain points&#39; of a research scientist.</li>\n</ul>\n<ul>\n<li>Scaling experience delivering mission-critical services on multi-thousand GPU clusters (H100/Blackwell/Rubin architectures).</li>\n</ul>\n<ul>\n<li>Strategic vision to define &#39;what&#39;s next&#39; in the AI stack, from automated RL loops to specialized sandbox environments.</li>\n</ul>\n<p>Why CoreWeave?</p>\n<p>In 2026, CoreWeave is the foundation of the largest infrastructure buildout in human history. We are building AI Factories, not just data centers.</p>\n<ul>\n<li>Silicon-Up Innovation: Work directly with the latest NVIDIA architectures.</li>\n</ul>\n<ul>\n<li>Impact: You will be the architect of the environment that enables the next new discovery.</li>\n</ul>\n<p>Velocity: We move at the speed of the researchers we support, bypassing legacy cloud bottlenecks to deliver raw power.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_854e95b5-76b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4665964006?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"executive","x-job-type":"full-time","x-salary-range":"$233,000 to $341,000","x-skills-required":["Slurm","Kubernetes","InfiniBand/RDMA","Distributed training clusters","GPU clusters","H100/Blackwell/Rubin architectures","Reinforcement Learning (RL)","RLHF pipelines"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:50:11.130Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Slurm, Kubernetes, InfiniBand/RDMA, Distributed training clusters, GPU clusters, H100/Blackwell/Rubin architectures, Reinforcement Learning (RL), RLHF pipelines","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":233000,"maxValue":341000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a8092b6e-7f5"},"title":"Bare Metal Support Engineer","description":"<p>As a Bare Metal Support Engineer at CoreWeave, you will be responsible for supporting, operating, and maintaining CoreWeave&#39;s extensive GPU fleet across our growing data centers in the U.S., Europe, and beyond.</p>\n<p>You will work closely with customers, data center technicians, and engineering teams to ensure the reliability, performance, and scalability of our infrastructure.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Providing high-level support for customers utilizing bare-metal GPU fleets on CoreWeave Cloud.</li>\n<li>Diagnosing, triaging, and investigating reported customer issues and high-priority incidents, identifying root causes and escalating when necessary.</li>\n<li>Developing a deep understanding of customer workloads and use cases to provide tailored technical support.</li>\n<li>Coordinating remote troubleshooting and hardware interventions with Data Center Technicians.</li>\n<li>Creating and maintaining internal documentation, including troubleshooting guides, best practices, and knowledge base articles.</li>\n<li>Participating in an on-call rotation to support production clusters and ensure operational reliability.</li>\n<li>Collaborating with engineering teams to improve hardware reliability, software stability, and system performance.</li>\n<li>Implementing automation and scripting to streamline support workflows and reduce manual interventions.</li>\n<li>Performing in-depth log analysis and debugging across multiple layers of the stack (firmware, drivers, hardware).</li>\n<li>Providing feedback to internal teams on common support issues to drive continuous improvements.</li>\n<li>Working with networking teams to troubleshoot connectivity issues affecting customer workloads.</li>\n<li>Supporting supercomputing infrastructure running GPU workloads at scale.</li>\n<li>Driving operational excellence by refining internal processes and support methodologies.</li>\n</ul>\n<p>To succeed in this role, you will need:</p>\n<ul>\n<li>Experience in data centers, GPU clusters, server deployments, system administration, or hardware troubleshooting.</li>\n<li>Demonstrated experience driving resolutions and continuous improvements across cross-functional environments and teams within a data center environment.</li>\n<li>Intermediate knowledge of Linux (Ubuntu, CentOS, or similar), including command-line proficiency.</li>\n<li>Experience with NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing (HPC), and large-scale data center environments.</li>\n<li>Experience in networking fundamentals (TCP/IP, VLANs, DNS, DHCP) and troubleshooting tools.</li>\n<li>Hands-on experience with firmware updates, BIOS configurations, and driver management.</li>\n<li>Experience analyzing system logs and debugging issues across firmware, drivers, and hardware layers.</li>\n<li>Experience working with Jira, Confluence, Notion, or other issue-tracking and documentation platforms.</li>\n<li>Experience in scripting and automation (Python, Bash, Ansible, or similar).</li>\n</ul>\n<p>If you&#39;re a curious and analytical individual with a passion for problem-solving and a desire to work in a fast-paced environment, we&#39;d love to hear from you!</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a8092b6e-7f5","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4560350006?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$83,000 to $132,000","x-skills-required":["Linux","GPU clusters","server deployments","system administration","hardware troubleshooting","NVIDIA GPUs","SuperMicro systems","Dell systems","high-performance computing","large-scale data center environments","networking fundamentals","troubleshooting tools","firmware updates","BIOS configurations","driver management","system logs","debugging issues","Jira","Confluence","Notion","issue-tracking","documentation platforms","scripting","automation"],"x-skills-preferred":["Kubernetes","Docker","containerized infrastructure"],"datePosted":"2026-04-18T15:49:58.535Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Linux, GPU clusters, server deployments, system administration, hardware troubleshooting, NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing, large-scale data center environments, networking fundamentals, troubleshooting tools, firmware updates, BIOS configurations, driver management, system logs, debugging issues, Jira, Confluence, Notion, issue-tracking, documentation platforms, scripting, automation, Kubernetes, Docker, containerized infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":83000,"maxValue":132000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a0051ff6-ddf"},"title":"Facilities Operations Manager","description":"<p>We&#39;re seeking a driven Facilities Operations Manager to join our team and ensure the relentless performance of our data center infrastructure. This role is critical to maintaining the uptime and efficiency of the systems powering our AI breakthroughs.</p>\n<p>As a Facilities Operations Manager, you&#39;ll lead teams, oversee cutting-edge facilities, and solve complex problems in real time to keep our mission on track. You&#39;ll own the operation of power, cooling, and monitoring systems at scale, bringing technical depth and a no-excuses mindset to our facility.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Manage all aspects of data center critical infrastructure,switchgear, generators, UPS systems, chillers, liquid cooling, and building monitoring,ensuring 99.999%+ uptime.</li>\n<li>Lead 24x7 teams of facility technicians and vendors, driving safety, execution, and a culture of accountability.</li>\n<li>Troubleshoot and resolve facility emergencies using root cause analysis, acting as the go-to escalation point.</li>\n<li>Spearhead optimization projects, collaborating with engineers to integrate next-gen tech and cut operational costs.</li>\n<li>Own the operations budget, balancing efficiency with performance under tight deadlines.</li>\n<li>Enforce compliance with safety and operational protocols, anticipating regulatory shifts.</li>\n<li>Coordinate with cross-functional teams to deliver high-quality outcomes and boost team morale.</li>\n<li>Support multi-site operations and new facility build-outs as xAI scales.</li>\n</ul>\n<p>Basic Qualifications:</p>\n<ul>\n<li>Minimum of 5 years in data center operations or facility management, ideally with hyperscaler or industrial systems.</li>\n<li>Strong grasp of critical infrastructure,power, cooling, and monitoring systems.</li>\n<li>Proven ability to lead teams and manage projects under pressure.</li>\n<li>Sharp analytical and communication skills.</li>\n</ul>\n<p>Preferred Skills and Experience:</p>\n<ul>\n<li>B.S. in Engineering, Facilities Management, or related field; advanced degree a plus.</li>\n<li>Experience with GPU clusters or AI-driven data center environments.</li>\n<li>Methodical troubleshooting and technical leadership chops.</li>\n<li>Familiarity with Southaven, MS area regulations and practices is a bonus.</li>\n<li>Comfort with Excel, Word, and operational tools; CAD or monitoring software knowledge is a plus.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a0051ff6-ddf","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://www.xai.com/","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/4685202007?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["data center operations","facility management","critical infrastructure","team leadership","project management","analytical skills","communication skills"],"x-skills-preferred":["GPU clusters","AI-driven data center environments","methodical troubleshooting","technical leadership","CAD or monitoring software"],"datePosted":"2026-04-18T15:35:02.637Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Southaven, MS"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"data center operations, facility management, critical infrastructure, team leadership, project management, analytical skills, communication skills, GPU clusters, AI-driven data center environments, methodical troubleshooting, technical leadership, CAD or monitoring software"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_93a4ece6-182"},"title":"Member of Technical Staff, Site Reliability Engineer (HPC)","description":"<p>As Microsoft continues to push the boundaries of AI, we are on the lookout for experienced individuals to work with us on the most interesting and challenging AI questions of our time. Our vision is to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. We&#39;re looking for an experienced HPC Site Reliability Engineer (SRE) to join our High Performance Computing (HPC) infrastructure team. In this role, you&#39;ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You&#39;ll ensure that AI systems stay efficient and reliable with very high uptimes.</p>\n<p>Microsoft&#39;s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.</p>\n<p>This role is part of Microsoft AI&#39;s Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being.</p>\n<p>Responsibilities\nReliability &amp; Availability : Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference.\nObservability : Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking.\nAutomation &amp; Tooling : Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments.\nIncident Management : Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.\nSecurity &amp; Compliance : Ensure data privacy, compliance, and secure operations across model training and serving environments.\nCollaboration : Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.</p>\n<p>Qualifications\nRequired Qualifications\nMaster’s Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR Bachelor’s Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR equivalent experience</p>\n<p>Preferred Qualifications\nStrong proficiency in Kubernetes, Docker, and container orchestration.\nKnowledge of CI/CD pipelines for Inference and ML model deployment.\nHands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code.\nExpertise in monitoring &amp; observability tools (Grafana, Datadog, OpenTelemetry, etc.).\nStrong programming/scripting skills in Python, Go, or Bash.\nSolid knowledge of distributed systems, networking, and storage.\nExperience running large-scale GPU clusters for ML/AI workloads (preferred).\nFamiliarity with ML training/inference pipelines.\nExperience with high-performance computing (HPC) and workload schedulers (Kubernetes operators).\nBackground in capacity planning &amp; cost optimization for GPU-heavy environments.</p>\n<p>Work on cutting-edge infrastructure that powers the future of Generative AI. Collaborate with world-class researchers and engineers. Impact millions of users through reliable and responsible AI deployments. Competitive compensation, equity options, and comprehensive benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_93a4ece6-182","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-site-reliability-engineer-hpc-mai-superintelligence-team/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$139,900 – $274,800 per year","x-skills-required":["Kubernetes","Docker","container orchestration","CI/CD pipelines","public cloud platforms","infrastructure-as-code","monitoring & observability tools","programming/scripting skills in Python, Go, or Bash","distributed systems","networking","storage","GPU clusters","ML training/inference pipelines","high-performance computing","workload schedulers"],"x-skills-preferred":["strong proficiency in Kubernetes","knowledge of CI/CD pipelines","hands-on experience with public cloud platforms","expertise in monitoring & observability tools","strong programming/scripting skills in Python, Go, or Bash","solid knowledge of distributed systems","experience running large-scale GPU clusters","familiarity with ML training/inference pipelines","experience with high-performance computing"],"datePosted":"2026-03-08T22:09:23.399Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Docker, container orchestration, CI/CD pipelines, public cloud platforms, infrastructure-as-code, monitoring & observability tools, programming/scripting skills in Python, Go, or Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, strong proficiency in Kubernetes, knowledge of CI/CD pipelines, hands-on experience with public cloud platforms, expertise in monitoring & observability tools, strong programming/scripting skills in Python, Go, or Bash, solid knowledge of distributed systems, experience running large-scale GPU clusters, familiarity with ML training/inference pipelines, experience with high-performance computing","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}}]}