{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/infiniband"},"x-facet":{"type":"skill","slug":"infiniband","display":"Infiniband","count":24},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_92f42e08-ee9"},"title":"Systems Engineer, HPC","description":"<p>About Mistral</p>\n<p>At Mistral AI, we build high-performance, open, and efficient AI systems designed to power the next generation of applications. Our infrastructure combines large-scale distributed systems, cloud platforms, and HPC environments to support cutting-edge research and production workloads.</p>\n<p>We are a collaborative, low-ego, and highly technical team, operating across Europe, the US, and beyond. As we scale rapidly, we are building the foundational infrastructure to support thousands of nodes and petabyte-scale systems.</p>\n<p>Join us to be part of a pioneering company shaping the future of AI. Together, we can make a meaningful impact.</p>\n<p><strong>About the Role</strong></p>\n<p>We are looking for Systems Engineers / System Administrators to help design, operate, and scale the infrastructure behind Mistral’s AI platforms.</p>\n<p>This is a hands-on, hybrid role combining:</p>\n<p>Systems administration (operating and troubleshooting large-scale Linux environments)</p>\n<p>Systems engineering (automation, scalability, and performance improvements)</p>\n<p>You’ll work closely with infrastructure, HPC, and research teams to ensure our clusters and platforms run reliably at scale.</p>\n<p><strong>What You’ll Work On</strong></p>\n<p><strong>Core Systems Operations</strong></p>\n<p>Operate and maintain large-scale Linux environments (bare metal, clusters, cloud)</p>\n<p>Monitor system health, troubleshoot incidents, and ensure high availability</p>\n<p>Support production and research workloads across multiple environments</p>\n<p><strong>Scaling Infrastructure</strong></p>\n<p>Help scale clusters toward hundreds to thousands of nodes</p>\n<p>Work on systems handling petabyte-scale storage</p>\n<p>Improve performance, reliability, and resource utilisation</p>\n<p><strong>Automation &amp; Engineering</strong></p>\n<p>Automate operational tasks using tools like Python, Bash, Ansible, or Terraform</p>\n<p>Improve deployment, provisioning, and system lifecycle management</p>\n<p>Contribute to system design and architecture decisions</p>\n<p><strong>Cross-Functional Collaboration</strong></p>\n<p>Work closely with:</p>\n<p>HPC / infrastructure teams</p>\n<p>Platform / DevOps engineers</p>\n<p>Research teams</p>\n<p>Act as a bridge between users and infrastructure</p>\n<p><strong>What We’re Looking For</strong></p>\n<p><strong>Must-have</strong></p>\n<p>Strong Linux systems administration experience (core requirement)</p>\n<p>Experience working in large-scale environments:</p>\n<p>HPC clusters or cloud infrastructure</p>\n<p>Experience with Job schedulers (e.g. Slurm)</p>\n<p>Solid troubleshooting skills across systems, hardware, and networks</p>\n<p><strong>Nice-to-have (any of these)</strong></p>\n<p>Containers / orchestration (e.g. Kubernetes)</p>\n<p>Storage systems (e.g. Ceph, Lustre, NFS)</p>\n<p>Networking fundamentals (Ethernet; InfiniBand is a plus)</p>\n<p>Infrastructure as Code / automation tooling</p>\n<p>GPU or AI/ML experience</p>\n<p><strong>Profile We Value</strong></p>\n<p>Pragmatic problem solver who can operate in fast-scaling environments</p>\n<p>Comfortable working across multiple domains (“Swiss army knife” mindset)</p>\n<p>Able to go deep in one area while learning others</p>\n<p>Low-ego, collaborative, and hands-on</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_92f42e08-ee9","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai/careers","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/c2cf8b02-cb79-4e13-8717-25817813542d","x-work-arrangement":"remote","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Linux","Python","Bash","Ansible","Terraform","Slurm","Job scheduler"],"x-skills-preferred":["Kubernetes","Ceph","Lustre","NFS","InfiniBand","Infrastructure as Code","GPU","AI/ML"],"datePosted":"2026-04-24T16:10:07.840Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Linux, Python, Bash, Ansible, Terraform, Slurm, Job scheduler, Kubernetes, Ceph, Lustre, NFS, InfiniBand, Infrastructure as Code, GPU, AI/ML"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_376da89d-421"},"title":"HPC Manager","description":"<p>We are currently looking for an experienced HPC Manager to be responsible for the management, performance, and continuous evolution of the High Performance Computing (HPC) environment supporting CFD workloads and all related services at our UK site in Milton Keynes.</p>\n<p>The role ensures maximum availability, performance, and scalability of the CFD compute cluster and its ecosystem, enabling engineering teams to run complex simulations efficiently in a highly competitive, performance-driven environment.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>HPC &amp; CFD Infrastructure Management: Own and manage the CFD HPC cluster, including compute, storage, and high-performance networking; ensure optimal performance and availability of CFD workloads; manage job scheduling, resource allocation, and workload prioritization; oversee performance tuning, benchmarking, and system optimization; maintain and evolve parallel file systems and data pipelines supporting CFD; drive capacity planning and future HPC architecture evolution; willingness to travel occasionally to our UK branch in Milton Keynes (DC site); availability to respond to critical issues affecting the computing cluster, including during weekends when necessary.</li>\n</ul>\n<ul>\n<li>Collaboration with Engineering: Work closely with CFD and engineering teams to optimize simulation workflows; support users in maximizing efficiency of HPC resources; act as primary point of contact for HPC-related topics in the UK site.</li>\n</ul>\n<ul>\n<li>Operations &amp; Reliability: Ensure 24/7 reliability of HPC services supporting CFD activities; implement monitoring, alerting, and automation; lead troubleshooting of complex system and performance issues; manage software stack, compilers, libraries, and tools used in CFD environments.</li>\n</ul>\n<ul>\n<li>Leadership &amp; Continuous Improvement: Lead and develop a team of HPC engineers/administrators; define best practices, documentation, and operational procedures; continuously evaluate new technologies (GPU, cloud, hybrid HPC); drive efficiency, scalability, and innovation across HPC services.</li>\n</ul>\n<p>What We Offer:</p>\n<ul>\n<li>Working in a young, collaborative and international environment.</li>\n<li>Tailored training.</li>\n<li>Company Events / Briefings.</li>\n<li>On site Gym.</li>\n<li>Bonus scheme.</li>\n<li>Annual salary review process.</li>\n<li>Meal Tickets.</li>\n<li>Free additional health insurance.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_376da89d-421","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Visa Cash App Racing Bulls Formula 1 Team","sameAs":"https://jobs.redbull.com","logo":"https://logos.yubhub.co/jobs.redbull.com.png"},"x-apply-url":"https://jobs.redbull.com/gb-en/milton-keynes-vcarb-f1-team-hpc-manager-prv-ref30239o","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Linux","cluster management","HPC schedulers","InfiniBand","low-latency networking","parallel file systems","CFD workloads","simulation environments","performance tuning","optimization","leadership","stakeholder management","English"],"x-skills-preferred":["GPU computing","container technologies","automation","scripting"],"datePosted":"2026-04-24T13:17:06.413Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Milton Keynes"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Automotive","skills":"Linux, cluster management, HPC schedulers, InfiniBand, low-latency networking, parallel file systems, CFD workloads, simulation environments, performance tuning, optimization, leadership, stakeholder management, English, GPU computing, container technologies, automation, scripting"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_af057d6a-95c"},"title":"HPC Network Engineer","description":"<p>As an HPC Network Engineer at Mistral AI, you will design, deploy, and optimize high-performance network infrastructures for our HPC clusters and AI workloads. You will collaborate with cross-functional teams to ensure seamless integration of networking solutions with our compute, storage, and cloud platforms.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Design, implement, and optimize high-performance, low-latency network architectures for HPC environments, including InfiniBand, RoCE, and high-speed Ethernet.</li>\n</ul>\n<ul>\n<li>Collaborate with HPC, DevOps, and AI research teams to integrate networking solutions with compute clusters, storage systems, and cloud platforms.</li>\n</ul>\n<ul>\n<li>Troubleshoot and resolve complex network issues to minimize downtime and maximize performance.</li>\n</ul>\n<ul>\n<li>Follow escalation procedures and ensure solutions are provided in a timely manner. Ensure escalation is progressing accordingly with the given severity.</li>\n</ul>\n<ul>\n<li>Monitor network performance, capacity, and security, implementing improvements as needed.</li>\n</ul>\n<ul>\n<li>Stay updated with emerging HPC networking technologies and best practices, and drive their adoption within Mistral.</li>\n</ul>\n<ul>\n<li>Develop and maintain documentation for network architectures, configurations, and operational procedures.</li>\n</ul>\n<p>Qualifications &amp; Experience:</p>\n<p>Technical Skills:</p>\n<ul>\n<li>Proficiency in HPC networking protocols (InfiniBand, RoCE, TCP/IP, MPLS).</li>\n</ul>\n<ul>\n<li>Hands-on experience with network hardware (switches, routers, NICs) from vendors like Mellanox, Cisco, or Arista.</li>\n</ul>\n<ul>\n<li>Knowledge of network automation tools (Ansible, Python scripting).</li>\n</ul>\n<ul>\n<li>Familiarity with HPC environments, parallel computing, and distributed systems.</li>\n</ul>\n<ul>\n<li>Experience with network security best practices.</li>\n</ul>\n<p>Soft Skills:</p>\n<ul>\n<li>Strong problem-solving and analytical skills.</li>\n</ul>\n<ul>\n<li>Ability to thrive in a fast-paced, collaborative environment.</li>\n</ul>\n<ul>\n<li>Excellent communication skills (English required; French is a plus).</li>\n</ul>\n<ul>\n<li>Teaching and documentation skills to ensure knowledge is archived and distributed to team members.</li>\n</ul>\n<p>Why Join Mistral?</p>\n<ul>\n<li>Impact: Play a pivotal role in scaling Mistral&#39;s cutting-edge AI infrastructure.</li>\n</ul>\n<ul>\n<li>Growth: Opportunity to shape data centre operations from the ground up in a high-growth startup environment.</li>\n</ul>\n<ul>\n<li>Collaboration: Work with a talented, cross-functional team passionate about AI and technology.</li>\n</ul>\n<ul>\n<li>Flexibility: Competitive compensation, benefits, and the chance to contribute to revolutionary projects.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_af057d6a-95c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/6857fa38-ce30-4513-9930-acf7d78d42ed","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["HPC networking protocols","InfiniBand","RoCE","TCP/IP","MPLS","network hardware","switches","routers","NICs","Mellanox","Cisco","Arista","network automation tools","Ansible","Python scripting","HPC environments","parallel computing","distributed systems","network security best practices"],"x-skills-preferred":[],"datePosted":"2026-04-24T13:11:00.043Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"HPC networking protocols, InfiniBand, RoCE, TCP/IP, MPLS, network hardware, switches, routers, NICs, Mellanox, Cisco, Arista, network automation tools, Ansible, Python scripting, HPC environments, parallel computing, distributed systems, network security best practices"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_7dc0b69a-5b8"},"title":"Senior Engineer, Storage Control Plane","description":"<p>We&#39;re looking for a Senior Storage Engineer to play a key role in designing, building, and operating the control plane for our high-performance AI storage platform. You&#39;ll help evolve CoreWeave&#39;s storage systems by building reliable, scalable, and high-throughput solutions that power some of the largest and innovative AI workloads in the world.</p>\n<p>This role involves close collaboration with teams across infrastructure, compute, and platform to ensure our storage services scale automatically and seamlessly while maximizing performance and reliability.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Design and implement a highly scalable multi-tenant control plane that supports CoreWeave&#39;s growing AI storage and cloud infrastructure needs.</li>\n<li>Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file system and integrate dedicated storage clusters into diverse customer environments.</li>\n<li>Work with technologies such as RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, and distributed filesystems to optimize storage performance and efficiency.</li>\n<li>Participate in efforts to improve the reliability, durability, and observability of our storage stack.</li>\n<li>Collaborate with operations teams to monitor, analyze, and optimize storage systems using telemetry, metrics, and dashboards to improve performance, latency, and resilience.</li>\n<li>Work cross-functionally with platform, product, and infrastructure teams to deliver seamless storage capabilities across the stack.</li>\n<li>Share your knowledge and mentor other engineers on best practices in building distributed, high-performance systems.</li>\n</ul>\n<p>The ideal candidate will have:</p>\n<ul>\n<li>A Bachelor&#39;s or Master&#39;s degree in Computer Science, Engineering, or a related field.</li>\n<li>6–10 years of experience working in storage systems engineering or infrastructure.</li>\n<li>Strong hands-on experience with object storage or distributed filesystems in production environments.</li>\n<li>Experience with one or more storage protocols (e.g. S3, NFS) and file systems such as Ceph, DAOS, or similar.</li>\n<li>Proficiency in a systems programming language such as Go, C, or Rust.</li>\n<li>Familiarity with storage observability tools and telemetry pipelines (e.g., ClickHouse, Prometheus, Grafana).</li>\n<li>Solid understanding of cloud-native infrastructure, Kubernetes, and scalable system architecture.</li>\n<li>Strong debugging and problem-solving skills in distributed, high-performance environments.</li>\n<li>Clear communicator, able to work collaboratively across teams and share technical insights effectively.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_7dc0b69a-5b8","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4611874006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$139,000 to $204,000","x-skills-required":["object storage","distributed filesystems","RDMA","GPU Direct Storage","RoCE","InfiniBand","SPDK","cloud-native infrastructure","Kubernetes","scalable system architecture"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:57:57.450Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"object storage, distributed filesystems, RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, cloud-native infrastructure, Kubernetes, scalable system architecture","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139000,"maxValue":204000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_588dfb0e-611"},"title":"Solutions Architect - Kubernetes","description":"<p>As a Solutions Architect at CoreWeave, you will play a vital role in helping customers succeed with our cloud infrastructure offerings, focusing on Kubernetes solutions within high-performance compute (HPC) environments.</p>\n<p>Your responsibilities will include serving as the primary technical point of contact for customers, establishing strong technical relationships and ensuring their success with CoreWeave&#39;s cloud infrastructure offerings.</p>\n<p>You will collaborate closely with customers to understand their unique business needs and create, prototype, and deploy tailored solutions that align with their requirements.</p>\n<p>You will lead proof of concept initiatives to showcase the value and viability of CoreWeave&#39;s solutions within specific environments.</p>\n<p>You will drive technical leadership and direction during customer meetings, presentations, and workshops, addressing any technical queries or concerns that arise.</p>\n<p>You will act as a virtual member of CoreWeave&#39;s Kubernetes product and engineering teams, identifying opportunities for product enhancement and collaborating with engineers to implement your suggestions.</p>\n<p>You will offer valuable insights on product features, functionality, and performance, contributing regularly to discussions about product strategy and architecture.</p>\n<p>You will conduct periodic technical reviews and assessments of customer workloads, pinpointing opportunities for workload optimization and suggesting suitable solutions.</p>\n<p>You will stay informed of the latest developments and trends in Kubernetes, cloud computing and infrastructure, sharing your thought leadership with customers and internal stakeholders.</p>\n<p>You will lead the prototyping and initiation of research and development efforts for emerging products and solutions, delivering prototypes and key insights for internal consumption.</p>\n<p>You will represent CoreWeave at conferences and industry events, with occasional travel as required.</p>\n<p>To be successful in this role, you will need to have a B.S. in Computer Science or a related technical discipline, or equivalent experience.</p>\n<p>You will also need to have 7+ years of proven experience as a Solutions Architect, engineer, researcher, or technical account manager in cloud infrastructure, focusing on building distributed systems or HPC/cloud services, with an expertise focused on scalable Kubernetes solutions.</p>\n<p>You will need to be fluent in cloud computing concepts, architecture, and technologies with hands-on experience in designing and implementing cloud solutions.</p>\n<p>You will need to have a proven track record with building customer relationships, communicating clearly and the ability to break down complex technical concepts to both technical and non-technical audiences.</p>\n<p>You will need to be familiar with NVIDIA GPUs typically used in AI/ML applications and associated technologies such as Infiniband and NVIDIA Collective Communications Library (NCCL).</p>\n<p>You will need to have experience with running large-scale Artificial Intelligence/Machine Learning (AI/ML) training and inference workloads on technologies such as Slurm and Kubernetes.</p>\n<p>Preferred qualifications include code contributions to open-source inference frameworks, experience with scripting and automation related to Kubernetes clusters and workloads, experience with building solutions across multi-cloud environments, and client or customer-facing publications/talks on latency, optimization, or advanced model-server architectures.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_588dfb0e-611","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4557835006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $220,000","x-skills-required":["Kubernetes","Cloud Computing","High-Performance Compute (HPC)","Distributed Systems","Cloud Infrastructure","Scalable Solutions","NVIDIA GPUs","Infiniband","NVIDIA Collective Communications Library (NCCL)","Slurm","Kubernetes Clusters"],"x-skills-preferred":["Code Contributions to Open-Source Inference Frameworks","Scripting and Automation Related to Kubernetes Clusters and Workloads","Building Solutions Across Multi-Cloud Environments","Client or Customer-Facing Publications/Talks on Latency, Optimization, or Advanced Model-Server Architectures"],"datePosted":"2026-04-18T15:57:29.779Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Cloud Computing, High-Performance Compute (HPC), Distributed Systems, Cloud Infrastructure, Scalable Solutions, NVIDIA GPUs, Infiniband, NVIDIA Collective Communications Library (NCCL), Slurm, Kubernetes Clusters, Code Contributions to Open-Source Inference Frameworks, Scripting and Automation Related to Kubernetes Clusters and Workloads, Building Solutions Across Multi-Cloud Environments, Client or Customer-Facing Publications/Talks on Latency, Optimization, or Advanced Model-Server Architectures","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":220000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_d799d883-0dd"},"title":"Solutions Architect- Networking","description":"<p>As a Solutions Architect at CoreWeave, you will play a vital role in leading innovation at every turn. You will have the opportunity to demonstrate thought leadership and engage hands-on throughout our customers&#39; entire lifecycle. From establishing their Kubernetes environment to developing proofs of concept, onboarding, and optimizing workloads, you will lead innovation at every turn.</p>\n<p>In this role, you will:</p>\n<p>Serve as the primary technical point of contact for customers, establishing strong technical relationships and ensuring their success with CoreWeave&#39;s cloud infrastructure offerings, focusing on networking technologies within high-performance compute (HPC) environments Collaborate closely with customers to understand their unique business needs and create, prototype, and deploy tailored solutions that align with their requirements. Lead proof of concept initiatives to showcase the value and viability of CoreWeave&#39;s solutions within specific environments. Drive technical leadership and direction during customer meetings, presentations, and workshops, addressing any technical queries or concerns that arise. Act as a virtual member of CoreWeave&#39;s Networking product and engineering teams, identifying opportunities for product enhancement and collaborating with engineers to implement your suggestions. Offer valuable insights on product features, functionality, and performance, contributing regularly to discussions about product strategy and architecture. Conduct periodic technical reviews and assessments of customer workloads, pinpointing opportunities for workload optimization and suggesting suitable solutions. Stay informed of the latest developments and trends in Kubernetes, cloud computing and infrastructure, sharing your thought leadership with customers and internal stakeholders. Lead the prototyping and initiation of research and development efforts for emerging products and solutions, delivering prototypes and key insights for internal consumption. Represent CoreWeave at conferences and industry events, with occasional travel as required.</p>\n<p>Who You Are:</p>\n<p>B.S. in Computer Science or a related technical discipline, or equivalent experience 7+ years of proven experience as a Solutions Architect, engineer, researcher, or technical account manager in cloud infrastructure focusing on building distributed systems or HPC/cloud services, with an expertise focused on infrastructure networking. Fluency in cloud computing concepts, architecture, and technologies with hands-on experience in designing and implementing cloud solutions Proven track record with building customer relationships, communicating clearly and the ability to break down complex technical concepts to both technical and non-technical audiences Expertise with a broad range of networking technologies and topics, with a familiarity to understand the needs and use cases is it relates to securing and enabling high performance networking environments. Experience with managing infrastructure networking, Kubernnetes CSI management, and private networking concepts Familiar with NVIDIA GPUs typically used in AI/ML applications and associated technologies such as Infiniband and NVIDIA Collective Communications Library (NCCL)</p>\n<p>Preferred:</p>\n<p>Code contributions to open-source inference frameworks Experience with scripting and automation related to network technologies Experience with building solutions across multi-cloud environments Client or customer-facing publications/talks on latency, optimization, or advanced model-server architectures</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_d799d883-0dd","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4568528006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $220,000","x-skills-required":["cloud computing","Kubernetes","infrastructure networking","high-performance computing","networking technologies","NVIDIA GPUs","Infiniband","NVIDIA Collective Communications Library (NCCL)"],"x-skills-preferred":["open-source inference frameworks","scripting and automation","multi-cloud environments","latency, optimization, or advanced model-server architectures"],"datePosted":"2026-04-18T15:56:27.053Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing, Kubernetes, infrastructure networking, high-performance computing, networking technologies, NVIDIA GPUs, Infiniband, NVIDIA Collective Communications Library (NCCL), open-source inference frameworks, scripting and automation, multi-cloud environments, latency, optimization, or advanced model-server architectures","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":220000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6f3a053e-c43"},"title":"Staff Software Engineer, AI Reliability Engineering","description":"<p>We&#39;re seeking a Staff Software Engineer to join our AI Reliability Engineering team. As a key member of our team, you will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, and lead incident response for critical AI services.</p>\n<p>You will work closely with teams across Anthropic to improve reliability across our most critical serving paths. You will be responsible for making the systems that deliver Claude more robust and resilient, whether during an incident or collaborating on projects.</p>\n<p>To be successful in this role, you should have strong distributed systems, infrastructure, or reliability backgrounds. You should be curious and brave, comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</p>\n<p>You will be working on high-availability serving infrastructure across multiple regions and cloud providers. You will support the reliability of safeguard model serving, which is critical for both site reliability and Anthropic&#39;s safety commitments.</p>\n<p>If you&#39;re committed to creating reliable, interpretable, and steerable AI systems, and you&#39;re passionate about working on complex technical problems, we&#39;d love to hear from you.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6f3a053e-c43","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5101169008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"€235.000-€295.000 EUR","x-skills-required":["distributed systems","infrastructure","reliability","Service Level Objectives","monitoring","observability","incident response","high-availability serving infrastructure","cloud providers"],"x-skills-preferred":["SRE","Production Engineer","chaos engineering","systematic resilience testing","AI-specific observability tools and frameworks","ML hardware accelerators","RDMA","InfiniBand"],"datePosted":"2026-04-18T15:53:59.220Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Dublin, IE"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, Service Level Objectives, monitoring, observability, incident response, high-availability serving infrastructure, cloud providers, SRE, Production Engineer, chaos engineering, systematic resilience testing, AI-specific observability tools and frameworks, ML hardware accelerators, RDMA, InfiniBand"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a1ab2590-2b4"},"title":"Staff Security Engineer, Network Security","description":"<p>We are seeking a Staff Network Security Engineer to architect the defense of our global backbone, edge, and massive-scale GPU clusters. You will move beyond configuring firewalls to engineering security into the network fabric itself,utilizing telemetry, automation, and deep protocol analysis.</p>\n<p>As a Staff Network Security Engineer, you will:</p>\n<p>Unravel and tackle network security challenges at an exhilarating global scale. Collaborate with exceptional network architects and engineers building the backbone infrastructure for the AI revolution. Enjoy the freedom and support to experiment, innovate, and significantly shape our approach to securing the underlay and overlay of our cloud.</p>\n<p>In this role, you will: Conducting architecture reviews, protocol analysis, and design assessments to proactively identify and fix vulnerabilities in our backbone and data center fabrics. Developing robust, repeatable frameworks for network security automation (CoPP, ACL generation, Route Filtering) that make it easy for teams to build securely from day one. Collaborating closely with Network Engineering teams to integrate security checks and validation seamlessly into their CI/CD and config-push pipelines. Crafting clear, practical security guidance and documentation that empowers engineers to deploy secure routing policies and topologies. Actively participating in architectural discussions regarding peering, transit, and traffic engineering, providing insightful security recommendations. Occasionally, &#39;drawing the owl&#39; - figuring out innovative solutions for securing massive throughput environments while navigating ambiguous situations.</p>\n<p>You will be working with a talented team of network engineers, security experts, and AI researchers to build and deploy a highly scalable and secure cloud infrastructure.</p>\n<p>If you are passionate about network security, cloud computing, and AI, and enjoy working in a fast-paced, dynamic environment, we encourage you to apply for this exciting opportunity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a1ab2590-2b4","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4620164006","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$188,000 to $275,000","x-skills-required":["core network protocols (BGP, OSPF/IS-IS, TCP/IP)","deep knowledge of how they function at the packet level","network automation or security tooling in Go, Python, or similar modern languages","collaborating with network architects to implement secure designs in multi-vendor environments","Linux networking internals, control plane protection, and managing infrastructure as code"],"x-skills-preferred":["hyperscale network architectures (CLOS fabrics, MPLS/EVPN, VXLAN)","hardware-level networking security (SmartNICs/DPUs, connectX)","flow-based telemetry analysis","internet routing security standards (RPKI, MANRS)","advanced DDoS mitigation strategies at the network layer","Infiniband and RoCE"],"datePosted":"2026-04-18T15:52:43.431Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"core network protocols (BGP, OSPF/IS-IS, TCP/IP), deep knowledge of how they function at the packet level, network automation or security tooling in Go, Python, or similar modern languages, collaborating with network architects to implement secure designs in multi-vendor environments, Linux networking internals, control plane protection, and managing infrastructure as code, hyperscale network architectures (CLOS fabrics, MPLS/EVPN, VXLAN), hardware-level networking security (SmartNICs/DPUs, connectX), flow-based telemetry analysis, internet routing security standards (RPKI, MANRS), advanced DDoS mitigation strategies at the network layer, Infiniband and RoCE","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":188000,"maxValue":275000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_33821044-320"},"title":"Principal Engineer, Storage","description":"<p>We&#39;re looking for a Principal Engineer to play a key role in designing, building, and operating the data plane for our high-performance AI storage platform.</p>\n<p>You&#39;ll develop CoreWeave&#39;s storage systems by building reliable, scalable, and high-throughput solutions that power some of the largest and most innovative AI workloads in the world.</p>\n<p>This role involves close collaboration with teams across infrastructure, compute, and platform to ensure our storage services scale automatically and seamlessly while maximizing performance and reliability.</p>\n<p>About the role:</p>\n<ul>\n<li>Design and implement a highly scalable multi-tenant control plane that supports CoreWeave&#39;s growing AI storage and cloud infrastructure needs.</li>\n<li>Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file system and integrate dedicated storage clusters into diverse customer environments.</li>\n<li>Work with technologies such as RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, and distributed filesystems to optimize storage performance and efficiency.</li>\n<li>Participate in efforts to improve the reliability, durability, and observability of our storage stack.</li>\n<li>Collaborate with operations teams to monitor, analyze, and optimize storage systems using telemetry, metrics, and dashboards to improve performance, latency, and resilience.</li>\n<li>Work cross-functionally with platform, product, and infrastructure teams to deliver seamless storage capabilities across the stack.</li>\n<li>Share your knowledge and mentor other engineers on best practices in building distributed, high-performance systems, especially focusing on low level storage details that improve performance and durability.</li>\n</ul>\n<p>Who You Are:</p>\n<ul>\n<li>Bachelor’s, Master’s, or PHD degree in Computer Science, Engineering, or a related field.</li>\n<li>8–10+ years of experience working in storage systems engineering.</li>\n<li>Strong hands-on experience with object storage, block storage or distributed filesystems in production environments.</li>\n<li>Proficiency in a systems programming language such as Go, C, or Rust.</li>\n<li>Solid understanding of cloud-native infrastructure, Kubernetes, and scalable system architecture.</li>\n<li>Strong debugging and problem-solving skills in distributed, high-performance environments.</li>\n<li>Clear communicator, able to work collaboratively across teams and share technical insights effectively.</li>\n<li>Familiarity with the trade offs between HDD and SSD based storage systems.</li>\n</ul>\n<p>The base salary range for this role is $206,000 to $303,000.</p>\n<p>In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>\n<p>What We Offer</p>\n<p>The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.</p>\n<p>In addition to a competitive salary, we offer a variety of benefits to support your needs, including:</p>\n<ul>\n<li>Medical, dental, and vision insurance</li>\n<li>100% paid for by CoreWeave</li>\n<li>Company-paid Life Insurance</li>\n<li>Voluntary supplemental life insurance</li>\n<li>Short and long-term disability insurance</li>\n<li>Flexible Spending Account</li>\n<li>Health Savings Account</li>\n<li>Tuition Reimbursement</li>\n<li>Ability to Participate in Employee Stock Purchase Program (ESPP)</li>\n<li>Mental Wellness Benefits through Spring Health</li>\n<li>Family-Forming support provided by Carrot</li>\n<li>Paid Parental Leave</li>\n<li>Flexible, full-service childcare support with Kinside</li>\n<li>401(k) with a generous employer match</li>\n<li>Flexible PTO</li>\n<li>Catered lunch each day in our office and data center locations</li>\n<li>A casual work environment</li>\n<li>A work culture focused on innovative disruption</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_33821044-320","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4646276006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$206,000 to $303,000","x-skills-required":["object storage","block storage","distributed filesystems","RDMA","GPU Direct Storage","RoCE","InfiniBand","SPDK","cloud-native infrastructure","Kubernetes","scalable system architecture","systems programming language","Go","C","Rust"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:51:53.363Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"object storage, block storage, distributed filesystems, RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, cloud-native infrastructure, Kubernetes, scalable system architecture, systems programming language, Go, C, Rust","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":206000,"maxValue":303000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_9166d234-4c5"},"title":"Solutions Architect - HPC/AI/ML","description":"<p>As a Solutions Architect at CoreWeave, you will play a vital and dynamic role in helping customers establish their Kubernetes environment, develop proofs of concept, onboard, and optimise workloads. You will serve as the primary technical point of contact for customers, establishing strong technical relationships and ensuring their success with CoreWeave&#39;s cloud infrastructure offerings, focusing on AI/ML workloads within high-performance compute (HPC) environments.</p>\n<p>Collaborate closely with customers to understand their unique business needs and create, prototype, and deploy tailored solutions that align with their requirements. Lead proof of concept initiatives to showcase the value and viability of CoreWeave&#39;s solutions within specific environments.</p>\n<p>Drive technical leadership and direction during customer meetings, presentations, and workshops, addressing any technical queries or concerns that arise. Act as a virtual member of CoreWeave&#39;s Kubernetes product and engineering teams, identifying opportunities for product enhancement and collaborating with engineers to implement your suggestions.</p>\n<p>Offer valuable insights on product features, functionality, and performance, contributing regularly to discussions about product strategy and architecture. Conduct periodic technical reviews and assessments of customer workloads, pinpointing opportunities for workload optimisation and suggesting suitable solutions.</p>\n<p>Stay informed of the latest developments and trends in Kubernetes, cloud computing and infrastructure, sharing your thought leadership with customers and internal stakeholders. Lead the prototyping and initiation of research and development efforts for emerging products and solutions, delivering prototypes and key insights for internal consumption.</p>\n<p>Represent CoreWeave at conferences and industry events, with occasional travel as required.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_9166d234-4c5","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4649044006","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $225,000 SGD","x-skills-required":["cloud computing concepts","architecture","technologies","NVIDIA GPUs","Infiniband","NVIDIA Collective Communications Library (NCCL)","Slurm","Kubernetes"],"x-skills-preferred":["code contributions to open-source inference frameworks","scripting and automation related to AI/ML workloads","building solutions across multi-cloud environments","client or customer-facing publications/talks on latency, optimisation, or advanced model-server architectures"],"datePosted":"2026-04-18T15:51:30.371Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Singapore"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing concepts, architecture, technologies, NVIDIA GPUs, Infiniband, NVIDIA Collective Communications Library (NCCL), Slurm, Kubernetes, code contributions to open-source inference frameworks, scripting and automation related to AI/ML workloads, building solutions across multi-cloud environments, client or customer-facing publications/talks on latency, optimisation, or advanced model-server architectures","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":225000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_fb9b187c-e32"},"title":"HPC Engineer","description":"<p>We are seeking a skilled and driven NVLink Engineer to support large-scale data center deployments. In this role, you&#39;ll be at the forefront of cutting-edge infrastructure technologies, ensuring the optimal performance and stability of NVLink systems.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Support the deployment of NVLink systems across large data center environments.</li>\n<li>Support the full lifecycle management of NVLink hardware and software components.</li>\n<li>Build and maintain tooling to automate and streamline the deployment, monitoring and troubleshooting workflows.</li>\n<li>Diagnose and resolve performance, connectivity and stability issues in complex environments.</li>\n<li>Collaborate with internal teams and external customers worldwide.</li>\n<li>Participate in a rotating on-call schedule to ensure 24/7 support coverage.</li>\n</ul>\n<p>Required Qualifications:</p>\n<ul>\n<li>Solid understanding of networking fundamentals</li>\n<li>Proven background in troubleshooting network and server hardware at the component level.</li>\n<li>Strong Linux system administration skills.</li>\n<li>Proficiency in at least one language (e.g., Python, Go).</li>\n<li>Proven ability to troubleshoot and debug complex application issues.</li>\n<li>Excellent communication and collaboration skills.</li>\n<li>Experience with Ansible.</li>\n</ul>\n<p>Preferred Qualifications:</p>\n<ul>\n<li>Experience with InfiniBand networking.</li>\n<li>Experience managing large-scale environments (1,000+ switches or nodes).</li>\n<li>Prior experience with NVLink technologies.</li>\n<li>Knowledge of Redfish API for system management.</li>\n<li>Experience with NVUE (NVIDIA User Experience).</li>\n<li>Background with SONiC.</li>\n<li>Experience with Grafana/PromQL</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_fb9b187c-e32","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4645664006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$109,000 to $204,000","x-skills-required":["Networking fundamentals","Linux system administration","Python","Go","Troubleshooting and debugging"],"x-skills-preferred":["InfiniBand networking","Ansible","Redfish API","NVUE","SONiC","Grafana/PromQL"],"datePosted":"2026-04-18T15:50:52.753Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY/ Bellevue, WA/ Sunnyvale, CA / Livingston, NJ"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Networking fundamentals, Linux system administration, Python, Go, Troubleshooting and debugging, InfiniBand networking, Ansible, Redfish API, NVUE, SONiC, Grafana/PromQL","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":109000,"maxValue":204000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_1868194d-726"},"title":"Operations Engineer, HPC Networking","description":"<p>In this role, you will support the deployment, monitoring, troubleshooting, and maintenance of large-scale InfiniBand fabrics, ensuring their stability and performance.</p>\n<p>The ideal candidate will have a strong operations mindset, effective collaboration skills, and the ability to solve complex issues in a dynamic environment.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Regularly monitoring the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.</li>\n<li>Investigating and resolving operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks.</li>\n<li>Assisting with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams.</li>\n<li>Performing routine maintenance and upgrades on InfiniBand switches and control plane components.</li>\n<li>Collaborating with HPC cluster operations teams to provide troubleshooting and operational expertise.</li>\n</ul>\n<p>Investing in our people is one of our top priorities, and we value candidates who can bring their diversified experiences to our teams.</p>\n<p>Minimum Qualifications:</p>\n<ul>\n<li>At least 1 year of experience with InfiniBand or similar networking technologies.</li>\n<li>Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting.</li>\n<li>Experience with Linux system administration and maintenance.</li>\n<li>Proficiency in at least one scripting language.</li>\n</ul>\n<p>Preferred Qualifications:</p>\n<ul>\n<li>Hands-on experience with Nvidia UFM or similar fabric management tools.</li>\n<li>Familiarity with SLURM job scheduler and its role in HPC environments.</li>\n<li>Experience with monitoring and visualization platforms such as Grafana or Prometheus.</li>\n<li>Experience with operational tooling and automation frameworks like Ansible.</li>\n<li>Knowledge of data center operations, including server racks, and cabling.</li>\n<li>Python or Bash scripting.</li>\n</ul>\n<p>Why CoreWeave? At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n<li>Act Like an Owner</li>\n<li>Empower Employees</li>\n<li>Deliver Best-in-Class Client Experiences</li>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization&#39;s growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.</p>\n<p>Come join us!</p>\n<p>The base salary range for this role is $110,000 to $179,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_1868194d-726","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4673462006","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$110,000 to $179,000","x-skills-required":["InfiniBand","Linux system administration","Scripting language","Networking concepts","Architectures","Topologies","Operational best practices","Troubleshooting"],"x-skills-preferred":["Nvidia UFM","SLURM job scheduler","Grafana","Prometheus","Ansible","Data center operations","Server racks","Cabling","Python","Bash scripting"],"datePosted":"2026-04-18T15:50:12.336Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"InfiniBand, Linux system administration, Scripting language, Networking concepts, Architectures, Topologies, Operational best practices, Troubleshooting, Nvidia UFM, SLURM job scheduler, Grafana, Prometheus, Ansible, Data center operations, Server racks, Cabling, Python, Bash scripting","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":110000,"maxValue":179000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_854e95b5-76b"},"title":"Sr. Director of Product, Research and Training Infrastructure","description":"<p>CoreWeave is seeking a visionary Sr. Director of Product, Research Training Infrastructure to lead the product strategy and engineering execution for the services that power the most ambitious AI research labs in the world.</p>\n<p>This executive leader will own the product strategy and engineering execution for the Research Training Stack, focusing on the specialized orchestration, evaluation, and iteration tools required for massive-scale pre-training and post-training.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Frontier Orchestration: Oversee the evolution of SUNK (Slurm on Kubernetes) to provide researchers with deterministic, bare-metal performance through a cloud-native interface.</li>\n</ul>\n<ul>\n<li>Holistic Training Services: Drive the development of next-generation orchestrators and automated training-based evaluation frameworks that ensure model quality throughout the lifecycle.</li>\n</ul>\n<ul>\n<li>Post-Training Excellence: Build the infrastructure required for sophisticated Reinforcement Learning (RL) and RLHF pipelines, enabling labs to refine foundation models with maximum efficiency.</li>\n</ul>\n<ul>\n<li>Customer Advocacy: Act as the primary technical partner for lead researchers at global AI labs, translating their &#39;future-state&#39; requirements into actionable product roadmaps.</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>Proven leadership experience in engineering leadership, with at least 5+ years managing large-scale infrastructure at a top-tier research lab or an AI-native cloud provider.</li>\n</ul>\n<ul>\n<li>Deep, hands-on knowledge of Slurm, Kubernetes, and the specific networking requirements (InfiniBand/RDMA) for distributed training clusters.</li>\n</ul>\n<ul>\n<li>Research mindset and understanding of the &#39;pain points&#39; of a research scientist.</li>\n</ul>\n<ul>\n<li>Scaling experience delivering mission-critical services on multi-thousand GPU clusters (H100/Blackwell/Rubin architectures).</li>\n</ul>\n<ul>\n<li>Strategic vision to define &#39;what&#39;s next&#39; in the AI stack, from automated RL loops to specialized sandbox environments.</li>\n</ul>\n<p>Why CoreWeave?</p>\n<p>In 2026, CoreWeave is the foundation of the largest infrastructure buildout in human history. We are building AI Factories, not just data centers.</p>\n<ul>\n<li>Silicon-Up Innovation: Work directly with the latest NVIDIA architectures.</li>\n</ul>\n<ul>\n<li>Impact: You will be the architect of the environment that enables the next new discovery.</li>\n</ul>\n<p>Velocity: We move at the speed of the researchers we support, bypassing legacy cloud bottlenecks to deliver raw power.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_854e95b5-76b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4665964006","x-work-arrangement":"hybrid","x-experience-level":"executive","x-job-type":"full-time","x-salary-range":"$233,000 to $341,000","x-skills-required":["Slurm","Kubernetes","InfiniBand/RDMA","Distributed training clusters","GPU clusters","H100/Blackwell/Rubin architectures","Reinforcement Learning (RL)","RLHF pipelines"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:50:11.130Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Slurm, Kubernetes, InfiniBand/RDMA, Distributed training clusters, GPU clusters, H100/Blackwell/Rubin architectures, Reinforcement Learning (RL), RLHF pipelines","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":233000,"maxValue":341000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_99450ad6-e3b"},"title":"Network Engineer - AI/HPC","description":"<p><strong>About the Role</strong></p>\n<p>We are seeking a skilled Network Engineer to join our team at xAI. As a Network Engineer, you will play a critical role in designing and operating large-scale networks for our AI and HPC systems.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Design and operate large-scale networks with a deep understanding of congestion control on ethernet and Infiniband</li>\n<li>Develop and optimize network configurations to ensure high performance and availability</li>\n<li>Collaborate with the team to design the next iteration of our backend and front-end networks</li>\n<li>Travel to Memphis to build capacity and participate in a team on-call rotation</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>Minimum of 10 years designing and operating large-scale networks with 5 years in the ethernet AI/HPC space</li>\n<li>Deep understanding of congestion control on ethernet with Infiniband an added bonus</li>\n<li>Expertise in creating a portfolio of metrics for performance and operations to optimize the fleet for training and inference traffic</li>\n<li>Experience with Python to automate away repetitive tasks and facilitate daily job working with and analyzing large sets of data</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Opportunity to work with a highly motivated team focused on engineering excellence</li>\n<li>Collaborative and dynamic work environment</li>\n<li>Professional development opportunities</li>\n</ul>\n<p><strong>What We Offer</strong></p>\n<ul>\n<li>Competitive salary and benefits package</li>\n<li>Opportunity to work on cutting-edge AI and HPC projects</li>\n<li>Collaborative and dynamic work environment</li>\n</ul>\n<p><strong>How to Apply</strong></p>\n<p>If you are a motivated and experienced Network Engineer looking for a new challenge, please submit your application, including your resume and cover letter, to [insert contact information].</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_99450ad6-e3b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://www.xai.com/","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/4946691007","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["RoCEv2","NCCL","Python","Ethernet","Infiniband","AI training and inference workloads"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:45:15.340Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Memphis, TN"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"RoCEv2, NCCL, Python, Ethernet, Infiniband, AI training and inference workloads"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b3746239-557"},"title":"HPC Network Engineer","description":"<p>As an HPC Network Engineer at Mistral AI, you will design, deploy, and optimize high-performance network infrastructures for our HPC clusters and AI workloads. You will collaborate with cross-functional teams to ensure seamless integration of networking solutions with our compute, storage, and cloud platforms.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li><p>Design, implement, and optimize high-performance, low-latency network architectures for HPC environments, including InfiniBand, RoCE, and high-speed Ethernet.</p>\n</li>\n<li><p>Collaborate with HPC, DevOps, and AI research teams to integrate networking solutions with compute clusters, storage systems, and cloud platforms.</p>\n</li>\n<li><p>Troubleshoot and resolve complex network issues to minimize downtime and maximize performance.</p>\n</li>\n<li><p>Follow escalation procedures and ensure solutions are provided in a timely manner. Ensure escalation is progressing accordingly with the given severity.</p>\n</li>\n<li><p>Monitor network performance, capacity, and security, implementing improvements as needed.</p>\n</li>\n<li><p>Stay updated with emerging HPC networking technologies and best practices, and drive their adoption within Mistral.</p>\n</li>\n<li><p>Develop and maintain documentation for network architectures, configurations, and operational procedures.</p>\n</li>\n</ul>\n<p>Qualifications &amp; Experience:</p>\n<p>Technical Skills:</p>\n<ul>\n<li><p>Proficiency in HPC networking protocols (InfiniBand, RoCE, TCP/IP, MPLS).</p>\n</li>\n<li><p>Hands-on experience with network hardware (switches, routers, NICs) from vendors like Mellanox, Cisco, or Arista.</p>\n</li>\n<li><p>Knowledge of network automation tools (Ansible, Python scripting).</p>\n</li>\n<li><p>Familiarity with HPC environments, parallel computing, and distributed systems.</p>\n</li>\n<li><p>Experience with network security best practices.</p>\n</li>\n</ul>\n<p>Soft Skills:</p>\n<ul>\n<li><p>Strong problem-solving and analytical skills.</p>\n</li>\n<li><p>Ability to thrive in a fast-paced, collaborative environment.</p>\n</li>\n<li><p>Excellent communication skills (English required; French is a plus).</p>\n</li>\n<li><p>Teaching and documentation skills to ensure knowledge is archived and distributed to team members.</p>\n</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b3746239-557","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/6857fa38-ce30-4513-9930-acf7d78d42ed","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["HPC networking protocols","InfiniBand","RoCE","TCP/IP","MPLS","network hardware","switches","routers","NICs","Mellanox","Cisco","Arista","network automation tools","Ansible","Python scripting","HPC environments","parallel computing","distributed systems","network security best practices"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:47:44.875Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"France, USA, UK, Germany, Singapore"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"HPC networking protocols, InfiniBand, RoCE, TCP/IP, MPLS, network hardware, switches, routers, NICs, Mellanox, Cisco, Arista, network automation tools, Ansible, Python scripting, HPC environments, parallel computing, distributed systems, network security best practices"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_24be48df-238"},"title":"Field Hardware Engineer, HPC","description":"<p>We&#39;re hiring a Field HW Engineer to work on-site at our data centre in Bruyères-le-Châtel. As a Field HW Engineer, you will be responsible for understanding end-to-end systems, executing complex/vendor-level interventions, and guiding L1 engineers on site.</p>\n<p>Your work will involve hands-on troubleshooting and repair of compute, storage, interconnect and cooling systems to keep our large GPU/CPU cluster healthy and scalable. You will also be responsible for leading complex interventions, advanced diagnostics, guiding and uplifting L1s, process and automation, safety and compliance, and parts and logistics.</p>\n<p>To be successful in this role, you will need 5+ years of experience in data center/server hardware or L2/L3 hardware support, with proven complex hands-on work in production (HPC/AI/Cloud at scale). You should have end-to-end hardware expertise, including comfort with CPU/memory/PCIe cards, NICs, PSUs, drives, network, power and cooling. You should also be confident in analyzing BMC/IPMI logs, linux software logs and crashes simple CLI checks, and have methodical root cause analysis skills.</p>\n<p>The ideal candidate will be willing to travel between sites (Paris area or nearby regions, occasionally in Europe or US) and have a strong understanding of safety and discipline, including impeccable ESD/LOTO/PPE habits, zero rough handling, and clean, labeled, auditable work.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_24be48df-238","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai"},"x-apply-url":"https://jobs.lever.co/mistral/ea94b55b-58e1-437b-bf3d-07ed150308e3","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["data center/server hardware","L2/L3 hardware support","complex hands-on work in production (HPC/AI/Cloud at scale)","end-to-end hardware expertise","CPU/memory/PCIe cards","NICs","PSUs","drives","network","power and cooling","BMC/IPMI logs","linux software logs","crashes simple CLI checks","root cause analysis"],"x-skills-preferred":["vendor tools (iDRAC/iLO/IPMI)","RAID/storage basics (NVMe/SAS/SATA)","high-speed interconnect (Ethernet/InfiniBand)","coding/automation (Python/Bash)"],"datePosted":"2026-03-10T11:27:14.542Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bruyères-le-Châtel"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"data center/server hardware, L2/L3 hardware support, complex hands-on work in production (HPC/AI/Cloud at scale), end-to-end hardware expertise, CPU/memory/PCIe cards, NICs, PSUs, drives, network, power and cooling, BMC/IPMI logs, linux software logs, crashes simple CLI checks, root cause analysis, vendor tools (iDRAC/iLO/IPMI), RAID/storage basics (NVMe/SAS/SATA), high-speed interconnect (Ethernet/InfiniBand), coding/automation (Python/Bash)"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c8c20fa9-7f3"},"title":"Datacenter Hardware Engineer, HPC","description":"<p>About Mistral AI</p>\n<p>At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.</p>\n<p>We are a company that democratizes AI through high-performance, optimized, open-source and cutting-edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on-premises or in cloud environments.</p>\n<p>Our offerings include le Chat, the AI assistant for life and work. We are a team passionate about AI and its potential to transform society.</p>\n<p>Role Summary</p>\n<p>Our compute footprint is growing fast to support our science and engineering teams. We’re hiring a Datacenter HW Engineer to maintain, troubleshoot, and scale our GPU/CPU clusters safely and reliably.</p>\n<p>What you will do</p>\n<ul>\n<li>Diagnose &amp; operate core server/cluster components - Investigate and handle compute/storage hardware issues (CPU, memory, drives, NICs, GPUs, PSUs) and interconnect problems (switches, cables, transceivers; Ethernet/InfiniBand).</li>\n<li>Safety &amp; procedures - Apply lockout/tagout (LOTO) and ESD discipline; follow pre/post-work checklists; maintain tidy, safe work areas.</li>\n<li>First-line diagnostics - Triage using LEDs, POST, beep codes and basic tests; capture evidence (photos, serials, results); open/update/close tickets with clear notes.</li>\n<li>Preventive maintenance - Provide feedback and ideas to improve proactive activities, monitoring, and targeted follow-ups on recurring or specific anomalies; help turn ad-hoc checks into SOPs, alerts, and dashboards.</li>\n<li>Parts &amp; logistics - Receive and track parts, keep labeled inventory accurate, manage simple RMAs, and coordinate with vendors.</li>\n<li>Collaboration &amp; escalation - Partner with senior hardware/firmware owners on complex or multi-node issues; communicate status and next steps crisply.</li>\n<li>Documentation &amp; quality - Keep SOPs/checklists current; ensure zero undocumented changes and consistent, audit-ready records.</li>\n</ul>\n<p>About you</p>\n<ul>\n<li>Hands-on mindset in datacenters/server hardware: you can install/re-seat/swap GPU/PCIe cards, NICs, PSUs, drives, and work cleanly in racks (rails, cabling, labeling).</li>\n<li>Disciplined and meticulous: follows checklists, ESD/LOTO; no rough handling; careful with all high-value server components.</li>\n<li>Practical electrical basics: power-off, PPE, short-circuit risk awareness.</li>\n<li>Comfortable in racks: cooling, network, storage, PDU, cable management; can lift/mount safely (within HSE limits).</li>\n<li>Clear communicator: short factual updates; reliable teammate; punctual and process-minded.</li>\n<li>Hardware-passionate, professionally grounded: strong curiosity and craft mindset.</li>\n</ul>\n<p>Nice to have</p>\n<ul>\n<li>HPC/AI/Cloud at scale experience (production environments), large-fleet/server install &amp; maintenance in datacenters.</li>\n<li>Basic networking (Ethernet/InfiniBand) and basic Linux (boot/check; no coding needed).</li>\n<li>Coding/automation skills (Python/Bash): small tools/scripts to improve checklists, photo/serial capture, inventory sync, or simple monitoring/reporting.</li>\n<li>Experience with inventory/RMA tools and vendor coordination.</li>\n<li>Exposure to HPC/research/industrial environments.</li>\n</ul>\n<p>What we offer</p>\n<ul>\n<li>Competitive salary and equity package</li>\n<li>Health insurance</li>\n<li>Transportation allowance</li>\n<li>Sport allowance</li>\n<li>Meal vouchers</li>\n<li>Private pension plan</li>\n<li>Generous parental leave policy</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c8c20fa9-7f3","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai"},"x-apply-url":"https://jobs.lever.co/mistral/ddf7bcbb-e223-4768-a553-6e95df472cf7","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Datacenter hardware","Server hardware","GPU/CPU clusters","Networking","Linux","Scripting (Python/Bash)","Inventory/RMA tools","Vendor coordination"],"x-skills-preferred":["HPC/AI/Cloud at scale experience","Basic networking (Ethernet/InfiniBand)","Basic Linux (boot/check; no coding needed)","Coding/automation skills (Python/Bash)"],"datePosted":"2026-03-10T11:25:48.956Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Datacenter hardware, Server hardware, GPU/CPU clusters, Networking, Linux, Scripting (Python/Bash), Inventory/RMA tools, Vendor coordination, HPC/AI/Cloud at scale experience, Basic networking (Ethernet/InfiniBand), Basic Linux (boot/check; no coding needed), Coding/automation skills (Python/Bash)"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a51375e8-30e"},"title":"Member of Technical Staff, Software Co-Design AI HPC Systems","description":"<p>Our team&#39;s mission is to architect, co-design, and productionize next-generation AI systems at datacenter scale. We operate at the intersection of models, systems software, networking, storage, and AI hardware, optimizing end-to-end performance, efficiency, reliability, and cost. Our work spans today&#39;s frontier AI workloads and directly shapes the next generation of accelerators, system architectures, and large-scale AI platforms. We pursue this mission through deep hardware–software co-design, combining rigorous systems thinking with hands-on engineering. The team invests heavily in understanding real production workloads large-scale training, inference, and emerging multimodal models and translating those insights into concrete improvements across the stack: from kernels, runtimes, and distributed systems, all the way down to silicon-level trade-offs and datacenter-scale architectures. This role sits at the boundary between exploration and production. You will work closely with internal infrastructure, hardware, compiler, and product teams, as well as external partners across the hardware and systems ecosystem. Our operating model emphasizes rapid ideation and prototyping, followed by disciplined execution to drive high-leverage ideas into production systems that operate at massive scale. In addition to delivering real-world impact on large-scale AI platforms, the team actively contributes to the broader research and engineering community. Our work aligns closely with leading communities in ML systems, distributed systems, computer architecture, and high-performance computing, and we regularly publish, prototype, and open-source impactful technologies where appropriate.</p>\n<p>About the Team</p>\n<p>We build foundational AI infrastructure that enables large-scale training and inference across diverse workloads and rapidly evolving hardware generations. Our work directly shapes how AI systems are designed, deployed, and scaled today and into the future. Engineers on this team operate with end-to-end ownership, deep technical rigor, and a strong bias toward real-world impact.</p>\n<p>Microsoft Superintelligence Team</p>\n<p>Microsoft Superintelligence team’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.</p>\n<p>This role is part of Microsoft AI’s Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!</p>\n<p>Responsibilities</p>\n<p>Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks. Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements. Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems. Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps. Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations. Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs. Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams. Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a51375e8-30e","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-software-co-design-ai-hpc-systems-mai-superintelligence-team-3/","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["AI accelerator or GPU architectures","Distributed systems and large-scale AI training/inference","High-performance computing (HPC) and collective communications","ML systems, runtimes, or compilers","Performance modeling, benchmarking, and systems analysis","Hardware–software co-design for AI workloads","Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development"],"x-skills-preferred":["Experience designing or operating large-scale AI clusters for training or inference","Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications","Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand)","Background in performance modeling and capacity planning for future hardware generations","Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews","Publications, patents, or open-source contributions in systems, architecture, or ML systems"],"datePosted":"2026-03-08T22:18:41.443Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"AI accelerator or GPU architectures, Distributed systems and large-scale AI training/inference, High-performance computing (HPC) and collective communications, ML systems, runtimes, or compilers, Performance modeling, benchmarking, and systems analysis, Hardware–software co-design for AI workloads, Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development, Experience designing or operating large-scale AI clusters for training or inference, Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications, Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand), Background in performance modeling and capacity planning for future hardware generations, Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews, Publications, patents, or open-source contributions in systems, architecture, or ML systems"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b151fcc2-2fb"},"title":"Member of Technical Staff, High Performance Computing Engineer","description":"<p>We are looking for experienced Member of Technical Staff, High Performance Computing Engineers to help build and scale the infrastructure that trains our frontier models and powers the next evolution of our personal AI, Copilot.</p>\n<p>This role offers the unique opportunity to work on some of the largest scale supercomputers in the world – a rare chance to operate at such a significant scale.</p>\n<p><strong>Responsibilities</strong></p>\n<p>Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.</p>\n<p>Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.</p>\n<p>Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.</p>\n<p>Develop and maintain automation and tooling using Bash and/or Python to improve cluster reliability, observability, and operational efficiency.</p>\n<p>Partner closely with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.</p>\n<p>Drive work forward independently by navigating ambiguity and technical roadblocks, delivering incremental improvements that get capabilities into users’ hands quickly.</p>\n<p><strong>Qualifications</strong></p>\n<p>Do you have a Bachelor’s degree in computer science, or related technical field AND 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters, AND 4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.), AND 4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP, OR equivalent experience?</p>\n<p><strong>Preferred Qualifications</strong></p>\n<p>Master’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters, AND 6+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.), AND 6+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP, OR equivalent experience.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b151fcc2-2fb","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-high-performance-computing-engineer-mai-superintelligence-team-3/","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["HPC","SLURM","Kubernetes","GPU compute","high-performance storage","networking","Bash","Python","nvidia InfiniBand clusters","Ray"],"x-skills-preferred":["LLM training clusters","AI platforms","Machine Learning frameworks","large-scale HPC or GPU systems"],"datePosted":"2026-03-08T22:15:08.170Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Zürich"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"HPC, SLURM, Kubernetes, GPU compute, high-performance storage, networking, Bash, Python, nvidia InfiniBand clusters, Ray, LLM training clusters, AI platforms, Machine Learning frameworks, large-scale HPC or GPU systems"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_cd1a0d16-311"},"title":"Member of Technical Staff, Software Co-Design AI HPC Systems","description":"<p>Our team&#39;s mission is to architect, co-design, and productionize next-generation AI systems at datacenter scale. We operate at the intersection of models, systems software, networking, storage, and AI hardware, optimizing end-to-end performance, efficiency, reliability, and cost.</p>\n<p>We pursue this mission through deep hardware–software co-design, combining rigorous systems thinking with hands-on engineering. The team invests heavily in understanding real production workloads large-scale training, inference, and emerging multimodal models and translating those insights into concrete improvements across the stack: from kernels, runtimes, and distributed systems, all the way down to silicon-level trade-offs and datacenter-scale architectures.</p>\n<p>This role sits at the boundary between exploration and production. You will work closely with internal infrastructure, hardware, compiler, and product teams, as well as external partners across the hardware and systems ecosystem. Our operating model emphasizes rapid ideation and prototyping, followed by disciplined execution to drive high-leverage ideas into production systems that operate at massive scale.</p>\n<p>In addition to delivering real-world impact on large-scale AI platforms, the team actively contributes to the broader research and engineering community. Our work aligns closely with leading communities in ML systems, distributed systems, computer architecture, and high-performance computing, and we regularly publish, prototype, and open-source impactful technologies where appropriate.</p>\n<p>Microsoft Superintelligence Team\nMicrosoft Superintelligence team’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.</p>\n<p>This role is part of Microsoft AI’s Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact.</p>\n<p>Responsibilities\nLead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.</p>\n<p>Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.</p>\n<p>Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.</p>\n<p>Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.</p>\n<p>Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.</p>\n<p>Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.</p>\n<p>Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.</p>\n<p>Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.</p>\n<p>Qualifications\nBachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.</p>\n<p>Additional or Preferred Qualifications\nMaster’s Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor’s Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.</p>\n<p>Strong background in one or more of the following areas: AI accelerator or GPU architectures Distributed systems and large-scale AI training/inference High-performance computing (HPC) and collective communications ML systems, runtimes, or compilers Performance modeling, benchmarking, and systems analysis Hardware–software co-design for AI workloads Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.</p>\n<p>Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders. Experience designing or operating large-scale AI clusters for training or inference. Deep familiarity with LLMs, multimodal models, or recommendation systems, and their systems-level implications. Experience with accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand). Background in performance modeling and capacity planning for future hardware generations. Prior experience contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews. Publications, patents, or open-source contributions in systems, architecture, or ML systems are a plus.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_cd1a0d16-311","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-software-co-design-ai-hpc-systems-mai-superintelligence-team-2/","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$139,900 – $274,800 per year","x-skills-required":["C","C++","C#","Java","JavaScript","Python","AI accelerator or GPU architectures","Distributed systems and large-scale AI training/inference","High-performance computing (HPC) and collective communications","ML systems, runtimes, or compilers","Performance modeling, benchmarking, and systems analysis","Hardware–software co-design for AI workloads","Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development"],"x-skills-preferred":["LLMs, multimodal models, or recommendation systems, and their systems-level implications","Accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand)","Performance modeling and capacity planning for future hardware generations","Contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews","Publications, patents, or open-source contributions in systems, architecture, or ML systems"],"datePosted":"2026-03-08T22:13:30.666Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Redmond"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"C, C++, C#, Java, JavaScript, Python, AI accelerator or GPU architectures, Distributed systems and large-scale AI training/inference, High-performance computing (HPC) and collective communications, ML systems, runtimes, or compilers, Performance modeling, benchmarking, and systems analysis, Hardware–software co-design for AI workloads, Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development, LLMs, multimodal models, or recommendation systems, and their systems-level implications, Accelerator interconnects and communication stacks (e.g., NCCL, MPI, RDMA, high-speed Ethernet or InfiniBand), Performance modeling and capacity planning for future hardware generations, Contributing to or leading hardware roadmaps, silicon bring-up, or platform architecture reviews, Publications, patents, or open-source contributions in systems, architecture, or ML systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_73ff6f07-c0e"},"title":"Staff Software Engineer, AI Reliability Engineering","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>Claude has your back. AIRE has Claude&#39;s. Help us keep Claude reliable for everyone who depends on it.</p>\n<p>AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.</p>\n<p>Reliability here is an emergent phenomenon that transcends any single team&#39;s boundaries, so someone has to zoom out and look at the whole picture. That&#39;s us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.</li>\n<li>Design and implement monitoring and observability systems across the token path.</li>\n<li>Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers</li>\n<li>Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.</li>\n<li>Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic&#39;s safety commitments.</li>\n</ul>\n<p><strong>You may be a good fit if you</strong></p>\n<ul>\n<li>Have strong distributed systems, infrastructure, or reliability backgrounds -- we&#39;re looking for reliability-minded software engineers and SREs.</li>\n<li>Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</li>\n<li>Think holistically about how systems compose and where the seams are.</li>\n<li>Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.</li>\n<li>Care about users and feel ownership over outcomes, even for systems you don&#39;t own.</li>\n<li>Have excellent communication and collaboration skills -- you&#39;ll be partnering across the entire company.</li>\n<li>Bring diverse experience -- the team&#39;s strength comes from people who&#39;ve built product stacks, scaled databases, run massive distributed systems, and everything in between.</li>\n</ul>\n<p><strong>Strong candidates may also</strong></p>\n<ul>\n<li>Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems</li>\n<li>Have experience operating large-scale model serving or training infrastructure (&gt;1000 GPUs).</li>\n<li>Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).</li>\n<li>Understand ML-specific networking optimizations like RDMA and InfiniBand.</li>\n<li>Have expertise in AI-specific observability tools and frameworks.</li>\n<li>Have experience with chaos engineering and systematic resilience testing.</li>\n<li>Have contributed to open-source infrastructure or ML tooling.</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship</strong></p>\n<p>We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong></p>\n<p>Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</p>\n<p><strong>Your safety matters to us.</strong></p>\n<p>To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links—visit anthropic.com/careers directly for confirmed position openings.</p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and engineering as it does with computer science.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_73ff6f07-c0e","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5101173008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"£325,000 - £390,000GBP","x-skills-required":["distributed systems","infrastructure","reliability","software engineering","SRE","large scale systems","model serving","training infrastructure","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"x-skills-preferred":["SRE","Production Engineer","reliability-focused roles","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"datePosted":"2026-03-08T13:51:34.354Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, SRE, Production Engineer, reliability-focused roles, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling","baseSalary":{"@type":"MonetaryAmount","currency":"GBP","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":390000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_10798a1e-9fa"},"title":"Staff Software Engineer, AI Reliability Engineering","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>Claude has your back. AIRE has Claude&#39;s. Help us keep Claude reliable for everyone who depends on it.</p>\n<p>AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.</p>\n<p>Reliability here is an emergent phenomenon that transcends any single team&#39;s boundaries, so someone has to zoom out and look at the whole picture. That&#39;s us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.</li>\n<li>Design and implement monitoring and observability systems across the token path.</li>\n<li>Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers</li>\n<li>Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.</li>\n<li>Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic&#39;s safety commitments.</li>\n</ul>\n<p><strong>You may be a good fit if you</strong></p>\n<ul>\n<li>Have strong distributed systems, infrastructure, or reliability backgrounds -- we&#39;re looking for reliability-minded software engineers and SREs.</li>\n<li>Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</li>\n<li>Think holistically about how systems compose and where the seams are.</li>\n<li>Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.</li>\n<li>Care about users and feel ownership over outcomes, even for systems you don&#39;t own.</li>\n<li>Have excellent communication and collaboration skills -- you&#39;ll be partnering across the entire company.</li>\n<li>Bring diverse experience -- the team&#39;s strength comes from people who&#39;ve built product stacks, scaled databases, run massive distributed systems, and everything in between.</li>\n</ul>\n<p><strong>Strong candidates may also</strong></p>\n<ul>\n<li>Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems</li>\n<li>Have experience operating large-scale model serving or training infrastructure (&gt;1000 GPUs).</li>\n<li>Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).</li>\n<li>Understand ML-specific networking optimizations like RDMA and InfiniBand.</li>\n<li>Have expertise in AI-specific observability tools and frameworks.</li>\n<li>Have experience with chaos engineering and systematic resilience testing.</li>\n<li>Have contributed to open-source infrastructure or ML tooling.</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Salary</strong></p>\n<p>The annual compensation range for this role is €235.000 - €295.000EUR.</p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and engineering as it does with computer science. We strive to build a team that reflects this perspective, with people from a wide range of backgrounds and disciplines.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_10798a1e-9fa","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5101169008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"€235.000 - €295.000EUR","x-skills-required":["distributed systems","infrastructure","reliability","software engineering","SRE","large scale systems","model serving","training infrastructure","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"x-skills-preferred":["communication","collaboration","diverse experience","product stacks","databases","distributed systems"],"datePosted":"2026-03-08T13:48:18.742Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Dublin"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, communication, collaboration, diverse experience, product stacks, databases, distributed systems"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_5d37a7c7-d2a"},"title":"ML Infrastructure Engineer","description":"<p><strong>About the role</strong></p>\n<p>The ML Infrastructure team at Cursor builds large-scale compute, storage, and software infrastructure to support the company&#39;s work building the world&#39;s best agentic coding model. We&#39;re looking for strong engineers who are interested in building high-performance infrastructure and the software to support it. This role works closely with ML researchers and engineers to enable their work through improvements to our training framework, systems reliability/performance, and developer experience.</p>\n<p><strong>What you&#39;ll do</strong></p>\n<ul>\n<li>Collaborate with ML researchers to improve the throughput and reliability of training</li>\n<li>Work with OEMs, cloud service providers, and others to plan and build cutting-edge GPU infrastructure</li>\n<li>Improve the density and scalability of compute environments to enable increasingly large RL workloads</li>\n<li>Create software and systems to automate building, monitoring, and running GPU clusters</li>\n<li>Build workload scheduling and data movement systems to support Cursor&#39;s growing training footprint</li>\n</ul>\n<p><strong>You may be a fit if</strong></p>\n<ul>\n<li>A strong background in systems and infrastructure-focused software engineering, particularly in Python, Typescript, Rust, and Golang</li>\n<li>Experience with distributed storage and networking infrastructure, particularly on Linux systems across cloud and bare metal environments</li>\n<li>Exposure to large-scale systems and their unique challenges, ideally across thousands of nodes with significant resource footprints</li>\n</ul>\n<p><strong>Nice to have</strong></p>\n<ul>\n<li>Operational exposure to Nvidia GPUs with Infiniband or RoCE, particularly with Blackwell and Hopper-class hardware</li>\n<li>Exposure to Ray, Slurm, or other common compute and runtime schedulers</li>\n</ul>\n<p>Name<em> Email</em> ↥ Upload file LinkedIn URL GitHub Profile</p>\n<p>Please write a short note on a project you&#39;re proud of:</p>\n<p>Will you now or in the future require visa sponsorship to work in the country where this position is located?</p>\n<p>Has someone at Cursor referred you for this role? If so, please include their email here</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_5d37a7c7-d2a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Cursor","sameAs":"https://cursor.com","logo":"https://logos.yubhub.co/cursor.com.png"},"x-apply-url":"https://cursor.com/careers/software-engineer-ml-infrastructure","x-work-arrangement":"remote","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Python","Typescript","Rust","Golang","Distributed storage","Networking infrastructure","Linux systems","Kubernetes"],"x-skills-preferred":["Nvidia GPUs","Infiniband","RoCE","Blackwell","Hopper-class hardware","Ray","Slurm"],"datePosted":"2026-03-08T00:17:18.553Z","jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Typescript, Rust, Golang, Distributed storage, Networking infrastructure, Linux systems, Kubernetes, Nvidia GPUs, Infiniband, RoCE, Blackwell, Hopper-class hardware, Ray, Slurm"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_d5390946-539"},"title":"Software Engineer, Model Inference","description":"<p><strong>Software Engineer, Model Inference</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Scaling</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$295K – $555K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>Our Inference team brings OpenAI’s most capable research and technology to the world through our products. We empower consumers, enterprise and developers alike to use and access our start-of-the-art AI models, allowing them to do things that they’ve never been able to before. We focus on performant and efficient model inference, as well as accelerating research progression via model inference.</p>\n<p><strong>About the Role</strong></p>\n<p>We are looking for an engineer who wants to take the world&#39;s largest and most capable AI models and optimize them for use in a high-volume, low-latency, and high-availability production and research environment.</p>\n<p><strong>In this role, you will:</strong></p>\n<ul>\n<li>Work alongside machine learning researchers, engineers, and product managers to bring our latest technologies into production.</li>\n</ul>\n<ul>\n<li>Work alongside researchers to enable advanced research through awesome engineering.</li>\n</ul>\n<ul>\n<li>Introduce new techniques, tools, and architecture that improve the performance, latency, throughput, and efficiency of our model inference stack.</li>\n</ul>\n<ul>\n<li>Build tools to give us visibility into our bottlenecks and sources of instability and then design and implement solutions to address the highest priority issues.</li>\n</ul>\n<ul>\n<li>Optimize our code and fleet of Azure VMs to utilize every FLOP and every GB of GPU RAM of our hardware.</li>\n</ul>\n<p><strong>You might thrive in this role if you:</strong></p>\n<ul>\n<li>Have an understanding of modern ML architectures and an intuition for how to optimize their performance, particularly for inference.</li>\n</ul>\n<ul>\n<li>Own problems end-to-end, and are willing to pick up whatever knowledge you&#39;re missing to get the job done.</li>\n</ul>\n<ul>\n<li>Have at least 5 years of professional software engineering experience.</li>\n</ul>\n<ul>\n<li>Have or can quickly gain familiarity with PyTorch, NVidia GPUs and the software stacks that optimize them (e.g. NCCL, CUDA), as well as HPC technologies such as InfiniBand, MPI, NVLink, etc.</li>\n</ul>\n<ul>\n<li>Have experience architecting, building, observing, and debugging production distributed systems. Bonus point if worked on performance-critical distributed systems.</li>\n</ul>\n<ul>\n<li>Have needed to rebuild or substantially refactor production systems several times over due to rapidly increasing scale.</li>\n</ul>\n<ul>\n<li>Are self-directed and enjoy figuring out the most important problem to work on.</li>\n</ul>\n<ul>\n<li>Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed.</li>\n</ul>\n<p><strong>About OpenAI</strong></p>\n<p>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_d5390946-539","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/83b6755d-7785-4186-9050-5ef3ad127941","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$295K – $555K • Offers Equity","x-skills-required":["PyTorch","NVidia GPUs","NCCL","CUDA","HPC technologies","InfiniBand","MPI","NVLink","Azure VMs","GPU RAM","FLOP"],"x-skills-preferred":["modern ML architectures","intuition for optimizing performance","distributed systems","performance-critical distributed systems"],"datePosted":"2026-03-06T18:31:29.482Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"PyTorch, NVidia GPUs, NCCL, CUDA, HPC technologies, InfiniBand, MPI, NVLink, Azure VMs, GPU RAM, FLOP, modern ML architectures, intuition for optimizing performance, distributed systems, performance-critical distributed systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":295000,"maxValue":555000,"unitText":"YEAR"}}}]}