Member of Technical Staff, High Performance Computing Engineer

b151fcc2-2fb Member of Technical Staff, High Performance Computing Engineer We are looking for experienced Member of Technical Staff, High Performance Computing Engineers to help build and scale the infrastructure that trains our frontier models and powers the next evolution of our personal AI, Copilot.

This role offers the unique opportunity to work on some of the largest scale supercomputers in the world – a rare chance to operate at such a significant scale.

Responsibilities

Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.

Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.

Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.

Develop and maintain automation and tooling using Bash and/or Python to improve cluster reliability, observability, and operational efficiency.

Partner closely with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.

Drive work forward independently by navigating ambiguity and technical roadblocks, delivering incremental improvements that get capabilities into users’ hands quickly.

Qualifications

Do you have a Bachelor’s degree in computer science, or related technical field AND 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters, AND 4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.), AND 4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP, OR equivalent experience?

Preferred Qualifications

Master’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters, AND 6+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.), AND 6+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP, OR equivalent experience.

XML job scraping automation by YubHub

]]> full-time staff onsite HPC, SLURM, Kubernetes, GPU compute, high-performance storage, networking, Bash, Python, nvidia InfiniBand clusters, Ray, LLM training clusters, AI platforms, Machine Learning frameworks, large-scale HPC or GPU systems Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that develops and markets software, services, and solutions for personal and business use. It is one of the largest and most influential technology companies in the world. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-high-performance-computing-engineer-mai-superintelligence-team-3/ Zürich 2026-03-08 c1d20281-7ee Member of Technical Staff, High Performance Computing Engineer Summary

Microsoft AI are looking for a talented Member of Technical Staff, High Performance Computing Engineer at their London office. This role sits at the heart of building and scaling the infrastructure that trains their frontier models and powers the next evolution of their personal AI, Copilot. You'll work directly with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs to resolution.

About the Role

As a Member of Technical Staff, High Performance Computing Engineer, you will design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings. You will own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale. You will serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.

Accountabilities

Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.
Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.
Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar), including ongoing maintenance, performance tuning, and troubleshooting of massive clusters.

The Candidate we're looking for

Experience:

4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters.
4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.).
4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP.

Technical skills:

Experience with LLM training clusters.
Experience working with AI platforms, frameworks, and APIs.
Experience using Machine Learning frameworks, including experience using, deploying, and scaling language learning models, either personally or professionally.

Personal attributes:

Ability to identify, analyze, and resolve complex technical issues, ensuring optimal performance, scalability, and user experience.
Dedication to writing clean, maintainable, and well-documented code with a focus on application quality, performance, and security.

Benefits

Competitive salary and benefits package.
Opportunity to work with a leading technology company and contribute to HERE's mission.
Collaborative and dynamic work environment.
Professional development opportunities.

XML job scraping automation by YubHub

]]> full-time staff onsite Competitive salary and benefits package High Performance Computing, Cloud Infrastructure, Machine Learning, AI Platforms, Frameworks and APIs, LLM Training Clusters, AI Platforms, Frameworks and APIs Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that empowers every person and every organization on the planet to achieve more. They come together with a growth mindset, innovate to empower others, and collaborate to realize their shared goals. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-high-performance-computing-engineer-mai-superintelligence-team-2/ London 2026-03-06 7abfb827-590 Member of Technical Staff, High Performance Computing Engineer Summary

Microsoft AI are looking for experienced Member of Technical Staff, High Performance Computing Engineers to help build and scale the infrastructure that trains their frontier models and powers the next evolution of their personal AI, Copilot.

About the Role

This role offers the unique opportunity to work on some of the largest scale supercomputers in the world – a rare chance to operate at such a significant scale. As a Member of Technical Staff, High Performance Computing Engineer, you will design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings. You will own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.

Accountabilities

Design, operate, and maintain large-scale HPC environments, drawing on hands-on engineering experience in production settings.
Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes), ensuring reliable and efficient job scheduling at scale.

The Candidate we're looking for

Experience:

4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters.
4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.).
4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP.

Technical skills:

Experience with LLM training clusters.
Experience working with AI platforms, frameworks, and APIs.
Experience using Machine Learning frameworks, including experience using, deploying, and scaling language learning models, either personally or professionally.

Personal attributes:

Ability to identify, analyze, and resolve complex technical issues, ensuring optimal performance, scalability, and user experience.
Dedication to writing clean, maintainable, and well-documented code with a focus on application quality, performance, and security.

Benefits

Competitive salary.
Comprehensive benefits package.
Opportunities for professional growth and development.

XML job scraping automation by YubHub

]]> full-time staff onsite Competitive salary High Performance Computing, Cloud Infrastructure, Machine Learning, AI Platforms, Frameworks and APIs, LLM Training Clusters, AI Platforms, Frameworks and APIs Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that empowers every person and every organization on the planet to achieve more. They come together with a growth mindset, innovate to empower others, and collaborate to realize their shared goals. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-high-performance-computing-engineer-mai-superintelligence-team/ Multiple Locations, United States 2026-03-06