Security Engineer, Infrastructure

c7de81b4-bec Security Engineer, Infrastructure We are seeking a highly skilled Infrastructure Security Engineer to join our team. This role is integral to ensuring the security and integrity of our platform.

You will be responsible for securing large cloud environments, orchestrating and securing various compute clusters, and reviewing infrastructure as code. Your expertise in cloud security, infrastructure automation, and advanced security practices will be essential in maintaining and enhancing our security posture.

Key responsibilities include:

Securing infrastructure across large cloud hosting providers (e.g., AWS, Azure, GCP).
Implementing and maintaining robust security configurations and policies for cloud environments.
Conducting regular security assessments and audits of infrastructure to identify vulnerabilities and areas for improvement.
Developing and enforcing security best practices for infrastructure automation and orchestration.
Collaborating with DeveloperExperience, IT, and product teams to integrate security into all stages of the infrastructure lifecycle.
Reviewing and securing infrastructure as code (e.g., Terraform, CloudFormation).
Educating and mentoring team members on infrastructure security best practices and emerging threats.

Ideally, you'd have:

Proven experience as a Security Engineer with a focus on product security.
Proficiency in NodeJS, TypeScript, and Kubernetes.
Experience with orchestrating and securing GPU clusters.
Proficiency in infrastructure as code tools such as Terraform and CloudFormation.
Excellent communication skills, with the ability to clearly explain technical concepts and their implications to both technical and non-technical stakeholders.
Demonstrated ability to influence security strategies and drive improvements within an organisation.
Relevant security certifications (e.g., AWS Certified Security Specialty, Certified Cloud Security Professional) are a plus.
Experience in a senior or lead security role is preferred.

Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training.

XML job scraping automation by YubHub

]]> full-time senior hybrid $237,600-$297,000 USD cloud security, infrastructure automation, advanced security practices, NodeJS, TypeScript, Kubernetes, Terraform, CloudFormation, orchestrating and securing GPU clusters, relevant security certifications Engineering Technology Scale https://logos.yubhub.co/scale.com.png Scale develops reliable AI systems for the world's most important decisions. https://www.scale.com/ https://job-boards.greenhouse.io/scaleai/jobs/4646888005 New York, NY; San Francisco, CA; Seattle, WA; Washington, DC 2026-04-18 594b20c4-c28 Infrastructure Engineer, Security We're looking for an infrastructure engineer to own and evolve the security infrastructure that underpins our foundation models. In this role, you'll work across compute, storage, networking, and data platforms, making sure our systems are secure, reliable, and built to scale.

You'll shape controls, architecture, and tooling so that security is part of how the platform works by default. You'll partner closely with research and product teams, enabling them to move quickly while keeping our models, data, and environments protected.

Key responsibilities include:

Architecting security patterns for platforms and services, including network segmentation, service-to-service authentication, RBAC, and policy enforcement in Kubernetes and cloud environments.

Managing identity, access, and secrets for humans and services: workload and cross-cloud identity, least-privilege IAM, and secrets management.

Building secure platforms for data ingestion, processing, and curation: classification, encryption, access controls, and safe sharing patterns across teams.

Writing threat models and reviewing designs with researchers and engineers to help them ship features and experiments in a safe, scalable way.

Automating security checks and building guardrails: policy-as-code, secure infrastructure baselines, validation in CI/CD, and tools that make the secure path the easiest one.

Requirements include:

Bachelor's degree or equivalent experience in engineering, or similar.

Strong background with containers and orchestration (e.g., Kubernetes) and how to secure them (namespaces, network policies, pod security, admission controls, etc.).

Practical experience with Infrastructure as Code (Terraform or similar), including secure patterns for provisioning networks, IAM, and shared services.

Solid understanding of cloud networking and security: VPCs, load balancers, service discovery, mTLS, firewalls, and zero-trust-style architectures.

Proficiency with a systems language such as Rust and scripting in Python for building platform components and internal tools.

Evidence of owning complex, production-critical systems, including debugging issues that span infra, security, and application layers.

Preferred qualifications include experience with ML infrastructure, GPU clusters, or large-scale training environments, as well as background in AI labs, HPC environments, or ML-heavy organizations.

XML job scraping automation by YubHub

]]> full-time senior onsite $200,000 - $475,000 USD Kubernetes, Infrastructure as Code, Cloud Networking and Security, Systems Language (Rust), Scripting (Python), ML Infrastructure, GPU Clusters, Large-Scale Training Environments, AI Labs, HPC Environments Engineering Technology Thinking Machines Lab https://logos.yubhub.co/thinkingmachineslab.com.png Thinking Machines Lab is building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. https://thinkingmachineslab.com/ https://job-boards.greenhouse.io/thinkingmachines/jobs/5015964008 San Francisco 2026-04-18 854e95b5-76b Sr. Director of Product, Research and Training Infrastructure CoreWeave is seeking a visionary Sr. Director of Product, Research Training Infrastructure to lead the product strategy and engineering execution for the services that power the most ambitious AI research labs in the world.

This executive leader will own the product strategy and engineering execution for the Research Training Stack, focusing on the specialized orchestration, evaluation, and iteration tools required for massive-scale pre-training and post-training.

Key responsibilities include:

Frontier Orchestration: Oversee the evolution of SUNK (Slurm on Kubernetes) to provide researchers with deterministic, bare-metal performance through a cloud-native interface.

Holistic Training Services: Drive the development of next-generation orchestrators and automated training-based evaluation frameworks that ensure model quality throughout the lifecycle.

Post-Training Excellence: Build the infrastructure required for sophisticated Reinforcement Learning (RL) and RLHF pipelines, enabling labs to refine foundation models with maximum efficiency.

Customer Advocacy: Act as the primary technical partner for lead researchers at global AI labs, translating their 'future-state' requirements into actionable product roadmaps.

Requirements include:

Proven leadership experience in engineering leadership, with at least 5+ years managing large-scale infrastructure at a top-tier research lab or an AI-native cloud provider.

Deep, hands-on knowledge of Slurm, Kubernetes, and the specific networking requirements (InfiniBand/RDMA) for distributed training clusters.

Research mindset and understanding of the 'pain points' of a research scientist.

Scaling experience delivering mission-critical services on multi-thousand GPU clusters (H100/Blackwell/Rubin architectures).

Strategic vision to define 'what's next' in the AI stack, from automated RL loops to specialized sandbox environments.

Why CoreWeave?

In 2026, CoreWeave is the foundation of the largest infrastructure buildout in human history. We are building AI Factories, not just data centers.

Silicon-Up Innovation: Work directly with the latest NVIDIA architectures.

Impact: You will be the architect of the environment that enables the next new discovery.

Velocity: We move at the speed of the researchers we support, bypassing legacy cloud bottlenecks to deliver raw power.

XML job scraping automation by YubHub

]]> full-time executive hybrid $233,000 to $341,000 Slurm, Kubernetes, InfiniBand/RDMA, Distributed training clusters, GPU clusters, H100/Blackwell/Rubin architectures, Reinforcement Learning (RL), RLHF pipelines Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides infrastructure and tools for artificial intelligence research and development. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4665964006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 a8092b6e-7f5 Bare Metal Support Engineer As a Bare Metal Support Engineer at CoreWeave, you will be responsible for supporting, operating, and maintaining CoreWeave's extensive GPU fleet across our growing data centers in the U.S., Europe, and beyond.

You will work closely with customers, data center technicians, and engineering teams to ensure the reliability, performance, and scalability of our infrastructure.

Key responsibilities include:

Providing high-level support for customers utilizing bare-metal GPU fleets on CoreWeave Cloud.
Diagnosing, triaging, and investigating reported customer issues and high-priority incidents, identifying root causes and escalating when necessary.
Developing a deep understanding of customer workloads and use cases to provide tailored technical support.
Coordinating remote troubleshooting and hardware interventions with Data Center Technicians.
Creating and maintaining internal documentation, including troubleshooting guides, best practices, and knowledge base articles.
Participating in an on-call rotation to support production clusters and ensure operational reliability.
Collaborating with engineering teams to improve hardware reliability, software stability, and system performance.
Implementing automation and scripting to streamline support workflows and reduce manual interventions.
Performing in-depth log analysis and debugging across multiple layers of the stack (firmware, drivers, hardware).
Providing feedback to internal teams on common support issues to drive continuous improvements.
Working with networking teams to troubleshoot connectivity issues affecting customer workloads.
Supporting supercomputing infrastructure running GPU workloads at scale.
Driving operational excellence by refining internal processes and support methodologies.

To succeed in this role, you will need:

Experience in data centers, GPU clusters, server deployments, system administration, or hardware troubleshooting.
Demonstrated experience driving resolutions and continuous improvements across cross-functional environments and teams within a data center environment.
Intermediate knowledge of Linux (Ubuntu, CentOS, or similar), including command-line proficiency.
Experience with NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing (HPC), and large-scale data center environments.
Experience in networking fundamentals (TCP/IP, VLANs, DNS, DHCP) and troubleshooting tools.
Hands-on experience with firmware updates, BIOS configurations, and driver management.
Experience analyzing system logs and debugging issues across firmware, drivers, and hardware layers.
Experience working with Jira, Confluence, Notion, or other issue-tracking and documentation platforms.
Experience in scripting and automation (Python, Bash, Ansible, or similar).

If you're a curious and analytical individual with a passion for problem-solving and a desire to work in a fast-paced environment, we'd love to hear from you!

XML job scraping automation by YubHub

]]> full-time mid hybrid $83,000 to $132,000 Linux, GPU clusters, server deployments, system administration, hardware troubleshooting, NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing, large-scale data center environments, networking fundamentals, troubleshooting tools, firmware updates, BIOS configurations, driver management, system logs, debugging issues, Jira, Confluence, Notion, issue-tracking, documentation platforms, scripting, automation, Kubernetes, Docker, containerized infrastructure Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that delivers a platform of technology, tools, and teams to enable innovators to build and scale AI with confidence. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4560350006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 022d9aef-8cd Member of Technical Staff - Infrastructure Reliability About the Role

We are training some of the largest models in the world on the latest hardware across multiple environments. To do this reliably at xAI's pace, we need engineers who have battle-tested experience keeping massive distributed infrastructure up and running 24/7, including on-prem and cloud-based infrastructure.

You will own the availability, performance, and evolution of xAI's core compute, storage, and networking infrastructure. This is not an ops-only role , strong coding is a hard requirement. You will design, implement, and ship systems software, automation, and tooling in Python and/or Rust that directly impact training throughput and cluster utilization.

Responsibilities

Define and execute the technical strategy for infrastructure reliability and scalability
Build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy
Lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes
Identify, instrument, and eliminate systemic failure patterns (capacity, network, hardware, storage, software)
Design and implement high-leverage systems software (daemons, controllers, schedulers, etc.) in Python and Rust.

Basic Qualifications

5+ years shipping production software and/or operating distributed infrastructure at scale
Expert-level knowledge of Linux systems, TCP/IP networking, and systems programming
Strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++.

Preferred Skills and Experience

Significant contributions to large-scale GPU clusters or AI/ML infrastructure
Experience in on-call rotations and incident response in high-stakes environments.

Compensation and Benefits

$180,000 - $400,000 USD

Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

XML job scraping automation by YubHub

]]> full-time staff onsite $180,000 - $400,000 USD Linux systems, TCP/IP networking, systems programming, Rust, Python, Go, C++, container orchestration, container runtimes, infrastructure-as-code, large-scale GPU clusters, AI/ML infrastructure, on-call rotations, incident response Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/4801451007 Palo Alto, CA 2026-04-18 a0051ff6-ddf Facilities Operations Manager We're seeking a driven Facilities Operations Manager to join our team and ensure the relentless performance of our data center infrastructure. This role is critical to maintaining the uptime and efficiency of the systems powering our AI breakthroughs.

As a Facilities Operations Manager, you'll lead teams, oversee cutting-edge facilities, and solve complex problems in real time to keep our mission on track. You'll own the operation of power, cooling, and monitoring systems at scale, bringing technical depth and a no-excuses mindset to our facility.

Responsibilities:

Manage all aspects of data center critical infrastructure,switchgear, generators, UPS systems, chillers, liquid cooling, and building monitoring,ensuring 99.999%+ uptime.
Lead 24x7 teams of facility technicians and vendors, driving safety, execution, and a culture of accountability.
Troubleshoot and resolve facility emergencies using root cause analysis, acting as the go-to escalation point.
Spearhead optimization projects, collaborating with engineers to integrate next-gen tech and cut operational costs.
Own the operations budget, balancing efficiency with performance under tight deadlines.
Enforce compliance with safety and operational protocols, anticipating regulatory shifts.
Coordinate with cross-functional teams to deliver high-quality outcomes and boost team morale.
Support multi-site operations and new facility build-outs as xAI scales.

Basic Qualifications:

Minimum of 5 years in data center operations or facility management, ideally with hyperscaler or industrial systems.
Strong grasp of critical infrastructure,power, cooling, and monitoring systems.
Proven ability to lead teams and manage projects under pressure.
Sharp analytical and communication skills.

Preferred Skills and Experience:

B.S. in Engineering, Facilities Management, or related field; advanced degree a plus.
Experience with GPU clusters or AI-driven data center environments.
Methodical troubleshooting and technical leadership chops.
Familiarity with Southaven, MS area regulations and practices is a bonus.
Comfort with Excel, Word, and operational tools; CAD or monitoring software knowledge is a plus.

XML job scraping automation by YubHub

]]> full-time senior onsite data center operations, facility management, critical infrastructure, team leadership, project management, analytical skills, communication skills, GPU clusters, AI-driven data center environments, methodical troubleshooting, technical leadership, CAD or monitoring software Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/4685202007 Southaven, MS 2026-04-18 290c3d28-4b2 Partner Solution Architect - ASEAN About Mistral AI

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.

We are a global company with teams distributed between France, USA, UK, Germany and Singapore. We are a diverse workforce that thrives in competitive environments and is committed to driving innovation.

Why This Role Matters

You will be the technical linchpin between Mistral and our strategic partners in ASEAN (Nvidia, Dell, Hyperscalers, Global System Integrators), translating our open-weight models and sovereign AI architecture into deployable, scalable solutions.

By designing joint architectures, influencing partner GTM motions, and earning a seat at the CIO/CTO table, you will accelerate Mistral’s technical credibility and deployment velocity across Asia Pacific.

This is a foundational role where you will define how open-weight AI is operationalized at scale in the region.

What You Will Do

Partner Technical Leadership & Architecture Design

Lead the technical design, deployment, and enablement of Mistral’s partner solutions, bridging our AI models with partner infrastructure (Nvidia, Dell, Hyperscalers, GSIs) to deliver scalable AI Labs, AI Factories, and sovereign AI architectures.

Serve as the trusted technical advisor to partner CTOs, CIOs, and engineering leaders—shaping joint architectures, guiding GPU/model deployment strategies, and accelerating GTM execution.

Design reference architectures and deployment patterns for partner-led implementations (e.g., multi-GPU inference clusters, AI Lab topologies, private AI clouds).

Innovate the Executive Briefing Center (EBC) function for technical leaders (CIOs, CTOs, CDOs), positioning Mistral as the default choice for enterprise AI.

Co-design sovereign AI reference architectures with Nvidia and Dell (H100, H200, GB200 platforms).

Co-Sell & Revenue Enablement

Collaborate with Mistral’s partner and sales teams to progress deals, providing technical expertise to penetrate accounts and influence GTM pipeline.

Support partners in qualifying/disqualifying opportunities, ensuring Mistral solutions unlock maximum value for customers.

Deploy Mistral’s enterprise AI suite (models, fine-tuning, use-case building) in partner-led environments, tailoring solutions to customer requirements.

Trusted Advisor & Lighthouse Implementations

Drive strategic partner-led opportunities through technical discovery, architecture design, and POC execution.

Lead lighthouse deployments that become referenceable case studies (e.g., Singtel AI Grid, Accenture AI Lab).

Establish a scalable partner enablement framework, training 100+ partner engineers across ASEAN.

Product Feedback & Internal Collaboration

Coordinate with Mistral’s product and engineering teams to relay partner-specific requirements and feedback.

Align joint GTM and technical execution between Mistral Science, Partner Engineering, and partner field teams.

About You

Must-Have

10–15 years’ experience in partner-facing technical sales or solution architecture (e.g., Partner SA, Alliance Architect, Partner Technology Strategist).

Proven ability to engage C-suite and senior technical stakeholders (CTO, CIO, Chief Architect) in strategic architecture discussions.

Deep GenAI/LLM expertise: RAG, fine-tuning, prompt engineering, model evaluation, and deployment patterns.

Technical mastery of AI/ML infrastructure (GPU clusters, cloud platforms, model deployment frameworks).

Track record of co-designing/deploying joint solutions with ecosystem partners (Nvidia, Dell, AWS, Accenture, etc.).

Executive communication: Ability to articulate science-driven value propositions to technical and business audiences.

Entrepreneurial mindset: Operates autonomously in high-growth environments; creates playbooks, not follows them.

Fluent in English; confident working across diverse, cross-cultural teams in Asia.

Nice-to-Have

Experience with open-weight LLMs or open-source AI stacks (Mistral, Hugging Face, LangChain, vLLM, RAG frameworks).

Prior involvement in AI Lab, AI Factory, or Sovereign Cloud deployments.

Familiarity with data governance, model evaluation, and GPU sizing for large-scale inference.

Network across GSIs and infrastructure partners in Asia

Exposure to multi-region partner programs or joint GTM initiatives in APJ.

Bonus languages: Korean, Japanese, or Mandarin for regional partner engagement.

What we offer

💰 Competitive cash salary and equity

🚑 Health Insurance : Best in Class

🥎 Sport : $90 for gym membership allowance

🥕 Food : $200 monthly allowance for meals (solution might evolve as we grow bigger)

🚴 Transportation : $120/month for public transport or Parking charges reimbursed

🏝️ PTO: 18 per year

XML job scraping automation by YubHub

]]> full-time senior onsite GenAI/LLM expertise, RAG, fine-tuning, prompt engineering, model evaluation, deployment patterns, AI/ML infrastructure, GPU clusters, cloud platforms, model deployment frameworks, co-designing/deploying joint solutions, ecosystem partners, Nvidia, Dell, AWS, Accenture, open-weight LLMs, open-source AI stacks, Mistral, Hugging Face, LangChain, vLLM, RAG frameworks, data governance, model evaluation, GPU sizing, large-scale inference, GSIs, infrastructure partners, multi-region partner programs, joint GTM initiatives, APJ, Korean, Japanese, Mandarin Engineering Technology Mistral AI Mistral AI is an AI technology company that provides high-performance, optimized, open-source and cutting-edge models, products and solutions. https://mistral.ai/careers https://jobs.lever.co/mistral/fe3542b5-4f99-4d62-af6a-fbdfd13bf0e4 Singapore 2026-03-10 93a4ece6-182 Member of Technical Staff, Site Reliability Engineer (HPC) As Microsoft continues to push the boundaries of AI, we are on the lookout for experienced individuals to work with us on the most interesting and challenging AI questions of our time. Our vision is to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. We're looking for an experienced HPC Site Reliability Engineer (SRE) to join our High Performance Computing (HPC) infrastructure team. In this role, you'll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You'll ensure that AI systems stay efficient and reliable with very high uptimes.

Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

This role is part of Microsoft AI's Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being.

Responsibilities Reliability & Availability : Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference. Observability : Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking. Automation & Tooling : Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments. Incident Management : Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements. Security & Compliance : Ensure data privacy, compliance, and secure operations across model training and serving environments. Collaboration : Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.

Qualifications Required Qualifications Master’s Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR Bachelor’s Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR equivalent experience

Preferred Qualifications Strong proficiency in Kubernetes, Docker, and container orchestration. Knowledge of CI/CD pipelines for Inference and ML model deployment. Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code. Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.). Strong programming/scripting skills in Python, Go, or Bash. Solid knowledge of distributed systems, networking, and storage. Experience running large-scale GPU clusters for ML/AI workloads (preferred). Familiarity with ML training/inference pipelines. Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators). Background in capacity planning & cost optimization for GPU-heavy environments.

Work on cutting-edge infrastructure that powers the future of Generative AI. Collaborate with world-class researchers and engineers. Impact millions of users through reliable and responsible AI deployments. Competitive compensation, equity options, and comprehensive benefits.

XML job scraping automation by YubHub

]]> full-time staff hybrid $139,900 – $274,800 per year Kubernetes, Docker, container orchestration, CI/CD pipelines, public cloud platforms, infrastructure-as-code, monitoring & observability tools, programming/scripting skills in Python, Go, or Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, strong proficiency in Kubernetes, knowledge of CI/CD pipelines, hands-on experience with public cloud platforms, expertise in monitoring & observability tools, strong programming/scripting skills in Python, Go, or Bash, solid knowledge of distributed systems, experience running large-scale GPU clusters, familiarity with ML training/inference pipelines, experience with high-performance computing Engineering Technology Microsoft https://logos.yubhub.co/microsoft.ai.png Microsoft is a multinational technology company that develops, manufactures, licenses, and supports a wide range of software products, services, and devices. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-site-reliability-engineer-hpc-mai-superintelligence-team/ Mountain View 2026-03-08 f8953efe-b98 Member of Technical Staff, Evaluations Engineering Summary

Microsoft AI are looking for a talented Member of Technical Staff, Evaluations Engineer to help build the next wave of capabilities of our personalized AI assistant, Copilot. We’re looking for someone who will bring an abundance of positive energy, empathy, and kindness to the team every day, in addition to being highly effective.

About the Role

We are looking for a highly skilled and experienced engineer to join our Evaluations Engineering team. As a Member of Technical Staff, Evaluations Engineer, you will be responsible for developing and tuning the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures. You will also be responsible for benchmarking GB200 and AMD MIxxx GPU clusters, gathering data and insights to develop the pretraining compute roadmap, and caring deeply about conversational AI and its deployment.

Accountabilities

Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures.
Benchmark GB200 and AMD MIxxx GPU clusters.
Gather data and insights to develop the pretraining compute roadmap.

The Candidate we're looking for

Experience:

Bachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Technical skills:

Experience with generative AI.
Experience with distributed computing.

Personal attributes:

Enjoy working in a fast-paced, design-driven, product development cycle.
Embody our Culture and Values.

Benefits

Software Engineering IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.
Software Engineering IC6 – The typical base pay range for this role across the U.S. is USD $163,000 – $296,400 per year.

XML job scraping automation by YubHub

]]> full-time staff onsite USD $139,900 – $274,800 per year C, C++, C#, Java, JavaScript, Python, Generative AI, Distributed Computing, Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures, Experience with benchmarking GPU clusters Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that specializes in artificial intelligence and machine learning. They are known for their innovative products and services that aim to make a positive impact on people's lives. Microsoft AI is committed to advancing the field of AI and making it more accessible to everyone. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-evaluations-engineering-mai-superintelligence-team-2/ Redmond 2026-03-06 675d41e9-5f9 Member of Technical Staff, Reinforcement Learning Systems Summary

Microsoft AI are looking for a talented Member of Technical Staff, Reinforcement Learning Systems to help build the world's most advanced reinforcement learning systems. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising AI technology.

About the Role

We are responsible for designing, developing, and operating the large-scale reinforcement learning systems that power several use cases across the Superintelligence team. We are looking for individuals who can contribute to cutting-edge research and help bridge the gap between cutting-edge research and robust, production-grade distributed systems. The ideal candidate has both distributed systems expertise and a scientific mindset and will be able to build complex and scalable systems from the ground up, identify and resolve performance bottlenecks, debug complex, cross-system issues with extremely high attention to detail, and contribute to solving scientific and research challenges.

Accountabilities

Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures.
Benchmark GB200 and AMD MIxxx GPU clusters.
Gather data and insights to develop the pretraining compute roadmap.

The Candidate we're looking for

Experience:

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Technical skills:

Experience with generative AI.
Experience with distributed computing.

Personal attributes:

A high degree of craftsmanship and pay close attention to details.
Enjoy working in a fast-paced, design-driven, product development cycle.

Benefits

Software Engineering IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.
There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area.

XML job scraping automation by YubHub

]]> full-time staff onsite USD $139,900 – $274,800 per year C, C++, C#, Java, JavaScript, Python, Generative AI, Distributed Computing, Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures, Experience with GPU clusters Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that is dedicated to advancing artificial intelligence and machine learning. They are responsible for developing and deploying AI models that power various products and services, including Copilot and Bing. Microsoft AI is committed to creating AI that amplifies human potential while ensuring humanity remains firmly in control. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-reinforcement-learning-systems-mai-superintelligence-team-3/ New York 2026-03-06 b0dff67a-5b5 Member of Technical Staff, Reinforcement Learning Systems Summary

About the Role

Accountabilities

Develop and tune the pretraining scalable software for Nvidia GB200 72NVL CX8 and AMD MIxxx architectures.
Benchmark GB200 and AMD MIxxx GPU clusters.
Gather data and insights to develop the pretraining compute roadmap.

The Candidate we're looking for

Experience:

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Technical skills:

Experience with generative AI.
Experience with distributed computing.

Personal attributes:

A high degree of craftsmanship and pay close attention to details.
Enjoy working in a fast-paced, design-driven, product development cycle.

Benefits

Software Engineering IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.
There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area

XML job scraping automation by YubHub

]]> full-time staff onsite USD $139,900 – $274,800 per year C, C++, C#, Java, JavaScript, Python, Generative AI, Distributed computing, Experience with Nvidia GB200 72NVL CX8 and AMD MIxxx architectures, Experience with GPU clusters Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that is dedicated to advancing artificial intelligence and machine learning. They are responsible for developing and deploying AI models that power various products and services, including Copilot and Bing. Microsoft AI is committed to creating AI that amplifies human potential while ensuring humanity remains firmly in control. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-reinforcement-learning-systems-mai-superintelligence-team/ Mountain View 2026-03-06