Software Engineer, Cloud Engineering

46fbbe1a-6b9 Software Engineer, Cloud Engineering Join us in building the future of finance.

Our mission is to democratize finance for all. An estimated $124 trillion of assets will be inherited by younger generations in the next two decades. The largest transfer of wealth in human history. If you’re ready to be at the epicenter of this historic cultural and financial shift, keep reading.

We are building an elite team, applying frontier technologies to the world’s biggest financial problems. We’re looking for bold thinkers. Sharp problem-solvers. Builders who are wired to make an impact. Robinhood isn’t a place for complacency, it’s where ambitious people do the best work of their careers. We’re a high-performing, fast-moving team with ethics at the center of everything we do. Expectations are high, and so are the rewards.

The Bitstamp Cloud Engineering team is responsible for designing, maintaining, and scaling the AWS infrastructure that powers our global crypto exchange. As part of the Robinhood family, we are aligning our systems to support global expansion while maintaining the stability and reliability our institutional customers depend on. The team partners closely with Security, Data, Product Engineering, and Platform teams to ensure our infrastructure supports secure trading, high availability, and regulatory requirements. We value practical solutions, operational rigor, and steady progress toward automation and long-term scalability!

As a Senior Cloud Engineer, you will be an established individual contributor responsible for the reliability, scalability, and modernization of our AWS environment. You will manage core compute and database infrastructure while contributing to projects that align Bitstamp’s systems with Robinhood’s broader platform architecture. This role balances hands-on operational support with forward-looking automation work, reducing manual effort and improving system resilience. Your work will directly support high-volume trading systems and ensure our platform performs consistently as we scale globally!

This role is based in our Ljubljana office(s), with in-person attendance expected at least 3 days per week. At Robinhood, we believe in the power of in-person work to accelerate progress, spark innovation, and strengthen community. Our office experience is intentional, energizing, and designed to fully support high-performing teams.

Applications for this role will be accepted through April 27th, 2026 Requires participation in an on-call rotation to support business needs.

Responsibilities

Manage and troubleshoot AWS services including EC2, ECS, Aurora, and DynamoDB to ensure high availability and performance.
Contribute to infrastructure projects that align Bitstamp systems with Robinhood’s global platform architecture and shared services.
Identify manual operational processes and implement automation using infrastructure-as-code and workflow tooling.
Monitor database and compute capacity, adjusting configurations to support platform growth and transaction volume.
Participate in the on-call rotation, diagnosing and resolving production issues to maintain 24/7 system stability.

Requirements

2–3+ years of hands-on experience in Cloud Engineering, DevOps, or Infrastructure Engineering within AWS environments.
Practical experience with EC2, ECS, and database technologies such as Aurora and DynamoDB.
Use structured troubleshooting methods and data analysis to resolve moderately complex infrastructure issues.
Comfortable balancing operational responsibilities with incremental automation improvements.
Communicate technical concepts clearly and collaborate effectively with engineering and platform teams.

What we offer

Challenging, high-impact work to grow your career
Performance driven compensation with multipliers for outsized impact and bonus programs
Top tier benefits to fuel your work, including supplemental health insurance, ancillary insurance, and mental health support programs
Lifestyle wallet - a highly flexible employer-paid benefits spending account expenses beyond traditional benefits such as wellness, childcare, learning, and more.
Time off to recharge including company holidays, paid time off, sick time, paid volunteer time off, parental leave, and more!
Exceptional office experience with catered meals, events, and comfortable workspaces.
Monthly commuter stipend to help offset in-office commuting costs

XML job scraping automation by YubHub

]]> full-time senior onsite AWS, EC2, ECS, Aurora, DynamoDB, Cloud Engineering, DevOps, Infrastructure Engineering Engineering Finance Bitstamp https://logos.yubhub.co/bitstamp.net.png Bitstamp is a cryptocurrency exchange that operates globally. It was founded in 2011 and is headquartered in Luxembourg. https://www.bitstamp.net/ https://job-boards.greenhouse.io/robinhood/jobs/7589432?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Ljubljana, Slovenia 2026-04-25 5bc76aca-281 Research Engineer, Data Infrastructure About Mistral AI

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.

We are a dynamic team passionate about AI and its potential to transform society. Our diverse workforce thrives in competitive environments and is committed to driving innovation.

Our teams are distributed between France, USA, UK, Germany and Singapore. We are creative, low-ego and team-spirited.

Join us to be part of a pioneering company shaping the future of AI.

Together, we can make a meaningful impact.

Role Summary

Research Engineer, Data Infrastructure

The Data Infrastructure team at Mistral AI is architecting the backbone of our frontier model training and fine-tuning ecosystem. We are building the specialized compute and data fabrics required to power the development of world-class AI.

Our vision is to operate some of the largest compute fleets in production and build data lakes and metadata systems with a roadmap toward exabyte-scale architecture.

We are currently in the process of building a high-performance training platform designed for massive scale across both on-premise and cloud-native Kubernetes environments.

We are leading a strategic transition from legacy scheduling to modern orchestration.

With numerous clusters distributed across various regions, we are focussed on implementing sophisticated multi-cluster orchestration and cloud-bursting capabilities to better utilize our global resources and ensure our researchers have seamless access to compute wherever it resides.

Our mission is to evolve our current systems into a platform that is as durable as it is flexible.

Location: Paris / London (hybrid) or remote EU/UK with one hub day per month.

About the Role

This role focuses on building and operating the next generation of data infrastructure at Mistral AI.

You will be a core contributor to our evolution, helping us design and scale massive compute fleets and storage systems designed for high performance and scalability.

You will help us move toward a future of decoupled control and data planes, scaling big data compute and storage platforms while ensuring secure and governed data access for MLOps and research.

You will take full lifecycle ownership: from architecting the migration away from legacy orchestrators to implementing production-grade pipelines and participating in on-call rotations for critical training jobs.

In this role, you will:

Build & Scale: Help us reach our goal of operating massive distributed compute and storage systems

Global Orchestration: Architect and maintain multi-cluster orchestration layers to optimize workload placement across diverse hardware and regions.

Design Future-Proof Storage: Architect our transition to modern storage formats to handle fine-tuning datasets at a scale that anticipates exabyte growth.

Platform Engineering: Contribute to the development of our internal training platform, ensuring seamless model training and fine-tuning capabilities across Kubernetes and SLURM based environments.

Metadata & Lineage: Implement and manage systems to provide clear visibility and lineage as our data and model pipelines grow in complexity.

Operational Excellence: Use modern deployment workflows to manage cloud-native deployments, ensuring our data platform can scale by orders of magnitude while remaining reliable and efficient.

You might thrive in this role if you:

Have 4+ years of experience in Data Infrastructure, MLOps, or Infrastructure Engineering.

Have experience or a strong interest in supporting foundational compute and storage platforms.

Are proficient in Python and enjoy solving the "brittle data lake" problem with modern, columnar storage standards.

Are well-versed in Kubernetes-native tooling and excited to debug large-scale distributed systems across multi-cluster environments.

Take pride in building and operating scalable, reliable, and secure systems from the ground up.

Are comfortable with ambiguity and the challenges of building high-scale infrastructure in a rapid-growth AI environment.

Benefits

France

Competitive cash salary and equity

Food: Daily lunch vouchers

Sport: Monthly contribution to a Gympass subscription

Transportation: Monthly contribution to a mobility pass

Health: Full health insurance for you and your family

Parental: Generous parental leave policy

Visa sponsorship

Competitive cash salary and equity

Insurance

Transportation: Reimburse office parking charges, or £90 per month for public transport

Sport: £90 per month reimbursement for gym membership

Meal voucher: £200 monthly allowance for meals

Pension plan: SmartPension (percentages are 5% Employee & 3% Employer)

XML job scraping automation by YubHub

]]> full-time senior hybrid Python, Kubernetes, Data Infrastructure, MLOps, Infrastructure Engineering, Cloud-Native Deployments, Modern Deployment Workflows, Columnar Storage Standards, Distributed Systems, Multi-Cluster Environments Engineering Technology Mistral AI https://logos.yubhub.co/mistral.ai.png Mistral AI is a company that develops high-performance, optimized, open-source and cutting-edge AI models, products and solutions. https://mistral.ai https://jobs.lever.co/mistral/071a5491-ea01-413f-ad78-f85b5e4c2215?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Paris 2026-04-24 eea394f9-dab Research Engineer, Data Infrastructure About Mistral AI

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.

Role Summary

This role focuses on building and operating the next generation of data infrastructure at Mistral AI. You will be a core contributor to our evolution, helping us design and scale massive compute fleets and storage systems designed for high performance and scalability. You will help us move toward a future of decoupled control and data planes, scaling big data compute and storage platforms while ensuring secure and governed data access for MLOps and research. You will take full lifecycle ownership: from architecting the migration away from legacy orchestrators to implementing production-grade pipelines and participating in on-call rotations for critical training jobs.

Responsibilities

Build & Scale: Help us reach our goal of operating massive distributed compute and storage systems
Global Orchestration: Architect and maintain multi-cluster orchestration layers to optimize workload placement across diverse hardware and regions.
Design Future-Proof Storage: Architect our transition to modern storage formats to handle fine-tuning datasets at a scale that anticipates exabyte growth.
Platform Engineering: Contribute to the development of our internal training platform, ensuring seamless model training and fine-tuning capabilities across Kubernetes and SLURM based environments.
Metadata & Lineage: Implement and manage systems to provide clear visibility and lineage as our data and model pipelines grow in complexity.
Operational Excellence: Use modern deployment workflows to manage cloud-native deployments, ensuring our data platform can scale by

About you

Have 4+ years of experience in Data Infrastructure, MLOps, or Infrastructure Engineering.
Have experience or a strong interest in supporting foundational compute and storage platforms.
Are proficient in Python and enjoy solving the "brittle data lake" problem with modern, columnar storage standards.
Are well-versed in Kubernetes-native tooling and excited to debug large-scale distributed systems across multi-cluster environments.
Take pride in building and operating scalable, reliable, and secure systems from the ground up.
Are comfortable with ambiguity and the challenges of building high-scale infrastructure in a rapid-growth AI environment.

What we offer

Competitive salary and equity.
Healthcare: Medical/Dental/Vision covered for you and your family.
Pension: 401K (6% matching)
PTO: 18 days
Transportation: Reimburse office parking charges, or $120/month for public transport
Sport: $120/month reimbursement for gym membership
Meal stipend: $400 monthly allowance for meals (solution might evolve as we grow bigger)
Visa sponsorship
Coaching: we offer BetterUp coaching on a voluntary basis

XML job scraping automation by YubHub

]]> full-time mid hybrid Python, Kubernetes, SLURM, Data Infrastructure, MLOps, Infrastructure Engineering Engineering Technology Mistral AI https://logos.yubhub.co/mistral.ai.png Mistral AI is a company that develops and provides high-performance, optimized, open-source and cutting-edge AI models, products and solutions. https://mistral.ai/careers https://jobs.lever.co/mistral/37f53ee5-dd88-43e3-be6a-70e3db159c8f?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Palo Alto 2026-04-24 d0203959-b2e Research Engineer, Data Infrastructure About Mistral AI

Mistral AI is a pioneering company shaping the future of AI. We believe in the power of AI to simplify tasks, save time, and enhance learning and creativity.

Role Summary

Responsibilities

Build & Scale: Help us reach our goal of operating massive distributed compute and storage systems
Global Orchestration: Architect and maintain multi-cluster orchestration layers to optimize workload placement across diverse hardware and regions.
Design Future-Proof Storage: Architect our transition to modern storage formats to handle fine-tuning datasets at a scale that anticipates exabyte growth.
Platform Engineering: Contribute to the development of our internal training platform, ensuring seamless model training and fine-tuning capabilities across Kubernetes and SLURM based environments.
Metadata & Lineage: Implement and manage systems to provide clear visibility and lineage as our data and model pipelines grow in complexity.
Operational Excellence: Use modern deployment workflows to manage cloud-native deployments, ensuring our data platform can scale by

About You

Have 4+ years of experience in Data Infrastructure, MLOps, or Infrastructure Engineering.
Have experience or a strong interest in supporting foundational compute and storage platforms.
Are proficient in Python and enjoy solving the "brittle data lake" problem with modern, columnar storage standards.
Are well-versed in Kubernetes-native tooling and excited to debug large-scale distributed systems across multi-cluster environments.
Take pride in building and operating scalable, reliable, and secure systems from the ground up.
Are comfortable with ambiguity and the challenges of building high-scale infrastructure in a rapid-growth AI environment.

What We Offer

Competitive salary and equity.
Healthcare: Medical/Dental/Vision covered for you and your family.
Pension: 401K (6% matching).
PTO: 18 days.
Transportation: Reimburse office parking charges, or $120/month for public transport.
Sport: $120/month reimbursement for gym membership.
Meal stipend: $400 monthly allowance for meals (solution might evolve as we grow bigger).
Visa sponsorship.
Coaching: we offer BetterUp coaching on a voluntary basis.

XML job scraping automation by YubHub

]]> full-time mid hybrid Python, Kubernetes, SLURM, Data Infrastructure, MLOps, Infrastructure Engineering Engineering Technology Mistral AI https://logos.yubhub.co/mistral.ai.png Mistral AI provides high-performance, optimized, open-source and cutting-edge AI models, products and solutions for enterprise and personal needs. https://mistral.ai/careers https://jobs.lever.co/mistral/37f53ee5-dd88-43e3-be6a-70e3db159c8f?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Palo Alto 2026-04-24 dbfbd1d2-0a3 Research Engineer, Data Infrastructure About Mistral AI

Mistral AI is a pioneering company shaping the future of AI. We believe in the power of AI to simplify tasks, save time, and enhance learning and creativity.

Role Summary

In this role, you will be a core contributor to our evolution, helping us design and scale massive compute fleets and storage systems designed for high performance and scalability. You will help us move toward a future of decoupled control and data planes, scaling big data compute and storage platforms while ensuring secure and governed data access for MLOps and research.

Responsibilities

Build & Scale: Help us reach our goal of operating massive distributed compute and storage systems
Global Orchestration: Architect and maintain multi-cluster orchestration layers to optimize workload placement across diverse hardware and regions.
Design Future-Proof Storage: Architect our transition to modern storage formats to handle fine-tuning datasets at a scale that anticipates exabyte growth.
Platform Engineering: Contribute to the development of our internal training platform, ensuring seamless model training and fine-tuning capabilities across Kubernetes and SLURM based environments.
Metadata & Lineage: Implement and manage systems to provide clear visibility and lineage as our data and model pipelines grow in complexity.
Operational Excellence: Use modern deployment workflows to manage cloud-native deployments, ensuring our data platform can scale by orders of magnitude while remaining reliable and efficient.

You might thrive in this role if you:

Have 4+ years of experience in Data Infrastructure, MLOps, or Infrastructure Engineering.
Have experience or a strong interest in supporting foundational compute and storage platforms.
Are proficient in Python and enjoy solving the "brittle data lake" problem with modern, columnar storage standards.
Are well-versed in Kubernetes-native tooling and excited to debug large-scale distributed systems across multi-cluster environments.
Take pride in building and operating scalable, reliable, and secure systems from the ground up.
Are comfortable with ambiguity and the challenges of building high-scale infrastructure in a rapid-growth AI environment.

Benefits

France

Competitive cash salary and equity
Food: Daily lunch vouchers
Sport: Monthly contribution to a Gympass subscription
Transportation: Monthly contribution to a mobility pass
Health: Full health insurance for you and your family
Parental: Generous parental leave policy

Competitive cash salary and equity
Insurance
Transportation: Reimburse office parking charges, or £90 per month for public transport
Sport: £90 per month reimbursement for gym membership
Meal voucher: £200 monthly allowance for meals
Pension plan: SmartPension (percentages are 5% Employee & 3% Employer)

XML job scraping automation by YubHub

]]> full-time mid hybrid Python, Kubernetes, Data Infrastructure, MLOps, Infrastructure Engineering, Columnar Storage Standards Engineering Technology Mistral AI https://logos.yubhub.co/mistral.ai.png Mistral AI is a company that develops and provides artificial intelligence (AI) technology and solutions. It has a diverse workforce and operates globally. https://mistral.ai/careers https://jobs.lever.co/mistral/071a5491-ea01-413f-ad78-f85b5e4c2215?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Paris 2026-04-24 26a321e1-136 Software Engineer, Codex Core Agents Compensation

$230K – $385K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

The Codex Core Agent team builds the kernel of Codex. We own making the agent better, accelerating research, and making those improvements real in production for our users.

About the Role

We’re looking for engineers to build the infrastructure that powers Codex agents in production. This role focuses on the systems that let models safely execute code, interact with tools, complete long-running tasks, and operate reliably and efficiently at scale.

What You’ll Do

Design and build execution environments for AI agents, including sandboxing, isolation, and reproducibility.

Develop systems for agent orchestration across multi-step, tool-using workflows.

Build infrastructure for running, testing, and debugging code generated by models.

Create state and memory systems that allow agents to persist context across long-running tasks.

Optimize tokens, latency, reliability, and cost across Codex’s production fleet.

Support model rollouts, capacity planning, and the core tradeoffs between quality, speed, and economics to manage a fleet of frontier agents at scale.

Build shared platform capabilities that unblock product teams, partner teams, and open source Codex.

You Might Be a Good Fit If You

Have strong experience in distributed systems or infrastructure engineering.

Have built systems involving containers, sandboxing, or virtualization.

Are comfortable working across backend systems, APIs, and developer tooling.

Care deeply about system reliability, performance, and security.

Enjoy working on ambiguous, zero-to-one problems.

Want to help build the systems that turn model capability into a dependable software engineering agent.

Bonus Points

Experience with code execution platforms, CI/CD systems, or build systems.

Familiarity with LLMs, agents, or tool-use frameworks.

Background in security engineering or isolation systems.

Experience building developer platforms, IDE tooling, or open source infrastructure.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

For additional information, please see [OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement](https://cdn.openai.com/policies/eeo-policy-statement.pdf).

Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for US-based candidates. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.

To notify OpenAI that you believe this job posting is non-compliant, please submit a report through [this form](https://form.asana.com/?d=57018692298241&k=5MqR40fZd7jlxVUh5J-UeA). No response will be provided to inquiries unrelated to job posting compliance.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this [link](https://form.asana.com/?k=bQ7w9h3iexRlicUdWRiwvg&d=57018692298241).

[OpenAI Global Applicant Privacy Policy](https://cdn.openai.com/policies/global-employee-and-contractor-privacy-policy.pdf)

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

XML job scraping automation by YubHub

]]> Full time senior onsite $230K – $385K distributed systems, infrastructure engineering, containers, sandboxing, virtualization, backend systems, APIs, developer tooling, security, code execution platforms, CI/CD systems, build systems, LLMs, agents, tool-use frameworks, security engineering, isolation systems, developer platforms, IDE tooling, open source infrastructure Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. https://openai.com/ https://jobs.ashbyhq.com/openai/7ade7a12-845c-4e3a-af23-c028420bd181?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply San Francisco; London, UK; New York City; Seattle 2026-04-24 bdf4e05a-b8c MTS - Site Reliability Engineer As Microsoft continues to push the boundaries of AI, we are on the lookout for individuals to work with us on the most interesting and challenging AI questions of our time. Our vision is bold and broad , to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. It’s also inclusive: we aim to make AI accessible to all , consumers, businesses, developers , so that everyone can realize its benefits.

We’re looking for an experienced Site Reliability Engineer (SRE) to join our infrastructure team. In this role, you’ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You’ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities:

Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.

Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra.

Performance Optimization: Analyze system performance and scalability, optimize resource utilization (compute, GPU clusters, storage, networking).

Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments.

Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.

Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments.

Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.

Qualifications:

Required Qualifications: 4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.

Preferred Qualifications: Strong proficiency in Kubernetes, Docker, and container orchestration. Knowledge of CI/CD pipelines for Inference and ML model deployment. Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code. Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.). Strong programming/scripting skills in Python, Go, or Bash. Solid knowledge of distributed systems, networking, and storage. Experience running large-scale GPU clusters for ML/AI workloads (preferred). Familiarity with ML training/inference pipelines. Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators). Background in capacity planning & cost optimization for GPU-heavy environments.

Work on cutting-edge infrastructure that powers the future of Generative AI. Collaborate with world-class researchers and engineers. Impact millions of users through reliable and responsible AI deployments. Competitive compensation, equity options, and comprehensive benefits.

XML job scraping automation by YubHub

]]> full-time staff hybrid $119,800 - $234,700 per year Site Reliability Engineering, DevOps, Infrastructure Engineering, Kubernetes, Docker, container orchestration, CI/CD pipelines, ML model deployment, public cloud platforms, Azure, AWS, GCP, infrastructure-as-code, monitoring & observability tools, Grafana, Datadog, OpenTelemetry, Python, Go, Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, capacity planning, cost optimization, cloud architecture, containerization, microservices, API design, security, compliance, agile development, scrum, kanban Engineering Technology Microsoft https://logos.yubhub.co/microsoft.ai.png Microsoft is a multinational technology company that develops, manufactures, licenses, and supports a wide range of software products, services, and devices. https://microsoft.ai https://microsoft.ai/job/mts-site-reliability-engineer/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Redmond 2026-04-24 5ff592ac-9d8 Sr. Software Engineer, Inference We are seeking a Senior Software Engineer to join our Inference team, responsible for building and maintaining critical systems that serve Claude to millions of users worldwide. The team has a dual mandate: maximizing compute efficiency to serve our explosive customer growth, while enabling breakthrough research by giving our scientists the high-performance inference infrastructure they need to develop next-generation models.

As a Senior Software Engineer, you will be responsible for designing, implementing, and deploying large-scale distributed systems, including intelligent request routing, fleet-wide orchestration, and load balancing. You will work closely with our research team to develop new inference features and integrate new AI accelerator platforms.

To succeed in this role, you should have significant software engineering experience, particularly with distributed systems, and be results-oriented with a bias towards flexibility and impact. You should also be able to pick up slack, even if it goes outside your job description, and thrive in environments where technical excellence directly drives both business results and research breakthroughs.

Responsibilities:

Design and implement large-scale distributed systems, including intelligent request routing, fleet-wide orchestration, and load balancing
Work closely with our research team to develop new inference features and integrate new AI accelerator platforms
Collaborate with cross-functional teams to ensure seamless deployment and operation of our systems
Analyze observability data to tune performance based on real-world production workloads
Manage multi-region deployments and geographic routing for global customers

Requirements:

Bachelor's degree or equivalent combination of education, training, and/or experience
Significant software engineering experience, particularly with distributed systems
Results-oriented with a bias towards flexibility and impact
Ability to pick up slack, even if it goes outside your job description
Thrives in environments where technical excellence directly drives both business results and research breakthroughs

Preferred Qualifications:

Experience with Kubernetes and cloud infrastructure (AWS, GCP)
Familiarity with machine learning systems and infrastructure
Strong communication and collaboration skills

Benefits:

Competitive compensation and benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours
Lovely office space in which to collaborate with colleagues

Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process

XML job scraping automation by YubHub

]]> full-time senior hybrid £225,000-£325,000 GBP Distributed systems, Kubernetes, Cloud infrastructure, Machine learning systems, Infrastructure engineering, Python, Rust, Java, C++, Go Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5152348008?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply London, UK 2026-04-18 b0e99a49-d99 Senior Engineering Manager - Infrastructure About Us

We're looking for an Infrastructure Senior Engineering Manager to help us build a seamless, reliable platform for the dbt platform across AWS, Azure, and GCP.

Our team's mission is to create a seamless developer experience by providing a stable, observable, and easy-to-use infrastructure platform. Over the past year, we've designed and operationalized a next-gen cell-based architecture, scaling the dbt platform across all three cloud providers. Now, we're focused on automation, self-service, and improving developer velocity through better tooling, processes, and infrastructure design.

As a Senior Engineering Manager, you'll lead your team on infrastructure projects to refine our platform while ensuring performance, reliability, and an excellent developer experience. You'll collaborate across teams, tackle real infrastructure challenges, and help shape the future of the multi-cloud dbt platform.

Responsibilities

Build, lead, and coach a team of 8-12 engineers to manage the infrastructure for the dbt platform and report to the Director of Infrastructure
Empower your team to achieve big goals by giving them product and business context and supporting team ownership of the roadmap, product development lifecycle, and technical excellence
Dive deep into our product to frame tradeoffs and make decisions about what, how, and when we build
Partner with Product Marketing, Solutions Architecture, and Customer Support to build delightful migration experiences, helping our customers seamlessly move off legacy deployments
Coach engineers in product thinking, quality, and software engineering. Build individualized growth plans and match interests and capabilities to team goals
Work with peer managers to evolve organizational processes like product training, technical decision making, project execution, and planning

Requirements

5+ years in people management with a software or infrastructure engineering team
Experience managing senior individual contributors (Staff+ level)
Experience supporting a cloud-based infrastructure with complex resource requirements and global deployment strategy
Deep understanding of Terraform and cloud infrastructure state management

Nice to Have

Experience leading teams through all parts of the product development lifecycle
Have successfully partnered across teams and departments to coordinate cross-cutting initiatives
You are interested in our mission and values. You are inspired to drive progress in the data and analytics ecosystem

Compensation & Benefits

Salary: We offer competitive compensation packages commensurate with experience, including salary, equity, and where applicable, performance-based pay. Our Talent Acquisition Team can answer questions around dbt Labs' total rewards during your interview process.

In select locations (including Boston, Chicago, Denver, Los Angeles, Philadelphia, New York Metro, San Francisco, DC Metro, Seattle, Austin), an alternate range may apply, as specified below.

The typical starting salary range for this role is: $223,000 - $270,000 USD

The typical starting salary range for this role in the select locations listed is: $248,000 - $300,000 US

Equity Stake Benefits

dbt Labs offers: unlimited vacation, 401k w/3% guaranteed contribution, excellent healthcare, paid parental leave, wellness stipend, home office stipend, and more!

Our Hiring Process

Interview with a Talent Acquisition Partner (30 Mins)
Technical Interview with Hiring Manager (60 Mins)
Team Interviews ( 3 rounds, 45 Mins each)
Final Values Interview (30 Mins)

If you’re passionate about building well-designed, high-impact software, we’d love to hear from you!

XML job scraping automation by YubHub

]]> full-time senior remote $223,000 - $270,000 USD Terraform, Cloud infrastructure state management, People management, Software engineering, Infrastructure engineering, Product development lifecycle, Technical decision making, Project execution, Process improvement Engineering Technology dbt Labs https://logos.yubhub.co/getdbt.com.png dbt Labs is a leading analytics engineering platform, used by over 90,000 teams every week, with over $100 million in annual recurring revenue. https://www.getdbt.com/ https://job-boards.greenhouse.io/dbtlabsinc/jobs/4686309005?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply US - Remote 2026-04-18 67b4ccd7-51d Senior Software Engineer, Observability Insights Join CoreWeave's Observability team, where we are building the next-generation insights layer for AI systems.

Our team empowers internal and external users to understand, troubleshoot, and optimize complex AI workloads by transforming telemetry into actionable insights.

As a Senior Software Engineer on the Observability Insights team, you will lead the development of agentic interfaces and product experiences that sit atop CoreWeave's telemetry layer.

You'll design multi-tenant APIs, managed Grafana experiences, and MCP-based tool servers to help customers and internal teams interact with data in innovative ways.

Collaborating closely with PMs and engineering leadership, your work will shape the end-to-end observability experience and influence how people engage with cutting-edge AI infrastructure.

About the role

6+ years of experience in software or infrastructure engineering building production-grade backend systems and distributed APIs.

Strong focus on developer-facing infrastructure, with a customer-obsessed approach to SDKs, CLIs, and APIs.

Proficient in reliability engineering, including fault-tolerant design, SLOs, error budgets, and multi-tenant system resilience.

Familiar with observability systems such as ClickHouse, Loki, VictoriaMetrics, Prometheus, and Grafana.

Experienced in agentic applications or LLM-based features, including grounding, tool calling, and operational safety.

Comfortable writing production code primarily in Go, with the ability to integrate Python components when needed.

Collaborative experience in agile teams delivering end-to-end telemetry-to-insights pipelines.

Preferred

Experience operating Kubernetes clusters at scale, especially for AI workloads.

Hands-on experience with logging, tracing, and metrics platforms in production, with deep knowledge of cardinality, indexing, and query optimization.

Experienced in running distributed systems or API services at cloud scale, including event streaming and data pipeline management.

Familiarity with LLM frameworks, MCP, and agentic tooling (e.g., Langchain, AgentCore).

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast!

We're in an exciting stage of hyper-growth that you will not want to miss out on.

We're not afraid of a little chaos, and we're constantly learning.

Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at Your Core

Act Like an Owner

Empower Employees

Deliver Best-in-Class Client Experiences

Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking.

We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems.

As we get set for takeoff, the organization's growth opportunities are constantly expanding.

You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.

Come join us!

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 software engineering, infrastructure engineering, backend systems, distributed APIs, reliability engineering, fault-tolerant design, SLOs, error budgets, multi-tenant system resilience, observability systems, ClickHouse, Loki, VictoriaMetrics, Prometheus, Grafana, agentic applications, LLM-based features, grounding, tool calling, operational safety, Go, Python, Kubernetes, logging, tracing, metrics platforms, cardinality, indexing, query optimization, event streaming, data pipeline management, LLM frameworks, MCP, agent tooling, operating Kubernetes clusters Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4650163006?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply New York, NY / Sunnyvale, CA 2026-04-18 bd3983d2-d82 Staff Technical Program Manager, Reliability & Observability We're looking for a technical, hands-on, and mission-driven Staff Technical Program Manager (TPM) to lead Reliability & Observability initiatives. In this role, you will collaborate closely with Machine Learning (ML) engineers, infrastructure engineers, and product managers across Airbnb to develop holistic solutions that ensure the Airbnb platform is robust, highly available, and transparent in its operations.

The Reliability & Observability team enables safe, resilient, and transparent operation of critical systems powering Airbnb. This team creates and maintains the frameworks and platforms for proactive monitoring, alerting, logging, tracing, and incident management, helping teams across Airbnb maintain service health and quickly remediate issues.

As a TPM, you will play a crucial role in tackling projects that power search and recommendations across the entire Airbnb platform,directly influencing how guests and hosts connect in meaningful ways. Success in this role requires a strong sense of ownership, coupled with the ability to think critically and strategically while managing relationships with cross-functional stakeholders.

Responsibilities:

Shape and influence the technical direction of projects, ensuring we meet stakeholder needs while maintaining high-quality standards.

Rapidly prototype and validate project ideas through iterative development cycles, adapting as necessary based on insights and data.

Balance broad, outcome-driven thinking with attention to critical details, exercising sound judgment to prioritize where deep focus is necessary to ensure successful execution.

Define and secure stakeholder alignment on clear, measurable success criteria to accelerate AI initiatives.

Regularly assess risks and opportunities, and devise proactive mitigation strategies to maintain momentum and project success.

Maintain transparent and effective communication channels to keep stakeholders informed of progress, developments, and challenges.

Expertly present outcomes and updates to senior leadership, clearly and comprehensively articulating trade-offs, risks, and emerging opportunities.

Your Expertise:

At least 10 years of work experience, with at least 8 years as a TPM or relevant experience.

Demonstrated ability to work through ambiguity to detailed solutions.

Self-motivated, proactive and proven ability to adapt well and work with teams having different operating cadences.

Sound business judgment, a proven ability to influence others, strong analytical skills, and a track record of taking ownership, leading data driven analyses, and influencing results.

Experience with ML models, LLMs, LRMs, feature development, model testing and resource management to support the development of AI-powered product experiences.

Familiar with A/B testing, incremental delivery and deployment.

Ability to ramp up quickly and learn new technologies with minimal lag time.

Excellent written and oral business communication and people skills, with the ability to influence stakeholders.

Our Commitment To Inclusion & Belonging:

Airbnb is committed to working with the broadest talent pool possible. We believe diverse ideas foster innovation and engagement, and allow us to attract creatively-led people, and to develop the best products, services and solutions.

All qualified individuals are encouraged to apply. We strive to also provide a disability inclusive application and interview process. If you are a candidate with a disability and require reasonable accommodation in order to submit an application, please contact us at: reasonableaccommodations@airbnb.com.

How We'll Take Care of You:

Our job titles may span more than one career level. The actual base pay is dependent upon many factors, such as: training, transferable skills, work experience, business needs and market demands. The base pay range is subject to change and may be modified in the future. This role may also be eligible for bonus, equity, benefits, and Employee Travel Credits.

Pay Range $194,000-$242,000 USD

XML job scraping automation by YubHub

]]> full-time staff remote $194,000-$242,000 USD Technical Program Management, Reliability & Observability, Machine Learning, Infrastructure Engineering, Product Management, A/B Testing, Incremental Delivery, Deployment Engineering Technology Airbnb https://logos.yubhub.co/airbnb.com.png Airbnb is a global online marketplace for short-term vacation rentals. It was founded in 2007 and has since grown to become one of the largest and most popular travel platforms in the world. https://www.airbnb.com/ https://job-boards.greenhouse.io/airbnb/jobs/7558202?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply United States 2026-04-18 51758515-c12 Member of Technical Staff We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment.

This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure.

The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime,including close partnership with facility operations to address physical infrastructure impacts.

In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities.

By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scale environments like those at hyperscale cloud providers.

Responsibilities:

Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.

Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers,open to innovative stacks beyond traditional ones like ELK.

Collaborate with cross-functional teams,including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management),to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).

Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.

Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.

Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation.

Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios.

Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies.

Basic Qualifications:

Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).

5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.

Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.

Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.

Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).

Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.

Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.

Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.

Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.

Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).

Preferred Skills and Experience:

7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.

Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.

Proficiency in Rust for systems programming and performance-critical components.

Direct experience integrating software reliability tools with physical data center infrastructure.

Experience with observability tools and practices, such as metrics collection, logging, tracing, and dashboards.

Familiarity with containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).

Experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.

Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.

Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.

Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).

XML job scraping automation by YubHub

]]> full-time staff onsite Python, Rust, Linux systems administration, performance tuning, kernel-level understanding, scripting/automation, containerization, orchestration, observability, metrics collection, logging, tracing, dashboards, networking fundamentals, TCP/IP, routing, redundancy, DNS, Kubernetes, Docker, Grafana, Prometheus, ELK, DevOps, SRE, infrastructure engineering, systems engineering Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/5044403007?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Memphis, TN 2026-04-18 eebf21c4-d1f Staff Site Reliability Engineer Join our Site Reliability Engineering (SRE) team and help ensure the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide.

As a Staff Site Reliability Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.

We are seeking Staff SREs who are passionate about building and maintaining resilient systems at scale. Your mission will be to proactively find and analyze reliability problems across our stack, then design and implement software and systems to create step-function improvements.

You will design robust observability solutions, lead incident response, automate operational tasks, and continuously improve our infrastructure's reliability, all while mentoring and educating the broader engineering team to make reliability a core value at Replit.

Responsibilities

Architect and Implement Observability: Design, build, and lead the implementation of comprehensive monitoring, logging, and tracing solutions. Create dashboards and metrics that provide real-time visibility into system health and performance, enabling proactive issue detection.

Define and Drive Reliability Standards: Work with product and engineering teams to define, implement, and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Build systems to monitor and report on these metrics, holding teams accountable and ensuring we maintain high reliability standards while balancing innovation speed.

Lead Incident Management and Response: Act as a senior leader during high-impact incidents, guiding the team to rapid resolution. Conduct thorough, blameless post-mortems and drive the implementation of preventative measures. Develop and refine runbooks and build automation to reduce Mean Time To Recovery (MTTR).

Drive Automation and Infrastructure as Code: Architect, build, and improve automation to eliminate toil and operational work. Design and maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios.

Optimize Performance on Kubernetes: Collaborate with core infrastructure and product teams to performance-tune and optimize our large-scale cloud deployments, with a deep focus on Kubernetes, Docker, and GCP. Identify and resolve performance bottlenecks, implement capacity planning strategies, and reduce latency across global regions.

Debug and Harden Distributed Systems: Dive deep into debugging extremely difficult technical problems across the stack. Use your findings to design and implement long-term fixes that make our systems and products more robust, operable, and easier to diagnose.

Provide Staff-Level Guidance: Review feature and system designs from across the company, acting as a key owner for the reliability, scalability, security, and operational integrity of those designs.

Educate and Mentor: Educate, mentor, and hold accountable the broader engineering team to improve the reliability of our systems, making reliability a core value of the Replit engineering culture.

Build and Integrate: Write high-quality, well-tested code in Python or Go to meet the needs of your customers, whether it's building new internal tools or integrating with third-party vendors.

Required Skills and Experience

8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering).

Strong programming skills in languages like Python or Go. You write high-quality, well-tested code.

Deep understanding of distributed systems. You’ve designed, built, scaled, and maintained production services and know how to compose a service-oriented architecture.

Deep experience with container orchestration platforms, specifically Kubernetes, and cloud-native technologies.

Proven track record of designing, implementing, and maintaining sophisticated monitoring and observability solutions (e.g., metrics, logging, tracing).

Strong incident management skills with extensive experience leading incident response for complex systems and demonstrated critical thinking under pressure.

Experience with infrastructure as code (e.g., Terraform, Pulumi) and configuration management tools.

Excellent written and verbal communication skills, with an ability to explain complex technical concepts clearly and simply and a bias toward open, transparent cultural practices.

Strong interpersonal skills, with experience working with and mentoring engineers from junior to principal levels.

A willingness to dive into understanding, debugging, and improving any layer of the stack.

You're passionate about making software creation accessible and empowering the next generation of builders.

Bonus Points

Deep experience with Google Cloud Platform (GCP) services and tools.

Expert-level knowledge of modern observability platforms (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).

Experience designing and building reliable systems capable of handling high throughput and low latency.

Significant experience with Go and Terraform.

Familiarity with working in rapid-growth, startup environments.

Experience writing company-facing blog posts and training materials.

XML job scraping automation by YubHub

]]> Full time staff remote $220K - $325K Site Reliability Engineering, DevOps, Systems Engineering, Infrastructure Engineering, Python, Go, Distributed Systems, Container Orchestration, Kubernetes, Cloud-Native Technologies, Monitoring and Observability, Incident Management, Infrastructure as Code, Terraform, Pulumi, Configuration Management, Google Cloud Platform, Prometheus, Grafana, Datadog, OpenTelemetry, Go, Terraform Engineering Technology Replit https://logos.yubhub.co/replit.com.png Replit is an agentic software creation platform that enables anyone to build applications using natural language, with millions of users worldwide. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/replit/d50ad15b-82d4-452f-b4ea-2a7f5e796170?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Remote (United States) 2026-03-08 da726093-b19 Research Engineer, Discovery About the Role

As a Research Engineer on our team, you will work end to end across the whole model stack, identifying and addressing key infra blockers on the path to scientific AGI. Strong candidates should have familiarity with elements of language model training, evaluation, and inference and eagerness to quickly dive and get up to speed in areas they are not yet an expert on. This may include performance optimization, distributed systems, VM/sandboxing/container deployment, and large scale data pipelines.

Responsibilities:

Design and implement large-scale infrastructure systems to support AI scientist training, evaluation, and deployment across distributed environments
Identify and resolve infrastructure bottlenecks impeding progress toward scientific capabilities
Develop robust and reliable evaluation frameworks for measuring progress towards scientific AGI.
Build scalable and performant VM/sandboxing/container architectures to safely execute long-horizon AI tasks and scientific workflows
Collaborate to translate experimental requirements into production-ready infrastructure
Develop large scale data pipelines to handle advanced language model training requirements
Optimize large scale training and inference pipelines for stable and efficient reinforcement learning

You may be a good fit if you:

Have 6+ years of highly-relevant experience in infrastructure engineering with demonstrated expertise in large-scale distributed systems
Are a strong communicator and enjoy working collaboratively
Possess deep knowledge of performance optimization techniques and system architectures for high-throughput ML workloads
Have experience with containerization technologies (Docker, Kubernetes) and orchestration at scale
Have proven track record of building large-scale data pipelines and distributed storage systems
Excel at diagnosing and resolving complex infrastructure challenges in production environments
Can work effectively across the full ML stack from data pipelines to performance optimization
Have experience collaborating with other researchers to scale experimental ideas
Thrive in fast-paced environments and can rapidly iterate from experimentation to production

Strong candidates may also have:

Experience with language model training infrastructure and distributed ML frameworks (PyTorch, JAX, etc.)
Background in building infrastructure for AI research labs or large-scale ML organizations
Knowledge of GPU/TPU architectures and language model inference optimization
Experience with cloud platforms (AWS, GCP) at enterprise scale
Familiarity with VM and container orchestration.
Experience with workflow orchestration tools and experiment management systems
History working with large scale reinforcement learning
Comfort with large scale data pipelines (Beam, Spark, Dask, …)

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.
Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.
Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit anthropic.com/careers directly for confirmed position openings.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale projects, and we're committed to making a positive impact on the world.

XML job scraping automation by YubHub

]]> full-time senior hybrid $350,000 - $850,000 USD infrastructure engineering, large-scale distributed systems, performance optimization, containerization technologies, orchestration at scale, data pipelines, distributed storage systems, complex infrastructure challenges, ML stack, workflow orchestration tools, experiment management systems, reinforcement learning, large scale data pipelines, language model training infrastructure, distributed ML frameworks, GPU/TPU architectures, language model inference optimization, cloud platforms, VM and container orchestration, workflow orchestration tools, experiment management systems, large scale reinforcement learning, large scale data pipelines Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that aims to create reliable, interpretable, and steerable AI systems. It has a team of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/4669581008?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply San Francisco, CA 2026-03-08 8c164f95-f8d Senior Infrastructure Engineer Join our Infrastructure Engineering team and help ensure the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide. As a Senior Infrastructure Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.

We are seeking Senior Infrastructure Engineers who are passionate about building and maintaining resilient systems at scale. Your mission will be to proactively find and analyse reliability problems across our stack, then design and implement software and systems to address them. You will build robust monitoring solutions, automate operational tasks, and continuously improve our infrastructure's reliability.

You Will:

Drive Automation and Infrastructure as Code: Build and improve automation to eliminate toil and operational work. Maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios.
Optimise Performance and Infrastructure: Collaborate with core infrastructure and product teams to performance tune and optimise our cloud deployments (Kubernetes, Docker, GCP). Identify and resolve performance bottlenecks and implement capacity planning strategies.
Elevate Developer Experience: Design and implement improvements to our build, test, and deployment systems to make software delivery faster, safer, and more reliable for all engineers.
Drive Cross-Team Improvements: Partner with service owners across Replit to understand their pain points, and collaborate on implementing build/test/deploy enhancements within their specific services.
Build Shared Tooling: Create and maintain centralized tooling and automation that improves the engineering lifecycle, from local development to production monitoring.
Debug and Harden Systems: Dive deep into debugging difficult technical problems, making our systems and products more robust, operable, and easier to diagnose.
Collaborate on Design Reviews: Participate in feature and system design reviews, contributing expertise on security, scale, and operational considerations.
Build and Integrate: Write high-quality, well-tested code to meet the needs of your customers, including building pipelines to integrate with 3rd party vendors.

Required Skills and Experience:

4+ years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering).
Strong programming skills in languages like Python or Go.
You write high-quality, well-tested code.
Solid understanding of distributed systems. You've built, scaled, and maintained production services and understand service-oriented architecture.
Experience with container orchestration platforms (Kubernetes) and cloud-native technologies.
Experience implementing and maintaining monitoring/observability solutions, with strong skills in debugging and performance tuning.
Strong incident management skills with experience participating in incident response and demonstrated critical thinking under pressure.
Experience with infrastructure as code (e.g., Terraform) and configuration management tools.
Excellent written and verbal communication skills, with an ability to explain technical concepts clearly.
A willingness to dive into understanding, debugging, and improving any layer of the stack.
You're passionate about making software creation accessible and empowering the next generation of builders.

Bonus Points:

Experience with Google Cloud Platform (GCP) services and tools.
Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.).
Experience building reliable systems capable of handling high throughput and low latency.
Experience with Go and Terraform.
Familiarity with working in rapid-growth environments.

_This is a full-time role that can be held from our Foster City, CA office. The role has an in-office requirement of Monday, Wednesday, and Friday._

Full-Time Employee Benefits Include:

Competitive Salary & Equity
401(k) Program with a 4% match
Health, Dental, Vision and Life Insurance
Short Term and Long Term Disability
Paid Parental, Medical, Caregiver Leave
Commuter Benefits
Monthly Wellness Stipend
Autonomous Work Environment
In Office Set-Up Reimbursement
Flexible Time Off (FTO) + Holidays
Quarterly Team Gatherings
In Office Amenities

XML job scraping automation by YubHub

]]> full-time senior hybrid $190K - $240K Site Reliability Engineering, DevOps, Systems Engineering, Infrastructure Engineering, Python, Go, Terraform, Kubernetes, Docker, GCP, Monitoring/observability solutions, Debugging and performance tuning, Incident management, Infrastructure as code, Configuration management tools, Google Cloud Platform (GCP) services and tools, Modern observability platforms (Prometheus, Grafana, Datadog, etc.), Building reliable systems capable of handling high throughput and low latency, Go and Terraform, Familiarity with working in rapid-growth environments Engineering Technology Replit https://logos.yubhub.co/replit.com.png Replit is a software creation platform that enables anyone to build applications using natural language. With millions of users worldwide, Replit is a leading platform in the software development industry. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/replit/16c85abc-763c-4f36-ab67-64f416343384?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Foster City, CA 2026-03-07 b7de618e-5e1 Site Reliability Engineer Join our Site Reliability Engineering team and help ensure the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide. As a Site Reliability Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.

We are seeking SREs who are passionate about building and maintaining resilient systems at scale. Your mission will be to design and implement robust monitoring solutions, automate operational tasks, and continuously improve our infrastructure's reliability and performance.

Responsibilities

Design and Implement Observability Solutions: Develop comprehensive monitoring and alerting systems using modern observability tools. Create dashboards and metrics that provide real-time visibility into system health and performance. Implement logging strategies that enable quick problem identification and resolution.

Drive Automation and Infrastructure as Code: Architect and implement infrastructure automation solutions using tools like Terraform, Ansible, or Pulumi. Design and maintain CI/CD pipelines that enable reliable and consistent deployments. Create self-healing systems that can automatically respond to common failure scenarios.

Establish SLOs and SLIs: Work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Build systems to track and report on these metrics, ensuring we maintain high reliability standards while balancing innovation speed.

Incident Management and Response: Lead incident response efforts, conducting thorough post-mortems, and implementing improvements to prevent future occurrences. Develop and maintain runbooks for critical services. Build tools and processes that reduce Mean Time To Recovery (MTTR).

Performance Optimization: Identify and resolve performance bottlenecks across our infrastructure. Implement capacity planning strategies and optimize resource utilization. Work on reducing latency and improving system efficiency across global regions.

Requirements

4-8 years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering)

Strong programming skills in languages commonly used for automation (Python, Go, or similar)

Deep understanding of distributed systems

Experience with container orchestration platforms (Kubernetes) and cloud-native technologies

Proven track record of implementing and maintaining monitoring/observability solutions

Strong incident management skills with experience leading incident response

Experience with infrastructure as code and configuration management tools

Bonus Points

Experience with Google Cloud Platform (GCP) services and tools

Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.)

What We Value

Problem-solving mindset: Ability to approach complex operational challenges systematically and devise effective solutions

Self-directed and autonomous: Capable of working independently while collaborating effectively with cross-functional teams

Strong communication skills: Ability to explain complex technical concepts to both technical and non-technical audiences

Continuous learning: Passion for staying current with industry best practices and new technologies

Focus on automation: Strong belief in automating repetitive tasks and building self-healing systems

Full-Time Employee Benefits Include

Competitive Salary & Equity

401(k) Program with a 4% match

Health, Dental, Vision and Life Insurance

Short Term and Long Term Disability

Paid Parental, Medical, Caregiver Leave

Commuter Benefits

Monthly Wellness Stipend

Autonomous Work Environment

In Office Set-Up Reimbursement

Flexible Time Off (FTO) + Holidays

Quarterly Team Gatherings

In Office Amenities

Want to Learn More About What We Are Up To?

Meet the Replit Agent

Replit: Make an app for that

Replit Blog

Amjad TED Talk

Interviewing + Culture at Replit

Operating Principles

Reasons not to work at Replit

XML job scraping automation by YubHub

]]> full-time senior remote $160K - $250K Site Reliability Engineering, DevOps, Systems Engineering, Infrastructure Engineering, Python, Go, Distributed systems, Container orchestration platforms, Cloud-native technologies, Monitoring/observability solutions, Incident management, Infrastructure as code, Configuration management tools, Google Cloud Platform, Prometheus, Grafana, Datadog Engineering Technology Replit https://logos.yubhub.co/replit.com.png Replit is a software creation platform that enables anyone to build applications using natural language. With millions of users worldwide, Replit is a leading provider of software development tools. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/replit/f6e6158e-eb89-4008-81ea-1b7512bc509d?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply United States 2026-03-07 323bc85d-b69 Staff Infrastructure Engineer About the Role:

Join our Infrastructure Engineering team and help ensure the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide. As a Staff Infrastructure Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.

Responsibilities:

Drive Automation and Infrastructure as Code: Architect, build, and improve automation to eliminate toil and operational work. Design and maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios.

Optimise Performance and Infrastructure: Collaborate with core infrastructure and product teams to performance tune and optimise our cloud deployments (Kubernetes, Docker, GCP). Identify and resolve performance bottlenecks, implement capacity planning strategies, and reduce latency across global regions.

Elevate Developer Experience: Design and implement improvements to our build, test, and deployment systems to make software delivery faster, safer, and more reliable for all engineers.

Drive Cross-Company Improvements: Partner directly with service owners across Replit to understand their pain points, and collaborate on implementing build/test/deploy enhancements within their specific services.

Build Shared Tooling: Create and maintain centralized tooling and automation that improves the entire engineering lifecycle, from local development to production monitoring.

Debug and Harden Systems: Dive deep into debugging extremely difficult technical problems, making our systems and products more robust, operable, and easier to diagnose.

Provide Staff-Level Guidance: Review feature and system designs, acting as an owner for the security, scale, and operational integrity of those designs.

Educate and Mentor: Educate, mentor, and hold accountable the engineering team to improve the reliability of our systems, making reliability a core value of the Replit engineering culture.

Build and Integrate: Write high-quality, well-tested code to meet the needs of your customers, including building pipelines to integrate with 3rd party vendors.

Required Skills and Experience:

8-10 years of experience in Infrastructure Engineering or similar roles (DevOps, Systems Engineering, Site Reliability Engineering).

Strong programming skills in languages like Python or Go.

You write high-quality, well-tested code.

Deep understanding of distributed systems. You've designed, built, scaled, and maintained production services and know how to compose a service-oriented architecture.

Experience with container orchestration platforms (Kubernetes) and cloud-native technologies.

Proven track record of implementing and maintaining monitoring/observability solutions, with strong skills in debugging and performance tuning.

Strong incident management skills with experience leading incident response and demonstrated critical thinking under pressure.

Experience with infrastructure as code (e.g., Terraform) and configuration management tools.

Excellent written and verbal communication skills, with an ability to explain technical concepts clearly and simply and a bias toward open, transparent cultural practices.

Strong interpersonal skills, with experience working with engineers from junior to principal levels.

A willingness to dive into understanding, debugging, and improving any layer of the stack.

You're passionate about making software creation accessible and empowering the next generation of builders.

Bonus Points:

Deep experience with Google Cloud Platform (GCP) services and tools.

Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.).

Experience designing and building reliable systems capable of handling high throughput and low latency.

Experience with Go and Terraform.

Familiarity with working in rapid-growth environments.

Experience writing company-facing blog posts and training materials.

Full-Time Employee Benefits Include:

Competitive Salary & Equity

401(k) Program with a 4% match

Health, Dental, Vision and Life Insurance

Short Term and Long Term Disability

Paid Parental, Medical, Caregiver Leave

Commuter Benefits

Monthly Wellness Stipend

Autonomous Work Environment

In Office Set-Up Reimbursement

Flexible Time Off (FTO) + Holidays

Quarterly Team Gatherings

XML job scraping automation by YubHub

]]> full-time staff hybrid $220K – $325K Infrastructure Engineering, DevOps, Systems Engineering, Site Reliability Engineering, Python, Go, Distributed systems, Container orchestration platforms, Cloud-native technologies, Monitoring/observability solutions, Infrastructure as code, Configuration management tools, Google Cloud Platform, Prometheus, Grafana, Datadog, Go, Terraform, Rapid-growth environments, Company-facing blog posts, Training materials Engineering Technology Replit https://logos.yubhub.co/replit.com.png Replit is a software creation platform that enables anyone to build applications using natural language. With millions of users worldwide, Replit is democratizing software development by removing traditional barriers to application creation. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/replit/6481ec1e-527c-4c1f-a041-2fb5021e7bd5?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Foster City, CA 2026-03-07 4b563c21-dd0 Software Engineer, Data Infrastructure Software Engineer, Data Infrastructure

Location

San Francisco

Employment Type

Full time

Department

Applied AI

Compensation

$185K – $385K • Offers Equity

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

Data Platform at OpenAI owns the foundational data stack powering critical product, research, and analytics workflows. We operate some of the largest Spark compute fleets in production; design, and build data lakes and metadata systems on Iceberg and Delta with a vision toward exabyte-scale architecture; run high throughput streaming platforms on Kafka and Flink; provide orchestration with Airflow; and support ML feature engineering tooling such as Chronon. Our mission is to deliver reliable, secure, and efficient data access at scale and accelerate intelligent, AI assisted data workflows.

About the Role

This role focuses on building and operating data infrastructure that supports massive compute fleets and storage systems, designed for high performance and scalability. You’ll help design, build, and operate the next generation of data infrastructure at OpenAI. You will scale and harden big data compute and storage platforms, build and support high-throughput streaming systems, build and operate low latency data ingestions, enable secure and governed data access for ML and analytics, and design for reliability and performance at extreme scale.

You will take full lifecycle ownership: architecture, implementation, production operations, and on-call participation.

Responsibilities

Design, build, and maintain data infrastructure systems such as distributed compute, data orchestration, distributed storage, streaming infrastructure, machine learning infrastructure while ensuring scalability, reliability, and security

Ensure our data platform can scale by orders of magnitude while remaining reliable and efficient

Accelerate company productivity by empowering your fellow engineers & teammates with excellent data tooling and systems

Collaborate with product, research and analytics teams to build the technical foundations capabilities that unlock new features and experiences

Own the reliability of the systems you build, including participation in an on-call rotation for critical incidents

Requirements

4+ years in data infrastructure engineering OR

4+ years in infrastructure engineering with a strong interest in data

Take pride in building and operating scalable, reliable, secure systems

Are comfortable with ambiguity and rapid change

Have an intrinsic desire to learn and fill in missing skills, and an equally strong talent for sharing learnings clearly and concisely with others

About OpenAI

XML job scraping automation by YubHub

]]> full-time mid hybrid $185K – $385K • Offers Equity data infrastructure engineering, infrastructure engineering, Spark, Kafka, Flink, Airflow, Chronon, Iceberg, Delta, Terraform, distributed systems, machine learning, data science, cloud computing, containerization, DevOps Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/f763c6b3-5167-4a67-b691-4c3fa2c44156?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply San Francisco 2026-03-06 af70e58f-a16 Technical Program Manager - Compute Summary

Microsoft AI are looking for a talented Technical Program Manager - Compute at their Redmond office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising AI technology. You'll work directly with leadership to shape the company's direction in the AI market.

About the Role

As a Technical Program Manager - Compute, you will drive projects and programs related to compute infrastructure, including forecasting and allocation resource needs like compute, storage, network. You will collaborate with product teams, engineers, researchers, and external partners to identify gaps and drive timelines toward resolution and mitigation. You will leverage data and analytics to define metrics, set baselines and targets for fleet efficiency & optimize. You will advocate for AI team's resource needs with exec and working level partners across Microsoft. You will foster a culture of collaboration, continuous improvement, and growth. You will own the status of key compute projects, proactively identifying risks and proposing solutions to ensure timely delivery. You will communicate program strategies, progress, and results to executive leadership and key stakeholders, advocating for quality and efficiency within the team.

Accountabilities

Drive projects and programs related to compute infrastructure, including forecasting and allocation resource needs like compute, storage, network.
Collaborate with product teams, engineers, researchers, and external partners to identify gaps and drive timelines toward resolution and mitigation.

The Candidate we're looking for

Experience:

6+ years' experience in technical program management, infrastructure engineering, AI/ML, or product development OR equivalent experience.

Technical skills:

Strong technical curiosity and judgment.

Personal attributes:

Proactive attitude and enthusiasm for exploring new methods and technologies in compute and infrastructure.

Benefits

Competitive salary.
Comprehensive benefits package.
Opportunities for professional growth and development.
Collaborative and dynamic work environment.

XML job scraping automation by YubHub

]]> full-time senior onsite USD $139,900 - $274,800 per year technical program management, infrastructure engineering, AI/ML, product development, data analytics, cloud computing, containerization Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that is on a mission to train the world's most capable AI frontier models, pushing the boundaries of scale, performance, and product deployment. https://microsoft.ai https://microsoft.ai/job/technical-program-manager-compute-2/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Redmond 2026-03-06 d19375cb-532 Technical Program Manager - Compute Summary

Microsoft AI are looking for a talented Technical Program Manager to join their team in Mountain View. This role will be responsible for driving projects and programs related to compute infrastructure, including forecasting and allocation resource needs like compute, storage, network. The ideal candidate will have experience collaborating with AI researchers, engineers, and infrastructure teams to deliver robust, scalable solutions.

About the Role

As a Technical Program Manager at Microsoft AI, you will be responsible for driving projects and programs related to compute infrastructure. This will include forecasting and allocation resource needs like compute, storage, network. You will collaborate with product teams, engineers, researchers, and external partners to identify gaps and drive timelines toward resolution and mitigation. You will leverage data and analytics to define metrics, set baselines and targets for fleet efficiency & optimize. You will also advocate for AI team's resource needs with exec and working level partners across Microsoft.

Accountabilities

Drive projects and programs related to compute infrastructure, including forecasting and allocation resource needs like compute, storage, network.
Collaborate with product teams, engineers, researchers, and external partners to identify gaps and drive timelines toward resolution and mitigation.

The Candidate we're looking for

Experience:

6+ years' experience in technical program management, infrastructure engineering, AI/ML, or product development.

Technical skills:

Experience with cloud computing platforms, such as Azure or AWS.
Knowledge of containerization and orchestration tools, such as Docker and Kubernetes.

Personal attributes:

Strong technical curiosity and judgment.
Proactive attitude and enthusiasm for exploring new methods and technologies in compute and infrastructure.

Benefits

Competitive salary and benefits package.
Opportunities for professional growth and development.
Collaborative and dynamic work environment.

XML job scraping automation by YubHub

]]> full-time senior onsite USD $139,900 – $274,800 per year technical program management, infrastructure engineering, AI/ML, product development, cloud computing, containerization, orchestration, cloud computing, containerization, orchestration Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that specializes in artificial intelligence and machine learning. They are known for their innovative products and services that empower individuals and organizations to achieve more. Microsoft AI is committed to pushing the boundaries of AI and making it more accessible to everyone. https://microsoft.ai https://microsoft.ai/job/technical-program-manager-compute/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Mountain View 2026-03-06 50e40039-fc8 Technical Program Manager - Compute Summary

Microsoft AI are looking for a talented Technical Program Manager - Compute at their New York office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising AI technology. You'll work directly with leadership to shape the company's direction in the AI market.

About the Role

Accountabilities

Drive projects and programs related to compute infrastructure, including forecasting and allocation resource needs like compute, storage, network.
Collaborate with product teams, engineers, researchers, and external partners to identify gaps and drive timelines toward resolution and mitigation.

The Candidate we're looking for

Experience:

6+ years' experience in technical program management, infrastructure engineering, AI/ML, or product development OR equivalent experience.

Technical skills:

Strong technical curiosity and judgment.

Personal attributes:

Proactive attitude and enthusiasm for exploring new methods and technologies in compute and infrastructure.

Benefits

Competitive salary.
Comprehensive benefits package.
Opportunities for professional growth and development.
Collaborative and dynamic work environment.

XML job scraping automation by YubHub

]]> full-time senior onsite USD $139,900 – $274,800 per year technical program management, infrastructure engineering, AI/ML, product development, strong technical curiosity and judgment, proactive attitude and enthusiasm for exploring new methods and technologies in compute and infrastructure Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that is on a mission to train the world's most capable AI frontier models, pushing the boundaries of scale, performance, and product deployment. They are a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. https://microsoft.ai https://microsoft.ai/job/technical-program-manager-compute-3/?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply New York 2026-03-06