AI Infrastructure Engineer

9af8d812-df8 AI Infrastructure Engineer We're looking for Senior+ AI Infrastructure Engineers to build the systems that train and serve Intercom's next generation of AI products.

As a Senior AI Infrastructure Engineer focused on model training and inference, you will:

Implement and scale training pipelines for large transformer and LLM models, from data ingestion and preprocessing through distributed training and evaluation.

Build and optimize inference services that deliver low-latency, high-reliability experiences for our customers, including autoscaling, routing, and fallbacks.

Work on GPU-level performance: tuning kernels, improving utilization, and identifying bottlenecks across our training and inference stack.

Collaborate closely with ML scientists to implement cutting edge training and inference methods and bring them to production.

Play an active role in hiring, mentoring, and developing other engineers on the team.

Raise the bar for technical standards, reliability, and operational excellence across Intercom’s AI platform.

We’re looking to hire Senior+ AI Infrastructure Engineers. You’re likely a great fit if:

You have 5+ years of experience in software engineering, with a strong track record of shipping high-quality products or platforms.

You hold a degree in Computer Science, Computer Engineering, or a related field (or you have equivalent experience with very strong fundamentals).

You have hands-on experience with one or more of the following:

Model training (especially transformers and LLMs).

Model inference at scale (again, especially transformers and LLMs).

Low-level GPU work, such as writing CUDA or Triton kernels.

Comfortable working in production environments at meaningful scale (traffic, data, or organizational).

You communicate clearly, can explain complex technical topics to different audiences, and enjoy close collaboration with both engineers and non-engineers.

You take pride in strong technical fundamentals, love learning, and are willing to invest in your own development.

Have deep knowledge of at least one programming language (for example Python, Ruby, Java, Go, etc.). Specific language experience is less important than your ability to write clean, reliable code and learn new stacks quickly.

We are a well-treated bunch, with awesome benefits! If there’s something important to you that’s not on this list, talk to us!

Competitive salary, annual bonus and equity

Regular compensation reviews - we reward great work!

Unlimited access to Claude Code and best-in-class AI tools; experimentation & building is encouraged & celebrated.

Generous paid time off above statutory minimum

Hybrid working

MacBooks are our standard, but we also offer Windows for certain roles when needed.

Fun events for employees, friends, and family!

XML job scraping automation by YubHub

]]> full-time senior hybrid model training, model inference, low-level GPU work, CUDA, Triton, Python, Ruby, Java, Go, experience at AI native companies, running training or inference workloads on Kubernetes, AWS, cloud providers, production experience with Python in ML or infrastructure contexts Engineering Technology Intercom https://logos.yubhub.co/intercom.com.png Intercom is an AI company that builds customer service solutions. It was founded in 2011 and serves nearly 30,000 global businesses. https://www.intercom.com/ https://job-boards.greenhouse.io/intercom/jobs/7824142 Berlin, Germany 2026-04-18 cba88898-896 Research Engineer, Infrastructure, Kernels We're looking for an infrastructure research engineer to design, optimize, and maintain the compute foundations that power large-scale language model training. You will develop high-performance ML kernels (e.g., CUDA, CuTe, Triton), enable efficient low-precision arithmetic, and improve the distributed compute stack that makes training large models possible.

This role is perfect for an engineer who enjoys working close to the metal and across the research boundary. You'll collaborate with researchers and systems architects to bridge algorithmic design with hardware efficiency. You'll prototype new kernel implementations, profile performance across hardware generations, and help define the numerical and parallelism strategies that determine how we scale next-generation AI systems.

Responsibilities

Design and implement custom ML kernels (e.g., CUDA, CuTe, Triton) for core LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for modern GPU and accelerator architectures.
Design and think through compute primitives to reduce memory bandwidth bottlenecks and improve kernel compute efficiency.
Collaborate with research teams to align kernel-level optimizations with model architecture and algorithmic goals.
Develop and maintain a library of reusable kernels and performance benchmarks that serve as the foundation for internal model training.
Contribute to infrastructure stability and scalability, ensuring reproducibility, consistency across precision formats, and high utilization of compute resources.
Document and share insights through internal talks, technical papers, or open-source contributions to strengthen the broader ML systems community.

Skills and Qualifications

Minimum qualifications:

Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar.
Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases
Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.
Proficiency in CUDA, CuTe, Triton, or other GPU programming frameworks.
Demonstrated ability to analyze, profile, and optimize compute-intensive workloads.

Preferred qualifications:

Experience training or supporting large-scale language models with tens of billions of parameters or more.
Track record of improving research productivity through infrastructure design or process improvements.
Experience developing or tuning kernels for deep learning frameworks such as PyTorch, JAX, or custom accelerators.
Familiarity with tensor parallelism, pipeline parallelism, or distributed data processing frameworks.
Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (e.g., XLA, TVM).
Contributions to open-source GPU, ML systems, or compiler optimization projects.
Prior research or engineering experience in numerical optimization, communication-efficient training, or scalable AI infrastructure.

XML job scraping automation by YubHub

]]> full-time senior onsite $350,000 - $475,000 USD CUDA, CuTe, Triton, GPU programming frameworks, Deep learning frameworks (e.g., PyTorch, JAX), Computer science, Electrical engineering, Statistics, Machine learning, Physics, Robotics, Experience training or supporting large-scale language models with tens of billions of parameters or more, Track record of improving research productivity through infrastructure design or process improvements, Experience developing or tuning kernels for deep learning frameworks such as PyTorch, JAX, or custom accelerators, Familiarity with tensor parallelism, pipeline parallelism, or distributed data processing frameworks, Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (e.g., XLA, TVM), Contributions to open-source GPU, ML systems, or compiler optimization projects, Prior research or engineering experience in numerical optimization, communication-efficient training, or scalable AI infrastructure Engineering Technology Thinking Machines Lab https://logos.yubhub.co/thinkingmachines.ai.png Thinking Machines Lab is a technology company that has created widely used AI products, including ChatGPT and Character.ai, and open-source projects like PyTorch. https://thinkingmachines.ai/ https://job-boards.greenhouse.io/thinkingmachines/jobs/5013934008 San Francisco 2026-04-18 c9ab5cbc-dd6 Research Engineer, Performance RL We're hiring a Research Engineer to join our Code RL team within the RL organization. As a Research Engineer, you'll advance our models' ability to safely write correct, fast code for accelerators.

You'll need to know accelerator performance well to turn it into tasks and signals models can learn from. Specifically, you will:

Invent, design and implement RL environments and evaluations.
Conduct experiments and shape our research roadmap.
Deliver your work into training runs.
Collaborate with other researchers, engineers, and performance engineering specialists across and outside Anthropic.

We're looking for someone with expertise in accelerators (CUDA, ROCm, Triton, Pallas), ML framework programming (JAX or PyTorch), and experience with balancing research exploration with engineering implementation.

Strong candidates may also have experience with reinforcement learning, porting ML workloads between different types of accelerators, and familiarity with LLM training methodologies.

The annual compensation range for this role is $350,000-$850,000 USD.

Please note that we're an extremely collaborative group, and we value communication skills. The easiest way to understand our research directions is to read our recent research.

We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues.

XML job scraping automation by YubHub

]]> full-time senior hybrid $350,000-$850,000 USD accelerator performance, ML framework programming, reinforcement learning, RL environments and evaluations, experiments and research roadmap, training runs, collaboration with researchers and engineers, CUDA, ROCm, Triton, Pallas, JAX, PyTorch, LLM training methodologies Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that focuses on creating reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5160330008 San Francisco, CA 2026-04-18 2bc6ae79-8ee Staff Technical Lead for Inference & ML Performance We're looking for a Staff Technical Lead for Inference & ML Performance to guide a team in building and optimizing state-of-the-art inference systems. This role is intense yet deeply impactful.

You'll shape the future of fal's inference engine and ensure our generative models achieve best-in-class performance. Your work directly impacts our ability to rapidly deliver cutting-edge creative solutions to users, from individual creators to global brands.

Day-to-day, you'll set technical direction, guide your team to build high-performance inference solutions, and personally contribute to critical inference performance enhancements and optimizations. You'll collaborate closely with research & applied ML teams, influence model inference strategies and deployment techniques, and drive advanced performance optimizations.

As a leader, you'll mentor and scale your team, coach and expand your team of performance-focused engineers, and help them innovate, solve complex performance challenges, and level up their skills.

To succeed in this role, you'll need to be deeply experienced in ML performance optimization, understand the full ML performance stack, and know inference inside-out. You'll also need to thrive in cross-functional collaboration and have excellent leadership skills.

If you're ready to lead the future of inference performance at a fast-paced, high-growth frontier, apply now!

XML job scraping automation by YubHub

]]> full-time staff onsite ML performance optimization, PyTorch, TensorRT, TransformerEngine, Triton, CUTLASS kernels, Quantization, Kernel authoring, Compilation, Model parallelism, Distributed serving, Profiling Engineering Technology fal https://logos.yubhub.co/fal.com.png fal is a fast-growing company pioneering the next generation of generative-media infrastructure. https://fal.com https://job-boards.greenhouse.io/fal/jobs/4012780009 San Francisco 2026-04-18 1507524b-770 Research Engineer, Performance RL We're hiring a Research Engineer to join our Code RL team within the RL organization. As a Research Engineer, you'll advance our models' ability to safely write correct, fast code for accelerators.

You'll need to know accelerator performance well to turn it into tasks and signals models can learn from. Specifically, you will:

Invent, design and implement RL environments and evaluations.
Conduct experiments and shape our research roadmap.
Deliver your work into training runs.
Collaborate with other researchers, engineers, and performance engineering specialists across and outside Anthropic.

You may be a good fit if you:

Have expertise with accelerators (CUDA, ROCm, Triton, Pallas), ML framework programming (JAX or PyTorch).
Have worked across the stack – kernels, model code, distributed systems.
Know how to balance research exploration with engineering implementation.
Are passionate about AI's potential and committed to developing safe and beneficial systems.

Strong candidates may also have:

Experience with reinforcement learning.
Experience porting ML workloads between different types of accelerators.
Familiarity with LLM training methodologies.

The annual compensation range for this role is $350,000-$850,000 USD.

We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact , advancing our long-term goals of steerable, trustworthy AI , rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science.

We kitchen is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues.

XML job scraping automation by YubHub

]]> full-time senior hybrid $350,000-$850,000 USD accelerators, ML framework programming, distributed systems, reinforcement learning, LLM training methodologies, CUDA, ROCm, Triton, Pallas, JAX, PyTorch Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5160330008 San Francisco, CA 2026-04-18 28107212-128 Performance Engineer, GPU As a GPU Performance Engineer at Anthropic, you will be responsible for architecting and implementing the foundational systems that power Claude and push the frontiers of what's possible with large language models. You will maximize GPU utilization and performance at unprecedented scale, develop cutting-edge optimizations that directly enable new model capabilities, and dramatically improve inference efficiency.

Working at the intersection of hardware and software, you will implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.

Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.

Responsibilities:

Architect and implement foundational systems that power Claude
Maximize GPU utilization and performance at unprecedented scale
Develop cutting-edge optimizations that directly enable new model capabilities
Dramatically improve inference efficiency
Implement state-of-the-art techniques from custom kernel development to distributed system architectures
Work at the intersection of hardware and software
Span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization

Requirements:

Deep experience with GPU programming and optimization at scale
Impact-driven, passionate about delivering measurable performance breakthroughs
Ability to navigate complex systems from hardware interfaces to high-level ML frameworks
Enjoy collaborative problem-solving and pair programming
Want to work on state-of-the-art language models with real-world impact
Care about the societal impacts of your work
Thrive in ambiguous environments where you define the path forward

Nice to have:

Experience with GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization
ML Compilers & Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators
Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight
Distributed Systems: NCCL, NVLink, collective communication, model parallelism
Low-Precision: INT8/FP8 quantization, mixed-precision techniques
Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration

Representative projects:

Co-design attention mechanisms and algorithms for next-generation hardware architectures
Develop custom kernels for emerging quantization formats and mixed-precision techniques
Design distributed communication strategies for multi-node GPU clusters
Optimize end-to-end training and inference pipelines for frontier language models
Build performance modeling frameworks to predict and optimize GPU utilization
Implement kernel fusion strategies to minimize memory bandwidth bottlenecks
Create resilient systems for planet-scale distributed training infrastructure
Profile and eliminate performance bottlenecks in production serving infrastructure
Partner with hardware vendors to influence future accelerator capabilities and software stacks

Note: The salary range for this position is $280,000-$850,000 USD per year.

XML job scraping automation by YubHub

]]> full-time senior hybrid $280,000-$850,000 USD per year GPU programming, optimization at scale, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, PyTorch/JAX internals, torch.compile, XLA, custom operators, kernel fusion, memory bandwidth optimization, profiling with Nsight, NCCL, NVLink, collective communication, model parallelism, INT8/FP8 quantization, mixed-precision techniques, large-scale training infrastructure, fault tolerance, cluster orchestration Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/4926227008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 586b9fef-509 Senior Software Engineer - Network Enablement (Applied ML) We believe that the way people interact with their finances will drastically improve in the next few years. We're dedicated to empowering this transformation by building the tools and experiences that thousands of developers use to create their own products.

On this team, you will build and operate the ML infrastructure and product services that enable trust and intelligence across Plaid's network. You'll own feature engineering, offline training and batch scoring, online feature serving, and real-time inference so model outputs directly power partner-facing fraud & trust products and bank intelligence features.

Responsibilities

Embed model inference into Network Enablement product flows and decision logic (APIs, feature flags, backend flows).
Define and instrument product + ML success metrics (fraud reduction, retention lift, false positives, downstream impact).
Design and run experiments and rollout plans (backtesting, shadow scoring, A/B tests, feature-flagged releases) to validate product hypotheses.
Build and operate offline training pipelines and production batch scoring for bank intelligence products.
Ship and maintain online feature serving and low-latency model inference endpoints for real-time partner/bank scoring.
Implement model CI/CD, model/version registry, and safe rollout/rollback strategies.
Monitor model/data health: drift/regression detection, model-quality dashboards, alerts, and SLOs targeted to partner product needs.
Ensure offline and online parity, data lineage, and automated validation / data contracts to reduce regressions.
Optimize inference performance and cost for real-time scoring (batching, caching, runtime selection).
Ensure fairness, explainability and PII-aware handling for partner-facing ML features; maintain auditability for compliance.
Partner with platform and cross-functional teams to scale the ML/data foundation (graph features, sequence embeddings, unified pipelines).
Mentor engineers and document team standards for ML productization and operations.

Qualifications

Must-haves:
Strong software engineering skills including systems design, APIs, and building reliable backend services (Go or Python preferred).
Production experience with batch and streaming data pipelines and orchestration tools such as Airflow or Spark.
Experience building or operating real-time scoring and online feature-serving systems, including feature stores and low-latency model inference.
Experience integrating model outputs into product flows (APIs, feature flags) and measuring impact through experiments and product metrics.
Experience with model lifecycle and operations: model registries, CI/CD for models, reproducible training, offline & online parity, monitoring and incident response.
Nice to have:
Experience in fraud, risk, or marketing intelligence domains.
Experience with feature-store products (Tecton / Chronon / Feast / internal) and unified pipelines.
Experience with graph frameworks, graph feature engineering, or sequence embeddings.
Experience optimizing inference at scale (Triton/ONNX/quantization, batching, caching).

Additional Information

Our mission at Plaid is to unlock financial freedom for everyone. To support that mission, we seek to build a diverse team of driven individuals who care deeply about making the financial ecosystem more equitable.

XML job scraping automation by YubHub

]]> full-time senior hybrid $190,800-$286,800 per year software engineering, systems design, APIs, backend services, Go, Python, batch and streaming data pipelines, orchestration tools, Airflow, Spark, real-time scoring, online feature-serving systems, feature stores, low-latency model inference, model outputs, product flows, experiments, product metrics, model lifecycle, operations, model registries, CI/CD, reproducible training, offline & online parity, monitoring, incident response, fraud, risk, marketing intelligence, feature-store products, unified pipelines, graph frameworks, graph feature engineering, sequence embeddings, inference at scale, Triton, ONNX, quantization, batching, caching Engineering Technology Plaid https://logos.yubhub.co/plaid.com.png Plaid is a technology company that powers the tools millions of people rely on to live a healthier financial life. The company has a presence in multiple countries and works with thousands of companies. https://plaid.com/ https://jobs.lever.co/plaid/43b1374d-5c5e-4b63-b710-a95e3cb76bbe San Francisco 2026-04-17 5c28c97d-fc5 Member of Technical Staff - Image / Video Generation Job Title

Member of Technical Staff - Image / Video Generation

Job Description

We're the team behind Latent Diffusion, Stable Diffusion, and FLUX,foundational technologies that changed how the world creates images and video. We're creating the generative models that power how people make images and video,tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we’re just getting started.

Why This Role

You'll train large-scale diffusion models for image and video generation, exploring new approaches while maintaining the rigor that helps us distinguish meaningful progress from incremental tweaks. This isn't about following established recipes,it's about running the experiments that clarify which architectural choices matter and which are less impactful.

What You’ll Work On

Trains large-scale diffusion transformer models for image and video data, working at the scale where intuitions break and empirical evidence matters
Rigorously ablates design choices,running experiments that isolate variables, control for confounds, and produce insights you can actually trust,then communicating those results to shape our research direction
Reasons about the speed-quality tradeoffs of neural network architectures in production settings where both constraints matter simultaneously
Fine-tunes diffusion models for specialized applications like image and video upscalers, inpainting/outpainting models, and other tasks where general-purpose models aren't enough

What We’re Looking For

You've trained large-scale diffusion models and developed strong intuitions about what matters. You know that at research scale, every design choice has tradeoffs, and the only way to know which ones are worth making is through careful ablation. You're comfortable debugging distributed training issues and presenting research findings to the team.

Required Skills

Hands-on experience training large-scale diffusion models for image and video data, with practical knowledge of common failure modes and what matters most in training
Experience fine-tuning diffusion models for specialized applications,upscalers, inpainting, outpainting, or other tasks where understanding the domain matters as much as understanding the architecture
Deep understanding of how to effectively evaluate image and video generative models,knowing which metrics correlate with quality and which are just convenient proxies
Strong proficiency in PyTorch, transformer architectures, and the full ecosystem of modern deep learning
Solid understanding of distributed training techniques,FSDP, low precision training, model parallelism,because our models don't fit on one GPU and training decisions impact research outcomes

Preferred Skills

Experience writing forward and backward Triton kernels and ensuring their correctness while considering floating point errors
Proficiency with profiling, debugging, and optimizing single and multi-GPU operations using tools like Nsight or stack trace viewers
Know the performance characteristics of different architectural choices at scale
Have published research that contributed to how people think about generative models

How We Work Together

We’re a distributed team with real offices that people actually use. Depending on your role, you’ll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We’ll cover reasonable travel costs to make this possible. We think in-person time matters, and we’ve structured things to make it accessible to all. We’ll discuss what this will look like for the role during our interview process.

XML job scraping automation by YubHub

]]> full-time staff hybrid large-scale diffusion models, image and video data, PyTorch, transformer architectures, distributed training techniques, writing forward and backward Triton kernels, profiling, debugging, and optimizing single and multi-GPU operations, published research on generative models Engineering Technology Black Forest Labs https://logos.yubhub.co/blackforestlabs.com.png Black Forest Labs is a research lab developing foundational technologies for image and video generation. They have a growing presence in San Francisco and headquarters in Freiburg, Germany. https://www.blackforestlabs.com/ https://job-boards.greenhouse.io/blackforestlabs/jobs/4132217008 Freiburg (Germany) 2026-04-17 3c4831ed-fa8 Technical Product Manager We're hiring a Technical Product Manager to define and execute Alluxio's AI systems strategy , spanning inference, training, and emerging agentic workloads. This role bridges the worlds of AI infrastructure and distributed data systems, guiding how Alluxio evolves to serve next-generation model architectures and large-scale data flows.

Responsibilities

Define the long-term vision and roadmap for Alluxio's AI data platform, covering inference, training, and agentic workloads.
Collaborate with engineering to design features that deliver high-throughput, low-latency data access (e.g., GPU-aware caching, streaming reads, tiered prefetching).
Ensure seamless integration with frameworks like PyTorch, TensorFlow, Ray, and Triton; evolve Alluxio's APIs for AI-native workloads.
Engage directly with enterprise AI teams to understand workload patterns, validate impact, and prioritize roadmap direction.
Stay ahead of trends in multi-model serving, retrieval-augmented generation (RAG), and agentic orchestration; translate them into actionable product plans.

Requirements

5–9 years of experience in product management or technical leadership within AI infrastructure, ML platforms, or distributed systems.
Strong understanding of AI/ML workflows , from model training and deployment to inference and data-access pipelines.
Proven track record of delivering infrastructure features that improve latency, GPU utilization, or total cost of ownership.
Technical fluency with distributed systems, caching, and cloud orchestration (Kubernetes, AWS/GCP/Azure).
Familiarity with AI frameworks such as PyTorch, TensorFlow, Triton, Ray, or LangChain.
Exceptional communication and strategic thinking , ability to translate complex systems work into clear, prioritized roadmaps.

Benefits

Shape how the world's most advanced AI systems access and process data.
Work at the intersection of distributed systems, AI acceleration, and open source.
Collaborate with world-class engineers, researchers, and customers driving the AI frontier.
Competitive compensation and equity package with comprehensive benefits.
A culture built on curiosity, empathy, and deep technical rigor.

XML job scraping automation by YubHub

]]> full-time senior remote Distributed systems, Caching, Cloud orchestration, Kubernetes, AWS/GCP/Azure, PyTorch, TensorFlow, Triton, Ray, LangChain Engineering Technology Alluxio https://logos.yubhub.co/alluxio.io.png Alluxio powers the data layer for modern AI and analytics, with proven production at eight of the top ten internet companies and seven of the ten highest-valued enterprises globally. https://alluxio.io https://jobs.lever.co/alluxio/e7e0f8a4-83ed-416b-9f95-7f3ed8abfa52 San Francisco 2026-04-17 ce88828f-470 Solutions Architect, AI and ML We are building the world's leading AI company and are looking for an experienced Cloud Solution Architect to help assist customers with adoption of GPU hardware and Software, as well as building and deploying Machine Learning (ML), Deep Learning (DL), data analytics solutions on various Cloud Computing Platforms.

As part of the Solutions Architecture team, we work with some of the most exciting computing hardware and software technologies including the latest breakthroughs in machine learning and data science. A Solutions Architect is the first line of technical expertise between NVIDIA and our customers so you will engage directly with developers, researchers, and data scientists with some of NVIDIA's most strategic technology customers as well as work directly with business and engineering teams on product strategy.

What you will be doing:

Working with Cloud Service Providers to develop and demonstrate solutions based on NVIDIA's ML/DL and data science software and hardware technologies

Build and deploy AI/ML solutions at scale using NVIDIA's AI software on cloud-based GPU platforms.

Build custom PoCs for solution that address customer's critical business needs applying NVIDIA hardware and software technology

Partner with Sales Account Managers or Developer Relations Managers to identify and secure new business opportunities for NVIDIA products and solutions for ML/DL and other software solutions

Prepare and deliver technical content to customers including presentations about purpose-built solutions, workshops about NVIDIA products and solutions, etc.

Conduct regular technical customer meetings for project/product roadmap, feature discussions, and intro to new technologies. Establish close technical ties to the customer to facilitate rapid resolution of customer issues

What we need to see:

3+ years of Solutions Engineering (or similar Sales Engineering roles) or equivalent experience

3+ years of work-related experience in Deep Learning and Machine Learning, including deep learning frameworks TensorFlow or PyTorch, GPU, and CUDA experience extremely helpful.

BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Statistics, Physics, or other Engineering fields or equivalent experience.

Established track record of deploying solutions in cloud computing environments including AWS, GCP, or Azure

Knowledge of DevOps/ML Ops technologies such as Docker/containers, Kubernetes, data center deployments

Ability to use at least one scripting language (i.e., Python)

Good programming and debugging skills

Ability to communicate your ideas/code clearly through documents, presentation etc.

Ways to stand out from the crowd:

AWS, GCP or Azure Professional Solution Architect Certification.

Hands-on experience with NVIDIA GPUs and SDKs (e.g. CUDA, RAPIDS, Triton etc.)

System-level experience specifically GPU-based systems

Experience with Deep Learning at scale

Familiarity with parallel programming and distributed computing platforms

XML job scraping automation by YubHub

]]> full-time senior onsite Solutions Engineering, Deep Learning and Machine Learning, TensorFlow or PyTorch, GPU and CUDA experience, BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Statistics, Physics, or other Engineering fields, DevOps/ML Ops technologies, Docker/containers, Kubernetes, data center deployments, Scripting language (i.e., Python), Good programming and debugging skills, Ability to communicate your ideas/code clearly through documents, presentation etc., AWS, GCP or Azure Professional Solution Architect Certification, Hands-on experience with NVIDIA GPUs and SDKs (e.g. CUDA, RAPIDS, Triton etc.), System-level experience specifically GPU-based systems, Experience with Deep Learning at scale, Familiarity with parallel programming and distributed computing platforms Engineering Technology NVIDIA https://logos.yubhub.co/nvidia.com.png NVIDIA is a leading technology company that specialises in designing and manufacturing graphics processing units (GPUs) and high-performance computing hardware. https://nvidia.wd5.myworkdayjobs.com https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-WA-Redmond/Solutions-Architect--AI-and-ML_JR2000691 Redmond, Santa Clara, Seattle 2026-03-09 f8883394-0fc Solutions Architect, AI and ML We are looking for an experienced Cloud Solution Architect to help assist customers with adoption of GPU hardware and Software, as well as building and deploying Machine Learning (ML) , Deep Learning (DL), data analytics solutions on various Cloud Computing Platforms.

As a Solutions Architect, you will engage directly with developers, researchers, and data scientists with some of NVIDIA’s most strategic technology customers as well as work directly with business and engineering teams on product strategy.

Key Responsibilities:

Help cloud customers craft, deploy, and maintain scalable, GPU-accelerated inference pipelines on cloud ML services and Kubernetes for large language models (LLMs) and generative AI workloads.
Enhance performance tuning using TensorRT/TensorRT-LLM, vLLM, Dynamo, and Triton Inference Server to improve GPU utilization and model efficiency.
Collaborate with multi-functional teams (engineering, product) and offer technical mentorship to cloud customers implementing AI inference at scale.
Build custom PoCs for solution that address customer’s critical business needs applying NVIDIA hardware and software technology
Partner with Sales Account Managers or Developer Relations Managers to identify and secure new business opportunities for NVIDIA products and solutions for ML/DL and other software solutions
Prepare and deliver technical content to customers including presentations about purpose-built solutions, workshops about NVIDIA products and solutions, etc.
Conduct regular technical customer meetings for project/product roadmap, feature discussions, and intro to new technologies. Establish close technical ties to the customer to facilitate rapid resolution of customer issues

Requirements:

BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Statistics, Physics, or other Engineering fields or equivalent experience.
3+ Years in Solutions Architecture with a proven track record of moving AI inference from POC to production in cloud computing environments including AWS, GCP, or Azure
3+ years of hands-on experience with Deep Learning frameworks such as PyTorch and TensorFlow
Excellent knowledge of the theory and practice of LLM and DL inference
Strong fundamentals in programming, optimizations, and software design, especially in Python
Experience with containerization and orchestration technologies like Docker and Kubernetes, monitoring, and observability solutions for AI deployments
Knowledge of Inference technologies - NVIDIA NIM, TensorRT-LLM, Dynamo, Triton Inference Server, vLLM, etc
Proficiency in problem-solving and debugging skills in GPU environments
Excellent presentation, communication and collaboration skills

Nice to Have:

AWS, GCP or Azure Professional Solution Architect Certification.
Experience optimizing and deploying large MoE LLMs at scale
Active contributions to open-source AI inference projects (e.g., vLLM, TensorRT-LLM Dynamo, SGLang, Triton or similar)
Experience with Multi-GPU Multi-node Inference technologies like Tensor Parallelism/Expert Parallelism, Disaggregated Serving, LWS, MPI, EFA/Infiniband, NVLink/PCIe, etc
Experience in developing and integrating monitoring and alerting solutions using Prometheus, Grafana, and NVIDIA DCGM and GPU performance Analysis and tools like NVIDIA Nsight Systems

XML job scraping automation by YubHub

]]> full-time senior onsite Cloud Solution Architecture, GPU hardware and Software, Machine Learning (ML), Deep Learning (DL), Data Analytics, Cloud Computing Platforms, Kubernetes, TensorRT, TensorRT-LLM, vLLM, Dynamo, Triton Inference Server, Python, Containerization, Orchestration, Monitoring, Observability, Inference technologies, NVIDIA NIM, Problem-solving, Debugging, GPU environments, AWS, GCP, Azure, Professional Solution Architect Certification, Large MoE LLMs, Open-source AI inference projects, Multi-GPU Multi-node Inference technologies, Monitoring and alerting solutions, Prometheus, Grafana, NVIDIA DCGM, GPU performance Analysis, NVIDIA Nsight Systems Engineering Technology NVIDIA https://logos.yubhub.co/nvidia.com.png NVIDIA is a leading technology company that specializes in designing and manufacturing graphics processing units (GPUs) and high-performance computing hardware. https://nvidia.wd5.myworkdayjobs.com https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-WA-Redmond/Solutions-Architect--AI-and-ML_JR2005988-1 Redmond, CA, Santa Clara, Seattle 2026-03-09 11a60d5a-f54 Performance Engineer, GPU About the role:

Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you'll architect and implement the foundational systems that power Claude and push the frontiers of what's possible with large language models. You'll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.

Working at the intersection of hardware and software, you'll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.

You might be a good fit if you:

Have deep experience with GPU programming and optimization at scale
Are impact-driven, passionate about delivering measurable performance breakthroughs
Can navigate complex systems from hardware interfaces to high-level ML frameworks
Enjoy collaborative problem-solving and pair programming
Want to work on state-of-the-art language models with real-world impact
Care about the societal impacts of your work
Thrive in ambiguous environments where you define the path forward

Strong candidates may also have experience with:

GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization
ML Compilers & Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators
Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight
Distributed Systems: NCCL, NVLink, collective communication, model parallelism
Low-Precision: INT8/FP8 quantization, mixed-precision techniques
Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration

Representative projects:

Co-design attention mechanisms and algorithms for next-generation hardware architectures
Develop custom kernels for emerging quantization formats and mixed-precision techniques
Design distributed communication strategies for multi-node GPU clusters
Optimize end-to-end training and inference pipelines for frontier language models
Build performance modeling frameworks to predict and optimize GPU utilization
Implement kernel fusion strategies to minimize memory bandwidth bottlenecks
Create resilient systems for planet-scale distributed training infrastructure
Profile and eliminate performance bottlenecks in production serving infrastructure
Partner with hardware vendors to influence future accelerator capabilities and software stacks

Deadline to apply: None. Applications will be reviewed on a rolling basis.

The expected salary range for this position is:

Annual Salary: $280,000 - $850,000USD

XML job scraping automation by YubHub

]]> full-time senior hybrid $280,000 - $850,000USD GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration, GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic's mission is to create reliable, interpretable, and steerable AI systems. The company is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/4926227008 San Francisco, CA | New York City, NY | Seattle, WA 2026-03-08 7badeaf5-492 Hardware / Software CoDesign Engineer Hardware / Software CoDesign Engineer

Location

San Francisco

Employment Type

Full time

Location Type

Hybrid

Department

Scaling

Compensation

$342K – $555K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI’s supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI.

About the Role

As an Engineer on our hardware optimization and co-design team, you will co-design future hardware from different vendors for programmability and performance. You will work with our kernel, compiler and machine learning engineers to understand their unique needs related to ML techniques, algorithms, numerical approximations, programming expressivity, and compiler optimizations. You will evangelize these constraints with various vendors to develop and influence future hardware architectures towards efficient training and inference on our models. If you are excited about efficiently distributing a large language model across devices, dealing with and optimizing system-wide/rack-wide networking bottlenecks and eventually tailoring the compute pipe and memory hierarchy of the hardware platform, simulating workloads at different abstractions and working closely with our partners, this is the perfect opportunity!

In this role, you will:

Co-design future hardware for programmability and performance with our hardware vendors

Assist hardware vendors in developing optimal kernels and add support for it in our compiler

Develop performance estimates for critical kernels for different hardware configurations and drive decisions on compute core and memory hierarchy features

Build system performance models at different abstraction levels and carry out analysis to drive decisions on scale up, scale out, front end networking

Work with machine learning engineers, kernel engineers and compiler developers to understand their vision and needs from high performance accelerators

Manage communication and coordination with internal and external partners

Influence the roadmap of hardware partners to optimize them for OpenAI’s workloads.

Evaluate potential partners’ accelerators and platforms.

As the scope of the role and team grows, understand and influence roadmaps for hardware partners for our datacenter networks, racks, and buildings.

You might thrive in this role if you have:

4+ years of industry experience, including experience harnessing compute at scale and optimizing ML platform code to run efficiently on target hardware.

Strong experience in software/hardware co-design

Deep understanding of GPU and/or other AI accelerators

Experience with CUDA, Triton or a related accelerator programming language

Experience driving Machine Learning accuracy with low precision formats

Experience with system performance modeling and analysis to optimize ML model deployment

Strong coding skills in C/C++ and Python

Are familiar with the fundamentals of deep learning computing and chip architecture/microarchitecture.

These attributes are nice to have:

PhD in Computer Science and Engineering with a specialization in Computer Architecture, Parallel Computing. Compilers or other Systems

Strong understanding of LLMs and challenges related to their training and inference

XML job scraping automation by YubHub

]]> full-time senior hybrid $342K – $555K • Offers Equity software/hardware co-design, GPU and/or other AI accelerators, CUDA, Triton or a related accelerator programming language, Machine Learning accuracy with low precision formats, system performance modeling and analysis to optimize ML model deployment, C/C++ and Python, PhD in Computer Science and Engineering with a specialization in Computer Architecture, Parallel Computing. Compilers or other Systems, Strong understanding of LLMs and challenges related to their training and inference Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is a technology company that develops and commercializes advanced artificial intelligence (AI) systems. The company was founded in 2015 and is headquartered in San Francisco, California. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/bdbb2292-ecb3-42dc-ba89-65edf397d8f8 San Francisco 2026-03-06 989f992b-6b2 Software Engineer, Inference – AMD GPU Enablement Software Engineer, Inference – AMD GPU Enablement

Location

San Francisco

Employment Type

Full time

Department

Scaling

Compensation

$295K – $555K • Offers Equity

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

Our Inference team brings OpenAI’s most capable research and technology to the world through our products. We empower consumers, enterprises and developers alike to use and access our state-of-the-art AI models, allowing them to do things that they’ve never been able to before. We focus on performant and efficient model inference, as well as accelerating research progression via model inference.

About the Role

We’re hiring engineers to scale and optimize OpenAI’s inference infrastructure across emerging GPU platforms. You’ll work across the stack - from low-level kernel performance to high-level distributed execution - and collaborate closely with research, infra, and performance teams to ensure our largest models run smoothly on new hardware.

This is a high-impact opportunity to shape OpenAI’s multi-platform inference capabilities from the ground up with a particular focus on advancing inference performance on AMD accelerators.

In this role, you will:

Own bring-up, correctness and performance of the OpenAI inference stack on AMD hardware.

Integrate internal model-serving infrastructure (e.g., vLLM, Triton) into a variety of GPU-backed systems.

Debug and optimize distributed inference workloads across memory, network, and compute layers.

Validate correctness, performance, and scalability of model execution on large GPU clusters.

Collaborate with partner teams to design and optimize high-performance GPU kernels for accelerators using HIP, Triton, or other performance-focused frameworks.

Collaborate with partner teams to build, integrate and tune collective communication libraries (e.g., RCCL) used to parallelize model execution across many GPUs.

You can thrive in this role if you:

Have experience writing or porting GPU kernels using HIP, CUDA, or Triton, and care deeply about low-level performance.

Are familiar with communication libraries like NCCL/RCCL and understand their role in high-throughput model serving.

Have worked on distributed inference systems and are comfortable scaling models across fleets of accelerators.

Enjoy solving end-to-end performance challenges across hardware, system libraries, and orchestration layers.

Are excited to be part of a small, fast-moving team building new infrastructure from first principles.

Nice to Have:

Contributions to open-source libraries like RCCL, Triton, or vLLM.

Experience with GPU performance tools (Nsight, rocprof, perf) and memory/comms profiling.

Prior experience deploying inference on other non-NVIDIA GPU environments.

Knowledge of model/tensor parallelism, mixed precision, and serving 10B+ parameter models.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time mid onsite $295K – $555K • Offers Equity GPU kernels, HIP, CUDA, Triton, NCCL/RCCL, distributed inference systems, GPU performance tools, memory/comms profiling, open-source libraries, GPU performance tools, memory/comms profiling Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/9b79406c-89a8-49bd-8a38-e72db80996e9 San Francisco 2026-03-06 46bb9922-091 ML Research Engineer - Hardware Codesign Location

San Francisco

Employment Type

Full time

Department

Scaling

Compensation

$185K – $455K • Offers Equity

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI’s supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI.

About the Role

We’re seeking a Research-Hardware Codesign Engineer to operate at the boundary between model research and silicon/system architecture. You’ll help shape the numerics, architecture, and technology bets of future OpenAI silicon in collaboration with both Research and Hardware.

Your work will include debugging gaps between rooflines and reality, writing quantization kernels, derisking numerics via model evals, quantifying system architecture tradeoffs, and implementing novel numeric RTL. This is a hands-on role for people who go looking for hard problems, get to ground truth, and drive it to production. Strong prioritization and clear, honest communication are essential.

Location: San Francisco, CA (Hybrid: 3 days/week onsite)

Relocation assistance available.

In this role you will:

Build on our roofline simulator to track evolving workloads, and deliver analyses that quantify the impact of system architecture decisions and support technology pathfinding.

Debug gaps between performance simulation and real measurements; clearly communicate root cause, bottlenecks, and invalid assumptions.

Write emulation kernels for low-precision numerics and lossy compression schemes, and get Research the information they need to trade efficiency with model quality.

Prototype numerics modules by pushing RTL through synthesis; hand off novel numerics cleanly, or occasionally own an RTL module end-to-end.

Proactively pull in new ML workloads, prototype them with rooflines and/or functional simulation, and drive initial evaluation of new opportunities or risks.

Understand the whole picture from ML science to hardware optimization, and slice this end-to-end objective into near-term deliverables.

Build ad-hoc collaborations across teams with very different goals and areas of expertise, and keep progress unblocked.

Communicate design tradeoffs clearly with explicit assumptions and confidence levels; produce a trail of evidence that enables confident execution.

You Will Thrive in this Role if:

An exceptional track record of high-quality technical output, and a bias for shipping a prototype now and iterating later in the absence of clear requirements.

Strong Python, and C++ or Rust, with a cautious attitude toward correctness and an intuition for clean extensibility.

Experience writing Triton, CUDA, or similar, and an understanding of the resulting mapping of tensor ops to functional units.

Working knowledge of PyTorch or JAX; experience in large ML codebases is a plus.

Practical understanding of floating point numerics, the ML tradeoffs of reduced precision, and the current state of the art in model quantization.

Deep understanding of transformer models, and strong intuition for transformer rooflines and the tradeoffs of sharded training and inference in large-scale ML systems.

Experience writing RTL (especially for floating point logic) and understanding of PPA tradeoffs is a plus.

Strong cross-functional communication (e.g. across ML researchers and hardware engineers); ability to slice ambiguous early-incubation ideas into concrete arenas in which progress can be made.

_To comply with U.S. export control laws and regulations, candidates for this role may need to meet certain legal status requirements.

XML job scraping automation by YubHub

]]> full-time senior hybrid $185K – $455K Python, C++, Rust, Triton, CUDA, PyTorch, JAX, Floating point numerics, Model quantization, Transformer models, RTL, PPA tradeoffs, Strong Python, C++ or Rust, Experience writing Triton, CUDA or similar, Working knowledge of PyTorch or JAX, Experience in large ML codebases Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is a technology company that develops and commercializes advanced artificial intelligence (AI) systems. It was founded in 2015 and is headquartered in San Francisco, California. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/5931abef-191b-417e-89f1-1d06f00e908c San Francisco 2026-03-06 7f56054b-d77 Principal Software Engineer Summary

Microsoft AI are looking for a talented Principal Software Engineer at their Mountain View office. This role sits at the heart of strategic decision-making, driving innovations in AI infrastructure. You'll work directly with key partners to understand, design, and implement complex inferencing capabilities for state-of-the-art deep learning models.

About the Role

As a Principal Software Engineer, you will be responsible for engaging directly with key partners to understand, design, and implement complex inferencing capabilities for state-of-the-art deep learning models. You will work with cutting-edge hardware and software stacks to deliver best-in-class inference performance while optimizing for cost, leveraging open-source projects to advance deep learning applications. You will collaborate with external and internal teams to identify new areas for improvement and contribute to innovations that enhance model performance and deployment.

Accountabilities

Engage directly with key partners to understand, design, and implement complex inferencing capabilities for state-of-the-art deep learning models.
Work with cutting-edge hardware and software stacks to deliver best-in-class inference performance while optimizing for cost.

The Candidate we're looking for

Experience:

6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python.

Technical skills:

Experience with model compression (quantization, distillation, SVD, low-rank methods).
Experience in building high-throughput inference serving stacks (continuous batching, KV-cache optimizations, routing).

Personal attributes:

Solid experience in GPU inference optimization (CUDA, TensorRT, Triton, or custom GPU kernels).
Proficiency in profiling tools (Nsight, TensorBoard, PyTorch profiler) and ability to identify CPU/GPU bottlenecks.

Benefits

Competitive salary range of USD $139,900 – $274,800 per year.
Comprehensive benefits package, including health insurance, retirement plan, and paid time off.
Opportunities for professional growth and development.
Collaborative and dynamic work environment.

XML job scraping automation by YubHub

]]> full-time senior onsite USD $139,900 – $274,800 per year C, C++, C#, Java, JavaScript, Python, model compression, GPU inference optimization, TensorRT, Triton, CUDA, Nsight, TensorBoard, PyTorch profiler Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that specializes in artificial intelligence and machine learning. They are known for their innovative products and services that aim to make a positive impact on society. With a strong focus on research and development, Microsoft AI is constantly pushing the boundaries of what is possible with AI. https://microsoft.ai https://microsoft.ai/job/principal-software-engineer-24/ Mountain View 2026-03-06 961a53f3-82e Senior Software Engineer Summary

Microsoft are looking for a talented Senior Software Engineer at their Suzhou office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising the search engine and online advertising ecosystem. You'll work directly with leadership to shape the company's direction in the search and advertising markets.

About the Role

The R&D of Search Ads aims to build an online advertising ecosystem of users, advertisers, and the search engine. Bing Search Ads Understanding team is chartered to deliver world class algorithm using web scale data. Our mission is to drive user satisfaction, advertiser ROI and Bing revenue. A core challenge is to match advertisers’ “Ad display” and users’ “query” by build an intelligent system to really understand the users need. This is a very hard problem that demands the most advanced AI models and sophisticated engineering systems. Join us to work on projects highly strategic to Bing search in a fun and fast-paced environment!

Accountabilities

Design, develop, and maintain high-performance software in C/C++ and Python, including GPU programming with CUDA, ROCm, or Triton.
Optimize model inference and training pipelines for speed, throughput, memory efficiency, and cost across GPU platforms.

The Candidate we're looking for

Experience:

Bachelor’s Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, CUDA, or ROCm OR equivalent experience.

Technical skills:

Practical experience writing new GPU kernels, going beyond experience of GPU workloads with existing library kernels.

Personal attributes:

Cross-team collaboration skills and the desire to collaborate in a team of researchers and developers.

Benefits

Work on projects highly strategic to Bing search in a fun and fast-paced environment.
Collaborate with platform teams to integrate and tune solutions on emerging accelerator stacks and rapidly evolving toolchains.
Partner with internal and external stakeholders to translate requirements into scalable performance features and optimizations for state-of-the-art models.

XML job scraping automation by YubHub

]]> full-time senior onsite C/C++, Python, CUDA, ROCm, Triton, GPU programming, High-performance software development, Deep learning frameworks, Inference optimization, GPU profiling tools Engineering Technology Microsoft https://logos.yubhub.co/microsoft.ai.png Microsoft is a multinational technology company that develops, manufactures, licenses, and supports a wide range of software products, services, and devices. The company is known for its Windows operating system, Office software suite, and Xbox gaming console. Microsoft is headquartered in Redmond, Washington, and is one of the largest and most successful technology companies in the world. https://microsoft.ai https://microsoft.ai/job/senior-software-engineer-76/ Suzhou 2026-03-06 a15b11dd-765 Principal Software Engineer Summary

Microsoft AI are looking for a talented Principal Software Engineer at their Redmond office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising AI technology. You'll work directly with leadership to shape the company's direction in the AI market.

About the Role

As a Principal Software Engineer, you will be responsible for designing and implementing complex software systems that drive innovation in AI infrastructure. You will work with cutting-edge hardware and software stacks to deliver best-in-class inference performance while optimizing for cost, leveraging open-source projects to advance deep learning applications. You will collaborate with external and internal teams to identify new areas for improvement and contribute to innovations that enhance model performance and deployment.

Accountabilities

Engage directly with key partners to understand, design, and implement complex inferencing capabilities for state-of-the-art deep learning models, driving innovations in AI infrastructure.
Work with cutting-edge hardware and software stacks to deliver best-in-class inference performance while optimizing for cost, leveraging open-source projects to advance deep learning applications.

The Candidate we're looking for

Experience:

Bachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Technical skills:

Experience with model compression (quantization, distillation, SVD, low-rank methods).
Experience in building high-throughput inference serving stacks (continuous batching, KV-cache optimizations, routing).

Personal attributes:

Solid experience in GPU inference optimization (CUDA, TensorRT, Triton, or custom GPU kernels).
Proficiency in profiling tools (Nsight, TensorBoard, PyTorch profiler) and ability to identify CPU/GPU bottlenecks.

Benefits

Competitive salary
Comprehensive benefits package
Opportunities for professional growth and development
Collaborative and dynamic work environment

XML job scraping automation by YubHub

]]> full-time senior onsite USD $139,900 – $274,800 per year C, C++, C#, Java, JavaScript, Python, model compression, GPU inference optimization, profiling tools, TensorRT, Triton, CUDA, TensorBoard, PyTorch profiler Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading technology company that specializes in artificial intelligence and machine learning. They are known for their innovative products and services that aim to make a positive impact on society. With a strong focus on research and development, Microsoft AI is constantly pushing the boundaries of what is possible with AI. https://microsoft.ai https://microsoft.ai/job/principal-software-engineer-23/ Redmond 2026-03-06 426a1b6c-bb9 Senior Software Engineer Summary

Microsoft are looking for a talented Senior Software Engineer at their Beijing office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising the search engine and online advertising ecosystem. You'll work directly with leadership to shape the company's direction in the search engine and online advertising markets.

About the Role

Accountabilities

Design, develop, and maintain high-performance software in C/C++ and Python, including GPU programming with CUDA, ROCm, or Triton.
Optimize model inference and training pipelines for speed, throughput, memory efficiency, and cost across GPU platforms.

The Candidate we're looking for

Experience:

Bachelor’s Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, CUDA, or ROCm OR equivalent experience.

Technical skills:

Practical experience writing new GPU kernels, going beyond experience of GPU workloads with existing library kernels.

Personal attributes:

Cross-team collaboration skills and the desire to collaborate in a team of researchers and developers.

Benefits

Work on projects highly strategic to Bing search in a fun and fast-paced environment.
Collaborate with platform teams to integrate and tune solutions on emerging accelerator stacks and rapidly evolving toolchains.
Partner with internal and external stakeholders to translate requirements into scalable performance features and optimizations for state-of-the-art models.

XML job scraping automation by YubHub

]]> full-time senior onsite C/C++, Python, CUDA, ROCm, Triton, GPU programming, High-performance software development, Deep learning frameworks, Inference optimization, Software engineering principles, Architecture design Engineering Technology Microsoft https://logos.yubhub.co/microsoft.ai.png Microsoft is a multinational technology company that develops, manufactures, licenses, and supports a wide range of software products, services, and devices. The company is known for its Windows operating system, Office software suite, and Xbox gaming console. Microsoft is a leader in the technology industry and is committed to innovation and customer satisfaction. https://microsoft.ai https://microsoft.ai/job/senior-software-engineer-75/ Beijing 2026-03-06 7c0b682d-d0b Senior Software Engineer Summary

Microsoft AI are looking for a talented Senior Software Engineer at their Beijing office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising AI technology. You'll work directly with leadership to shape the company's direction in the AI market.

About the Role

We are seeking an expert Senior GPU Engineer to join our AI Infrastructure team. In this role, you will architect and optimize the core inference engine that powers our large-scale AI models. You will be responsible for pushing the boundaries of hardware performance, reducing latency, and maximizing throughput for Generative AI and Deep Learning workloads. You will work at the intersection of Deep Learning algorithms and low-level hardware, designing custom operators and building a highly efficient training/inference execution engine from the ground up.

Accountabilities

Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries.
Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization).

The Candidate we're looking for

Experience:

4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python.

Technical skills:

Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper).
Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution.

Personal attributes:

Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel).

Benefits

Starting January 26, 2026, Microsoft AI employees who live within a 50-mile commute of a designated Microsoft office in the U.S. or 25-mile commute of a non-U.S., country-specific location are expected to work from the office at least four days per week.

XML job scraping automation by YubHub

]]> full-time senior onsite C, C++, CUDA, Triton, PyTorch, Linux, CMake, pybind11, CI/CD, GPU workloads Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. https://microsoft.ai https://microsoft.ai/job/senior-software-engineer-17/ Beijing 2026-03-06 c041d54a-929 Internship Program Perplexity is excited to announce the Internship Program for exceptional Master’s or PhD students studying Computer Science or Engineering in the UK, enrolled in the 2025-2026 academic year. This is an intensive program in which you will work directly with our AI Inference team.

What you'll do

Work with the inference team to improve serving latency and throughput
Bring up support for new models and state-of-the-art inference optimizations or quantization schemes
Optimize inference across the entire stack, from GPU kernels to serving endpoints

What you need

Strong engineering track record with proven knowledge of fundamentals and programming languages (multi-threaded programming, networking, compilation, systems programming, etc)
Pursuing a Master's or PhD in Computer Science with a focus on performance-related subjects (HPC, Compilers, Distributed Systems)

XML job scraping automation by YubHub

]]> internship entry hybrid strong engineering track record, proven knowledge of fundamentals and programming languages, pursuing a Master's or PhD in Computer Science, experience with ML frameworks (Torch, JAX), experience with GPU programming (CUDA, Triton), experience with High-Performance Computing (OpenMPI) Engineering Technology Perplexity https://logos.yubhub.co/perplexity.com.png Perplexity is a rapidly growing AI startup that has experienced tremendous growth and adoption since publicly launching the world's first fully functional conversational answer engine in 2022. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/perplexity/79a07e2d-6150-4929-80fe-bbe13a641763 London 2026-03-04 7917d1eb-6e2 Engineering Manager - Inference We are looking for an Inference Engineering Manager to lead our AI Inference team. This is a unique opportunity to build and scale the infrastructure that powers Perplexity's products and APIs, serving millions of users with state-of-the-art AI capabilities.

What you'll do

You will own the technical direction and execution of our inference systems while building and leading a world-class team of inference engineers. Our current stack includes Python, PyTorch, Rust, C++, and Kubernetes.

Lead and grow a high-performing team of AI inference engineers
Develop APIs for AI inference used by both internal and external customers
Architect and scale our inference infrastructure for reliability and efficiency

What you need

5+ years of engineering experience with 2+ years in a technical leadership or management role
Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)
Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers

XML job scraping automation by YubHub

]]> full-time senior onsite $300K - $405K ML systems, inference frameworks, LLM architecture, CUDA, Triton, custom kernel development Engineering Technology Perplexity https://logos.yubhub.co/perplexity.com.png Perplexity is a rapidly growing company that is building and scaling the infrastructure that powers its products and APIs, serving millions of users with state-of-the-art AI capabilities. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/perplexity/2a87ccbf-82ef-4fc7-b1ed-4dd18b11baf9 San Francisco 2026-03-04