Senior GenAI Research Engineer - Optimization and Kernels

f0f66ce3-d78 Senior GenAI Research Engineer - Optimization and Kernels As a research engineer on the Scaling team at Databricks, you will be responsible for keeping up with the latest developments in deep learning and advancing the scientific frontier by creating new techniques that go beyond the state of the art.

You will work together on a collaborative team of researchers and engineers with diverse backgrounds and technical training. Your goal will be to make our customers successful in applying state-of-the-art LLMs and AI systems, and we encode our scientific expertise into our products to make that possible.

Your responsibilities will include:

Driving performance improvements through advanced optimization techniques including kernel fusion, mixed precision, memory layout optimization, tiling strategies, and tensorization for training-specific patterns

Designing, implementing, and optimizing high-performance GPU kernels for training workloads (e.g., attention mechanisms, custom layers, gradient computation, activation functions) targeting NVIDIA architectures

Designing and implementing distributed training frameworks for large language models, including parallelism strategies (data, tensor, pipeline, ZeRO-based) and optimized communication patterns for gradient synchronization and collective operations

Profiling, debugging, and optimizing end-to-end training workflows to identify and resolve performance bottlenecks, applying memory optimization techniques like activation checkpointing, gradient sharding, and mixed precision training

We look for candidates with a strong background in computer science or a related field, hands-on experience writing and tuning CUDA kernels for ML training applications, and a deep understanding of parallelism techniques and memory optimization strategies for large-scale model training.

XML job scraping automation by YubHub

]]> full-time senior onsite $166,000-$225,000 USD CUDA, NVIDIA GPU architecture, PyTorch, distributed training frameworks, parallelism techniques, memory optimization strategies Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks is a data and AI company that provides a unified platform for data, analytics, and AI. It was founded by the original creators of Lakehouse, Apache Spark, Delta Lake, and MLflow. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8297797002 San Francisco, California 2026-04-18 cba88898-896 Research Engineer, Infrastructure, Kernels We're looking for an infrastructure research engineer to design, optimize, and maintain the compute foundations that power large-scale language model training. You will develop high-performance ML kernels (e.g., CUDA, CuTe, Triton), enable efficient low-precision arithmetic, and improve the distributed compute stack that makes training large models possible.

This role is perfect for an engineer who enjoys working close to the metal and across the research boundary. You'll collaborate with researchers and systems architects to bridge algorithmic design with hardware efficiency. You'll prototype new kernel implementations, profile performance across hardware generations, and help define the numerical and parallelism strategies that determine how we scale next-generation AI systems.

Responsibilities

Design and implement custom ML kernels (e.g., CUDA, CuTe, Triton) for core LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for modern GPU and accelerator architectures.
Design and think through compute primitives to reduce memory bandwidth bottlenecks and improve kernel compute efficiency.
Collaborate with research teams to align kernel-level optimizations with model architecture and algorithmic goals.
Develop and maintain a library of reusable kernels and performance benchmarks that serve as the foundation for internal model training.
Contribute to infrastructure stability and scalability, ensuring reproducibility, consistency across precision formats, and high utilization of compute resources.
Document and share insights through internal talks, technical papers, or open-source contributions to strengthen the broader ML systems community.

Skills and Qualifications

Minimum qualifications:

Bachelor’s degree or equivalent experience in computer science, electrical engineering, statistics, machine learning, physics, robotics, or similar.
Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases
Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.
Proficiency in CUDA, CuTe, Triton, or other GPU programming frameworks.
Demonstrated ability to analyze, profile, and optimize compute-intensive workloads.

Preferred qualifications:

Experience training or supporting large-scale language models with tens of billions of parameters or more.
Track record of improving research productivity through infrastructure design or process improvements.
Experience developing or tuning kernels for deep learning frameworks such as PyTorch, JAX, or custom accelerators.
Familiarity with tensor parallelism, pipeline parallelism, or distributed data processing frameworks.
Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (e.g., XLA, TVM).
Contributions to open-source GPU, ML systems, or compiler optimization projects.
Prior research or engineering experience in numerical optimization, communication-efficient training, or scalable AI infrastructure.

XML job scraping automation by YubHub

]]> full-time senior onsite $350,000 - $475,000 USD CUDA, CuTe, Triton, GPU programming frameworks, Deep learning frameworks (e.g., PyTorch, JAX), Computer science, Electrical engineering, Statistics, Machine learning, Physics, Robotics, Experience training or supporting large-scale language models with tens of billions of parameters or more, Track record of improving research productivity through infrastructure design or process improvements, Experience developing or tuning kernels for deep learning frameworks such as PyTorch, JAX, or custom accelerators, Familiarity with tensor parallelism, pipeline parallelism, or distributed data processing frameworks, Experience implementing low-precision formats (FP8, INT8, block floating point) or contributing to related compiler stacks (e.g., XLA, TVM), Contributions to open-source GPU, ML systems, or compiler optimization projects, Prior research or engineering experience in numerical optimization, communication-efficient training, or scalable AI infrastructure Engineering Technology Thinking Machines Lab https://logos.yubhub.co/thinkingmachines.ai.png Thinking Machines Lab is a technology company that has created widely used AI products, including ChatGPT and Character.ai, and open-source projects like PyTorch. https://thinkingmachines.ai/ https://job-boards.greenhouse.io/thinkingmachines/jobs/5013934008 San Francisco 2026-04-18 dc17980d-461 Research Engineer, Interpretability JOB TITLE: Research Engineer, Interpretability \n LOCATION: San Francisco, CA \n DEPARTMENT: AI Research & Engineering \n \n JOB DESCRIPTION: \n \n When you see what modern language models are capable of, do you wonder, "How do these things work? How can we trust them?" \n \n The Interpretability team at Anthropic is working to reverse-engineer how trained models work because we believe that a mechanistic understanding is the most robust way to make advanced systems safe. \n \n Think of us as doing "neuroscience" of neural networks using "microscopes" we build - or reverse-engineering neural networks like binary programs. \n \n More resources to learn about our work: \n - Our research blog - covering advances including Monosemantic Features and Circuits \n - An Introduction to Interpretability from our research lead, Chris Olah \n - The Urgency of Interpretability from CEO Dario Amodei \n - Engineering Challenges Scaling Interpretability - directly relevant to this role \n - 60 Minutes segment - Around 8:07, see a demo of tooling our team built \n - New Yorker article - what it's like to work on one of AI's hardest open problems \n \n Even if you haven't worked on interpretability before, the infrastructure expertise is similar to what's needed across the lifecycle of a production language model: \n - Pretraining: Training dictionary learning models looks a lot like model pretraining - creating stable, performant training jobs for massively parameterized models across thousands of chips \n - Inference: Interp runs a customized inference stack. Day-to-day analysis requires services that allow editing a model's internal activations mid-forward-pass - for example, adding a "steering vector" \n - Performance: Like all LLM work, we push up against the limits of hardware and software. Rather than squeezing the last 0.1%, we are focused on finding bottlenecks, fixing them and moving ahead given rapidly evolving research and safety mission \n \n The science keeps scaling - and it's now applied directly in safety audits on frontier models, with real deadlines. As our research has matured, engineering and infrastructure have become a bottleneck. Your work will have a direct impact on one of the most important open problems in AI. \n \n RESPONSIBILITIES: \n - Build and maintain the specialized inference and training infrastructure that powers interpretability research - including instrumented forward/backward passes, activation extraction, and steering vector application \n - Resolve scaling and efficiency bottlenecks through profiling, optimization, and close collaboration with peer infrastructure teams \n - Design tools, abstractions, and platforms that enable researchers to rapidly experiment without hitting engineering barriers \n - Help bring interpretability research into production safety audits - with real deadlines and high reliability expectations \n - Work across the stack - from model internals and accelerator-level optimization to user-facing research tooling \n \n YOU MAY BE A GOOD FIT IF YOU: \n - Have 5-10+ years of experience building software \n - Are highly proficient in at least one programming language (e.g., Python, Rust, Go, Java) and productive with Python \n - Are extremely curious about unfamiliar domains; can quickly learn and put that knowledge to work, e.g. diving into new layers of the stack to find bottlenecks \n - Have a strong ability to prioritize the most impactful work and are comfortable operating with ambiguity and questioning assumptions \n - Prefer fast-moving collaborative projects to extensive solo efforts \n - Are curious about interpretability research and its role in AI safety (though no research experience is required!) \n - Care about the societal impacts and ethics of your work \n - Are comfortable working closely with researchers, translating research needs into engineering solutions. \n \n STRONG CANDIDATES MAY ALSO HAVE EXPERIENCE WITH: \n - Optimizing the performance of large-scale distributed systems \n - Language modeling fundamentals with transformers \n - High Performance LLM optimization: memory management, compute efficiency, parallelism strategies, inference throughput optimization \n - Working hands-on in a mainstream ML stack - PyTorch/CUDA on GPUs or JAX/XLA on TPUs \n - Collaborating closely with researchers and building tooling to support research teams; or directly performed research with complex engineering challenges \n \n REPRESENTATIVE PROJECTS: \n - Building Garcon, a tool that allows researchers to easily instrument LLMs to extract internal activations \n - Designing and optimizing a pipeline to efficiently collect petabytes of transformer activations and shuffle them \n - Profiling and optimizing ML training jobs, including multi-GPU parallelism and memory optimization \n - Building a steered inference system that applies targeted interventions to model internals at scale (conceptually similar to Golden Gate Claude but for safety research) \n \n ROLE SPECIFIC LOCATION POLICY: \n - This role is based in the San Francisco office; however, we are open to considering exceptional candidates for remote work on a case-by-case basis. \n \n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role's On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. \n Annual Salary:\\$315,000-\\$560,000 USD

XML job scraping automation by YubHub

]]> full-time senior hybrid $315,000-$560,000 USD Python, Rust, Go, Java, PyTorch, CUDA, JAX, XLA, High Performance LLM optimization, memory management, compute efficiency, parallelism strategies, inference throughput optimization, large-scale distributed systems, language modeling fundamentals, transformers, collaborating closely with researchers, building tooling to support research teams Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/4980430008 San Francisco, CA 2026-04-18 19c6b9e4-ff6 Foundation and generative models for biomolecules At Inceptive, you will drive forward development that could help billions of people. You will be part of a collaborative, interdisciplinary team building our biological software.

The design space of biomolecules is unimaginably vast , far beyond what can be explored experimentally. Yet within this space lie molecules with properties essential for new medicines. Our machine learning models learn to design therapeutic biomolecules with specific, desirable functions.

We advance the state of the art in molecular design by training large-scale foundation models and developing cutting-edge generative approaches. The models learn from diverse heterogeneous datasets and are refined through focused fine-tuning and feedback from experiments. Key to progress is a team that combines exceptional machine learning expertise with thorough domain understanding.

You will collaborate closely with other machine learning researchers and engineers, as well as computational and experimental biologists, to advance these models and translate their capabilities into real therapeutic designs.

Responsibilities

Embody our vision of an interdisciplinary environment and embrace learning about areas outside of your traditional area of expertise

Develop, implement, train, and iteratively improve state-of-the-art models for biomolecule design

Analyze, visualize, and communicate results to support team efforts in improving models and data

Create, deploy, and refine tools for efficient, reliable machine learning experimentation and production

Work with biologists to collect data for the training and evaluation of generative models of biomolecules

Provide mentorship and technical direction to team members as appropriate

Qualifications

3+ years of hands-on experience developing ML models

Demonstrated track record of implementing, training, improving advanced machine learning models

Highly capable programmer fluent in Python ecosystem and PyTorch or similar deep learning framework

Availability to work with team members across US and Europe, with meetings starting at 8am PT and ending at 7pm CET

Readiness to travel several times a year for company retreats and business events

Compensation

$200K – $275K + Bonus + Equity

Benefits

A competitive compensation package

30 days paid vacation per year

Comprehensive health insurance for US based employees

401K with company match for US based employees and Direktversicherung for German employees

Quarterly company-wide retreats

Monthly wellness benefit

Budget for multiple visits per year to our offices in Berlin, Palo Alto or Switzerland

Learning & Development budget to attend conferences, take courses, or otherwise invest in your professional growth, as well as access to the Learning & Development platform EdX and Hone

A buddy to help you get settled

At Inceptive, we are creating tools to develop increasingly powerful biological software for the rational design of novel, broadly accessible medicines and biotechnologies previously out of reach. Our team brings together vast expertise in molecular biology, machine learning, and software engineering, and we are all working towards becoming interdisciplinary, meaning we deepen the knowledge we have in our area of expertise while also expanding our knowledge of completely new fields.

XML job scraping automation by YubHub

]]> full-time entry|mid|senior|staff|executive onsite $200K – $275K + Bonus + Equity Python, PyTorch, Machine Learning, Deep Learning, Biological Software, Molecular Design, Generative Models, Domain Understanding, Interdisciplinary Teamwork, PhD in AI/ML, computer science, computational biology, physics, or a related field, Strong skills in designing, executing, and documenting machine learning experiments, Practical experience with modern generative models, Strong software engineering skills, in particular for data processing, evaluation of ML models, compute cluster orchestration, Experience with large-scale model training, foundation models, model parallelism, multi-node training, Experience with bio sequence data and datasets — various genomic and protein data, sequencing, functional assays, etc, Knowledge of biochemistry, molecular/cell biology, and drug development Engineering Technology Inceptive https://logos.yubhub.co/inceptive.com.png Inceptive is a company creating tools to develop increasingly powerful biological software for the rational design of novel, broadly accessible medicines and biotechnologies. https://inceptive.com https://job-boards.greenhouse.io/inceptive/jobs/4961579007 Berlin, Germany or Palo Alto, CA or Zurich, Switzerland 2026-04-18 2bc6ae79-8ee Staff Technical Lead for Inference & ML Performance We're looking for a Staff Technical Lead for Inference & ML Performance to guide a team in building and optimizing state-of-the-art inference systems. This role is intense yet deeply impactful.

You'll shape the future of fal's inference engine and ensure our generative models achieve best-in-class performance. Your work directly impacts our ability to rapidly deliver cutting-edge creative solutions to users, from individual creators to global brands.

Day-to-day, you'll set technical direction, guide your team to build high-performance inference solutions, and personally contribute to critical inference performance enhancements and optimizations. You'll collaborate closely with research & applied ML teams, influence model inference strategies and deployment techniques, and drive advanced performance optimizations.

As a leader, you'll mentor and scale your team, coach and expand your team of performance-focused engineers, and help them innovate, solve complex performance challenges, and level up their skills.

To succeed in this role, you'll need to be deeply experienced in ML performance optimization, understand the full ML performance stack, and know inference inside-out. You'll also need to thrive in cross-functional collaboration and have excellent leadership skills.

If you're ready to lead the future of inference performance at a fast-paced, high-growth frontier, apply now!

XML job scraping automation by YubHub

]]> full-time staff onsite ML performance optimization, PyTorch, TensorRT, TransformerEngine, Triton, CUTLASS kernels, Quantization, Kernel authoring, Compilation, Model parallelism, Distributed serving, Profiling Engineering Technology fal https://logos.yubhub.co/fal.com.png fal is a fast-growing company pioneering the next generation of generative-media infrastructure. https://fal.com https://job-boards.greenhouse.io/fal/jobs/4012780009 San Francisco 2026-04-18 c078633c-28c Senior Engineer, Core API - W&B You will be responsible for building and evolving the core backend systems and shared infrastructure that power our platform.

A significant portion of backend logic is shared across services, and this role will help define, maintain, and scale that foundation.

You will own and improve internal schema and code generation tooling that ensures consistency and correctness across services.

You will work on and extend our custom job scheduler, improving reliability, observability, and execution guarantees for distributed workloads.

You will contribute to safely execute large-scale concurrent and distributed operations.

You will play a key role in defining and maintaining API standards across teams, ensuring performance, backward compatibility, and clear evolution strategies.

You will collaborate closely with Product and various Engineering teams to design systems that are reliable, scalable, and maintainable over time.

The Core Systems team is responsible for the foundational backend infrastructure that powers Weights & Biases within CoreWeave.

Much of the platform's critical logic is shared across services, and this role sits at the center of that foundation.

You will work on the systems that other engineers build upon , from execution frameworks and schedulers to schema tooling and API standards.

This is a high-leverage role focused on durability, scalability, and long-term maintainability.

The systems you design and evolve will directly impact reliability, developer velocity, and the ability of the platform to scale with growing workloads.

You'll collaborate across teams to ensure that shared backend abstractions remain clean, performant, and consistent as we continue to expand our adoption of technologies like GraphQL and gRPC.

If you enjoy owning deep technical infrastructure, shaping engineering standards, and building systems that other engineers depend on every day, this role offers meaningful scope and impact.

You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.

Come join us!

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 backend engineering experience, designing and maintaining distributed systems, hands-on experience designing and evolving APIs, strong proficiency in Go, Python, or a comparable backend systems language, experience implementing concurrency and parallelism patterns in production systems, familiarity with schema management, code generation tools, or interface definition systems, experience building or operating custom job schedulers, workflow engines, or execution frameworks, experience defining cross-team API standards and governance models, background in high-scale data or ML infrastructure systems, experience improving reliability through observability, metrics, and SLO-driven development practices Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4658736006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 97212bdf-dd1 Research Engineer, Interpretability Job Title: Research Engineer, Interpretability

About the Role:

When you see what modern language models are capable of, do you wonder, "How do these things work? How can we trust them?" The Interpretability team at Anthropic is working to reverse-engineer how trained models work because we believe that a mechanistic understanding is the most robust way to make advanced systems safe.

Think of us as doing "neuroscience" of neural networks using "microscopes" we build - or reverse-engineering neural networks like binary programs.

More resources to learn about our work:

Our research blog - covering advances including Monosemantic Features and Circuits

An Introduction to Interpretability from our research lead, Chris Olah

The Urgency of Interpretability from CEO Dario Amodei

Engineering Challenges Scaling Interpretability - directly relevant to this role

60 Minutes segment - Around 8:07, see a demo of tooling our team built

New Yorker article - what it's like to work on one of AI's hardest open problems

Even if you haven't worked on interpretability before, the infrastructure expertise is similar to what's needed across the lifecycle of a production language model:

Pretraining: Training dictionary learning models looks a lot like model pretraining - creating stable, performant training jobs for massively parameterized models across thousands of chips

Inference: Interp runs a customized inference stack. Day-to-day analysis requires services that allow editing a model's internal activations mid-forward-pass - for example, adding a "steering vector"

Performance: Like all LLM work, we push up against the limits of hardware and software. Rather than squeezing the last 0.1%, we are focused on finding bottlenecks, fixing them and moving ahead given rapidly evolving research and safety mission

The science keeps scaling - and it's now applied directly in safety audits on frontier models, with real deadlines. As our research has matured, engineering and infrastructure have become a bottleneck. Your work will have a direct impact on one of the most important open problems in AI.

Responsibilities:

Build and maintain the specialized inference and training infrastructure that powers interpretability research - including instrumented forward/backward passes, activation extraction, and steering vector application

Resolve scaling and efficiency bottlenecks through profiling, optimization, and close collaboration with peer infrastructure teams

Design tools, abstractions, and platforms that enable researchers to rapidly experiment without hitting engineering barriers

Help bring interpretability research into production safety audits - with real deadlines and high reliability expectations

Work across the stack - from model internals and accelerator-level optimization to user-facing research tooling

You may be a good fit if you:

Have 5-10+ years of experience building software

Are highly proficient in at least one programming language (e.g., Python, Rust, Go, Java) and productive with Python

Are extremely curious about unfamiliar domains; can quickly learn and put that knowledge to work, e.g. diving into new layers of the stack to find bottlenecks

Have a strong ability to prioritize the most impactful work and are comfortable operating with ambiguity and questioning assumptions

Prefer fast-moving collaborative projects to extensive solo efforts

Are curious about interpretability research and its role in AI safety (though no research experience is required!)

Care about the societal impacts and ethics of your work

Are comfortable working closely with researchers, translating research needs into engineering solutions.

Strong candidates may also have experience with:

Optimizing the performance of large-scale distributed systems

Language modeling fundamentals with transformers

High Performance LLM optimization: memory management, compute efficiency, parallelism strategies, inference throughput optimization

Working hands-on in a mainstream ML stack - PyTorch/CUDA on GPUs or JAX/XLA on TPUs

Collaborating closely with researchers and building tooling to support research teams; or directly performed research with complex engineering challenges

Representative Projects:

Building Garcon, a tool that allows researchers to easily instrument LLMs to extract internal activations

Designing and optimizing a pipeline to efficiently collect petabytes of transformer activations and shuffle them

Profiling and optimizing ML training jobs, including multi-GPU parallelism and memory optimization

Building a steered inference system that applies targeted interventions to model internals at scale (conceptually similar to Golden Gate Claude but for safety research)

Role Specific Location Policy:

This role is based in the San Francisco office; however, we are open to considering exceptional candidates for remote work on a case-by-case basis.

The annual compensation range for this role is listed below.

For sales roles, the range provided is the role's On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.

Annual Salary: $315,000-$560,000 USD

XML job scraping automation by YubHub

]]> full-time senior hybrid $315,000-$560,000 USD Python, Rust, Go, Java, PyTorch, CUDA, JAX, XLA, Transformers, High Performance LLM optimization, Memory management, Compute efficiency, Parallelism strategies, Inference throughput optimization, Optimizing the performance of large-scale distributed systems, Language modeling fundamentals, Collaborating closely with researchers and building tooling to support research teams Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/4980430008 San Francisco, CA 2026-04-18 71554e46-b64 Senior Engineering Manager, AI Runtime At Databricks, we are committed to enabling data teams to solve the world's toughest problems. As a Senior Engineering Manager, you will lead the team owning both the product experience and the foundational infrastructure of our AI Runtime (AIR) product.

You will be responsible for shaping customer-facing capabilities while designing for scalability, extensibility, and performance of GPU training and adjacent areas. This will involve collaborating closely across the platform, product, infrastructure, and research organisations.

Key responsibilities include:

Leading, mentoring, and growing a high-performing engineering team responsible for the Custom Training product and its foundational infrastructure
Defining and owning the product and technical roadmap for AIR, balancing customer experience, functionality, and foundational investments
Collaborating closely with product, research, platform, infrastructure teams, and customers to drive end-to-end delivery
Driving architectural decisions and product design for managed GPU training at scale
Advocating for customer needs through direct engagement, ensuring engineering decisions translate to clear product impact

We are looking for someone with 8+ years of software engineering experience, with 3+ years in engineering management. You should have a track record of building and operating managed GPU training infrastructure at scale, as well as deep familiarity with distributed training frameworks and parallelism strategies.

In addition, you should have experience with training resilience patterns, such as checkpointing, elastic training, and automated failure recovery for long-running jobs. You should also have a strong understanding of GPU performance fundamentals, including NCCL, interconnect topologies, and memory optimisation.

Experience building platform products with clear SLAs is also essential, as is strong cross-functional leadership across platform, product, and research teams. Excellent collaboration and communication skills are also required.

The pay range for this role is $228,600-$314,250 USD per year, depending on location. The total compensation package may also include eligibility for annual performance bonus, equity, and benefits.

XML job scraping automation by YubHub

]]> full-time senior onsite $228,600-$314,250 USD per year software engineering, engineering management, distributed training frameworks, parallelism strategies, GPU training infrastructure, checkpointing, elastic training, automated failure recovery, GPU performance fundamentals, NCCL, interconnect topologies, memory optimisation Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks is a data and AI company that provides a unified platform for data, analytics, and AI. It was founded by the original creators of Lakehouse, Apache Spark, Delta Lake, and MLflow. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8490282002 Mountain View, California; San Francisco, California 2026-04-18 28107212-128 Performance Engineer, GPU As a GPU Performance Engineer at Anthropic, you will be responsible for architecting and implementing the foundational systems that power Claude and push the frontiers of what's possible with large language models. You will maximize GPU utilization and performance at unprecedented scale, develop cutting-edge optimizations that directly enable new model capabilities, and dramatically improve inference efficiency.

Working at the intersection of hardware and software, you will implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.

Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.

Responsibilities:

Architect and implement foundational systems that power Claude
Maximize GPU utilization and performance at unprecedented scale
Develop cutting-edge optimizations that directly enable new model capabilities
Dramatically improve inference efficiency
Implement state-of-the-art techniques from custom kernel development to distributed system architectures
Work at the intersection of hardware and software
Span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization

Requirements:

Deep experience with GPU programming and optimization at scale
Impact-driven, passionate about delivering measurable performance breakthroughs
Ability to navigate complex systems from hardware interfaces to high-level ML frameworks
Enjoy collaborative problem-solving and pair programming
Want to work on state-of-the-art language models with real-world impact
Care about the societal impacts of your work
Thrive in ambiguous environments where you define the path forward

Nice to have:

Experience with GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization
ML Compilers & Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators
Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight
Distributed Systems: NCCL, NVLink, collective communication, model parallelism
Low-Precision: INT8/FP8 quantization, mixed-precision techniques
Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration

Representative projects:

Co-design attention mechanisms and algorithms for next-generation hardware architectures
Develop custom kernels for emerging quantization formats and mixed-precision techniques
Design distributed communication strategies for multi-node GPU clusters
Optimize end-to-end training and inference pipelines for frontier language models
Build performance modeling frameworks to predict and optimize GPU utilization
Implement kernel fusion strategies to minimize memory bandwidth bottlenecks
Create resilient systems for planet-scale distributed training infrastructure
Profile and eliminate performance bottlenecks in production serving infrastructure
Partner with hardware vendors to influence future accelerator capabilities and software stacks

Note: The salary range for this position is $280,000-$850,000 USD per year.

XML job scraping automation by YubHub

]]> full-time senior hybrid $280,000-$850,000 USD per year GPU programming, optimization at scale, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, PyTorch/JAX internals, torch.compile, XLA, custom operators, kernel fusion, memory bandwidth optimization, profiling with Nsight, NCCL, NVLink, collective communication, model parallelism, INT8/FP8 quantization, mixed-precision techniques, large-scale training infrastructure, fault tolerance, cluster orchestration Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/4926227008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 7b2b97d5-0a1 Software Engineer, Inference Deployment About Anthropic

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the Role

Our mandate is to make inference deployment boring and unattended.

Anthropic serves Claude to millions of users across GPUs, TPUs, and Trainium — and every model update must reach production safely, quickly, and without disrupting service. We're building the systems that make inference deployment continuous and unattended.

As a Software Engineer on the Launch Engineering team, you'll design and build the deployment infrastructure that moves inference code from merge to production. This is a resource-constrained optimization problem at its core: validation and deployment consume the same accelerator chips that serve customer traffic — your deploys compete with live user requests for the same hardware. Every model brings different fleet sizes, startup times, and correctness requirements, so the system must adapt continuously. You'll build systems that navigate these constraints — orchestrating validation, scheduling deployments intelligently, and driving down cycle time from merge to production.

If you've built deployment systems at scale and gravitate toward the hardest problems at the intersection of automation and resource management, this team will give you an outsized scope to work on them.

Responsibilities

Own deployment orchestration that continuously moves validated inference builds into production across GPU, TPU, and Trainium fleets, unattended under normal conditions
Improve capacity-aware deployment scheduling to maximize deployment throughput against constrained accelerator budgets and variable fleet sizes
Extend deployment observability — dashboards and tooling that answer "what code is running in production," "where is my commit," and "what validation passed for this deploy"
Drive down cycle time from code merge to production with pipeline architectures that minimize serial dependencies and maximize parallelism
Optimize fleet rollout strategies for large-scale deployments across thousands of GPU, TPU, and Trainium chips, minimizing disruption to serving capacity
Evolve self-service model onboarding so that new models can be added to the continuous deployment pipeline without Launch Engineering involvement
Partner across the Inference organization with teams owning validation, autoscaling, and model routing to integrate deployment automation with their systems

You May Be a Good Fit If You Have

5+ years of experience building deployment, release, or delivery infrastructure at scale
Strong software engineering skills with experience designing systems that manage complex state machines and multi-stage pipelines
Experience with deployment systems where resource constraints shape the design — whether that's fleet capacity, network bandwidth, hardware availability, or coordinated rollout windows
A track record of building automation that measurably improves deployment velocity and reliability
Proficiency with Kubernetes-based deployments, rolling update mechanics, and container orchestration
Comfort working across the stack — from backend services and databases to CLI tools and web UIs
Strong communication skills and the ability to work closely with oncall engineers, model teams, and infrastructure partners

Strong Candidates May Also Have

Experience with ML inference or training infrastructure deployment, particularly across multiple accelerator types (GPU, TPU, Trainium)
Background in capacity planning or resource-constrained scheduling (e.g., bin-packing, fleet management, job scheduling with hardware affinity)
Experience with progressive delivery in systems with long validation cycles: canary/soak testing, blue-green deployments, traffic shifting, automated rollback
Experience at companies with large-scale release engineering challenges (mobile release trains, monorepo deployments, multi-datacenter rollouts)
Experience with Python and/or Rust in production systems

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

XML job scraping automation by YubHub

]]> full-time senior hybrid $320,000 - $485,000USD deployment, release, delivery, infrastructure, Kubernetes, container, orchestration, pipelines, state machines, multi-stage, pipelines, parallelism, optimization, resource management, automation, velocity, reliability, communication, collaboration, oncall, model teams, infrastructure partners, ML inference, training infrastructure, capacity planning, resource-constrained scheduling, bin-packing, fleet management, job scheduling, hardware affinity, progressive delivery, canary/soak testing, blue-green deployments, traffic shifting, automated rollback, mobile release trains, monorepo deployments, multi-datacenter rollouts, Python, Rust Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic's mission is to create reliable, interpretable, and steerable AI systems. The company is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5111745008 San Francisco, CA | New York City, NY | Seattle, WA 2026-03-08 11a60d5a-f54 Performance Engineer, GPU About the role:

Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you'll architect and implement the foundational systems that power Claude and push the frontiers of what's possible with large language models. You'll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.

Working at the intersection of hardware and software, you'll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.

You might be a good fit if you:

Have deep experience with GPU programming and optimization at scale
Are impact-driven, passionate about delivering measurable performance breakthroughs
Can navigate complex systems from hardware interfaces to high-level ML frameworks
Enjoy collaborative problem-solving and pair programming
Want to work on state-of-the-art language models with real-world impact
Care about the societal impacts of your work
Thrive in ambiguous environments where you define the path forward

Strong candidates may also have experience with:

GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization
ML Compilers & Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators
Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight
Distributed Systems: NCCL, NVLink, collective communication, model parallelism
Low-Precision: INT8/FP8 quantization, mixed-precision techniques
Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration

Representative projects:

Co-design attention mechanisms and algorithms for next-generation hardware architectures
Develop custom kernels for emerging quantization formats and mixed-precision techniques
Design distributed communication strategies for multi-node GPU clusters
Optimize end-to-end training and inference pipelines for frontier language models
Build performance modeling frameworks to predict and optimize GPU utilization
Implement kernel fusion strategies to minimize memory bandwidth bottlenecks
Create resilient systems for planet-scale distributed training infrastructure
Profile and eliminate performance bottlenecks in production serving infrastructure
Partner with hardware vendors to influence future accelerator capabilities and software stacks

Deadline to apply: None. Applications will be reviewed on a rolling basis.

The expected salary range for this position is:

Annual Salary: $280,000 - $850,000USD

XML job scraping automation by YubHub

]]> full-time senior hybrid $280,000 - $850,000USD GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration, GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic's mission is to create reliable, interpretable, and steerable AI systems. The company is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/4926227008 San Francisco, CA | New York City, NY | Seattle, WA 2026-03-08 d3a39f4c-d95 Software Engineer, Inference - Multi Modal Software Engineer, Inference - Multi Modal

Location

San Francisco

Employment Type

Full time

Department

Scaling

Compensation

$295K – $555K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

OpenAI’s Inference team powers the deployment of our most advanced models - including our GPT models, 4o Image Generation, and Whisper - across a variety of platforms. Our work ensures these models are available, performant, and scalable in production, and we partner closely with Research to bring the next generation of models into the world. We're a small, fast-moving team of engineers focused on delivering a world-class developer experience while pushing the boundaries of what AI can do.

We’re expanding into multimodal inference, building the infrastructure needed to serve models that handle image, audio, and other non-text modalities. These workloads are inherently more heterogeneous and experimental, involving diverse model sizes and interactions, more complex input/output formats, and tighter coordination with product and research.

About the Role

We’re looking for a software engineer to help us serve OpenAI’s multimodal models at scale. You’ll be part of a small team responsible for building reliable, high-performance infrastructure for serving real-time audio, image, and other MM workloads in production.

This work is inherently cross-functional: you’ll collaborate directly with researchers training these models and with product teams defining new modalities of interaction. You'll build and optimize the systems that let users generate speech, understand images, and interact with models in ways far beyond text.

In this role, you will:

Design and implement inference infrastructure for large-scale multimodal models.

Optimize systems for high-throughput, low-latency delivery of image and audio inputs and outputs.

Enable experimental research workflows to transition into reliable production services.

Collaborate closely with researchers, infra teams, and product engineers to deploy state-of-the-art capabilities.

Contribute to system-level improvements including GPU utilization, tensor parallelism, and hardware abstraction layers.

You might thrive in this role if you:

Have experience building and scaling inference systems for LLMs or multimodal models.

Have worked with GPU-based ML workloads and understand the performance dynamics of large models, especially with complex data like images or audio.

Enjoy experimental, fast-evolving work and collaborating closely with research.

Are comfortable dealing with systems that span networking, distributed compute, and high-throughput data handling.

Have familiarity with inference tooling like vLLM, TensorRT-LLM, or custom model parallel systems.

Own problems end-to-end and are excited to operate in ambiguous, fast-moving spaces.

Nice to Have:

Experience working with image generation or audio synthesis models in production.

Exposure to distributed ML training or system-efficient model design.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time mid onsite $295K – $555K • Offers Equity Software Engineer, Inference Infrastructure, GPU-based ML Workloads, Tensor Parallelism, Hardware Abstraction Layers, vLLM, TensorRT-LLM, Custom Model Parallel Systems, Image Generation, Audio Synthesis, Distributed ML Training, System-Efficient Model Design Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/4d14449e-5e7f-45d4-b103-8776a6c87086 San Francisco 2026-03-06