<?xml version="1.0" encoding="UTF-8"?>
<source>
  <jobs>
    <job>
      <externalid>28107212-128</externalid>
      <Title>Performance Engineer, GPU</Title>
      <Description><![CDATA[<p>As a GPU Performance Engineer at Anthropic, you will be responsible for architecting and implementing the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You will maximize GPU utilization and performance at unprecedented scale, develop cutting-edge optimizations that directly enable new model capabilities, and dramatically improve inference efficiency.</p>
<p>Working at the intersection of hardware and software, you will implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>
<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>
<p>Responsibilities:</p>
<ul>
<li>Architect and implement foundational systems that power Claude</li>
<li>Maximize GPU utilization and performance at unprecedented scale</li>
<li>Develop cutting-edge optimizations that directly enable new model capabilities</li>
<li>Dramatically improve inference efficiency</li>
<li>Implement state-of-the-art techniques from custom kernel development to distributed system architectures</li>
<li>Work at the intersection of hardware and software</li>
<li>Span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization</li>
</ul>
<p>Requirements:</p>
<ul>
<li>Deep experience with GPU programming and optimization at scale</li>
<li>Impact-driven, passionate about delivering measurable performance breakthroughs</li>
<li>Ability to navigate complex systems from hardware interfaces to high-level ML frameworks</li>
<li>Enjoy collaborative problem-solving and pair programming</li>
<li>Want to work on state-of-the-art language models with real-world impact</li>
<li>Care about the societal impacts of your work</li>
<li>Thrive in ambiguous environments where you define the path forward</li>
</ul>
<p>Nice to have:</p>
<ul>
<li>Experience with GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>
<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>
<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>
<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>
<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>
<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>
</ul>
<p>Representative projects:</p>
<ul>
<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>
<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>
<li>Design distributed communication strategies for multi-node GPU clusters</li>
<li>Optimize end-to-end training and inference pipelines for frontier language models</li>
<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>
<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>
<li>Create resilient systems for planet-scale distributed training infrastructure</li>
<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>
<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>
</ul>
<p>Note: The salary range for this position is $280,000-$850,000 USD per year.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$280,000-$850,000 USD per year</Salaryrange>
      <Skills>GPU programming, optimization at scale, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, PyTorch/JAX internals, torch.compile, XLA, custom operators, kernel fusion, memory bandwidth optimization, profiling with Nsight, NCCL, NVLink, collective communication, model parallelism, INT8/FP8 quantization, mixed-precision techniques, large-scale training infrastructure, fault tolerance, cluster orchestration</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems.</Employerdescription>
      <Employerwebsite>https://www.anthropic.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/4926227008</Applyto>
      <Location>San Francisco, CA | New York City, NY | Seattle, WA</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>11a60d5a-f54</externalid>
      <Title>Performance Engineer, GPU</Title>
      <Description><![CDATA[<p><strong>About the role:</strong></p>
<p>Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you&#39;ll architect and implement the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You&#39;ll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.</p>
<p>Working at the intersection of hardware and software, you&#39;ll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>
<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>
<p><strong>You might be a good fit if you:</strong></p>
<ul>
<li>Have deep experience with GPU programming and optimization at scale</li>
<li>Are impact-driven, passionate about delivering measurable performance breakthroughs</li>
<li>Can navigate complex systems from hardware interfaces to high-level ML frameworks</li>
<li>Enjoy collaborative problem-solving and pair programming</li>
<li>Want to work on state-of-the-art language models with real-world impact</li>
<li>Care about the societal impacts of your work</li>
<li>Thrive in ambiguous environments where you define the path forward</li>
</ul>
<p><strong>Strong candidates may also have experience with:</strong></p>
<ul>
<li>GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>
<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>
<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>
<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>
<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>
<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>
</ul>
<p><strong>Representative projects:</strong></p>
<ul>
<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>
<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>
<li>Design distributed communication strategies for multi-node GPU clusters</li>
<li>Optimize end-to-end training and inference pipelines for frontier language models</li>
<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>
<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>
<li>Create resilient systems for planet-scale distributed training infrastructure</li>
<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>
<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>
</ul>
<p><strong>Deadline to apply:</strong> None. Applications will be reviewed on a rolling basis.</p>
<p>The expected salary range for this position is:</p>
<p>Annual Salary: $280,000 - $850,000USD</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$280,000 - $850,000USD</Salaryrange>
      <Skills>GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers &amp; frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration, GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers &amp; frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic&apos;s mission is to create reliable, interpretable, and steerable AI systems. The company is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</Employerdescription>
      <Employerwebsite>https://job-boards.greenhouse.io</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/4926227008</Applyto>
      <Location>San Francisco, CA | New York City, NY | Seattle, WA</Location>
      <Country></Country>
      <Postedate>2026-03-08</Postedate>
    </job>
  </jobs>
</source>