{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/kernel-fusion"},"x-facet":{"type":"skill","slug":"kernel-fusion","display":"Kernel Fusion","count":2},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_28107212-128"},"title":"Performance Engineer, GPU","description":"<p>As a GPU Performance Engineer at Anthropic, you will be responsible for architecting and implementing the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You will maximize GPU utilization and performance at unprecedented scale, develop cutting-edge optimizations that directly enable new model capabilities, and dramatically improve inference efficiency.</p>\n<p>Working at the intersection of hardware and software, you will implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>\n<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Architect and implement foundational systems that power Claude</li>\n<li>Maximize GPU utilization and performance at unprecedented scale</li>\n<li>Develop cutting-edge optimizations that directly enable new model capabilities</li>\n<li>Dramatically improve inference efficiency</li>\n<li>Implement state-of-the-art techniques from custom kernel development to distributed system architectures</li>\n<li>Work at the intersection of hardware and software</li>\n<li>Span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization</li>\n</ul>\n<p>Requirements:</p>\n<ul>\n<li>Deep experience with GPU programming and optimization at scale</li>\n<li>Impact-driven, passionate about delivering measurable performance breakthroughs</li>\n<li>Ability to navigate complex systems from hardware interfaces to high-level ML frameworks</li>\n<li>Enjoy collaborative problem-solving and pair programming</li>\n<li>Want to work on state-of-the-art language models with real-world impact</li>\n<li>Care about the societal impacts of your work</li>\n<li>Thrive in ambiguous environments where you define the path forward</li>\n</ul>\n<p>Nice to have:</p>\n<ul>\n<li>Experience with GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>\n<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>\n<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>\n<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>\n<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>\n<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>\n</ul>\n<p>Representative projects:</p>\n<ul>\n<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>\n<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>\n<li>Design distributed communication strategies for multi-node GPU clusters</li>\n<li>Optimize end-to-end training and inference pipelines for frontier language models</li>\n<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>\n<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>\n<li>Create resilient systems for planet-scale distributed training infrastructure</li>\n<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>\n<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>\n</ul>\n<p>Note: The salary range for this position is $280,000-$850,000 USD per year.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_28107212-128","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4926227008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$280,000-$850,000 USD per year","x-skills-required":["GPU programming","optimization at scale","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","PyTorch/JAX internals","torch.compile","XLA","custom operators","kernel fusion","memory bandwidth optimization","profiling with Nsight","NCCL","NVLink","collective communication","model parallelism","INT8/FP8 quantization","mixed-precision techniques","large-scale training infrastructure","fault tolerance","cluster orchestration"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:40:11.758Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GPU programming, optimization at scale, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, PyTorch/JAX internals, torch.compile, XLA, custom operators, kernel fusion, memory bandwidth optimization, profiling with Nsight, NCCL, NVLink, collective communication, model parallelism, INT8/FP8 quantization, mixed-precision techniques, large-scale training infrastructure, fault tolerance, cluster orchestration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":280000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_11a60d5a-f54"},"title":"Performance Engineer, GPU","description":"<p><strong>About the role:</strong></p>\n<p>Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you&#39;ll architect and implement the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You&#39;ll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.</p>\n<p>Working at the intersection of hardware and software, you&#39;ll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>\n<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>\n<p><strong>You might be a good fit if you:</strong></p>\n<ul>\n<li>Have deep experience with GPU programming and optimization at scale</li>\n<li>Are impact-driven, passionate about delivering measurable performance breakthroughs</li>\n<li>Can navigate complex systems from hardware interfaces to high-level ML frameworks</li>\n<li>Enjoy collaborative problem-solving and pair programming</li>\n<li>Want to work on state-of-the-art language models with real-world impact</li>\n<li>Care about the societal impacts of your work</li>\n<li>Thrive in ambiguous environments where you define the path forward</li>\n</ul>\n<p><strong>Strong candidates may also have experience with:</strong></p>\n<ul>\n<li>GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>\n<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>\n<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>\n<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>\n<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>\n<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>\n</ul>\n<p><strong>Representative projects:</strong></p>\n<ul>\n<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>\n<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>\n<li>Design distributed communication strategies for multi-node GPU clusters</li>\n<li>Optimize end-to-end training and inference pipelines for frontier language models</li>\n<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>\n<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>\n<li>Create resilient systems for planet-scale distributed training infrastructure</li>\n<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>\n<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>\n</ul>\n<p><strong>Deadline to apply:</strong> None. Applications will be reviewed on a rolling basis.</p>\n<p>The expected salary range for this position is:</p>\n<p>Annual Salary: $280,000 - $850,000USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_11a60d5a-f54","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4926227008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$280,000 - $850,000USD","x-skills-required":["GPU programming","optimization at scale","custom kernel development","distributed system architectures","low-level tensor core optimizations","orchestrating thousands of GPUs","GPU kernel development","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","ML compilers & frameworks","PyTorch/JAX internals","torch.compile","XLA","custom operators","performance engineering","kernel fusion","memory bandwidth optimization","profiling with Nsight","distributed systems","NCCL","NVLink","collective communication","model parallelism","low-precision","INT8/FP8 quantization","mixed-precision techniques","production systems","large-scale training infrastructure","fault tolerance","cluster orchestration"],"x-skills-preferred":["GPU programming","optimization at scale","custom kernel development","distributed system architectures","low-level tensor core optimizations","orchestrating thousands of GPUs","GPU kernel development","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","ML compilers & frameworks","PyTorch/JAX internals","torch.compile","XLA","custom operators","performance engineering","kernel fusion","memory bandwidth optimization","profiling with Nsight","distributed systems","NCCL","NVLink","collective communication","model parallelism","low-precision","INT8/FP8 quantization","mixed-precision techniques","production systems","large-scale training infrastructure","fault tolerance","cluster orchestration"],"datePosted":"2026-03-08T13:45:05.412Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration, GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":280000,"maxValue":850000,"unitText":"YEAR"}}}]}