{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/custom-kernel-development"},"x-facet":{"type":"skill","slug":"custom-kernel-development","display":"Custom Kernel Development","count":2},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_11a60d5a-f54"},"title":"Performance Engineer, GPU","description":"<p><strong>About the role:</strong></p>\n<p>Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you&#39;ll architect and implement the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You&#39;ll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.</p>\n<p>Working at the intersection of hardware and software, you&#39;ll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>\n<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>\n<p><strong>You might be a good fit if you:</strong></p>\n<ul>\n<li>Have deep experience with GPU programming and optimization at scale</li>\n<li>Are impact-driven, passionate about delivering measurable performance breakthroughs</li>\n<li>Can navigate complex systems from hardware interfaces to high-level ML frameworks</li>\n<li>Enjoy collaborative problem-solving and pair programming</li>\n<li>Want to work on state-of-the-art language models with real-world impact</li>\n<li>Care about the societal impacts of your work</li>\n<li>Thrive in ambiguous environments where you define the path forward</li>\n</ul>\n<p><strong>Strong candidates may also have experience with:</strong></p>\n<ul>\n<li>GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>\n<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>\n<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>\n<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>\n<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>\n<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>\n</ul>\n<p><strong>Representative projects:</strong></p>\n<ul>\n<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>\n<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>\n<li>Design distributed communication strategies for multi-node GPU clusters</li>\n<li>Optimize end-to-end training and inference pipelines for frontier language models</li>\n<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>\n<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>\n<li>Create resilient systems for planet-scale distributed training infrastructure</li>\n<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>\n<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>\n</ul>\n<p><strong>Deadline to apply:</strong> None. Applications will be reviewed on a rolling basis.</p>\n<p>The expected salary range for this position is:</p>\n<p>Annual Salary: $280,000 - $850,000USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_11a60d5a-f54","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4926227008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$280,000 - $850,000USD","x-skills-required":["GPU programming","optimization at scale","custom kernel development","distributed system architectures","low-level tensor core optimizations","orchestrating thousands of GPUs","GPU kernel development","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","ML compilers & frameworks","PyTorch/JAX internals","torch.compile","XLA","custom operators","performance engineering","kernel fusion","memory bandwidth optimization","profiling with Nsight","distributed systems","NCCL","NVLink","collective communication","model parallelism","low-precision","INT8/FP8 quantization","mixed-precision techniques","production systems","large-scale training infrastructure","fault tolerance","cluster orchestration"],"x-skills-preferred":["GPU programming","optimization at scale","custom kernel development","distributed system architectures","low-level tensor core optimizations","orchestrating thousands of GPUs","GPU kernel development","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","ML compilers & frameworks","PyTorch/JAX internals","torch.compile","XLA","custom operators","performance engineering","kernel fusion","memory bandwidth optimization","profiling with Nsight","distributed systems","NCCL","NVLink","collective communication","model parallelism","low-precision","INT8/FP8 quantization","mixed-precision techniques","production systems","large-scale training infrastructure","fault tolerance","cluster orchestration"],"datePosted":"2026-03-08T13:45:05.412Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration, GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":280000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_7917d1eb-6e2"},"title":"Engineering Manager - Inference","description":"<p>We are looking for an Inference Engineering Manager to lead our AI Inference team. This is a unique opportunity to build and scale the infrastructure that powers Perplexity&#39;s products and APIs, serving millions of users with state-of-the-art AI capabilities.</p>\n<p><strong>What you&#39;ll do</strong></p>\n<p>You will own the technical direction and execution of our inference systems while building and leading a world-class team of inference engineers. Our current stack includes Python, PyTorch, Rust, C++, and Kubernetes.</p>\n<ul>\n<li>Lead and grow a high-performing team of AI inference engineers</li>\n<li>Develop APIs for AI inference used by both internal and external customers</li>\n<li>Architect and scale our inference infrastructure for reliability and efficiency</li>\n</ul>\n<p><strong>What you need</strong></p>\n<ul>\n<li>5+ years of engineering experience with 2+ years in a technical leadership or management role</li>\n<li>Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)</li>\n<li>Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_7917d1eb-6e2","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Perplexity","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/perplexity.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/perplexity/2a87ccbf-82ef-4fc7-b1ed-4dd18b11baf9","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$300K - $405K","x-skills-required":["ML systems","inference frameworks","LLM architecture"],"x-skills-preferred":["CUDA","Triton","custom kernel development"],"datePosted":"2026-03-04T12:24:50.159Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"ML systems, inference frameworks, LLM architecture, CUDA, Triton, custom kernel development","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":300000,"maxValue":405000,"unitText":"YEAR"}}}]}