{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/numerical-stability"},"x-facet":{"type":"skill","slug":"numerical-stability","display":"Numerical Stability","count":1},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_faffae87-882"},"title":"Staff Software Engineer - GenAI Performance and Kernel","description":"<p>As a staff software engineer for GenAI Performance and Kernel, you will own the design, implementation, optimization, and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned, low-level compute paths, manage trade-offs between hardware efficiency and generality, and mentor others in kernel-level performance engineering.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Leading the design, implementation, benchmarking, and maintenance of core compute kernels optimized for various hardware backends (GPU, accelerators)</li>\n<li>Driving the performance roadmap for kernel-level improvements: vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, auto-tuning, etc.</li>\n<li>Integrating kernel optimizations with higher-level ML systems</li>\n<li>Building and maintaining profiling, instrumentation, and verification tooling to detect correctness, performance regressions, numerical issues, and hardware utilization gaps</li>\n<li>Leading performance investigations and root-cause analysis on inference bottlenecks, e.g. memory bandwidth, cache contention, kernel launch overhead, tensor fragmentation</li>\n<li>Establishing coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend portability, and maintainability</li>\n<li>Influencing system architecture decisions to make kernel improvements more effective (e.g. memory layout, dataflow scheduling, kernel fusion boundaries)</li>\n<li>Mentoring and guiding other engineers working on lower-level performance, providing code reviews, and helping set best practices</li>\n<li>Collaborating with infrastructure, tooling, and ML teams to roll out kernel-level optimizations into production, and monitoring their impact</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>BS/MS/PhD in Computer Science, or a related field</li>\n<li>Deep hands-on experience writing and tuning compute kernels (CUDA, Triton, OpenCL, LLVM IR, assembly or similar sort) for ML workloads</li>\n<li>Strong knowledge of GPU/accelerator architecture: warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, SM occupancy, etc.</li>\n<li>Experience with advanced optimization techniques: tiling, blocking, software pipelining, vectorization, fusion, loop transformations, auto-tuning</li>\n<li>Familiarity with ML-specific kernel libraries (cuBLAS, cuDNN, CUTLASS, oneDNN, etc.) or open kernels</li>\n<li>Strong debugging and profiling skills (Nsight, NVProf, perf, vtune, custom instrumentation)</li>\n<li>Experience reasoning about numerical stability, mixed precision, quantization, and error propagation</li>\n<li>Experience in integrating optimized kernels into real-world ML inference systems; exposure to distributed inference pipelines, memory management, and runtime systems</li>\n<li>Experience building high-performance products leveraging GPU acceleration</li>\n<li>Excellent communication and leadership skills , able to drive design discussions, mentor colleagues, and make trade-offs visible</li>\n<li>A track record of shipping performance-critical, high-quality production software</li>\n<li>Bonus: published in systems/ML performance venues (e.g. MLSys, ASPLOS, ISCA, PPoPP), experience with custom accelerators or FPGA, experience with sparsity or model compression techniques</li>\n</ul>\n<p>The pay range for this role is $190,900-$232,800 USD per year, depending on location and experience.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_faffae87-882","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Databricks","sameAs":"https://databricks.com","logo":"https://logos.yubhub.co/databricks.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/databricks/jobs/8202700002","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$190,900-$232,800 USD per year","x-skills-required":["Compute kernels","GPU/accelerator architecture","Advanced optimization techniques","ML-specific kernel libraries","Debugging and profiling skills","Numerical stability","Mixed precision","Quantization","Error propagation","Distributed inference pipelines","Memory management","Runtime systems","High-performance products","GPU acceleration"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:46:07.442Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Compute kernels, GPU/accelerator architecture, Advanced optimization techniques, ML-specific kernel libraries, Debugging and profiling skills, Numerical stability, Mixed precision, Quantization, Error propagation, Distributed inference pipelines, Memory management, Runtime systems, High-performance products, GPU acceleration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":190900,"maxValue":232800,"unitText":"YEAR"}}}]}