Staff Software Engineer - GenAI Performance and Kernel

faffae87-882 Staff Software Engineer - GenAI Performance and Kernel As a staff software engineer for GenAI Performance and Kernel, you will own the design, implementation, optimization, and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned, low-level compute paths, manage trade-offs between hardware efficiency and generality, and mentor others in kernel-level performance engineering.

Key responsibilities include:

Leading the design, implementation, benchmarking, and maintenance of core compute kernels optimized for various hardware backends (GPU, accelerators)
Driving the performance roadmap for kernel-level improvements: vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, auto-tuning, etc.
Integrating kernel optimizations with higher-level ML systems
Building and maintaining profiling, instrumentation, and verification tooling to detect correctness, performance regressions, numerical issues, and hardware utilization gaps
Leading performance investigations and root-cause analysis on inference bottlenecks, e.g. memory bandwidth, cache contention, kernel launch overhead, tensor fragmentation
Establishing coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend portability, and maintainability
Influencing system architecture decisions to make kernel improvements more effective (e.g. memory layout, dataflow scheduling, kernel fusion boundaries)
Mentoring and guiding other engineers working on lower-level performance, providing code reviews, and helping set best practices
Collaborating with infrastructure, tooling, and ML teams to roll out kernel-level optimizations into production, and monitoring their impact

Requirements include:

BS/MS/PhD in Computer Science, or a related field
Deep hands-on experience writing and tuning compute kernels (CUDA, Triton, OpenCL, LLVM IR, assembly or similar sort) for ML workloads
Strong knowledge of GPU/accelerator architecture: warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, SM occupancy, etc.
Experience with advanced optimization techniques: tiling, blocking, software pipelining, vectorization, fusion, loop transformations, auto-tuning
Familiarity with ML-specific kernel libraries (cuBLAS, cuDNN, CUTLASS, oneDNN, etc.) or open kernels
Strong debugging and profiling skills (Nsight, NVProf, perf, vtune, custom instrumentation)
Experience reasoning about numerical stability, mixed precision, quantization, and error propagation
Experience in integrating optimized kernels into real-world ML inference systems; exposure to distributed inference pipelines, memory management, and runtime systems
Experience building high-performance products leveraging GPU acceleration
Excellent communication and leadership skills , able to drive design discussions, mentor colleagues, and make trade-offs visible
A track record of shipping performance-critical, high-quality production software
Bonus: published in systems/ML performance venues (e.g. MLSys, ASPLOS, ISCA, PPoPP), experience with custom accelerators or FPGA, experience with sparsity or model compression techniques

The pay range for this role is $190,900-$232,800 USD per year, depending on location and experience.

XML job scraping automation by YubHub

]]> full-time staff onsite $190,900-$232,800 USD per year Compute kernels, GPU/accelerator architecture, Advanced optimization techniques, ML-specific kernel libraries, Debugging and profiling skills, Numerical stability, Mixed precision, Quantization, Error propagation, Distributed inference pipelines, Memory management, Runtime systems, High-performance products, GPU acceleration Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks is a data and AI company that provides a unified platform for data, analytics, and AI. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8202700002 San Francisco, California 2026-04-18 655da07a-ab6 AI Tutor - Software Engineering Specialist We're seeking an experienced software engineer to join our team as an AI tutor. As a tutor, you will contribute to AI model training initiatives by curating code examples, offering precise solutions, and providing meticulous corrections in specialized programming languages.

Your responsibilities will include evaluating and refining AI-generated code, ensuring it adheres to industry standards for efficiency, scalability, and reliability. You will also collaborate with cross-functional teams to enhance AI-driven coding solutions, ensuring they meet enterprise-level quality and performance benchmarks.

To succeed in this role, you will need professional software engineering experience building scalable, high-performance applications. You should have deep expertise in one or more programming languages, strong proficiency in relevant frameworks and libraries, and a solid understanding of software design principles, performance optimization, and best practices.

As a detail-oriented and adaptable individual, you will thrive in a fast-paced work environment and possess strong logical reasoning skills. Experience integrating analytics, monitoring, and security best practices relevant to your technical domain is a plus. Containerization technologies, such as Docker, and knowledge of complementary technologies, such as backend systems, APIs, databases, and authentication, are also desirable.

This role may be offered as a full-time, part-time, or contractor position, depending on role needs and candidate fit. As a contractor, you will have the flexibility to set your own hours and determine the exact amount of time needed to complete deliverables. You will be working remotely from any location worldwide, subject to legal eligibility, time-zone compatibility, and role-specific needs.

US-based candidates will be compensated between $60/hour and $100/hour, depending on factors including relevant experience, skills, education, geographic location, and qualifications. International candidates will receive information during the recruitment process.

XML job scraping automation by YubHub

]]> full-time|part-time|contract senior remote $60/hour - $100/hour proficient in one or more programming languages, strong proficiency in relevant frameworks and libraries, solid understanding of software design principles, performance optimization, and best practices, experience implementing quality standards, including accessibility, security, and reliability, strong debugging and profiling skills using development tools and performance monitoring, adaptable, detail-oriented, logical reasoning skills, containerization technologies (e.g., Docker), knowledge of complementary technologies (e.g., backend systems, APIs, databases, authentication) Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/5063490007 Remote 2026-04-18