{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/quantization"},"x-facet":{"type":"skill","slug":"quantization","display":"Quantization","count":8},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2bc6ae79-8ee"},"title":"Staff Technical Lead for Inference & ML Performance","description":"<p>We&#39;re looking for a Staff Technical Lead for Inference &amp; ML Performance to guide a team in building and optimizing state-of-the-art inference systems. This role is intense yet deeply impactful.</p>\n<p>You&#39;ll shape the future of fal&#39;s inference engine and ensure our generative models achieve best-in-class performance. Your work directly impacts our ability to rapidly deliver cutting-edge creative solutions to users, from individual creators to global brands.</p>\n<p>Day-to-day, you&#39;ll set technical direction, guide your team to build high-performance inference solutions, and personally contribute to critical inference performance enhancements and optimizations. You&#39;ll collaborate closely with research &amp; applied ML teams, influence model inference strategies and deployment techniques, and drive advanced performance optimizations.</p>\n<p>As a leader, you&#39;ll mentor and scale your team, coach and expand your team of performance-focused engineers, and help them innovate, solve complex performance challenges, and level up their skills.</p>\n<p>To succeed in this role, you&#39;ll need to be deeply experienced in ML performance optimization, understand the full ML performance stack, and know inference inside-out. You&#39;ll also need to thrive in cross-functional collaboration and have excellent leadership skills.</p>\n<p>If you&#39;re ready to lead the future of inference performance at a fast-paced, high-growth frontier, apply now!</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2bc6ae79-8ee","directApply":true,"hiringOrganization":{"@type":"Organization","name":"fal","sameAs":"https://fal.com","logo":"https://logos.yubhub.co/fal.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/fal/jobs/4012780009","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["ML performance optimization","PyTorch","TensorRT","TransformerEngine","Triton","CUTLASS kernels","Quantization","Kernel authoring","Compilation","Model parallelism","Distributed serving","Profiling"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:50:42.839Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"ML performance optimization, PyTorch, TensorRT, TransformerEngine, Triton, CUTLASS kernels, Quantization, Kernel authoring, Compilation, Model parallelism, Distributed serving, Profiling"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_faffae87-882"},"title":"Staff Software Engineer - GenAI Performance and Kernel","description":"<p>As a staff software engineer for GenAI Performance and Kernel, you will own the design, implementation, optimization, and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned, low-level compute paths, manage trade-offs between hardware efficiency and generality, and mentor others in kernel-level performance engineering.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Leading the design, implementation, benchmarking, and maintenance of core compute kernels optimized for various hardware backends (GPU, accelerators)</li>\n<li>Driving the performance roadmap for kernel-level improvements: vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, auto-tuning, etc.</li>\n<li>Integrating kernel optimizations with higher-level ML systems</li>\n<li>Building and maintaining profiling, instrumentation, and verification tooling to detect correctness, performance regressions, numerical issues, and hardware utilization gaps</li>\n<li>Leading performance investigations and root-cause analysis on inference bottlenecks, e.g. memory bandwidth, cache contention, kernel launch overhead, tensor fragmentation</li>\n<li>Establishing coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend portability, and maintainability</li>\n<li>Influencing system architecture decisions to make kernel improvements more effective (e.g. memory layout, dataflow scheduling, kernel fusion boundaries)</li>\n<li>Mentoring and guiding other engineers working on lower-level performance, providing code reviews, and helping set best practices</li>\n<li>Collaborating with infrastructure, tooling, and ML teams to roll out kernel-level optimizations into production, and monitoring their impact</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>BS/MS/PhD in Computer Science, or a related field</li>\n<li>Deep hands-on experience writing and tuning compute kernels (CUDA, Triton, OpenCL, LLVM IR, assembly or similar sort) for ML workloads</li>\n<li>Strong knowledge of GPU/accelerator architecture: warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, SM occupancy, etc.</li>\n<li>Experience with advanced optimization techniques: tiling, blocking, software pipelining, vectorization, fusion, loop transformations, auto-tuning</li>\n<li>Familiarity with ML-specific kernel libraries (cuBLAS, cuDNN, CUTLASS, oneDNN, etc.) or open kernels</li>\n<li>Strong debugging and profiling skills (Nsight, NVProf, perf, vtune, custom instrumentation)</li>\n<li>Experience reasoning about numerical stability, mixed precision, quantization, and error propagation</li>\n<li>Experience in integrating optimized kernels into real-world ML inference systems; exposure to distributed inference pipelines, memory management, and runtime systems</li>\n<li>Experience building high-performance products leveraging GPU acceleration</li>\n<li>Excellent communication and leadership skills , able to drive design discussions, mentor colleagues, and make trade-offs visible</li>\n<li>A track record of shipping performance-critical, high-quality production software</li>\n<li>Bonus: published in systems/ML performance venues (e.g. MLSys, ASPLOS, ISCA, PPoPP), experience with custom accelerators or FPGA, experience with sparsity or model compression techniques</li>\n</ul>\n<p>The pay range for this role is $190,900-$232,800 USD per year, depending on location and experience.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_faffae87-882","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Databricks","sameAs":"https://databricks.com","logo":"https://logos.yubhub.co/databricks.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/databricks/jobs/8202700002","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$190,900-$232,800 USD per year","x-skills-required":["Compute kernels","GPU/accelerator architecture","Advanced optimization techniques","ML-specific kernel libraries","Debugging and profiling skills","Numerical stability","Mixed precision","Quantization","Error propagation","Distributed inference pipelines","Memory management","Runtime systems","High-performance products","GPU acceleration"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:46:07.442Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Compute kernels, GPU/accelerator architecture, Advanced optimization techniques, ML-specific kernel libraries, Debugging and profiling skills, Numerical stability, Mixed precision, Quantization, Error propagation, Distributed inference pipelines, Memory management, Runtime systems, High-performance products, GPU acceleration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":190900,"maxValue":232800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_28107212-128"},"title":"Performance Engineer, GPU","description":"<p>As a GPU Performance Engineer at Anthropic, you will be responsible for architecting and implementing the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You will maximize GPU utilization and performance at unprecedented scale, develop cutting-edge optimizations that directly enable new model capabilities, and dramatically improve inference efficiency.</p>\n<p>Working at the intersection of hardware and software, you will implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>\n<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Architect and implement foundational systems that power Claude</li>\n<li>Maximize GPU utilization and performance at unprecedented scale</li>\n<li>Develop cutting-edge optimizations that directly enable new model capabilities</li>\n<li>Dramatically improve inference efficiency</li>\n<li>Implement state-of-the-art techniques from custom kernel development to distributed system architectures</li>\n<li>Work at the intersection of hardware and software</li>\n<li>Span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization</li>\n</ul>\n<p>Requirements:</p>\n<ul>\n<li>Deep experience with GPU programming and optimization at scale</li>\n<li>Impact-driven, passionate about delivering measurable performance breakthroughs</li>\n<li>Ability to navigate complex systems from hardware interfaces to high-level ML frameworks</li>\n<li>Enjoy collaborative problem-solving and pair programming</li>\n<li>Want to work on state-of-the-art language models with real-world impact</li>\n<li>Care about the societal impacts of your work</li>\n<li>Thrive in ambiguous environments where you define the path forward</li>\n</ul>\n<p>Nice to have:</p>\n<ul>\n<li>Experience with GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>\n<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>\n<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>\n<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>\n<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>\n<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>\n</ul>\n<p>Representative projects:</p>\n<ul>\n<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>\n<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>\n<li>Design distributed communication strategies for multi-node GPU clusters</li>\n<li>Optimize end-to-end training and inference pipelines for frontier language models</li>\n<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>\n<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>\n<li>Create resilient systems for planet-scale distributed training infrastructure</li>\n<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>\n<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>\n</ul>\n<p>Note: The salary range for this position is $280,000-$850,000 USD per year.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_28107212-128","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4926227008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$280,000-$850,000 USD per year","x-skills-required":["GPU programming","optimization at scale","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","PyTorch/JAX internals","torch.compile","XLA","custom operators","kernel fusion","memory bandwidth optimization","profiling with Nsight","NCCL","NVLink","collective communication","model parallelism","INT8/FP8 quantization","mixed-precision techniques","large-scale training infrastructure","fault tolerance","cluster orchestration"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:40:11.758Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GPU programming, optimization at scale, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, PyTorch/JAX internals, torch.compile, XLA, custom operators, kernel fusion, memory bandwidth optimization, profiling with Nsight, NCCL, NVLink, collective communication, model parallelism, INT8/FP8 quantization, mixed-precision techniques, large-scale training infrastructure, fault tolerance, cluster orchestration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":280000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_586b9fef-509"},"title":"Senior Software Engineer - Network Enablement (Applied ML)","description":"<p>We believe that the way people interact with their finances will drastically improve in the next few years. We&#39;re dedicated to empowering this transformation by building the tools and experiences that thousands of developers use to create their own products.</p>\n<p>On this team, you will build and operate the ML infrastructure and product services that enable trust and intelligence across Plaid&#39;s network. You&#39;ll own feature engineering, offline training and batch scoring, online feature serving, and real-time inference so model outputs directly power partner-facing fraud &amp; trust products and bank intelligence features.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Embed model inference into Network Enablement product flows and decision logic (APIs, feature flags, backend flows).</li>\n<li>Define and instrument product + ML success metrics (fraud reduction, retention lift, false positives, downstream impact).</li>\n<li>Design and run experiments and rollout plans (backtesting, shadow scoring, A/B tests, feature-flagged releases) to validate product hypotheses.</li>\n<li>Build and operate offline training pipelines and production batch scoring for bank intelligence products.</li>\n<li>Ship and maintain online feature serving and low-latency model inference endpoints for real-time partner/bank scoring.</li>\n<li>Implement model CI/CD, model/version registry, and safe rollout/rollback strategies.</li>\n<li>Monitor model/data health: drift/regression detection, model-quality dashboards, alerts, and SLOs targeted to partner product needs.</li>\n<li>Ensure offline and online parity, data lineage, and automated validation / data contracts to reduce regressions.</li>\n<li>Optimize inference performance and cost for real-time scoring (batching, caching, runtime selection).</li>\n<li>Ensure fairness, explainability and PII-aware handling for partner-facing ML features; maintain auditability for compliance.</li>\n<li>Partner with platform and cross-functional teams to scale the ML/data foundation (graph features, sequence embeddings, unified pipelines).</li>\n<li>Mentor engineers and document team standards for ML productization and operations.</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>Must-haves:</li>\n<li>Strong software engineering skills including systems design, APIs, and building reliable backend services (Go or Python preferred).</li>\n<li>Production experience with batch and streaming data pipelines and orchestration tools such as Airflow or Spark.</li>\n<li>Experience building or operating real-time scoring and online feature-serving systems, including feature stores and low-latency model inference.</li>\n<li>Experience integrating model outputs into product flows (APIs, feature flags) and measuring impact through experiments and product metrics.</li>\n<li>Experience with model lifecycle and operations: model registries, CI/CD for models, reproducible training, offline &amp; online parity, monitoring and incident response.</li>\n<li>Nice to have:</li>\n<li>Experience in fraud, risk, or marketing intelligence domains.</li>\n<li>Experience with feature-store products (Tecton / Chronon / Feast / internal) and unified pipelines.</li>\n<li>Experience with graph frameworks, graph feature engineering, or sequence embeddings.</li>\n<li>Experience optimizing inference at scale (Triton/ONNX/quantization, batching, caching).</li>\n</ul>\n<p><strong>Additional Information</strong></p>\n<p>Our mission at Plaid is to unlock financial freedom for everyone. To support that mission, we seek to build a diverse team of driven individuals who care deeply about making the financial ecosystem more equitable.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_586b9fef-509","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Plaid","sameAs":"https://plaid.com/","logo":"https://logos.yubhub.co/plaid.com.png"},"x-apply-url":"https://jobs.lever.co/plaid/43b1374d-5c5e-4b63-b710-a95e3cb76bbe","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$190,800-$286,800 per year","x-skills-required":["software engineering","systems design","APIs","backend services","Go","Python","batch and streaming data pipelines","orchestration tools","Airflow","Spark","real-time scoring","online feature-serving systems","feature stores","low-latency model inference","model outputs","product flows","experiments","product metrics","model lifecycle","operations","model registries","CI/CD","reproducible training","offline & online parity","monitoring","incident response"],"x-skills-preferred":["fraud","risk","marketing intelligence","feature-store products","unified pipelines","graph frameworks","graph feature engineering","sequence embeddings","inference at scale","Triton","ONNX","quantization","batching","caching"],"datePosted":"2026-04-17T12:51:26.228Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, systems design, APIs, backend services, Go, Python, batch and streaming data pipelines, orchestration tools, Airflow, Spark, real-time scoring, online feature-serving systems, feature stores, low-latency model inference, model outputs, product flows, experiments, product metrics, model lifecycle, operations, model registries, CI/CD, reproducible training, offline & online parity, monitoring, incident response, fraud, risk, marketing intelligence, feature-store products, unified pipelines, graph frameworks, graph feature engineering, sequence embeddings, inference at scale, Triton, ONNX, quantization, batching, caching","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":190800,"maxValue":286800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c8cd29e4-b57"},"title":"Software Engineer, Numerics","description":"<p>Job Title: Software Engineer, Numerics</p>\n<p>At DeepMind, we&#39;re a team of scientists, engineers, machine learning experts and more, working together to advance the state of the art in artificial intelligence. We use our technologies for widespread public benefit and scientific discovery, and collaborate with others on critical challenges, ensuring safety and ethics are the highest priority.</p>\n<p>This is a high impact role that will impact the efficiency of serving through very low precision and sparse models. Additionally, this role will help drive the future Google HW roadmap to support forward looking numerics. This role offers the unique opportunity to address a historically underserved but increasingly critical area in the advancement of AI.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Decide the precision, numerics, and sparsity formats used by key Google AI models and ensure these decisions are reflected in the roadmaps for corresponding hardware.</li>\n<li>Drive novel research advancements and bring the most impactful ideas into production.</li>\n</ul>\n<p>About You:</p>\n<p>To set you up for success as a Software Engineer at Google DeepMind, we look for the following skills and experience:</p>\n<ul>\n<li>PhD in Computer Science or related field with at least 2+ years of relevant experience.</li>\n<li>Strong software-engineering skills in addition to a research background.</li>\n<li>Deep understanding of the numerics/quantization/sparsity literature.</li>\n<li>Practical experience driving low precision and sparse models through to production.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c8cd29e4-b57","directApply":true,"hiringOrganization":{"@type":"Organization","name":"DeepMind","sameAs":"https://deepmind.com/","logo":"https://logos.yubhub.co/deepmind.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/deepmind/jobs/7389626","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["PhD in Computer Science or related field","Strong software-engineering skills","Deep understanding of the numerics/quantization/sparsity literature","Practical experience driving low precision and sparse models through to production"],"x-skills-preferred":[],"datePosted":"2026-03-16T14:45:24.312Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View, California, US"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"PhD in Computer Science or related field, Strong software-engineering skills, Deep understanding of the numerics/quantization/sparsity literature, Practical experience driving low precision and sparse models through to production"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_11a60d5a-f54"},"title":"Performance Engineer, GPU","description":"<p><strong>About the role:</strong></p>\n<p>Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you&#39;ll architect and implement the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You&#39;ll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.</p>\n<p>Working at the intersection of hardware and software, you&#39;ll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>\n<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>\n<p><strong>You might be a good fit if you:</strong></p>\n<ul>\n<li>Have deep experience with GPU programming and optimization at scale</li>\n<li>Are impact-driven, passionate about delivering measurable performance breakthroughs</li>\n<li>Can navigate complex systems from hardware interfaces to high-level ML frameworks</li>\n<li>Enjoy collaborative problem-solving and pair programming</li>\n<li>Want to work on state-of-the-art language models with real-world impact</li>\n<li>Care about the societal impacts of your work</li>\n<li>Thrive in ambiguous environments where you define the path forward</li>\n</ul>\n<p><strong>Strong candidates may also have experience with:</strong></p>\n<ul>\n<li>GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>\n<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>\n<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>\n<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>\n<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>\n<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>\n</ul>\n<p><strong>Representative projects:</strong></p>\n<ul>\n<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>\n<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>\n<li>Design distributed communication strategies for multi-node GPU clusters</li>\n<li>Optimize end-to-end training and inference pipelines for frontier language models</li>\n<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>\n<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>\n<li>Create resilient systems for planet-scale distributed training infrastructure</li>\n<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>\n<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>\n</ul>\n<p><strong>Deadline to apply:</strong> None. Applications will be reviewed on a rolling basis.</p>\n<p>The expected salary range for this position is:</p>\n<p>Annual Salary: $280,000 - $850,000USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_11a60d5a-f54","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4926227008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$280,000 - $850,000USD","x-skills-required":["GPU programming","optimization at scale","custom kernel development","distributed system architectures","low-level tensor core optimizations","orchestrating thousands of GPUs","GPU kernel development","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","ML compilers & frameworks","PyTorch/JAX internals","torch.compile","XLA","custom operators","performance engineering","kernel fusion","memory bandwidth optimization","profiling with Nsight","distributed systems","NCCL","NVLink","collective communication","model parallelism","low-precision","INT8/FP8 quantization","mixed-precision techniques","production systems","large-scale training infrastructure","fault tolerance","cluster orchestration"],"x-skills-preferred":["GPU programming","optimization at scale","custom kernel development","distributed system architectures","low-level tensor core optimizations","orchestrating thousands of GPUs","GPU kernel development","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","ML compilers & frameworks","PyTorch/JAX internals","torch.compile","XLA","custom operators","performance engineering","kernel fusion","memory bandwidth optimization","profiling with Nsight","distributed systems","NCCL","NVLink","collective communication","model parallelism","low-precision","INT8/FP8 quantization","mixed-precision techniques","production systems","large-scale training infrastructure","fault tolerance","cluster orchestration"],"datePosted":"2026-03-08T13:45:05.412Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration, GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":280000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_46bb9922-091"},"title":"ML Research Engineer - Hardware Codesign","description":"<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Scaling</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$185K – $455K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p>More details about our benefits are available to candidates during the hiring process.</p>\n<p>This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.</p>\n<p><strong><strong>About the Team</strong></strong></p>\n<p>OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI’s supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI.</p>\n<p><strong><strong>About the Role</strong></strong></p>\n<p>We’re seeking a Research-Hardware Codesign Engineer to operate at the boundary between model research and silicon/system architecture. You’ll help shape the numerics, architecture, and technology bets of future OpenAI silicon in collaboration with both Research and Hardware.</p>\n<p>Your work will include debugging gaps between rooflines and reality, writing quantization kernels, derisking numerics via model evals, quantifying system architecture tradeoffs, and implementing novel numeric RTL. This is a hands-on role for people who go looking for hard problems, get to ground truth, and drive it to production. Strong prioritization and clear, honest communication are essential.</p>\n<p>Location: San Francisco, CA (Hybrid: 3 days/week onsite)</p>\n<p>Relocation assistance available.</p>\n<p><strong><strong>In this role you will:</strong></strong></p>\n<ul>\n<li>Build on our roofline simulator to track evolving workloads, and deliver analyses that quantify the impact of system architecture decisions and support technology pathfinding.</li>\n</ul>\n<ul>\n<li>Debug gaps between performance simulation and real measurements; clearly communicate root cause, bottlenecks, and invalid assumptions.</li>\n</ul>\n<ul>\n<li>Write emulation kernels for low-precision numerics and lossy compression schemes, and get Research the information they need to trade efficiency with model quality.</li>\n</ul>\n<ul>\n<li>Prototype numerics modules by pushing RTL through synthesis; hand off novel numerics cleanly, or occasionally own an RTL module end-to-end.</li>\n</ul>\n<ul>\n<li>Proactively pull in new ML workloads, prototype them with rooflines and/or functional simulation, and drive initial evaluation of new opportunities or risks.</li>\n</ul>\n<ul>\n<li>Understand the whole picture from ML science to hardware optimization, and slice this end-to-end objective into near-term deliverables.</li>\n</ul>\n<ul>\n<li>Build ad-hoc collaborations across teams with very different goals and areas of expertise, and keep progress unblocked.</li>\n</ul>\n<ul>\n<li>Communicate design tradeoffs clearly with explicit assumptions and confidence levels; produce a trail of evidence that enables confident execution.</li>\n</ul>\n<p><strong><strong>You Will Thrive in this Role if:</strong></strong></p>\n<ul>\n<li>An exceptional track record of high-quality technical output, and a bias for shipping a prototype now and iterating later in the absence of clear requirements.</li>\n</ul>\n<ul>\n<li>Strong Python, and C++ or Rust, with a cautious attitude toward correctness and an intuition for clean extensibility.</li>\n</ul>\n<ul>\n<li>Experience writing Triton, CUDA, or similar, and an understanding of the resulting mapping of tensor ops to functional units.</li>\n</ul>\n<ul>\n<li>Working knowledge of PyTorch or JAX; experience in large ML codebases is a plus.</li>\n</ul>\n<ul>\n<li>Practical understanding of floating point numerics, the ML tradeoffs of reduced precision, and the current state of the art in model quantization.</li>\n</ul>\n<ul>\n<li>Deep understanding of transformer models, and strong intuition for transformer rooflines and the tradeoffs of sharded training and inference in large-scale ML systems.</li>\n</ul>\n<ul>\n<li>Experience writing RTL (especially for floating point logic) and understanding of PPA tradeoffs is a plus.</li>\n</ul>\n<ul>\n<li>Strong cross-functional communication (e.g. across ML researchers and hardware engineers); ability to slice ambiguous early-incubation ideas into concrete arenas in which progress can be made.</li>\n</ul>\n<p>_To comply with U.S. export control laws and regulations, candidates for this role may need to meet certain legal status requirements.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_46bb9922-091","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/5931abef-191b-417e-89f1-1d06f00e908c","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$185K – $455K","x-skills-required":["Python","C++","Rust","Triton","CUDA","PyTorch","JAX","Floating point numerics","Model quantization","Transformer models","RTL","PPA tradeoffs"],"x-skills-preferred":["Strong Python","C++ or Rust","Experience writing Triton","CUDA or similar","Working knowledge of PyTorch or JAX","Experience in large ML codebases"],"datePosted":"2026-03-06T18:28:06.437Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, C++, Rust, Triton, CUDA, PyTorch, JAX, Floating point numerics, Model quantization, Transformer models, RTL, PPA tradeoffs, Strong Python, C++ or Rust, Experience writing Triton, CUDA or similar, Working knowledge of PyTorch or JAX, Experience in large ML codebases","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":185000,"maxValue":455000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4d9cbf90-719"},"title":"Principal Software Engineer","description":"<p><strong>Summary</strong></p>\n<p>Microsoft are looking for a highly experienced Principal Software Engineer to join their Ads Engineering Platform team to design and build next-generation Ads products that drive revenue growth and create innovative advertising experiences for users and advertisers.</p>\n<p><strong>About the Role</strong></p>\n<p>We are seeking a highly experienced software engineer to join our Ads Engineering Platform team to design and build next-generation Ads products that drive revenue growth and create innovative advertising experiences for users and advertisers. You will play a key role in evolving the core capabilities of our ad-serving infrastructure—the engine that powers advertising across Bing Search, MSN, Microsoft Start, and shopping experiences in Microsoft Edge. Our serving stack operates at massive global scale, delivering millions of ad requests per second through a geo-distributed, low-latency system that integrates real-time bidding, intelligent ranking, and ML-driven decisioning pipelines. We leverage a mix of CPU and GPU-based inference to balance latency, throughput, and cost efficiency. This role combines product innovation, distributed systems architecture, and performance engineering. You will help shape both new monetization capabilities and the next generation of model serving infrastructure that powers them.</p>\n<p><strong>Accountabilities</strong></p>\n<ul>\n<li>Design and build new Ads products and monetization capabilities that unlock incremental revenue and enhance advertiser and end-user experiences.</li>\n<li>Lead the development of large-scale, distributed online serving systems to process millions of ad requests per second with ultra-low latency and high reliability.</li>\n</ul>\n<p><strong>The Candidate we&#39;re looking for</strong></p>\n<p><strong>Experience:</strong></p>\n<ul>\n<li>10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python.</li>\n</ul>\n<p><strong>Technical skills:</strong></p>\n<ul>\n<li>Proficient experience designing and operating real-time online serving or ranking systems.</li>\n<li>Proficient understanding of distributed systems fundamentals: concurrency, multi-threading, memory management, networking, and fault tolerance.</li>\n</ul>\n<p><strong>Personal attributes:</strong></p>\n<ul>\n<li>Demonstrated ability to diagnose performance bottlenecks and improve latency, throughput, and cost efficiency in high-traffic systems.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Competitive salary range of $163,000 - $296,400 per year.</li>\n<li>Benefits and other compensation.</li>\n<li>Opportunities for professional growth and development.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4d9cbf90-719","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/principal-software-engineer-32/","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$163,000 - $296,400 per year","x-skills-required":["C","C++","C#","Java","JavaScript","Python","distributed systems","real-time online serving","ranking systems"],"x-skills-preferred":["GPU performance optimization","model quantization","efficient resource scheduling"],"datePosted":"2026-03-06T07:34:43.749Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Redmond"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"C, C++, C#, Java, JavaScript, Python, distributed systems, real-time online serving, ranking systems, GPU performance optimization, model quantization, efficient resource scheduling","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":163000,"maxValue":296400,"unitText":"YEAR"}}}]}