{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/distributed-training-frameworks"},"x-facet":{"type":"skill","slug":"distributed-training-frameworks","display":"Distributed Training Frameworks","count":4},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f0f66ce3-d78"},"title":"Senior GenAI Research Engineer - Optimization and Kernels","description":"<p>As a research engineer on the Scaling team at Databricks, you will be responsible for keeping up with the latest developments in deep learning and advancing the scientific frontier by creating new techniques that go beyond the state of the art.</p>\n<p>You will work together on a collaborative team of researchers and engineers with diverse backgrounds and technical training. Your goal will be to make our customers successful in applying state-of-the-art LLMs and AI systems, and we encode our scientific expertise into our products to make that possible.</p>\n<p>Your responsibilities will include:</p>\n<ul>\n<li>Driving performance improvements through advanced optimization techniques including kernel fusion, mixed precision, memory layout optimization, tiling strategies, and tensorization for training-specific patterns</li>\n</ul>\n<ul>\n<li>Designing, implementing, and optimizing high-performance GPU kernels for training workloads (e.g., attention mechanisms, custom layers, gradient computation, activation functions) targeting NVIDIA architectures</li>\n</ul>\n<ul>\n<li>Designing and implementing distributed training frameworks for large language models, including parallelism strategies (data, tensor, pipeline, ZeRO-based) and optimized communication patterns for gradient synchronization and collective operations</li>\n</ul>\n<ul>\n<li>Profiling, debugging, and optimizing end-to-end training workflows to identify and resolve performance bottlenecks, applying memory optimization techniques like activation checkpointing, gradient sharding, and mixed precision training</li>\n</ul>\n<p>We look for candidates with a strong background in computer science or a related field, hands-on experience writing and tuning CUDA kernels for ML training applications, and a deep understanding of parallelism techniques and memory optimization strategies for large-scale model training.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f0f66ce3-d78","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Databricks","sameAs":"https://databricks.com","logo":"https://logos.yubhub.co/databricks.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/databricks/jobs/8297797002","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$166,000-$225,000 USD","x-skills-required":["CUDA","NVIDIA GPU architecture","PyTorch","distributed training frameworks","parallelism techniques","memory optimization strategies"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:57:26.571Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"CUDA, NVIDIA GPU architecture, PyTorch, distributed training frameworks, parallelism techniques, memory optimization strategies","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":166000,"maxValue":225000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_ff4d3a91-b20"},"title":"Principal Engineer - Perf and Benchmarking","description":"<p>We&#39;re looking for a Principal Engineer to be the technical lead of CoreWeave&#39;s Benchmarking &amp; Performance team. You will be responsible for our planet-scale performance data warehouse: Ingesting, storing, transforming and analyzing performance events in all the data centers across our global infrastructure.</p>\n<p>You will also be an integral part of achieving industry-leading end-to-end performance benchmarking publications: If MLPerf (Training &amp; Inference), Working closely with NVIDIA (Megatron-LM, TensorRT-LLM &amp; DGX cloud) and the open-source community (llm-d, vLLM and all popular ML frameworks) speak to you, come help us demonstrate CoreWeave&#39;s performance reliability leadership in the field.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Strategy &amp; Leadership - Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers. Build, lead, and mentor a high-performing team of performance engineers and data analysts. Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails.</li>\n</ul>\n<ul>\n<li>Perf Ownership - Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication. Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, TensorRT/TensorRT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed.</li>\n</ul>\n<ul>\n<li>Internal Latency &amp; Throughput Benchmarks - Design a Kubernetes-native, repeatable benchmarking service that exercises CoreWeave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines. Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precisions (BF16/FP8/FP4), batch sizes, and GPU types. Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations.</li>\n</ul>\n<ul>\n<li>Tooling &amp; Automation - Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses. Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures).</li>\n</ul>\n<ul>\n<li>Cross-functional &amp; Community - Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/DeepSpeed, ONNX Runtime) to co-develop optimizations and upstream improvements. Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data.</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads.</li>\n</ul>\n<ul>\n<li>Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines).</li>\n</ul>\n<ul>\n<li>Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM).</li>\n</ul>\n<ul>\n<li>Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments.</li>\n</ul>\n<ul>\n<li>Excellent communicator able to interface with executives, customers, auditors, and OSS communities.</li>\n</ul>\n<p><strong>Nice to have</strong></p>\n<ul>\n<li>Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development.</li>\n</ul>\n<ul>\n<li>Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale.</li>\n</ul>\n<ul>\n<li>Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects.</li>\n</ul>\n<ul>\n<li>Experience benchmarking multi-region fleets and large clusters (thousands of GPUs).</li>\n</ul>\n<ul>\n<li>Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_ff4d3a91-b20","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4627302006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$206,000 to $333,000","x-skills-required":["Distributed systems","HPC/cloud services","Large-scale ML training","GPU performance","Model-server stacks","Distributed training frameworks","Kubernetes","ML control planes","Time-series databases","Log-structured merge trees","Custom storage engine development"],"x-skills-preferred":["MLPerf submissions","Audited benchmarks","Contributions to OSS projects","Benchmarking multi-region fleets","Large clusters","Publications/talks on ML performance"],"datePosted":"2026-04-18T15:51:17.448Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Distributed systems, HPC/cloud services, Large-scale ML training, GPU performance, Model-server stacks, Distributed training frameworks, Kubernetes, ML control planes, Time-series databases, Log-structured merge trees, Custom storage engine development, MLPerf submissions, Audited benchmarks, Contributions to OSS projects, Benchmarking multi-region fleets, Large clusters, Publications/talks on ML performance","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":206000,"maxValue":333000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_d6f9b362-dbe"},"title":"Senior  Machine Learning Engineer, ML Training Platform","description":"<p>As a Senior Machine Learning Engineer on the Machine Learning Platform team at Reddit, you will be instrumental in architecting, implementing, and maintaining foundational Machine Learning (ML) infrastructure that powers Feeds Ranking, Content Understanding, Recommendations and more.</p>\n<p>You will deliver a self-service ML platform that enables the continuous iteration and improvement of systems that use ML techniques including Deep Learning, Natural Language Processing, Recommendation Systems, Representation Learning and Computer Vision.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Leading the building, testing, and maintenance of ML training infrastructure at Reddit</li>\n<li>Designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows</li>\n<li>Evolving the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows</li>\n</ul>\n<p>You will work closely with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully.</p>\n<p>In addition to technical expertise, you will treat internal MLEs as your customers, conducting user research, reducing friction in the &#39;Idea-to-Prototype&#39; loop, and standardizing software environments (Docker images, Python dependency management).</p>\n<p>To be successful in this role, you will have 5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems. You will also have deep Kubernetes expertise, Jupyter Ecosystem knowledge, strong coding skills in Python and Go, and experience with GPU environments, cloud providers, and distributed training frameworks.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_d6f9b362-dbe","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Reddit","sameAs":"https://www.redditinc.com","logo":"https://logos.yubhub.co/redditinc.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/reddit/jobs/7074776","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$216,700-$303,400 USD","x-skills-required":["Kubernetes","Jupyter Ecosystem","Python","Go","GPU environments","Cloud providers","Distributed training frameworks"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:48:57.345Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote - United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Jupyter Ecosystem, Python, Go, GPU environments, Cloud providers, Distributed training frameworks","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":216700,"maxValue":303400,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_71554e46-b64"},"title":"Senior Engineering Manager, AI Runtime","description":"<p>At Databricks, we are committed to enabling data teams to solve the world&#39;s toughest problems. As a Senior Engineering Manager, you will lead the team owning both the product experience and the foundational infrastructure of our AI Runtime (AIR) product.</p>\n<p>You will be responsible for shaping customer-facing capabilities while designing for scalability, extensibility, and performance of GPU training and adjacent areas. This will involve collaborating closely across the platform, product, infrastructure, and research organisations.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Leading, mentoring, and growing a high-performing engineering team responsible for the Custom Training product and its foundational infrastructure</li>\n<li>Defining and owning the product and technical roadmap for AIR, balancing customer experience, functionality, and foundational investments</li>\n<li>Collaborating closely with product, research, platform, infrastructure teams, and customers to drive end-to-end delivery</li>\n<li>Driving architectural decisions and product design for managed GPU training at scale</li>\n<li>Advocating for customer needs through direct engagement, ensuring engineering decisions translate to clear product impact</li>\n</ul>\n<p>We are looking for someone with 8+ years of software engineering experience, with 3+ years in engineering management. You should have a track record of building and operating managed GPU training infrastructure at scale, as well as deep familiarity with distributed training frameworks and parallelism strategies.</p>\n<p>In addition, you should have experience with training resilience patterns, such as checkpointing, elastic training, and automated failure recovery for long-running jobs. You should also have a strong understanding of GPU performance fundamentals, including NCCL, interconnect topologies, and memory optimisation.</p>\n<p>Experience building platform products with clear SLAs is also essential, as is strong cross-functional leadership across platform, product, and research teams. Excellent collaboration and communication skills are also required.</p>\n<p>The pay range for this role is $228,600-$314,250 USD per year, depending on location. The total compensation package may also include eligibility for annual performance bonus, equity, and benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_71554e46-b64","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Databricks","sameAs":"https://databricks.com","logo":"https://logos.yubhub.co/databricks.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/databricks/jobs/8490282002","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$228,600-$314,250 USD per year","x-skills-required":["software engineering","engineering management","distributed training frameworks","parallelism strategies","GPU training infrastructure","checkpointing","elastic training","automated failure recovery","GPU performance fundamentals","NCCL","interconnect topologies","memory optimisation"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:45:28.312Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View, California; San Francisco, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, engineering management, distributed training frameworks, parallelism strategies, GPU training infrastructure, checkpointing, elastic training, automated failure recovery, GPU performance fundamentals, NCCL, interconnect topologies, memory optimisation","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":228600,"maxValue":314250,"unitText":"YEAR"}}}]}