{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/past-experience-working-on-distributed-training-for-large-models"},"x-facet":{"type":"skill","slug":"past-experience-working-on-distributed-training-for-large-models","display":"past experience working on distributed training for large models","count":1},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_b79d9627-55a"},"title":"Research Engineer, Infrastructure, Training Systems","description":"<p>We&#39;re seeking an infrastructure research engineer to design and build scalable, efficient training systems for large models. As a key member of our team, you&#39;ll take ownership of the training stack end-to-end, ensuring every GPU cycle drives scientific progress. Your goal is to make experimentation and training at Thinking Machines fast and reliable, allowing our research teams to focus on science, not system bottlenecks.</p>\n<p>Key responsibilities include designing, implementing, and optimizing distributed training systems, developing high-performance optimizations, and establishing standards for reliability, maintainability, and security. You&#39;ll collaborate with researchers and engineers to build scalable infrastructure and publish learnings through internal documentation, open-source libraries, or technical reports.</p>\n<p>We&#39;re looking for someone who blends deep systems and performance expertise with a curiosity for machine learning at scale. A strong understanding of deep learning frameworks, such as PyTorch, and experience working on distributed training for large models are preferred. If you have a track record of improving research productivity through infrastructure design or process improvements, that&#39;s a plus.</p>\n<p>This role is based in San Francisco, California, and offers a competitive salary range of $350,000 - $475,000 USD per year, depending on background, skills, and experience. We sponsor visas and offer generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_b79d9627-55a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Thinking Machines Lab","sameAs":"https://thinkingmachines.ai/","logo":"https://logos.yubhub.co/thinkingmachines.ai.png"},"x-apply-url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5013932008?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $475,000 USD per year","x-skills-required":["deep learning frameworks","distributed training","high-performance optimizations","reliability, maintainability, and security","scalable infrastructure"],"x-skills-preferred":["past experience working on distributed training for large models","track record of improving research productivity through infrastructure design or process improvements","contributions to open-source ML infrastructure"],"datePosted":"2026-04-18T15:57:59.640Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"deep learning frameworks, distributed training, high-performance optimizations, reliability, maintainability, and security, scalable infrastructure, past experience working on distributed training for large models, track record of improving research productivity through infrastructure design or process improvements, contributions to open-source ML infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":475000,"unitText":"YEAR"}}}]}