{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/familiarity-with-monitoring-and-observability-tools"},"x-facet":{"type":"skill","slug":"familiarity-with-monitoring-and-observability-tools","display":"familiarity with monitoring and observability tools","count":1},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0a2ea62c-943"},"title":"Research Engineer, Infrastructure, RL Systems","description":"<p>We&#39;re looking for an infrastructure research engineer to design and build the core systems that enable scalable, efficient training of large models through reinforcement learning.</p>\n<p>This role sits at the intersection of research and large-scale systems engineering: a builder who understands both the algorithms behind RL and the realities of distributed training and inference at scale. You&#39;ll wear many hats, from optimising rollout and reward pipelines to enhancing reliability, observability, and orchestration, collaborating closely with researchers and infra teams to make reinforcement learning stable, fast, and production-ready.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Design, build, and optimise the infrastructure that powers large-scale reinforcement learning and post-training workloads.</li>\n</ul>\n<ul>\n<li>Improve the reliability and scalability of RL training pipeline, distributed RL workloads, and training throughput.</li>\n</ul>\n<ul>\n<li>Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems.</li>\n</ul>\n<ul>\n<li>Collaborate with researchers to translate algorithmic ideas into production-grade training pipelines.</li>\n</ul>\n<ul>\n<li>Build evaluation and benchmarking infrastructure that measures model progress on helpfulness, safety, and factuality.</li>\n</ul>\n<ul>\n<li>Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.</li>\n</ul>\n<p>We&#39;re looking for someone with strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases. You should have a good understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.</p>\n<p>Experience training or supporting large-scale language models with tens of billions of parameters or more is a plus. Familiarity with monitoring and observability tools (Prometheus, Grafana, OpenTelemetry) is also a plus.</p>\n<p>Logistics:</p>\n<ul>\n<li>Location: This role is based in San Francisco, California.</li>\n</ul>\n<ul>\n<li>Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.</li>\n</ul>\n<ul>\n<li>Visa sponsorship: We sponsor visas. While we can&#39;t guarantee success for every candidate or role, if you&#39;re the right fit, we&#39;re committed to working through the visa process together.</li>\n</ul>\n<ul>\n<li>Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0a2ea62c-943","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Thinking Machines Lab","sameAs":"https://thinkingmachineslab.com/","logo":"https://logos.yubhub.co/thinkingmachineslab.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5013930008?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $475,000 USD","x-skills-required":["deep learning frameworks","PyTorch","JAX","complex codebases","scalable AI infrastructure","large-scale language models","monitoring and observability tools"],"x-skills-preferred":["experience training or supporting large-scale language models","familiarity with monitoring and observability tools"],"datePosted":"2026-04-18T15:56:59.642Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"deep learning frameworks, PyTorch, JAX, complex codebases, scalable AI infrastructure, large-scale language models, monitoring and observability tools, experience training or supporting large-scale language models, familiarity with monitoring and observability tools","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":475000,"unitText":"YEAR"}}}]}