Research Engineer, Infrastructure, RL Systems

0a2ea62c-943 Research Engineer, Infrastructure, RL Systems We're looking for an infrastructure research engineer to design and build the core systems that enable scalable, efficient training of large models through reinforcement learning.

This role sits at the intersection of research and large-scale systems engineering: a builder who understands both the algorithms behind RL and the realities of distributed training and inference at scale. You'll wear many hats, from optimising rollout and reward pipelines to enhancing reliability, observability, and orchestration, collaborating closely with researchers and infra teams to make reinforcement learning stable, fast, and production-ready.

Responsibilities:

Design, build, and optimise the infrastructure that powers large-scale reinforcement learning and post-training workloads.

Improve the reliability and scalability of RL training pipeline, distributed RL workloads, and training throughput.

Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems.

Collaborate with researchers to translate algorithmic ideas into production-grade training pipelines.

Build evaluation and benchmarking infrastructure that measures model progress on helpfulness, safety, and factuality.

Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.

We're looking for someone with strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases. You should have a good understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.

Experience training or supporting large-scale language models with tens of billions of parameters or more is a plus. Familiarity with monitoring and observability tools (Prometheus, Grafana, OpenTelemetry) is also a plus.

Logistics:

Location: This role is based in San Francisco, California.

Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.

Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.

Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.

XML job scraping automation by YubHub

]]> full-time senior onsite $350,000 - $475,000 USD deep learning frameworks, PyTorch, JAX, complex codebases, scalable AI infrastructure, large-scale language models, monitoring and observability tools, experience training or supporting large-scale language models, familiarity with monitoring and observability tools Engineering Technology Thinking Machines Lab https://logos.yubhub.co/thinkingmachineslab.com.png Thinking Machines Lab is a research organisation that focuses on developing collaborative general intelligence. https://thinkingmachineslab.com/ https://job-boards.greenhouse.io/thinkingmachines/jobs/5013930008?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply San Francisco 2026-04-18