<?xml version="1.0" encoding="UTF-8"?>
<source>
  <jobs>
    <job>
      <externalid>465e2cfb-ddc</externalid>
      <Title>Staff Machine Learning Research Scientist, LLM Evals</Title>
      <Description><![CDATA[<p>As a Staff Machine Learning Research Scientist on the LLM Evals team, you will lead the development of novel evaluation methodologies, metrics, and benchmarks to measure the capabilities and limitations of frontier LLMs.</p>
<p>Your primary responsibilities will include:</p>
<ul>
<li>Driving research on the effectiveness and limitations of existing LLM evaluation techniques.</li>
<li>Designing and developing novel evaluation benchmarks for large language models, covering areas such as instruction following, factuality, robustness, and fairness.</li>
<li>Communicating, collaborating, and building relationships with clients and peer teams to facilitate cross-functional projects.</li>
<li>Collaborating with internal teams and external partners to refine metrics and create standardized evaluation protocols.</li>
<li>Implementing scalable and reproducible evaluation pipelines using modern ML frameworks.</li>
<li>Publishing research findings in top-tier AI conferences and contributing to open-source benchmarking initiatives.</li>
<li>Mentoring and guiding research scientists and engineers, providing technical leadership across cross-functional projects.</li>
<li>Staying deeply engaged with the ML research community, tracking emerging work and contributing to the advancement of LLM evaluation science.</li>
</ul>
<p>The ideal candidate will have 5+ years of hands-on experience in large language model, NLP, and Transformer modeling, in the setting of both research and engineering development.</p>
<p>You will thrive in a high-energy, fast-paced startup environment and be ready to dedicate the time and effort needed to drive impactful results.</p>
<p>Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange>$264,800-$331,000 USD</Salaryrange>
      <Skills>large language model, NLP, Transformer modeling, evaluation methodologies, metrics, benchmarks, instruction following, factuality, robustness, fairness</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Scale</Employername>
      <Employerlogo>https://logos.yubhub.co/scale.com.png</Employerlogo>
      <Employerdescription>Scale develops reliable AI systems for the world&apos;s most important decisions, providing high-quality data and full-stack technologies to power leading models.</Employerdescription>
      <Employerwebsite>https://scale.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/scaleai/jobs/4628044005</Applyto>
      <Location>San Francisco, CA; Seattle, WA; New York, NY</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>60a7e1e6-b51</externalid>
      <Title>Tech Lead/Manager, Machine Learning Research Scientist- LLM Evals</Title>
      <Description><![CDATA[<p>As the leading data and evaluation partner for frontier AI companies, we&#39;re dedicated to advancing the evaluation and benchmarking of large language models (LLMs). Our Research teams work with the industry&#39;s leading AI labs to provide high-quality data and accelerate progress in GenAI research.</p>
<p>We&#39;re seeking a Tech Lead Manager to lead a talented team of research scientists and research engineers focused on developing and implementing novel evaluation methodologies, metrics, and benchmarks to assess the capabilities and limitations of our cutting-edge LLMs.</p>
<p>Key responsibilities:</p>
<ul>
<li>Lead a team of highly effective research scientists and research engineers on LLM evals.</li>
<li>Conduct research on the effectiveness and limitations of existing LLM evaluation techniques.</li>
<li>Design and develop novel evaluation benchmarks for large language models, covering areas such as instruction following, factuality, robustness, and fairness.</li>
<li>Communicate, collaborate, and build relationships with clients and peer teams to facilitate cross-functional projects.</li>
<li>Collaborate with internal teams and external partners to refine metrics and create standardized evaluation protocols.</li>
<li>Implement scalable and reproducible evaluation pipelines using modern ML frameworks.</li>
<li>Publish research findings in top-tier AI conferences and contribute to open-source benchmarking initiatives.</li>
</ul>
<p>Ideal candidate has 5+ years of hands-on experience in large language model, NLP, and Transformer modeling, in the setting of both research and engineering development. Experience supporting and leading a team of research scientists and research engineers is also required.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange>$264,800-$331,000 USD</Salaryrange>
      <Skills>large language model, NLP, Transformer modeling, research and engineering development, team leadership, cross-functional collaboration, evaluation methodologies, metrics and benchmarks, scalable and reproducible evaluation pipelines, modern ML frameworks, published research in top-tier AI conferences, open-source benchmarking initiatives, customer-facing role</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Scale</Employername>
      <Employerlogo>https://logos.yubhub.co/scale.com.png</Employerlogo>
      <Employerdescription>Scale develops reliable AI systems for the world&apos;s most important decisions, providing high-quality data and full-stack technologies.</Employerdescription>
      <Employerwebsite>https://scale.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/scaleai/jobs/4304790005</Applyto>
      <Location>San Francisco, CA; Seattle, WA; New York, NY</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>10bf8d86-b30</externalid>
      <Title>Research Engineer, Safeguards Labs</Title>
      <Description><![CDATA[<p><strong>About the Role</strong></p>
<p>We&#39;re hiring research engineers to define and execute the Labs research agenda. You&#39;ll scope your own projects, run experiments end-to-end, and decide when an idea is ready to hand off to a production team , or when to kill it and move on.</p>
<p><strong>Responsibilities:</strong></p>
<ul>
<li>Lead and contribute to research projects investigating new methods for detecting misuse of Claude, identifying malicious organisations and accounts, strengthening model safeguards, and other safety needs.</li>
</ul>
<ul>
<li>Design and run offline analyses over model usage data to surface abuse patterns, build classifiers and detection systems, and evaluate their effectiveness.</li>
</ul>
<ul>
<li>Develop and iterate on prototypes that could eventually feed signals into the real-time safeguards path, partnering with engineers on tech transfer.</li>
</ul>
<ul>
<li>Contribute to a broader research portfolio investigating methods for detecting abusive behaviour in chat-based or agentive workflows, and for training the model to robustly refrain from dangerous responses or behaviours without over-refusing.</li>
</ul>
<ul>
<li>Build evaluations and methodologies for measuring whether safeguards actually work, including in agentic settings.</li>
</ul>
<ul>
<li>Write up findings clearly so they inform decisions across Trust &amp; Safety, research, and product teams.</li>
</ul>
<p><strong>You may be a good fit if you:</strong></p>
<ul>
<li>Have a track record of independently driving research projects from ambiguous problem statements to concrete results, ideally in AI, ML, security, integrity, or a related technical field.</li>
</ul>
<ul>
<li>Are comfortable scoping your own work and switching between research, engineering, and analysis as a project demands.</li>
</ul>
<ul>
<li>Have working familiarity with how large language models operate , sampling, prompting, training , even if LLMs aren&#39;t your primary background.</li>
</ul>
<ul>
<li>Are proficient in Python and comfortable working with large datasets.</li>
</ul>
<ul>
<li>Care about the societal impacts of AI and want your work to directly reduce real-world harm.</li>
</ul>
<p><strong>Strong candidates may also have:</strong></p>
<ul>
<li>Experience building and training machine learning models, including classifiers for abuse, fraud, integrity, or security applications.</li>
</ul>
<ul>
<li>Knowledge of evaluation methodologies for language models and experience designing evals.</li>
</ul>
<ul>
<li>Experience with agentic environments and evaluating model behaviour in them.</li>
</ul>
<ul>
<li>Background in trust and safety, integrity, fraud detection, threat intelligence, or adversarial ML.</li>
</ul>
<ul>
<li>Experience with red teaming, jailbreak research, or interpretability methods like steering vectors.</li>
</ul>
<ul>
<li>A history of taking research prototypes and transferring them into production systems.</li>
</ul>
<p><strong>Logistics</strong></p>
<ul>
<li>Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience</li>
</ul>
<ul>
<li>Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience</li>
</ul>
<ul>
<li>Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position</li>
</ul>
<p><strong>Benefits</strong></p>
<ul>
<li>Competitive compensation and benefits</li>
</ul>
<ul>
<li>Optional equity donation matching</li>
</ul>
<ul>
<li>Generous vacation and parental leave</li>
</ul>
<ul>
<li>Flexible working hours</li>
</ul>
<ul>
<li>Lovely office space in which to collaborate with colleagues</li>
</ul>
<p><strong>Visa Sponsorship</strong></p>
<ul>
<li>We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</li>
</ul>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$350,000-$850,000 USD</Salaryrange>
      <Skills>Python, Machine learning, Large language models, Security, Integrity, Experience building and training machine learning models, Knowledge of evaluation methodologies for language models, Experience with agentic environments, Background in trust and safety, Experience with red teaming</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems.</Employerdescription>
      <Employerwebsite>https://www.anthropic.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/5191785008</Applyto>
      <Location>San Francisco, CA | New York City, NY</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>a0355e9d-a71</externalid>
      <Title>Research Lead, Training Insights</Title>
      <Description><![CDATA[<p>As a Research Lead on the Training Insights team, you&#39;ll develop the strategy for, and lead execution on, how we measure and characterise model capabilities across training and deployment. This is a hands-on leadership role: you&#39;ll drive original research into new evaluation methodologies while leading a small team of researchers and research engineers doing the same.</p>
<p>Your work will span the full lifecycle of model development. You&#39;ll research and build new long-horizon evaluations that test the boundaries of what our models can achieve, develop novel approaches to measuring emerging capabilities, and deepen our understanding of how those capabilities develop , both during production RL training and after. You&#39;ll also take a cross-organisational view, working across Reinforcement Learning, Pretraining, Inference, Product, Alignment, Safeguards, and other teams to map the landscape of model evaluations at Anthropic and identify critical gaps in coverage.</p>
<p>This role carries significant visibility and impact. You&#39;ll help shape the evaluation narrative for model releases, contributing directly to how Anthropic communicates about its models to both internal and external audiences. Done well, you will change how the industry measures and understands model capabilities, significantly furthering our safety mission.</p>
<p>Responsibilities:</p>
<ul>
<li>Build new novel and long-horizon evaluations</li>
<li>Develop novel measurement approaches for understanding how model capabilities emerge and evolve during RL training</li>
<li>Lead strategic evaluation coverage across the company</li>
<li>Shape the evaluation narrative for model releases</li>
<li>Lead and mentor a small team of researchers and research engineers, setting research direction and fostering a culture of rigorous, creative research</li>
<li>Design evaluation frameworks that balance scientific rigor with the practical demands of production training schedules</li>
<li>Build and maintain relationships across Anthropic&#39;s research organisation to ensure evaluation insights inform training and deployment decisions</li>
<li>Contribute to the broader research community through publications, open-source contributions, or external engagement on evaluation best practices</li>
</ul>
<p>You may be a good fit if you:</p>
<ul>
<li>Have significant experience designing and running evaluations for large language models or similar complex ML systems</li>
<li>Have led technical projects or teams, either formally or through sustained ownership of critical research directions</li>
<li>Are equally comfortable designing experiments and writing code,you can move between research and implementation fluidly</li>
<li>Think strategically about what to measure and why, not just how to measure it</li>
<li>Can synthesise information across multiple teams and workstreams to form a coherent picture of model capabilities</li>
<li>Communicate complex technical findings clearly to both technical and non-technical audiences</li>
<li>Are results-oriented and thrive in fast-paced environments where priorities shift based on research findings</li>
<li>Care deeply about AI safety and want your work to directly influence how capable AI systems are developed and deployed</li>
</ul>
<p>Strong candidates may also have:</p>
<ul>
<li>Experience building evaluations for long-horizon or agentic tasks</li>
<li>Deep familiarity with Reinforcement Learning training dynamics and how model behaviour changes during training</li>
<li>Published research in machine learning evaluation, benchmarking, or related areas</li>
<li>Experience with safety evaluation frameworks and red teaming methodologies</li>
<li>Background in psychometrics, experimental psychology, or other measurement-focused disciplines</li>
<li>A track record of communicating evaluation results to inform high-stakes decisions about model development or deployment</li>
<li>Experience managing or mentoring researchers and engineers</li>
</ul>
<p>Representative projects:</p>
<ul>
<li>Designing and implementing a suite of long-horizon evaluations that test model capabilities on tasks requiring sustained reasoning, planning, and tool use over extended interactions</li>
<li>Building systems to track capability development across RL training checkpoints, surfacing insights about when and how specific capabilities emerge</li>
<li>Conducting a cross-org audit of evaluation coverage, identifying blind spots, and prioritising new evaluations to fill critical gaps across Pretraining, RL, Inference, and Product</li>
<li>Developing the evaluation methodology and narrative for a major model release, working with research leads and communications to clearly characterise model capabilities and limitations</li>
<li>Researching and prototyping novel evaluation approaches for capabilities that are difficult to measure with existing benchmarks</li>
<li>Leading a team effort to build reusable evaluation infrastructure that serves multiple teams across the research organisation</li>
</ul>
<p>The annual compensation range for this role is $850,000.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$850,000-$850,000</Salaryrange>
      <Skills>AI, Machine Learning, Reinforcement Learning, Evaluation Methodologies, Research Leadership, Team Management, Communication, Results-Oriented, Fast-Paced Environments, Long-Horizon Evaluations, Agentic Tasks, Safety Evaluation Frameworks, Red Teaming Methodologies, Psychometrics, Experimental Psychology, Measurement-Focused Disciplines</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic is a company that creates reliable, interpretable, and steerable AI systems.</Employerdescription>
      <Employerwebsite>https://www.anthropic.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/5139654008</Applyto>
      <Location>Remote-Friendly (Travel Required) | San Francisco, CA; San Francisco, CA | New York City, NY</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>deb98db6-eba</externalid>
      <Title>Staff Software Engineer, Search Quality</Title>
      <Description><![CDATA[<p>At Databricks, we are enabling data teams to solve the world&#39;s toughest problems by building and running the world&#39;s best data and AI infrastructure platform.来たSearch plays a foundational role in this mission, powering everything from Retrieval Augmented Generation (RAG), AI assistants, and recommendation systems to enterprise knowledge management, in-product search, and data exploration.</p>
<p>As a Staff Software Engineer for Search Quality, you will drive the technical direction of ranking, relevance, evaluation, and quality initiatives across Databricks&#39; next-generation Search product. You&#39;ll design and build the systems, models, and evaluation frameworks that ensure our Search stack delivers accurate, high-quality results across diverse multimodal datasets and query patterns.</p>
<p>The impact you will have:</p>
<ul>
<li>Lead the technical vision for Search Quality, shaping the ranking architecture, relevance modeling stack, and evaluation systems that power Databricks&#39; next-generation retrieval experiences.</li>
</ul>
<ul>
<li>Identify and solve challenges in ranking, query understanding, and hybrid retrieval , advancing state-of-the-art techniques in vector, keyword, and multimodal search.</li>
</ul>
<ul>
<li>Design and train production-ready ranking and reranking models with strong guarantees around quality, latency, and resource efficiency.</li>
</ul>
<ul>
<li>Partner closely with research, product, and infra teams to define metrics, evaluation methodologies, and experimentation strategies for new retrieval features and model architectures.</li>
</ul>
<ul>
<li>Drive end-to-end engineering efforts , from early prototyping to production rollout , ensuring correctness, reliability, and measurable improvements to relevance.</li>
</ul>
<ul>
<li>Build and operate resilient, low-latency services for ranking, evaluation, and relevance signal processing.</li>
</ul>
<ul>
<li>Champion excellence in ML and search engineering, mentoring teammates and elevating design, code quality, and scientific rigor across the team.</li>
</ul>
<ul>
<li>Shape Databricks&#39; long-term roadmap for retrieval quality, ranking infrastructure, and the foundations for retrieval-driven AI products.</li>
</ul>
<p>What we look for:</p>
<ul>
<li>10+ years of experience building large-scale search, ranking, recommendation, or ML-driven relevance systems.</li>
</ul>
<ul>
<li>Deep expertise in Search Quality, including ranking models, signals, query understanding, and evaluation methodologies.</li>
</ul>
<ul>
<li>Strong understanding of relevance metrics and evaluation frameworks.</li>
</ul>
<ul>
<li>Familiarity with vector search, keyword search, hybrid retrieval, and embedding-based semantic retrieval.</li>
</ul>
<ul>
<li>Solid foundation in algorithms, data structures, and system design for performance-critical ranking and retrieval systems.</li>
</ul>
<ul>
<li>Proven ability to deliver high-impact technical initiatives with clear business or product outcomes.</li>
</ul>
<ul>
<li>Strong communication skills and ability to collaborate across teams in fast-moving environments.</li>
</ul>
<ul>
<li>Strategic and product-oriented mindset with the ability to align technical execution with long-term vision.</li>
</ul>
<ul>
<li>Passion for mentoring, growing engineers, and fostering technical excellence.</li>
</ul>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>staff</Experiencelevel>
      <Workarrangement>onsite</Workarrangement>
      <Salaryrange>$165,300-$219,675 USD</Salaryrange>
      <Skills>large-scale search, ranking, recommendation, ML-driven relevance systems, Search Quality, ranking models, signals, query understanding, evaluation methodologies, relevance metrics, evaluation frameworks, vector search, keyword search, hybrid retrieval, embedding-based semantic retrieval, algorithms, data structures, system design</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Databricks</Employername>
      <Employerlogo>https://logos.yubhub.co/databricks.com.png</Employerlogo>
      <Employerdescription>Databricks builds and runs the world&apos;s best data and AI infrastructure platform.</Employerdescription>
      <Employerwebsite>https://databricks.com</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/databricks/jobs/8295792002</Applyto>
      <Location>Mountain View, California</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>f5d92fd6-e21</externalid>
      <Title>Prompt Engineer, Agent Prompts &amp; Evals</Title>
      <Description><![CDATA[<p><strong>About the Role</strong></p>
<p>We&#39;re looking for prompt and context engineers to join our product engineering team to help build AI-first products, features, and evaluations. Your mission will be to bridge the gap between model capabilities and real product experience, working with product teams to build consistent, safe, and beneficial user experiences across all product surfaces.</p>
<p><strong>Key Responsibilities</strong></p>
<ul>
<li>Design, test, and optimize system prompts and feature-specific prompts that shape Claude&#39;s behavior across consumer and API products.</li>
<li>Build and maintain comprehensive evaluation suites that ensure model quality and consistency across product launches and updates.</li>
<li>Partner closely with product teams, research teams, and safeguards to ensure new features meet quality and safety standards.</li>
<li>Play a critical role in model releases, ensuring smooth rollouts and catching regressions before they impact users.</li>
<li>Help build and improve the frameworks and tools that allow teams to develop and test prompts and features with confidence.</li>
<li>Mentor product engineers on prompt engineering best practices and help teams build their first evaluations.</li>
<li>Work in a fast-paced environment where model capabilities advance daily, requiring quick adaptation and creative problem-solving.</li>
</ul>
<p><strong>What We&#39;re Looking For</strong></p>
<ul>
<li>5+ years of software engineering experience with Python or similar languages.</li>
<li>Demonstrated experience with LLMs and prompt engineering (through work, research, or significant personal projects).</li>
<li>Strong understanding of evaluation methodologies and metrics for AI systems.</li>
<li>Excellent written and verbal communication skills – you&#39;ll need to explain complex model behaviors to diverse stakeholders.</li>
<li>Ability to manage multiple concurrent projects and prioritize effectively.</li>
<li>Experience with version control, CI/CD, and modern software development practices.</li>
</ul>
<p><strong>You Might Thrive in This Role If You…</strong></p>
<ul>
<li>Get excited about the nuances of how language models behave and love finding creative ways to improve their outputs.</li>
<li>Enjoy being at the intersection of research and product, translating cutting-edge capabilities into user value.</li>
<li>Are comfortable with ambiguity and can define success metrics for novel AI features.</li>
<li>Have a strong sense of ownership and drive projects from conception to production.</li>
<li>Are passionate about building AI systems that are helpful, harmless, and honest.</li>
<li>Thrive in collaborative environments and enjoy teaching others.</li>
</ul>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>mid</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$320,000-$405,000 USD</Salaryrange>
      <Skills>Python, LLMs, Prompt engineering, Evaluation methodologies, Metrics for AI systems, Version control, CI/CD, Modern software development practices, Claude, A/B testing, Experimentation frameworks, AI safety, Alignment considerations, Building tools and infrastructure for ML/AI workflows</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems.</Employerdescription>
      <Employerwebsite>https://www.anthropic.com/</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/5107121008</Applyto>
      <Location>San Francisco, CA | New York City, NY</Location>
      <Country></Country>
      <Postedate>2026-04-18</Postedate>
    </job>
    <job>
      <externalid>557894f1-074</externalid>
      <Title>Prompt Engineer, Agent Prompts &amp; Evals</Title>
      <Description><![CDATA[<p><strong>About Anthropic</strong></p>
<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>
<p><strong>About the Role</strong></p>
<p>We’re looking for prompt and context engineers to join our product engineering team to help build AI-first products, features, and evaluations. Your mission will be to bridge the gap between model capabilities and real product experience, working with product teams to build consistent, safe, and beneficial user experiences across all product surfaces.</p>
<p>You will be deeply involved in new product feature and model releases at Anthropic, combining engineering expertise with an understanding of frontier AI applications and model quality. You’ll become an expert on Claude’s behavioural quirks and capabilities and apply that knowledge to deliver the best possible user experience across models and domains. You’ll be the first resource for product teams working on Claude’s AI infrastructure: system prompts, tool prompts, skills, and evaluations.</p>
<p>This role requires someone who can effectively balance caring deeply about making Claude the best it can be while also supporting a wide variety of concurrent projects and efforts across many product teams.</p>
<p><strong>Key Responsibilities</strong></p>
<ul>
<li><strong>Prompt Engineering Excellence:</strong> Design, test, and optimise system prompts and feature-specific prompts that shape Claude’s behaviour across consumer and API products.</li>
</ul>
<ul>
<li><strong>Evaluation Development:</strong> Build and maintain comprehensive evaluation suites that ensure model quality and consistency across product launches and updates.</li>
</ul>
<ul>
<li><strong>Cross-functional Collaboration:</strong> Partner closely with product teams, research teams, and safeguards to ensure new features meet quality and safety standards.</li>
</ul>
<ul>
<li><strong>Model Launch Support:</strong> Play a critical role in model releases, ensuring smooth rollouts and catching regressions before they impact users.</li>
</ul>
<ul>
<li><strong>Infrastructure Contribution:</strong> Help build and improve the frameworks and tools that allow teams to develop and test prompts and features with confidence.</li>
</ul>
<ul>
<li><strong>Knowledge Transfer:</strong> Mentor product engineers on prompt engineering best practices and help teams build their first evaluations.</li>
</ul>
<ul>
<li><strong>Rapid Iteration:</strong> Work in a fast-paced environment where model capabilities advance daily, requiring quick adaptation and creative problem-solving.</li>
</ul>
<p><strong>What We’re Looking For</strong></p>
<p><strong>Required Qualifications</strong></p>
<ul>
<li>5+ years of software engineering experience with Python or similar languages.</li>
</ul>
<ul>
<li>Demonstrated experience with LLMs and prompt engineering (through work, research, or significant personal projects).</li>
</ul>
<ul>
<li>Strong understanding of evaluation methodologies and metrics for AI systems.</li>
</ul>
<ul>
<li>Excellent written and verbal communication skills – you’ll need to explain complex model behaviours to diverse stakeholders.</li>
</ul>
<ul>
<li>Ability to manage multiple concurrent projects and prioritise effectively.</li>
</ul>
<ul>
<li>Experience with version control, CI/CD, and modern software development practices.</li>
</ul>
<p><strong>Preferred Qualifications</strong></p>
<ul>
<li>Experience with Claude or other frontier AI models in production settings.</li>
</ul>
<ul>
<li>Background in machine learning, NLP, or related fields.</li>
</ul>
<ul>
<li>Experience with A/B testing and experimentation frameworks (e.g., Statsig).</li>
</ul>
<ul>
<li>Familiarity with AI safety and alignment considerations.</li>
</ul>
<ul>
<li>Experience building tools and infrastructure for ML/AI workflows.</li>
</ul>
<ul>
<li>Track record of improving AI system performance through systematic evaluation and iteration.</li>
</ul>
<p><strong>You Might Thrive in This Role If You…</strong></p>
<ul>
<li>Get excited about the nuances of how language models behave and love finding creative ways to improve their outputs.</li>
</ul>
<ul>
<li>Enjoy being at the intersection of research and product, translating cutting-edge capabilities into user value.</li>
</ul>
<ul>
<li>Are comfortable with ambiguity and can define success metrics for novel AI features.</li>
</ul>
<ul>
<li>Have a strong sense of ownership and drive projects from conception to production.</li>
</ul>
<ul>
<li>Are passionate about building AI systems that are helpful, harmless, and honest.</li>
</ul>
<ul>
<li>Thrive in collaborative environments and enjoy teaching others.</li>
</ul>
<p><strong>Logistics</strong></p>
<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience.</p>
<p><strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>
<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren’t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>
<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong> Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you’re interested in this work.</p>
<p><strong>Your safety matters to us.</strong> To protect yourself from potential scams, we want to remind you that we will never ask you to pay any fees for the hiring process. If someone contacts you claiming to be from Anthropic and asks for money, please report it to us immediately.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$320,000 - $405,000USD</Salaryrange>
      <Skills>Python, LLMs, Prompt engineering, Evaluation methodologies, Metrics for AI systems, Version control, CI/CD, Modern software development practices, Claude, Frontier AI models, Machine learning, NLP, A/B testing, Experimentation frameworks, AI safety, Alignment considerations, Tools and infrastructure for ML/AI workflows</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic is a quickly growing organisation with a mission to create reliable, interpretable, and steerable AI systems. Their team is a group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</Employerdescription>
      <Employerwebsite>https://job-boards.greenhouse.io</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/5107121008</Applyto>
      <Location>San Francisco, CA | New York City, NY</Location>
      <Country></Country>
      <Postedate>2026-03-08</Postedate>
    </job>
    <job>
      <externalid>447c26bd-a83</externalid>
      <Title>Research Engineer, Universes</Title>
      <Description><![CDATA[<p><strong>About the Role</strong></p>
<p>We&#39;re looking for Research Engineers to help us build the next generation of training environments for capable and safe agentic AI. This role blends research and engineering responsibilities, requiring you to both implement novel approaches and contribute to research direction.</p>
<p><strong>Responsibilities:</strong></p>
<ul>
<li>Build the next generation of agentic environments</li>
<li>Build rigorous evaluations that measure real capability</li>
<li>Collaborate across research and infrastructure teams to ship environments into production training</li>
<li>Debug and iterate rapidly across research and production ML stacks</li>
<li>Contribute to research culture through technical discussions and collaborative problem-solving</li>
</ul>
<p><strong>You may be a good fit if you:</strong></p>
<ul>
<li>Are highly impact-driven — you care about outcomes, not activity</li>
<li>Operate with high agency</li>
<li>Have good research taste or senior technical experience, demonstrating good judgment in identifying what actually matters in complex problem spaces</li>
<li>Can balance research exploration with engineering implementation</li>
<li>Are passionate about the potential impact of AI and are committed to developing safe and beneficial systems</li>
<li>Are comfortable with uncertainty and adapt quickly as the landscape shifts</li>
<li>Have strong software engineering skills and can build robust infrastructure</li>
<li>Enjoy pair programming (we love to pair!)</li>
</ul>
<p><strong>Strong candidates may also have one or more of the following:</strong></p>
<ul>
<li>Have industry experience with large language model training, fine-tuning or evaluation</li>
<li>Have industry experience building RL environments, simulation systems, or large-scale ML infrastructure</li>
<li>Senior experience in a relevant technical field even if transitioning domains</li>
<li>Deep expertise in sandboxing, containerization, VM infrastructure, or distributed systems</li>
<li>Published influential work in relevant ML areas</li>
</ul>
<p><strong>Logistics</strong></p>
<ul>
<li>Education requirements: We require at least a Bachelor&#39;s degree in a related field or equivalent experience.</li>
<li>Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</li>
<li>Visa sponsorship: We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</li>
</ul>
<p><strong>How we&#39;re different</strong></p>
<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We&#39;re an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$500,000 - $850,000 USD</Salaryrange>
      <Skills>reinforcement learning, training environments, evaluation methodologies, software engineering, pair programming, large language model training, RL environments, simulation systems, distributed systems, influential work in ML areas</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic&apos;s mission is to create reliable, interpretable, and steerable AI systems. The company is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</Employerdescription>
      <Employerwebsite>https://job-boards.greenhouse.io</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/5061517008</Applyto>
      <Location>San Francisco, CA, Seattle, WA, New York City, NY</Location>
      <Country></Country>
      <Postedate>2026-03-08</Postedate>
    </job>
    <job>
      <externalid>c33b2d78-cc9</externalid>
      <Title>Research Lead, Training Insights</Title>
      <Description><![CDATA[<p><strong>About the role</strong></p>
<p>As a Research Lead on the Training Insights team, you&#39;ll develop the strategy for, and lead execution on, how we measure and characterise model capabilities across training and deployment. This is a hands-on leadership role: you&#39;ll drive original research into new evaluation methodologies while leading a small team of researchers and research engineers doing the same.</p>
<p>Your work will span the full lifecycle of model development. You&#39;ll research and build new long-horizon evaluations that test the boundaries of what our models can achieve, develop novel approaches to measuring emerging capabilities, and deepen our understanding of how those capabilities develop — both during production RL training and after. You&#39;ll also take a cross-organisational view, working across Reinforcement Learning, Pretraining, Inference, Product, Alignment, Safeguards, and other teams to map the landscape of model evaluations at Anthropic and identify critical gaps in coverage.</p>
<p>This role carries significant visibility and impact. You&#39;ll help shape the evaluation narrative for model releases, contributing directly to how Anthropic communicates about its models to both internal and external audiences. Done well, you will change how the industry measures and understands model capabilities, significantly furthering our safety mission.</p>
<p><strong>Responsibilities:</strong></p>
<ul>
<li>Build new novel and long-horizon evaluations</li>
<li>Develop novel measurement approaches for understanding how model capabilities emerge and evolve during RL training</li>
<li>Lead strategic evaluation coverage across the company</li>
<li>Shape the evaluation narrative for model releases</li>
<li>Lead and mentor a small team of researchers and research engineers, setting research direction and fostering a culture of rigorous, creative research</li>
<li>Design evaluation frameworks that balance scientific rigor with the practical demands of production training schedules</li>
<li>Build and maintain relationships across Anthropic&#39;s research organisation to ensure evaluation insights inform training and deployment decisions</li>
<li>Contribute to the broader research community through publications, open-source contributions, or external engagement on evaluation best practices</li>
</ul>
<p><strong>You may be a good fit if you:</strong></p>
<ul>
<li>Have significant experience designing and running evaluations for large language models or similar complex ML systems</li>
<li>Have led technical projects or teams, either formally or through sustained ownership of critical research directions</li>
<li>Are equally comfortable designing experiments and writing code—you can move between research and implementation fluidly</li>
<li>Think strategically about what to measure and why, not just how to measure it</li>
<li>Can synthesise information across multiple teams and workstreams to form a coherent picture of model capabilities</li>
<li>Communicate complex technical findings clearly to both technical and non-technical audiences</li>
<li>Are results-oriented and thrive in fast-paced environments where priorities shift based on research findings</li>
<li>Care deeply about AI safety and want your work to directly influence how capable AI systems are developed and deployed</li>
</ul>
<p><strong>Strong candidates may also have:</strong></p>
<ul>
<li>Experience building evaluations for long-horizon or agentic tasks</li>
<li>Deep familiarity with Reinforcement Learning training dynamics and how model behaviour changes during training</li>
<li>Published research in machine learning evaluation, benchmarking, or related areas</li>
<li>Experience with safety evaluation frameworks and red teaming methodologies</li>
<li>Background in psychometrics, experimental psychology, or other measurement-focused disciplines</li>
<li>A track record of communicating evaluation results to inform high-stakes decisions about model development or deployment</li>
<li>Experience managing or mentoring researchers and engineers</li>
</ul>
<p><strong>Representative projects:</strong></p>
<ul>
<li>Designing and implementing a suite of long-horizon evaluations that test model capabilities on tasks requiring sustained reasoning, planning, and tool use over extended interactions</li>
<li>Building systems to track capability development across RL training checkpoints, surfacing insights about when and how specific capabilities emerge</li>
<li>Conducting a cross-org audit of evaluation coverage, identifying blind spots, and prioritising new evaluations to fill critical gaps across Pretraining, RL, Inference, and Product</li>
<li>Developing the evaluation methodology and narrative for a major model release, working with research leads and communications to clearly characterise model capabilities and limitations</li>
<li>Researching and prototyping novel evaluation approaches for capabilities that are difficult to measure with existing benchmarks</li>
<li>Leading a team effort to build reusable evaluation infrastructure that serves multiple teams across the research organisation</li>
</ul>
<p><strong>Logistics</strong></p>
<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices repsectively.</p>
<p><strong>Visa sponsorship:</strong> We do sponsor visas!</p>
<p style="margin-top:24px;font-size:13px;color:#666;">XML job scraping automation by <a href="https://yubhub.co">YubHub</a></p>]]></Description>
      <Jobtype>full-time</Jobtype>
      <Experiencelevel>senior</Experiencelevel>
      <Workarrangement>hybrid</Workarrangement>
      <Salaryrange>$850,000 - $850,000USD</Salaryrange>
      <Skills>machine learning, evaluation methodologies, Reinforcement Learning, Pretraining, Inference, Product, Alignment, Safeguards, psychometrics, experimental psychology, safety evaluation frameworks, red teaming methodologies</Skills>
      <Category>Engineering</Category>
      <Industry>Technology</Industry>
      <Employername>Anthropic</Employername>
      <Employerlogo>https://logos.yubhub.co/anthropic.com.png</Employerlogo>
      <Employerdescription>Anthropic is a quickly growing organisation working to build beneficial AI systems. Their mission is to create reliable, interpretable, and steerable AI systems.</Employerdescription>
      <Employerwebsite>https://job-boards.greenhouse.io</Employerwebsite>
      <Compensationcurrency></Compensationcurrency>
      <Compensationmin></Compensationmin>
      <Compensationmax></Compensationmax>
      <Applyto>https://job-boards.greenhouse.io/anthropic/jobs/5139654008</Applyto>
      <Location>San Francisco, CA</Location>
      <Country></Country>
      <Postedate>2026-03-08</Postedate>
    </job>
  </jobs>
</source>