Staff Machine Learning Research Scientist, LLM Evals

465e2cfb-ddc Staff Machine Learning Research Scientist, LLM Evals As a Staff Machine Learning Research Scientist on the LLM Evals team, you will lead the development of novel evaluation methodologies, metrics, and benchmarks to measure the capabilities and limitations of frontier LLMs.

Your primary responsibilities will include:

Driving research on the effectiveness and limitations of existing LLM evaluation techniques.
Designing and developing novel evaluation benchmarks for large language models, covering areas such as instruction following, factuality, robustness, and fairness.
Communicating, collaborating, and building relationships with clients and peer teams to facilitate cross-functional projects.
Collaborating with internal teams and external partners to refine metrics and create standardized evaluation protocols.
Implementing scalable and reproducible evaluation pipelines using modern ML frameworks.
Publishing research findings in top-tier AI conferences and contributing to open-source benchmarking initiatives.
Mentoring and guiding research scientists and engineers, providing technical leadership across cross-functional projects.
Staying deeply engaged with the ML research community, tracking emerging work and contributing to the advancement of LLM evaluation science.

The ideal candidate will have 5+ years of hands-on experience in large language model, NLP, and Transformer modeling, in the setting of both research and engineering development.

You will thrive in a high-energy, fast-paced startup environment and be ready to dedicate the time and effort needed to drive impactful results.

Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training.

XML job scraping automation by YubHub

]]> full-time staff onsite $264,800-$331,000 USD large language model, NLP, Transformer modeling, evaluation methodologies, metrics, benchmarks, instruction following, factuality, robustness, fairness Engineering Technology Scale https://logos.yubhub.co/scale.com.png Scale develops reliable AI systems for the world's most important decisions, providing high-quality data and full-stack technologies to power leading models. https://scale.com/ https://job-boards.greenhouse.io/scaleai/jobs/4628044005 San Francisco, CA; Seattle, WA; New York, NY 2026-04-18 60a7e1e6-b51 Tech Lead/Manager, Machine Learning Research Scientist- LLM Evals As the leading data and evaluation partner for frontier AI companies, we're dedicated to advancing the evaluation and benchmarking of large language models (LLMs). Our Research teams work with the industry's leading AI labs to provide high-quality data and accelerate progress in GenAI research.

We're seeking a Tech Lead Manager to lead a talented team of research scientists and research engineers focused on developing and implementing novel evaluation methodologies, metrics, and benchmarks to assess the capabilities and limitations of our cutting-edge LLMs.

Key responsibilities:

Lead a team of highly effective research scientists and research engineers on LLM evals.
Conduct research on the effectiveness and limitations of existing LLM evaluation techniques.
Design and develop novel evaluation benchmarks for large language models, covering areas such as instruction following, factuality, robustness, and fairness.
Communicate, collaborate, and build relationships with clients and peer teams to facilitate cross-functional projects.
Collaborate with internal teams and external partners to refine metrics and create standardized evaluation protocols.
Implement scalable and reproducible evaluation pipelines using modern ML frameworks.
Publish research findings in top-tier AI conferences and contribute to open-source benchmarking initiatives.

Ideal candidate has 5+ years of hands-on experience in large language model, NLP, and Transformer modeling, in the setting of both research and engineering development. Experience supporting and leading a team of research scientists and research engineers is also required.

XML job scraping automation by YubHub

]]> full-time senior onsite $264,800-$331,000 USD large language model, NLP, Transformer modeling, research and engineering development, team leadership, cross-functional collaboration, evaluation methodologies, metrics and benchmarks, scalable and reproducible evaluation pipelines, modern ML frameworks, published research in top-tier AI conferences, open-source benchmarking initiatives, customer-facing role Engineering Technology Scale https://logos.yubhub.co/scale.com.png Scale develops reliable AI systems for the world's most important decisions, providing high-quality data and full-stack technologies. https://scale.com/ https://job-boards.greenhouse.io/scaleai/jobs/4304790005 San Francisco, CA; Seattle, WA; New York, NY 2026-04-18 10bf8d86-b30 Research Engineer, Safeguards Labs About the Role

We're hiring research engineers to define and execute the Labs research agenda. You'll scope your own projects, run experiments end-to-end, and decide when an idea is ready to hand off to a production team , or when to kill it and move on.

Responsibilities:

Lead and contribute to research projects investigating new methods for detecting misuse of Claude, identifying malicious organisations and accounts, strengthening model safeguards, and other safety needs.

Design and run offline analyses over model usage data to surface abuse patterns, build classifiers and detection systems, and evaluate their effectiveness.

Develop and iterate on prototypes that could eventually feed signals into the real-time safeguards path, partnering with engineers on tech transfer.

Contribute to a broader research portfolio investigating methods for detecting abusive behaviour in chat-based or agentive workflows, and for training the model to robustly refrain from dangerous responses or behaviours without over-refusing.

Build evaluations and methodologies for measuring whether safeguards actually work, including in agentic settings.

Write up findings clearly so they inform decisions across Trust & Safety, research, and product teams.

You may be a good fit if you:

Have a track record of independently driving research projects from ambiguous problem statements to concrete results, ideally in AI, ML, security, integrity, or a related technical field.

Are comfortable scoping your own work and switching between research, engineering, and analysis as a project demands.

Have working familiarity with how large language models operate , sampling, prompting, training , even if LLMs aren't your primary background.

Are proficient in Python and comfortable working with large datasets.

Care about the societal impacts of AI and want your work to directly reduce real-world harm.

Strong candidates may also have:

Experience building and training machine learning models, including classifiers for abuse, fraud, integrity, or security applications.

Knowledge of evaluation methodologies for language models and experience designing evals.

Experience with agentic environments and evaluating model behaviour in them.

Background in trust and safety, integrity, fraud detection, threat intelligence, or adversarial ML.

Experience with red teaming, jailbreak research, or interpretability methods like steering vectors.

A history of taking research prototypes and transferring them into production systems.

Logistics

Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience

Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience

Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position

Benefits

Competitive compensation and benefits

Optional equity donation matching

Generous vacation and parental leave

Flexible working hours

Lovely office space in which to collaborate with colleagues

Visa Sponsorship

We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

XML job scraping automation by YubHub

]]> full-time senior hybrid $350,000-$850,000 USD Python, Machine learning, Large language models, Security, Integrity, Experience building and training machine learning models, Knowledge of evaluation methodologies for language models, Experience with agentic environments, Background in trust and safety, Experience with red teaming Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5191785008 San Francisco, CA | New York City, NY 2026-04-18 a0355e9d-a71 Research Lead, Training Insights As a Research Lead on the Training Insights team, you'll develop the strategy for, and lead execution on, how we measure and characterise model capabilities across training and deployment. This is a hands-on leadership role: you'll drive original research into new evaluation methodologies while leading a small team of researchers and research engineers doing the same.

Your work will span the full lifecycle of model development. You'll research and build new long-horizon evaluations that test the boundaries of what our models can achieve, develop novel approaches to measuring emerging capabilities, and deepen our understanding of how those capabilities develop , both during production RL training and after. You'll also take a cross-organisational view, working across Reinforcement Learning, Pretraining, Inference, Product, Alignment, Safeguards, and other teams to map the landscape of model evaluations at Anthropic and identify critical gaps in coverage.

This role carries significant visibility and impact. You'll help shape the evaluation narrative for model releases, contributing directly to how Anthropic communicates about its models to both internal and external audiences. Done well, you will change how the industry measures and understands model capabilities, significantly furthering our safety mission.

Responsibilities:

Build new novel and long-horizon evaluations
Develop novel measurement approaches for understanding how model capabilities emerge and evolve during RL training
Lead strategic evaluation coverage across the company
Shape the evaluation narrative for model releases
Lead and mentor a small team of researchers and research engineers, setting research direction and fostering a culture of rigorous, creative research
Design evaluation frameworks that balance scientific rigor with the practical demands of production training schedules
Build and maintain relationships across Anthropic's research organisation to ensure evaluation insights inform training and deployment decisions
Contribute to the broader research community through publications, open-source contributions, or external engagement on evaluation best practices

You may be a good fit if you:

Have significant experience designing and running evaluations for large language models or similar complex ML systems
Have led technical projects or teams, either formally or through sustained ownership of critical research directions
Are equally comfortable designing experiments and writing code,you can move between research and implementation fluidly
Think strategically about what to measure and why, not just how to measure it
Can synthesise information across multiple teams and workstreams to form a coherent picture of model capabilities
Communicate complex technical findings clearly to both technical and non-technical audiences
Are results-oriented and thrive in fast-paced environments where priorities shift based on research findings
Care deeply about AI safety and want your work to directly influence how capable AI systems are developed and deployed

Strong candidates may also have:

Experience building evaluations for long-horizon or agentic tasks
Deep familiarity with Reinforcement Learning training dynamics and how model behaviour changes during training
Published research in machine learning evaluation, benchmarking, or related areas
Experience with safety evaluation frameworks and red teaming methodologies
Background in psychometrics, experimental psychology, or other measurement-focused disciplines
A track record of communicating evaluation results to inform high-stakes decisions about model development or deployment
Experience managing or mentoring researchers and engineers

Representative projects:

Designing and implementing a suite of long-horizon evaluations that test model capabilities on tasks requiring sustained reasoning, planning, and tool use over extended interactions
Building systems to track capability development across RL training checkpoints, surfacing insights about when and how specific capabilities emerge
Conducting a cross-org audit of evaluation coverage, identifying blind spots, and prioritising new evaluations to fill critical gaps across Pretraining, RL, Inference, and Product
Developing the evaluation methodology and narrative for a major model release, working with research leads and communications to clearly characterise model capabilities and limitations
Researching and prototyping novel evaluation approaches for capabilities that are difficult to measure with existing benchmarks
Leading a team effort to build reusable evaluation infrastructure that serves multiple teams across the research organisation

The annual compensation range for this role is $850,000.

XML job scraping automation by YubHub

]]> full-time senior hybrid $850,000-$850,000 AI, Machine Learning, Reinforcement Learning, Evaluation Methodologies, Research Leadership, Team Management, Communication, Results-Oriented, Fast-Paced Environments, Long-Horizon Evaluations, Agentic Tasks, Safety Evaluation Frameworks, Red Teaming Methodologies, Psychometrics, Experimental Psychology, Measurement-Focused Disciplines Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5139654008 Remote-Friendly (Travel Required) | San Francisco, CA; San Francisco, CA | New York City, NY 2026-04-18 deb98db6-eba Staff Software Engineer, Search Quality At Databricks, we are enabling data teams to solve the world's toughest problems by building and running the world's best data and AI infrastructure platform.来たSearch plays a foundational role in this mission, powering everything from Retrieval Augmented Generation (RAG), AI assistants, and recommendation systems to enterprise knowledge management, in-product search, and data exploration.

As a Staff Software Engineer for Search Quality, you will drive the technical direction of ranking, relevance, evaluation, and quality initiatives across Databricks' next-generation Search product. You'll design and build the systems, models, and evaluation frameworks that ensure our Search stack delivers accurate, high-quality results across diverse multimodal datasets and query patterns.

The impact you will have:

Lead the technical vision for Search Quality, shaping the ranking architecture, relevance modeling stack, and evaluation systems that power Databricks' next-generation retrieval experiences.

Identify and solve challenges in ranking, query understanding, and hybrid retrieval , advancing state-of-the-art techniques in vector, keyword, and multimodal search.

Design and train production-ready ranking and reranking models with strong guarantees around quality, latency, and resource efficiency.

Partner closely with research, product, and infra teams to define metrics, evaluation methodologies, and experimentation strategies for new retrieval features and model architectures.

Drive end-to-end engineering efforts , from early prototyping to production rollout , ensuring correctness, reliability, and measurable improvements to relevance.

Build and operate resilient, low-latency services for ranking, evaluation, and relevance signal processing.

Champion excellence in ML and search engineering, mentoring teammates and elevating design, code quality, and scientific rigor across the team.

Shape Databricks' long-term roadmap for retrieval quality, ranking infrastructure, and the foundations for retrieval-driven AI products.

What we look for:

10+ years of experience building large-scale search, ranking, recommendation, or ML-driven relevance systems.

Deep expertise in Search Quality, including ranking models, signals, query understanding, and evaluation methodologies.

Strong understanding of relevance metrics and evaluation frameworks.

Familiarity with vector search, keyword search, hybrid retrieval, and embedding-based semantic retrieval.

Solid foundation in algorithms, data structures, and system design for performance-critical ranking and retrieval systems.

Proven ability to deliver high-impact technical initiatives with clear business or product outcomes.

Strong communication skills and ability to collaborate across teams in fast-moving environments.

Strategic and product-oriented mindset with the ability to align technical execution with long-term vision.

Passion for mentoring, growing engineers, and fostering technical excellence.

XML job scraping automation by YubHub

]]> full-time staff onsite $165,300-$219,675 USD large-scale search, ranking, recommendation, ML-driven relevance systems, Search Quality, ranking models, signals, query understanding, evaluation methodologies, relevance metrics, evaluation frameworks, vector search, keyword search, hybrid retrieval, embedding-based semantic retrieval, algorithms, data structures, system design Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks builds and runs the world's best data and AI infrastructure platform. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8295792002 Mountain View, California 2026-04-18 f5d92fd6-e21 Prompt Engineer, Agent Prompts & Evals About the Role

We're looking for prompt and context engineers to join our product engineering team to help build AI-first products, features, and evaluations. Your mission will be to bridge the gap between model capabilities and real product experience, working with product teams to build consistent, safe, and beneficial user experiences across all product surfaces.

Key Responsibilities

Design, test, and optimize system prompts and feature-specific prompts that shape Claude's behavior across consumer and API products.
Build and maintain comprehensive evaluation suites that ensure model quality and consistency across product launches and updates.
Partner closely with product teams, research teams, and safeguards to ensure new features meet quality and safety standards.
Play a critical role in model releases, ensuring smooth rollouts and catching regressions before they impact users.
Help build and improve the frameworks and tools that allow teams to develop and test prompts and features with confidence.
Mentor product engineers on prompt engineering best practices and help teams build their first evaluations.
Work in a fast-paced environment where model capabilities advance daily, requiring quick adaptation and creative problem-solving.

What We're Looking For

5+ years of software engineering experience with Python or similar languages.
Demonstrated experience with LLMs and prompt engineering (through work, research, or significant personal projects).
Strong understanding of evaluation methodologies and metrics for AI systems.
Excellent written and verbal communication skills – you'll need to explain complex model behaviors to diverse stakeholders.
Ability to manage multiple concurrent projects and prioritize effectively.
Experience with version control, CI/CD, and modern software development practices.

You Might Thrive in This Role If You…

Get excited about the nuances of how language models behave and love finding creative ways to improve their outputs.
Enjoy being at the intersection of research and product, translating cutting-edge capabilities into user value.
Are comfortable with ambiguity and can define success metrics for novel AI features.
Have a strong sense of ownership and drive projects from conception to production.
Are passionate about building AI systems that are helpful, harmless, and honest.
Thrive in collaborative environments and enjoy teaching others.

XML job scraping automation by YubHub

]]> full-time mid hybrid $320,000-$405,000 USD Python, LLMs, Prompt engineering, Evaluation methodologies, Metrics for AI systems, Version control, CI/CD, Modern software development practices, Claude, A/B testing, Experimentation frameworks, AI safety, Alignment considerations, Building tools and infrastructure for ML/AI workflows Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5107121008 San Francisco, CA | New York City, NY 2026-04-18 557894f1-074 Prompt Engineer, Agent Prompts & Evals About Anthropic

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the Role

We’re looking for prompt and context engineers to join our product engineering team to help build AI-first products, features, and evaluations. Your mission will be to bridge the gap between model capabilities and real product experience, working with product teams to build consistent, safe, and beneficial user experiences across all product surfaces.

You will be deeply involved in new product feature and model releases at Anthropic, combining engineering expertise with an understanding of frontier AI applications and model quality. You’ll become an expert on Claude’s behavioural quirks and capabilities and apply that knowledge to deliver the best possible user experience across models and domains. You’ll be the first resource for product teams working on Claude’s AI infrastructure: system prompts, tool prompts, skills, and evaluations.

This role requires someone who can effectively balance caring deeply about making Claude the best it can be while also supporting a wide variety of concurrent projects and efforts across many product teams.

Key Responsibilities

Prompt Engineering Excellence: Design, test, and optimise system prompts and feature-specific prompts that shape Claude’s behaviour across consumer and API products.

Evaluation Development: Build and maintain comprehensive evaluation suites that ensure model quality and consistency across product launches and updates.

Cross-functional Collaboration: Partner closely with product teams, research teams, and safeguards to ensure new features meet quality and safety standards.

Model Launch Support: Play a critical role in model releases, ensuring smooth rollouts and catching regressions before they impact users.

Infrastructure Contribution: Help build and improve the frameworks and tools that allow teams to develop and test prompts and features with confidence.

Knowledge Transfer: Mentor product engineers on prompt engineering best practices and help teams build their first evaluations.

Rapid Iteration: Work in a fast-paced environment where model capabilities advance daily, requiring quick adaptation and creative problem-solving.

What We’re Looking For

Required Qualifications

5+ years of software engineering experience with Python or similar languages.

Demonstrated experience with LLMs and prompt engineering (through work, research, or significant personal projects).

Strong understanding of evaluation methodologies and metrics for AI systems.

Excellent written and verbal communication skills – you’ll need to explain complex model behaviours to diverse stakeholders.

Ability to manage multiple concurrent projects and prioritise effectively.

Experience with version control, CI/CD, and modern software development practices.

Preferred Qualifications

Experience with Claude or other frontier AI models in production settings.

Background in machine learning, NLP, or related fields.

Experience with A/B testing and experimentation frameworks (e.g., Statsig).

Familiarity with AI safety and alignment considerations.

Experience building tools and infrastructure for ML/AI workflows.

Track record of improving AI system performance through systematic evaluation and iteration.

You Might Thrive in This Role If You…

Get excited about the nuances of how language models behave and love finding creative ways to improve their outputs.

Enjoy being at the intersection of research and product, translating cutting-edge capabilities into user value.

Are comfortable with ambiguity and can define success metrics for novel AI features.

Have a strong sense of ownership and drive projects from conception to production.

Are passionate about building AI systems that are helpful, harmless, and honest.

Thrive in collaborative environments and enjoy teaching others.

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.

Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship: We do sponsor visas! However, we aren’t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you’re interested in this work.

Your safety matters to us. To protect yourself from potential scams, we want to remind you that we will never ask you to pay any fees for the hiring process. If someone contacts you claiming to be from Anthropic and asks for money, please report it to us immediately.

XML job scraping automation by YubHub

]]> full-time senior hybrid $320,000 - $405,000USD Python, LLMs, Prompt engineering, Evaluation methodologies, Metrics for AI systems, Version control, CI/CD, Modern software development practices, Claude, Frontier AI models, Machine learning, NLP, A/B testing, Experimentation frameworks, AI safety, Alignment considerations, Tools and infrastructure for ML/AI workflows Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a quickly growing organisation with a mission to create reliable, interpretable, and steerable AI systems. Their team is a group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5107121008 San Francisco, CA | New York City, NY 2026-03-08 447c26bd-a83 Research Engineer, Universes About the Role

We're looking for Research Engineers to help us build the next generation of training environments for capable and safe agentic AI. This role blends research and engineering responsibilities, requiring you to both implement novel approaches and contribute to research direction.

Responsibilities:

Build the next generation of agentic environments
Build rigorous evaluations that measure real capability
Collaborate across research and infrastructure teams to ship environments into production training
Debug and iterate rapidly across research and production ML stacks
Contribute to research culture through technical discussions and collaborative problem-solving

You may be a good fit if you:

Are highly impact-driven — you care about outcomes, not activity
Operate with high agency
Have good research taste or senior technical experience, demonstrating good judgment in identifying what actually matters in complex problem spaces
Can balance research exploration with engineering implementation
Are passionate about the potential impact of AI and are committed to developing safe and beneficial systems
Are comfortable with uncertainty and adapt quickly as the landscape shifts
Have strong software engineering skills and can build robust infrastructure
Enjoy pair programming (we love to pair!)

Strong candidates may also have one or more of the following:

Have industry experience with large language model training, fine-tuning or evaluation
Have industry experience building RL environments, simulation systems, or large-scale ML infrastructure
Senior experience in a relevant technical field even if transitioning domains
Deep expertise in sandboxing, containerization, VM infrastructure, or distributed systems
Published influential work in relevant ML areas

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience.
Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.
Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.

XML job scraping automation by YubHub

]]> full-time senior hybrid $500,000 - $850,000 USD reinforcement learning, training environments, evaluation methodologies, software engineering, pair programming, large language model training, RL environments, simulation systems, distributed systems, influential work in ML areas Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic's mission is to create reliable, interpretable, and steerable AI systems. The company is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5061517008 San Francisco, CA, Seattle, WA, New York City, NY 2026-03-08 c33b2d78-cc9 Research Lead, Training Insights About the role

As a Research Lead on the Training Insights team, you'll develop the strategy for, and lead execution on, how we measure and characterise model capabilities across training and deployment. This is a hands-on leadership role: you'll drive original research into new evaluation methodologies while leading a small team of researchers and research engineers doing the same.

Your work will span the full lifecycle of model development. You'll research and build new long-horizon evaluations that test the boundaries of what our models can achieve, develop novel approaches to measuring emerging capabilities, and deepen our understanding of how those capabilities develop — both during production RL training and after. You'll also take a cross-organisational view, working across Reinforcement Learning, Pretraining, Inference, Product, Alignment, Safeguards, and other teams to map the landscape of model evaluations at Anthropic and identify critical gaps in coverage.

Responsibilities:

Build new novel and long-horizon evaluations
Develop novel measurement approaches for understanding how model capabilities emerge and evolve during RL training
Lead strategic evaluation coverage across the company
Shape the evaluation narrative for model releases
Lead and mentor a small team of researchers and research engineers, setting research direction and fostering a culture of rigorous, creative research
Design evaluation frameworks that balance scientific rigor with the practical demands of production training schedules
Build and maintain relationships across Anthropic's research organisation to ensure evaluation insights inform training and deployment decisions
Contribute to the broader research community through publications, open-source contributions, or external engagement on evaluation best practices

You may be a good fit if you:

Have significant experience designing and running evaluations for large language models or similar complex ML systems
Have led technical projects or teams, either formally or through sustained ownership of critical research directions
Are equally comfortable designing experiments and writing code—you can move between research and implementation fluidly
Think strategically about what to measure and why, not just how to measure it
Can synthesise information across multiple teams and workstreams to form a coherent picture of model capabilities
Communicate complex technical findings clearly to both technical and non-technical audiences
Are results-oriented and thrive in fast-paced environments where priorities shift based on research findings
Care deeply about AI safety and want your work to directly influence how capable AI systems are developed and deployed

Strong candidates may also have:

Experience building evaluations for long-horizon or agentic tasks
Deep familiarity with Reinforcement Learning training dynamics and how model behaviour changes during training
Published research in machine learning evaluation, benchmarking, or related areas
Experience with safety evaluation frameworks and red teaming methodologies
Background in psychometrics, experimental psychology, or other measurement-focused disciplines
A track record of communicating evaluation results to inform high-stakes decisions about model development or deployment
Experience managing or mentoring researchers and engineers

Representative projects:

Designing and implementing a suite of long-horizon evaluations that test model capabilities on tasks requiring sustained reasoning, planning, and tool use over extended interactions
Building systems to track capability development across RL training checkpoints, surfacing insights about when and how specific capabilities emerge
Conducting a cross-org audit of evaluation coverage, identifying blind spots, and prioritising new evaluations to fill critical gaps across Pretraining, RL, Inference, and Product
Developing the evaluation methodology and narrative for a major model release, working with research leads and communications to clearly characterise model capabilities and limitations
Researching and prototyping novel evaluation approaches for capabilities that are difficult to measure with existing benchmarks
Leading a team effort to build reusable evaluation infrastructure that serves multiple teams across the research organisation

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices repsectively.

Visa sponsorship: We do sponsor visas!

XML job scraping automation by YubHub

]]> full-time senior hybrid $850,000 - $850,000USD machine learning, evaluation methodologies, Reinforcement Learning, Pretraining, Inference, Product, Alignment, Safeguards, psychometrics, experimental psychology, safety evaluation frameworks, red teaming methodologies Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a quickly growing organisation working to build beneficial AI systems. Their mission is to create reliable, interpretable, and steerable AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5139654008 San Francisco, CA 2026-03-08