Evals Engineer, Applied AI

9a42f26c-511 Evals Engineer, Applied AI We are seeking a technically rigorous and driven AI Research Engineer to join our Enterprise Evaluations team. This high-impact role is critical to our mission of delivering the industry's leading GenAI Evaluation Suite.

As a hands-on contributor to the core systems that ensure the safety, reliability, and continuous improvement of LLM-powered workflows and agents for the enterprise, you will partner with Scale's Operations team and enterprise customers to translate ambiguity into structured evaluation data. This involves guiding the creation and maintenance of gold-standard human-rated datasets and expert rubrics that anchor AI evaluation systems.

Your responsibilities will also include analysing feedback and collected data to identify patterns, refine evaluation frameworks, and establish iterative improvement loops that enhance the quality and relevance of human-curated assessments. You will design, research, and develop LLM-as-a-Judge autorater frameworks and AI-assisted evaluation systems, including creating models that critique, grade, and explain agent outputs.

To succeed in this role, you will need a strong foundational knowledge of large language models, a passion for tackling complex evaluation challenges, and the ability to thrive in a dynamic, fast-paced research environment. You should be able to think outside the box, stay current with the latest literature in AI evaluation, and be passionate about integrating novel research ideas into our workflows to build best-in-class evaluation systems.

In addition to your technical expertise, you will need excellent communication and collaboration skills, as you will work closely with cross-functional teams to drive project success.

If you are a motivated and detail-oriented individual with a passion for AI research and evaluation, we encourage you to apply for this exciting opportunity.

XML job scraping automation by YubHub

]]> full-time mid hybrid $216,000-$270,000 USD Python, PyTorch, TensorFlow, Large Language Models, Generative AI, Machine Learning, Applied Research, Evaluation Infrastructure, Advanced degree in Computer Science, Machine Learning, or a related quantitative field, Published research in leading ML or AI conferences, Experience designing, building, or deploying LLM-as-a-Judge frameworks or other automated evaluation systems, Experience collaborating with operations or external teams to define high-quality human annotator guidelines, Expertise in ML research engineering, stochastic systems, observability, or LLM-powered applications for model evaluation and analysis Engineering Technology Scale AI https://logos.yubhub.co/scale.com.png Scale AI develops reliable AI systems for the world's most important decisions. https://scale.com/ https://job-boards.greenhouse.io/scaleai/jobs/4629589005 San Francisco, CA; New York, NY 2026-04-18 30e11492-d62 Software Engineer, Safeguards Infrastructure We are looking for software engineers to help build the foundational pieces for safety, oversight and intervention mechanisms of our AI systems. As a software engineer on the Safeguards team, you will work to monitor models, prevent misuse, and ensure user well-being. This role will focus on building systems to detect unwanted model behaviors and prevent disallowed use of models. You will apply your technical skills to uphold our principles of safety, transparency, and oversight.

Responsibilities:

Develop the foundational systems which power Safeguards, including infrastructure for data storage and management, metric and evaluation systems, and tooling for human and agentic review.
Ensure the day-to-day running of Safeguards systems and hold a high operational bar which serves both safety and customers while reducing the amount of human intervention and oversight required.
Build robust and reliable multi-layered defenses for real-time improvement of safety mechanisms that work at scale

You may be a good fit if you have:

Bachelor’s degree in Computer Science, Software Engineering or comparable experience
4-10+ years of experience in a software engineering position
Proficiency in Python
Ability to work across the stack
Strong communication skills and ability to explain complex technical concepts to non-technical stakeholders

Strong candidates may also:

Have experience building trust and safety, anti-spam, fraud or abuse detection and mitigation mechanisms and interventions for AI/ML systems
Have experience building metrics and measurement systems or data and privacy management systems
Have worked closely with operational teams to build custom internal tooling
Be proficient in TypeScript or Rust
Have experience with Claude Code or similar agentic coding tools

The annual compensation range for this role is £255,000-£325,000 GBP.

XML job scraping automation by YubHub

]]> full-time senior hybrid £255,000-£325,000 GBP Python, Software Engineering, Data Storage and Management, Metric and Evaluation Systems, Tooling for Human and Agentic Review, TypeScript, Rust, Claude Code, Agentic Coding Tools Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5074908008 London, UK 2026-04-18 1bb1aad7-0aa Model Quality Software Engineer, Claude Code We're looking for a Staff Software Engineer to set technical direction at the intersection of engineering and research on the Claude Code team. In this role, you'll partner directly with Anthropic's researchers and engineering leadership to shape how we measure, understand, and improve Claude's coding capabilities.

As a senior individual contributor, you'll be accountable for the technical decisions that ripple across the team and beyond. You'll architect the systems, tooling, and evaluation infrastructure that determine how quickly our research can move.

Responsibilities:

Set technical direction for evaluation systems, research infrastructure, and internal tooling across the Claude Code team

Architect eval frameworks that measure model capabilities across diverse coding tasks and scale with our research roadmap

Lead the design of infrastructure that enables researchers to run experiments at scale, and make the foundational tradeoffs that shape how the team operates for years

Identify the highest-leverage engineering investments,often before anyone has asked for them,and drive them to completion

Serve as a senior technical bridge between product and research, using strong product intuition to influence which capabilities we prioritize and how we measure progress against them

Mentor and raise the bar for other engineers on the team; review designs, unblock peers, and model the engineering standards we want to scale

Partner with research leads to translate ambiguous research questions into durable engineering solutions

Own critical systems end-to-end, from architecture through production reliability, and take responsibility for their long-term health

If you have 10+ years of software engineering experience, with a track record of operating as a Staff or Principal engineer (or equivalent) at a high-caliber organization, you may be a good fit for this role.

Strong candidates may also have experience with designing or scaling eval/evaluation frameworks for ML systems, reinforcement learning infrastructure or training systems, leading technical initiatives in high-performance, demanding environments, research computing, scientific infrastructure, or developer platforms at scale, a strong quantitative foundation (math, physics, or related fields), and expertise in Python and TypeScript.

The annual compensation range for this role is $405,000-$485,000 USD.

XML job scraping automation by YubHub

]]> full-time staff hybrid $405,000-$485,000 USD software engineering, evaluation systems, research infrastructure, internal tooling, eval frameworks, model capabilities, research roadmap, infrastructure design, experimentation, engineering investments, product research, mentoring, design review, engineering standards, critical systems, architecture, production reliability, Python, TypeScript, ML systems, reinforcement learning, research computing, scientific infrastructure, developer platforms Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5098025008 San Francisco, CA | New York City, NY 2026-04-18 813dd0ec-e42 Software Engineer, Safeguards Infrastructure About Anthropic

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the role

We are looking for software engineers to help build the foundational pieces for safety, oversight and intervention mechanisms of our AI systems. As a software engineer on the Safeguards team, you will work to monitor models, prevent misuse, and ensure user well-being. This role will focus on building systems to detect unwanted model behaviors and prevent disallowed use of models. You will apply your technical skills to uphold our principles of safety, transparency, and oversight.

Responsibilities:

Develop the foundational systems which power Safeguards, including infrastructure for data storage and management, metric and evaluation systems, and tooling for human and agentic review.
Ensure the day-to-day running of Safeguards systems and hold a high operational bar which serves both safety and customers while reducing the amount of human intervention and oversight required.
Build robust and reliable multi-layered defenses for real-time improvement of safety mechanisms that work at scale

You may be a good fit if you have:

Bachelor’s degree in Computer Science, Software Engineering or comparable experience
4-10+ years of experience in a software engineering position
Proficiency in Python
Ability to work across the stack
Strong communication skills and ability to explain complex technical concepts to non-technical stakeholders

Strong candidates may also:

Have experience building trust and safety, anti-spam, fraud or abuse detection and mitigation mechanisms and interventions for AI/ML systems
Have experience building metrics and measurement systems or data and privacy management systems
Have worked closely with operational teams to build custom internal tooling
Be proficient in TypeScript or Rust
Have experience with Claude Code or similar agentic coding tools

Deadline to apply:

None. Applications will be reviewed on a rolling basis.

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship:

We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification.

Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

Your safety matters to us.

To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit anthropic.com/careers directly for confirmed position openings.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.

The easiest way to understand our research directions is to read our recent research. This research continues many of the directions our team worked on prior to Anthropic, including: GPT-3, Circuit-Based Interpretability, Multimodal Neurons, Scaling Laws, AI & Compute, Concrete Problems in AI Safety, and Learning from Human Preferences.

Come work with us!

Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. Guidance on Candidates' AI Usage: Learn about our policy for using AI in our application process

XML job scraping automation by YubHub

]]> full-time senior hybrid £255,000 - £325,000GBP Python, Software Engineering, Computer Science, Data Storage and Management, Metric and Evaluation Systems, Tooling for Human and Agentic Review, TypeScript, Rust, Claude Code, Agentic Coding Tools, Trust and Safety, Anti-Spam, Fraud or Abuse Detection and Mitigation Mechanisms Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that aims to create reliable, interpretable, and steerable AI systems. The company is headquartered in San Francisco and has a team of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5074908008 London, UK 2026-03-08 1f4fed68-606 Principal Product Manager, Foundational AI Research Summary

Microsoft AI are looking for a talented Principal Product Manager, Foundational AI Research at their New York office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising AI technology. You'll work directly with leadership to shape the company's direction in the AI market.

About the Role

As a Principal Product Manager, Foundational AI Research, you will be advancing the next generation of LLM models working on either (1) pre-training and post-training data and evaluations, (2) training infrastructure, or (3) API and Platform. You will be responsible for working backward from customer needs to prioritize new research, build evals and datasets, and work closely with AI researchers to build and execute project plans.

Accountabilities

Identify and prioritize language / coding / multimodal issues and work with researchers to find a path to resolution.
Create novel data collection tasks for taskers to evaluate models and to collect training data for fine-tuning.

The Candidate we're looking for

Experience:

8+ years experience in product/service/program management or software development.

Technical skills:

Experience building and shipping data pipelines or evaluation systems personally.

Personal attributes:

Proactive attitude and enthusiasm for exploring new methods and familiar with research and technical advancement in the area.

Benefits

Product Management IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.
Certain roles may be eligible for benefits and other compensation.

XML job scraping automation by YubHub

]]> full-time senior onsite USD $139,900 – $274,800 per year product management, software development, data pipelines, evaluation systems, research, technical advancement, proactive attitude Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading company in the field of artificial intelligence, with a mission to train the world's most capable AI frontier models and push the boundaries of scale, performance, and product deployment. https://microsoft.ai https://microsoft.ai/job/principal-product-manager-foundational-ai-research-3/ New York 2026-03-06 777fd980-5ee Principal Product Manager, Foundational AI Research Summary

Microsoft AI are looking for a talented Principal Product Manager, Foundational AI Research at their Redmond office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising AI technology. You'll work directly with leadership to shape the company's direction in the AI market.

About the Role

Accountabilities

Identify and prioritize language / coding / multimodal issues and work with researchers to find a path to resolution.
Create novel data collection tasks for taskers to evaluate models and to collect training data for fine-tuning.

The Candidate we're looking for

Experience:

8+ years experience in product/service/program management or software development.

Technical skills:

Experience building and shipping data pipelines or evaluation systems personally.

Personal attributes:

Proactive attitude and enthusiasm for exploring new methods and familiar with research and technical advancement in the area.

Benefits

Product Management IC5 – The typical base pay range for this role across the U.S. is USD $139,900 – $274,800 per year.
Certain roles may be eligible for benefits and other compensation.

XML job scraping automation by YubHub

]]> full-time senior onsite USD $139,900 – $274,800 per year product management, software development, data pipelines, evaluation systems, research, technical advancement, proactive attitude Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft AI is a leading provider of artificial intelligence solutions, working to train the world's most capable AI frontier models and push the boundaries of scale, performance, and product deployment. https://microsoft.ai https://microsoft.ai/job/principal-product-manager-foundational-ai-research-2/ Redmond 2026-03-06