{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/llm-evaluation-frameworks"},"x-facet":{"type":"skill","slug":"llm-evaluation-frameworks","display":"Llm Evaluation Frameworks","count":3},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4d924e95-bdd"},"title":"Research Engineer, RL Infrastructure and Reliability (Knowledge Work)","description":"<p><strong>About the role</strong></p>\n<p>The Knowledge Work team builds the training environments and evaluations that make Claude effective at real-world professional workflows , searching, analysing, and creating across the tools and documents knowledge workers use every day.</p>\n<p>As that work scales, the systems behind it need to be as rigorous as the research itself. We are looking for a Research Engineer to own the reliability, observability, and infrastructure foundation that the team&#39;s research depends on.</p>\n<p>You will be responsible for ensuring our training and evaluation runs remain stable, well-instrumented, and high-quality as they grow in scale and complexity. A core part of this role is shifting reliability work from reactive to proactive: hardening systems, stress-testing at realistic scale, and building the observability and tooling that surface problems early , so researchers can stay focused on research rather than incident response.</p>\n<p>You will be the team&#39;s stable, context-rich owner for environment health and evaluation integrity, and the primary point of contact for partner teams when issues arise.</p>\n<p>While you&#39;ll work closely with researchers building new training environments, the priority for this role is the reliability those environments depend on. It&#39;s best suited to an engineer who finds real ownership and impact in making critical systems dependable, and in being the person behind trustworthy evaluation results the entire organisation relies on.</p>\n<p><strong>Key Responsibilities:</strong></p>\n<ul>\n<li>Serve as the dedicated reliability owner for the Knowledge Work training environments, providing continuity of context and reducing the operational overhead of rotating ownership</li>\n<li>Own a clean, canonical set of evaluation tools and processes for Knowledge Work capabilities, including the process used for model releases</li>\n<li>Build and automate observability, dashboards, and operational tooling for our training environments and evaluation systems, with an emphasis on high signal-to-noise: a small set of trusted metrics and alerts rather than sprawling instrumentation</li>\n<li>Proactively harden environments and evaluation systems through load testing, fault injection, and stress testing at realistic scale, so failures surface early rather than during critical training work</li>\n<li>Act as the primary point of contact for partner training and infrastructure teams when issues in our environments arise, and drive incidents to resolution</li>\n<li>Reduce the operational burden on researchers so they can stay focused on research</li>\n</ul>\n<p><strong>Minimum Qualifications:</strong></p>\n<ul>\n<li>Highly experienced Python engineer who ships reliable, well-instrumented code that teammates trust in production</li>\n<li>Demonstrated experience operating ML or distributed systems at scale, including significant on-call and incident-response experience</li>\n<li>Strong SRE or production-engineering mindset , reaching for SLOs, load tests, and failure injection before reaching for more dashboards</li>\n<li>Foundational ML knowledge sufficient to understand what a training environment or evaluation is actually measuring, and recognise when an evaluation has become stale or gameable</li>\n<li>Able to read research code and reason evaluation integrity</li>\n</ul>\n<p><strong>Preferred Qualifications:</strong></p>\n<ul>\n<li>5+ years of experience operating ML or distributed systems at scale</li>\n<li>Experience building or operating RL environments, agent harnesses, or LLM evaluation frameworks</li>\n<li>Familiarity with reward modelling, evaluation design, or detecting and mitigating reward hacking</li>\n<li>Experience with observability stacks (metrics, tracing, structured logging) and operational dashboard tooling</li>\n<li>Background in chaos engineering, fault injection, or large-scale load testing</li>\n<li>Experience with data quality pipelines, drift detection, or evaluation-set curation and versioning</li>\n<li>Familiarity with large-scale training or inference infrastructure (schedulers, multi-agent orchestration, sandboxed execution)</li>\n<li>Prior experience as a dedicated reliability or operations owner embedded within a research team</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p>Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren’t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>How we’re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact , advancing our long-term goals of steerable, trustworthy AI , rather than work on smaller and more specific puzzles.</p>\n<p><strong>Come work with us!</strong></p>\n<p>Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, including a comprehensive health insurance package, 401(k) matching, and generous paid time off.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4d924e95-bdd","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5197337008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000-$850,000 USD","x-skills-required":["Python","ML","Distributed Systems","SRE","Production-Engineering","Observability","Dashboards","Operational Tooling","Load Testing","Fault Injection","Stress Testing","Reward Modelling","Evaluation Design","Data Quality Pipelines","Drift Detection","Evaluation-Set Curation","Versioning","Large-Scale Training","Inference Infrastructure","Schedulers","Multi-Agent Orchestration","Sandboxed Execution"],"x-skills-preferred":["RL Environments","Agent Harnesses","LLM Evaluation Frameworks","Chaos Engineering","Structured Logging","Dashboard Tooling"],"datePosted":"2026-04-24T13:11:33.535Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, ML, Distributed Systems, SRE, Production-Engineering, Observability, Dashboards, Operational Tooling, Load Testing, Fault Injection, Stress Testing, Reward Modelling, Evaluation Design, Data Quality Pipelines, Drift Detection, Evaluation-Set Curation, Versioning, Large-Scale Training, Inference Infrastructure, Schedulers, Multi-Agent Orchestration, Sandboxed Execution, RL Environments, Agent Harnesses, LLM Evaluation Frameworks, Chaos Engineering, Structured Logging, Dashboard Tooling","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c2419ec4-6fb"},"title":"Research Engineer, RL Infrastructure and Reliability (Knowledge Work)","description":"<p><strong>About the role</strong></p>\n<p>The Knowledge Work team builds the training environments and evaluations that make Claude effective at real-world professional workflows , searching, analysing, and creating across the tools and documents knowledge workers use every day.</p>\n<p>As that work scales, the systems behind it need to be as rigorous as the research itself. We are looking for a Research Engineer to own the reliability, observability, and infrastructure foundation that the team&#39;s research depends on.</p>\n<p>You will be responsible for ensuring our training and evaluation runs remain stable, well-instrumented, and high-quality as they grow in scale and complexity.</p>\n<p>A core part of this role is shifting reliability work from reactive to proactive: hardening systems, stress-testing at realistic scale, and building the observability and tooling that surface problems early , so researchers can stay focused on research rather than incident response.</p>\n<p>You will be the team&#39;s stable, context-rich owner for environment health and evaluation integrity, and the primary point of contact for partner teams when issues arise.</p>\n<p><strong>Key Responsibilities:</strong></p>\n<ul>\n<li>Serve as the dedicated reliability owner for the Knowledge Work training environments, providing continuity of context and reducing the operational overhead of rotating ownership</li>\n<li>Own a clean, canonical set of evaluation tools and processes for Knowledge Work capabilities, including the process used for model releases</li>\n<li>Build and automate observability, dashboards, and operational tooling for our training environments and evaluation systems, with an emphasis on high signal-to-noise: a small set of trusted metrics and alerts rather than sprawling instrumentation</li>\n<li>Proactively harden environments and evaluation systems through load testing, fault injection, and stress testing at realistic scale, so failures surface early rather than during critical training work</li>\n<li>Act as the primary point of contact for partner training and infrastructure teams when issues in our environments arise, and drive incidents to resolution</li>\n<li>Reduce the operational burden on researchers so they can stay focused on research</li>\n</ul>\n<p><strong>Minimum Qualifications:</strong></p>\n<ul>\n<li>Highly experienced Python engineer who ships reliable, well-instrumented code that teammates trust in production</li>\n<li>Demonstrated experience operating ML or distributed systems at scale, including significant on-call and incident-response experience</li>\n<li>Strong SRE or production-engineering mindset , reaching for SLOs, load tests, and failure injection before reaching for more dashboards</li>\n<li>Foundational ML knowledge sufficient to understand what a training environment or evaluation is actually measuring, and recognise when an evaluation has become stale or gameable</li>\n<li>Able to read research code and reason evaluation integrity</li>\n</ul>\n<p><strong>Preferred Qualifications:</strong></p>\n<ul>\n<li>5+ years of experience operating ML or distributed systems at scale</li>\n<li>Experience building or operating RL environments, agent harnesses, or LLM evaluation frameworks</li>\n<li>Familiarity with reward modelling, evaluation design, or detecting and mitigating reward hacking</li>\n<li>Experience with observability stacks (metrics, tracing, structured logging) and operational dashboard tooling</li>\n<li>Background in chaos engineering, fault injection, or large-scale load testing</li>\n<li>Experience with data quality pipelines, drift detection, or evaluation-set curation and versioning</li>\n<li>Familiarity with large-scale training or inference infrastructure (schedulers, multi-agent orchestration, sandboxed execution)</li>\n<li>Prior experience as a dedicated reliability or operations owner embedded within a research team</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p>Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren’t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>How we’re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact , advancing our long-term goals of steerable, trustworthy AI , rather than work on smaller and more specific puzzles.</p>\n<p><strong>Come work with us!</strong></p>\n<p>Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c2419ec4-6fb","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5197337008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000-$850,000 USD","x-skills-required":["Python","ML","Distributed Systems","SRE","Production-Engineering","Observability","Dashboards","Operational Tooling","Load Testing","Fault Injection","Stress Testing","Reliability","Infrastructure Foundation","Evaluation Integrity"],"x-skills-preferred":["RL Environments","Agent Harnesses","LLM Evaluation Frameworks","Reward Modelling","Evaluation Design","Chaos Engineering","Data Quality Pipelines","Drift Detection","Evaluation-Set Curation","Versioning","Large-Scale Training","Inference Infrastructure","Schedulers","Multi-Agent Orchestration","Sandboxed Execution"],"datePosted":"2026-04-24T12:16:31.677Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, ML, Distributed Systems, SRE, Production-Engineering, Observability, Dashboards, Operational Tooling, Load Testing, Fault Injection, Stress Testing, Reliability, Infrastructure Foundation, Evaluation Integrity, RL Environments, Agent Harnesses, LLM Evaluation Frameworks, Reward Modelling, Evaluation Design, Chaos Engineering, Data Quality Pipelines, Drift Detection, Evaluation-Set Curation, Versioning, Large-Scale Training, Inference Infrastructure, Schedulers, Multi-Agent Orchestration, Sandboxed Execution","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_bf1554e6-c64"},"title":"Software Engineer, AI Agents","description":"<p>About Us</p>\n<p>At Cloudflare, we are on a mission to help build a better Internet. Today the company runs one of the world&#39;s largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies.</p>\n<p>We protect and accelerate any Internet application online without adding hardware, installing software, or changing a line of code. Internet properties powered by Cloudflare all have web traffic routed through its intelligent global network, which gets smarter with every request. As a result, they see significant improvement in performance and a decrease in spam and other attacks.</p>\n<p>Why This Role Matters</p>\n<p>At Cloudflare, we&#39;re building industrial-scale AI agents that support customers directly. This isn&#39;t research theater. Your code will power real customer interactions from day one, at global scale. Cloudflare already has the parts. You will assemble Workers, Durable Objects, KV, R2, D1, Vectorize, Workers AI, AI Gateway, and the Agent SDK into real agents customers use every day.</p>\n<p>Role Intent</p>\n<p>Ship production agents on the Cloudflare stack. Build, deploy, learn, repeat. Your code is the front door for Cloudflare customers.</p>\n<p>What You Will Do</p>\n<ul>\n<li>Build agents on Workers with Durable Objects for state and short term memory</li>\n<li>Wire tools with the Agent SDK, MCP, and function calling</li>\n<li>Use Vectorize, KV, R2, and D1 for semantic memory, cache, files, and config</li>\n<li>Run models through Workers AI and AI Gateway; integrate third parties when needed</li>\n<li>Create evals, guardrails, and audits. Measure, tune, re-ship fast</li>\n<li>Build agents that summarize, propose fixes, and escalate cleanly to humans</li>\n<li>Expose agent health and metrics in transparent dashboards. No mystery boxes</li>\n<li>Integrate with queues and webhooks; publish events on Queues or Pub/Sub</li>\n<li>Cut cost per case and time to first response. Prove it with data.</li>\n<li>Take end to end ownership including on call for what you ship (with team support)</li>\n<li>Design and maintain robust observability for distributed AI workflows, implementing structured logging and end-to-end tracing across async service boundaries to ensure visibility into agent reasoning and execution.</li>\n<li>Architect security boundaries for agent-led operations; implementing secure credential handling, multi-layer approval gates, and fine-grained trust scoping for mutative actions.</li>\n</ul>\n<p>Must Have</p>\n<ul>\n<li>Demonstrated success shipping production systems. Repos and releases that show real work.</li>\n<li>Strong in TypeScript or Rust on Workers. HTTP, queues, async, performance</li>\n<li>Fluency with Durable Objects, KV or R2, and either D1 or Postgres</li>\n<li>Hands on with model tooling. Prompt I/O, tool calling, evals, safety checks</li>\n<li>Observability mindset. Logs, traces, metrics, redlines</li>\n<li>Experience with a2a/multi-agent frameworks</li>\n<li>Experience developing LLM evaluation frameworks; automated scoring systems, CI-integrated quality gates.</li>\n</ul>\n<p>Nice to Have</p>\n<ul>\n<li>Workers AI, AI Gateway, and Vectorize in production</li>\n<li>Salesforce or Service Cloud experience. Webhooks and case APIs</li>\n<li>Security depth. Prompt injection protection, secrets detection, PII handling</li>\n<li>OSS agent frameworks. Know what to borrow and what to throw away.</li>\n</ul>\n<p>How We Build</p>\n<p>Align fast on what matters.</p>\n<p>Divide and conquer. Own your piece.</p>\n<p>Ship. Watch customers use it.</p>\n<p>Learn and repeat.</p>\n<p>Why Join Cloudflare in India?</p>\n<p>Impact at global scale: Your code will serve Cloudflare&#39;s customers across every region. Tens of millions of Internet properties depend on us.</p>\n<p>Work on the edge: Few companies give engineers the chance to build AI directly into an edge platform that runs in 300+ cities worldwide.</p>\n<p>Career growth: As one of the early engineers in our India based AI team, you&#39;ll have visibility, leadership opportunities, and a direct hand in shaping Cloudflare&#39;s AI roadmap.</p>\n<p>Culture of ownership: We believe in autonomy, accountability, and trust. Engineers here own outcomes, not just tickets.</p>\n<p>Learn and grow fast: Collaborate with peers across Support, Product, Security, and AI Platform teams. We encourage knowledge sharing, mentorship, and continuous learning.</p>\n<p>Interview Signal</p>\n<p>Expect to demonstrate your ability to:</p>\n<ul>\n<li>Build a mini agent on Workers using the Agent SDK</li>\n<li>Store session memory in Durable Objects</li>\n<li>Add semantic recall with Vectorize</li>\n<li>Ship behind a KV flag with traces and observability</li>\n<li>Push to production fast and take ownership</li>\n</ul>\n<p>Team Mission</p>\n<p>The Agent Tech team owns the end to end stack for customer facing agents on Cloudflare. Everything runs at the edge.</p>\n<p>Core Stack: Workers, Durable Objects, KV, R2, D1, Queues, Pub/Sub, Vectorize, Workers AI, AI Gateway, Pages, Zero Trust.</p>\n<p>Principles: Ship fast. Measure truth. Simplify relentlessly. Own outcomes.</p>\n<p>Fraud Alert</p>\n<p>Do not fall victim to recruitment fraud. Cloudflare never charges application fees or requires candidates to purchase third-party certifications or training as a condition of employment. All official communication comes strictly from @cloudflare.com email addresses.</p>\n<p>What Makes Cloudflare Special?</p>\n<p>We’re not just a highly ambitious, large-scale technology company. We’re a highly ambitious, large-scale technology company with a soul.</p>\n<p>Fundamental to our mission to help build a better Internet is protecting the free and open Internet.</p>\n<p>Project Galileo: Since 2014, we&#39;ve equipped more than 2,400 journalism and civil society organizations in 111 countries with powerful tools to defend themselves against attacks that would otherwise censor their work, technology already used by Cloudflare’s enterprise customers--at no cost.</p>\n<p>Athenian Project: In 2017, we created the Athenian Project to ensure that state and local governments have the highest level of protection and reliability for free, so that their constituents have access to election information and voter registration. Since the project, we&#39;ve provided services to more than 425 local government election websites in 33 states.</p>\n<p>1.1.1.1: We released 1.1.1.1 to help fix the foundation of the Internet by building a faster, more secure and privacy-centric public DNS resolver. This is available publicly for everyone to use - it is the first consumer-focused service Cloudflare has ever released.</p>\n<p>Here’s the deal - we don’t store client IP addresses never, ever. We will continue to abide by our privacy commitment and ensure that no user data is sold to advertisers or used to target consumers.</p>\n<p>Sound like something you’d like to be a part of? We’d love to hear from you!</p>\n<p>This position may require access to information protected under U.S. export control laws, including the U.S. Export Administration Regulations. Please note that any offer of employment may be conditioned on your authorization to receive software or technology controlled under these U.S. export laws without sponsorship for an export license.</p>\n<p>Cloudflare is proud to be an equal opportunity employer. We are committed to providing equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, genetic information, or any other characteristic protected by law.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_bf1554e6-c64","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Cloudflare","sameAs":"https://www.cloudflare.com/","logo":"https://logos.yubhub.co/cloudflare.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/cloudflare/jobs/7831810","x-work-arrangement":null,"x-experience-level":null,"x-job-type":null,"x-salary-range":null,"x-skills-required":["TypeScript","Rust","Workers","HTTP","queues","async","performance","Durable Objects","KV","R2","Postgres","model tooling","prompt I/O","tool calling","evals","safety checks","observability mindset","logs","traces","metrics","redlines","a2a/multi-agent frameworks","LLM evaluation frameworks","automated scoring systems","CI-integrated quality gates"],"x-skills-preferred":[],"datePosted":"2026-04-24T12:12:03.262Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"In-Office"}},"occupationalCategory":"Engineering","industry":"Technology","skills":"TypeScript, Rust, Workers, HTTP, queues, async, performance, Durable Objects, KV, R2, Postgres, model tooling, prompt I/O, tool calling, evals, safety checks, observability mindset, logs, traces, metrics, redlines, a2a/multi-agent frameworks, LLM evaluation frameworks, automated scoring systems, CI-integrated quality gates"}]}