{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/agent-evaluation"},"x-facet":{"type":"skill","slug":"agent-evaluation","display":"Agent Evaluation","count":3},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4f808d6c-a4e"},"title":"Machine Learning Research Engineer, GenAI Applied ML","description":"<p><strong>About This Role</strong></p>\n<p>Lead applied ML engineering on Scale&#39;s Applied ML team, powering data infrastructure for leading agentic LLMs (ChatGPT, Gemini, Llama). You will build scalable multi-agent systems to validate agentic reasoning and behaviours, scale human expertise, and drive research into real-world agent reliability failures despite strong benchmarks, shipping production fixes.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Build and deploy multi-agent systems for agentic reasoning validation</li>\n<li>Develop pipelines to detect errors and scale human judgment</li>\n<li>Combine classical ML, LLMs, and multi-agent techniques for reliability</li>\n<li>Lead research into agent failure modes and ship fixes</li>\n<li>Use AI tools to speed prototyping and iteration</li>\n<li>Build data-driven evaluations and deploy rapid improvements</li>\n<li>Integrate systems into Scale&#39;s platform</li>\n</ul>\n<p><strong>Ideal Candidate</strong></p>\n<ul>\n<li>PhD or MSc in Computer Science, Mathematics, Statistics, or related field</li>\n<li>3+ years shipping scaled production ML systems</li>\n<li>Demonstrated real-world impact</li>\n<li>Mastery of PyTorch, TensorFlow, JAX, or scikit-learn</li>\n<li>Deep expertise in agentic LLMs and multi-agent systems</li>\n<li>Strong software engineering and microservices (AWS/GCP)</li>\n<li>Rapid, data-driven iteration</li>\n<li>Proficiency using AI tools to accelerate work</li>\n<li>Strong research depth with practical bias</li>\n<li>Excellent cross-functional communication</li>\n</ul>\n<p><strong>Nice to Have</strong></p>\n<ul>\n<li>Experience prototyping agent evaluation/reliability systems</li>\n<li>Human-in-the-loop or annotation pipeline work</li>\n<li>Open-source contributions in agents, evaluation, or alignment</li>\n<li>Publications on agent reliability (NeurIPS, ICML, ICLR)</li>\n</ul>\n<p><strong>Compensation</strong></p>\n<p>Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity-based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You&#39;ll also receive benefits including, but not limited to: Comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend.</p>\n<p><strong>About Us</strong></p>\n<p>At Scale, our mission is to develop reliable AI systems for the world&#39;s most important decisions. Our products provide the high-quality data and full-stack technologies that power the world&#39;s leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact. We work closely with industry leaders like Meta, Cisco, DLA Piper, Mayo Clinic, Time Inc., the Government of Qatar, and U.S. government agencies including the Army and Air Force. We are expanding our team to accelerate the development of AI applications.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4f808d6c-a4e","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Scale","sameAs":"https://scale.com/","logo":"https://logos.yubhub.co/scale.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/scaleai/jobs/4490301005","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$189,600-$237,000 USD","x-skills-required":["PyTorch","TensorFlow","JAX","scikit-learn","Agentic LLMs","Multi-agent systems","Software engineering","Microservices","Data-driven iteration","AI tools"],"x-skills-preferred":["Experience prototyping agent evaluation/reliability systems","Human-in-the-loop or annotation pipeline work","Open-source contributions in agents, evaluation, or alignment","Publications on agent reliability"],"datePosted":"2026-04-18T15:58:33.354Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA; New York, NY"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"PyTorch, TensorFlow, JAX, scikit-learn, Agentic LLMs, Multi-agent systems, Software engineering, Microservices, Data-driven iteration, AI tools, Experience prototyping agent evaluation/reliability systems, Human-in-the-loop or annotation pipeline work, Open-source contributions in agents, evaluation, or alignment, Publications on agent reliability","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":189600,"maxValue":237000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_d5f768d1-df6"},"title":"Full-Stack Engineer, AI Data Platform","description":"<p>Shape the Future of AI</p>\n<p>At Labelbox, we&#39;re building the critical infrastructure that powers breakthrough AI models at leading research labs and enterprises. Since 2018, we&#39;ve been pioneering data-centric approaches that are fundamental to AI development, and our work becomes even more essential as AI capabilities expand exponentially.</p>\n<p>We&#39;re the only company offering three integrated solutions for frontier AI development:</p>\n<ul>\n<li>Enterprise Platform &amp; Tools: Advanced annotation tools, workflow automation, and quality control systems that enable teams to produce high-quality training data at scale</li>\n</ul>\n<ul>\n<li>Frontier Data Labeling Service: Specialized data labeling through Alignerr, leveraging subject matter experts for next-generation AI models</li>\n</ul>\n<ul>\n<li>Expert Marketplace: Connecting AI teams with highly skilled annotators and domain experts for flexible scaling</li>\n</ul>\n<p>Why Join Us</p>\n<ul>\n<li>High-Impact Environment: We operate like an early-stage startup, focusing on impact over process. You&#39;ll take on expanded responsibilities quickly, with career growth directly tied to your contributions.</li>\n</ul>\n<ul>\n<li>Technical Excellence: Work at the cutting edge of AI development, collaborating with industry leaders and shaping the future of artificial intelligence.</li>\n</ul>\n<ul>\n<li>Innovation at Speed: We celebrate those who take ownership, move fast, and deliver impact. Our environment rewards high agency and rapid execution.</li>\n</ul>\n<ul>\n<li>Continuous Growth: Every role requires continuous learning and evolution. You&#39;ll be surrounded by curious minds solving complex problems at the frontier of AI.</li>\n</ul>\n<ul>\n<li>Clear Ownership: You&#39;ll know exactly what you&#39;re responsible for and have the autonomy to execute. We empower people to drive results through clear ownership and metrics.</li>\n</ul>\n<p>Role Overview</p>\n<p>We’re looking for a Full-Stack AI Engineer to join our team, where you’ll build the next generation of tools for developing, evaluating, and training state-of-the-art AI systems. You will own features end to end,from user-facing experiences and APIs to backend services, data models, and infrastructure.</p>\n<p>You’ll be at the heart of our applied AI efforts, with a particular focus on human-in-the-loop systems used to generate high-quality training data for Large Language Models (LLMs) and AI agents. This includes building a platform that enables us and our customers to create and evaluate data, as well as systems that leverage LLMs to assist with reviewing, scoring, and improving human submissions.</p>\n<p>Your Impact</p>\n<ul>\n<li>Own End-to-End Product Features</li>\n</ul>\n<p>Design, build, and ship complete workflows spanning frontend UI, APIs, backend services, databases, and production infrastructure.</p>\n<ul>\n<li>Enable Human-in-the-Loop AI Training</li>\n</ul>\n<p>Build systems that allow humans to efficiently create, review, and curate high-quality training and evaluation data used in AI model development.</p>\n<ul>\n<li>Support RLHF and Preference Data Workflows</li>\n</ul>\n<p>Design and implement tooling that supports RLHF-style pipelines, including task generation, human review, scoring, aggregation, and dataset versioning.</p>\n<ul>\n<li>Leverage LLMs in the Review Loop</li>\n</ul>\n<p>Build systems that use LLMs to assist human reviewers,such as automated checks, critiques, ranking suggestions, or quality signals,while maintaining human oversight.</p>\n<ul>\n<li>Advance AI Evaluation</li>\n</ul>\n<p>Design and implement evaluation frameworks and interactive tools for LLMs and AI agents across multiple data modalities (text, images, audio, video).</p>\n<ul>\n<li>Create Intuitive, Reviewer-Focused Interfaces</li>\n</ul>\n<p>Build thoughtful, efficient user interfaces (e.g., in React) optimized for high-throughput human review, quality control, and operational workflows.</p>\n<ul>\n<li>Architect Scalable Data &amp; Service Layers</li>\n</ul>\n<p>Design APIs, backend services, and data schemas that support large-scale data creation, review, and iteration with strong guarantees around correctness and traceability.</p>\n<ul>\n<li>Solve Ambiguous, Real-World Problems</li>\n</ul>\n<p>Translate loosely defined operational and research needs into practical, scalable, end-to-end systems.</p>\n<ul>\n<li>Ensure System Reliability</li>\n</ul>\n<p>Participate in on-call rotations to monitor, troubleshoot, and resolve issues across the full stack.</p>\n<ul>\n<li>Elevate the Team</li>\n</ul>\n<p>Improve engineering practices, development processes, and documentation. Share knowledge through technical writing and design discussions.</p>\n<p>What You Bring</p>\n<ul>\n<li>Bachelor’s degree in Computer Science, Data Engineering, or a related field.</li>\n</ul>\n<ul>\n<li>2+ years of experience in a software or machine learning engineering role.</li>\n</ul>\n<ul>\n<li>A proactive, product-focused mindset and a high degree of ownership, with a passion for building solutions that empower users.</li>\n</ul>\n<ul>\n<li>Experience using frontend frameworks like React/Redux and backend systems and technologies like Python, Java, GraphQL; familiarity with NodeJS and NestJS is a plus.</li>\n</ul>\n<ul>\n<li>Knowledge of designing and managing scalable database systems, including relational databases (e.g., PostgreSQL, MySQL), NoSQL stores (e.g., MongoDB, Cassandra), and cloud-native solutions (e.g., Google Spanner, AWS DynamoDB).</li>\n</ul>\n<ul>\n<li>Familiarity with cloud infrastructure like GCP (GCS, PubSub) and containerization (Kubernetes) is a plus.</li>\n</ul>\n<ul>\n<li>Excellent communication and collaboration skills.</li>\n</ul>\n<ul>\n<li>High proficiency in leveraging AI tools for daily development (e.g., Cursor, GitHub Copilot).</li>\n</ul>\n<ul>\n<li>Comfort and enthusiasm for working in a fast-paced, agile environment where rapid problem-solving is key.</li>\n</ul>\n<p>Bonus Points</p>\n<ul>\n<li>Experience building tools for AI/ML applications, particularly for data annotation, monitoring, or agent evaluation.</li>\n</ul>\n<ul>\n<li>Familiarity with data infrastructure components such as data pipelines, streaming systems, and storage architectures (e.g., Cloud Buckets, Key-Value Stores).</li>\n</ul>\n<ul>\n<li>Previous experience with search engines (e.g., ElasticSearch).</li>\n</ul>\n<ul>\n<li>Experience in optimizing databases for performance (e.g., schema design, indexing, query tuning) and integrating them with broader data workflows.</li>\n</ul>\n<p>Engineering at Labelbox</p>\n<p>At Labelbox Engineering, we&#39;re building a comprehensive platform that powers the future of AI development. Our team combines deep technical expertise with a passion for innovation, working at the intersection of AI infrastructure, data systems, and user experience. We believe in pushing technical boundaries while maintaining high standards of code quality and system reliability. Our engineering culture emphasizes autonomous decision-making, rapid iteration, and collaborative problem-solving. We&#39;ve cultivated an environment where engineers can take ownership of significant challenges, experiment with cutting-edge technologies, and see their solutions directly impact how leading AI labs and enterprises build the next generation of AI systems.</p>\n<p>Our Technology Stack</p>\n<p>Our engineering team works with a modern tech stack designed for scalability, performance, and developer efficiency:</p>\n<ul>\n<li>Frontend: React.js with Redux, TypeScript</li>\n</ul>\n<ul>\n<li>Backend: Node.js, TypeScript, Python, some Java &amp; Kotlin</li>\n</ul>\n<ul>\n<li>APIs: GraphQL</li>\n</ul>\n<ul>\n<li>Cloud &amp; Infrastructure: Google Cloud Platform (GCP), Kubernetes</li>\n</ul>\n<ul>\n<li>Databases: MySQL, Spanner, PostgreSQL</li>\n</ul>\n<ul>\n<li>Queueing / Streaming: Kafka, PubSub</li>\n</ul>\n<p>Labelbox strives to ensure pay parity across the organization and discuss compensation transparently. The expected annual base salary range for United States-based candidates is below. This range is not inclusive of any potential equity packages or additional benefits. Exact compensation varies based on a variety of factors, including skills and competencies, experience, and geographical location.</p>\n<p>Annual base salary range $130,000-$200,000 USD</p>\n<p>Life at Labelbox</p>\n<ul>\n<li>Location: Join our dedicated tech hubs in San Francisco or Wrocław, Poland</li>\n</ul>\n<ul>\n<li>Work Style: Hybrid model with 2 days per week in office, combining collaboration and flexibility</li>\n</ul>\n<ul>\n<li>Environment: Fast-paced and high-intensity, perfect for ambitious individuals who thrive on ownership and quick decision-making</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_d5f768d1-df6","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Labelbox","sameAs":"https://www.labelbox.com/","logo":"https://logos.yubhub.co/labelbox.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/labelbox/jobs/5019254007","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$130,000-$200,000 USD","x-skills-required":["React","Redux","Node.js","TypeScript","Python","Java","GraphQL","MySQL","PostgreSQL","Spanner","Kafka","PubSub","GCP","Kubernetes","Cloud computing","Containerization","Database management","Cloud infrastructure","API design","Backend services","Data models","Infrastructure"],"x-skills-preferred":["AI tools","Cursor","GitHub Copilot","Data annotation","Monitoring","Agent evaluation","Data infrastructure","Data pipelines","Streaming systems","Storage architectures","Search engines","ElasticSearch","Database optimization","Schema design","Indexing","Query tuning"],"datePosted":"2026-04-18T15:57:55.464Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco Bay Area"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"React, Redux, Node.js, TypeScript, Python, Java, GraphQL, MySQL, PostgreSQL, Spanner, Kafka, PubSub, GCP, Kubernetes, Cloud computing, Containerization, Database management, Cloud infrastructure, API design, Backend services, Data models, Infrastructure, AI tools, Cursor, GitHub Copilot, Data annotation, Monitoring, Agent evaluation, Data infrastructure, Data pipelines, Streaming systems, Storage architectures, Search engines, ElasticSearch, Database optimization, Schema design, Indexing, Query tuning","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":130000,"maxValue":200000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_769c0070-5b2"},"title":"Research Scientist, Agent Robustness","description":"<p>As a Research Scientist working on Agent Robustness, you will work on the fundamental challenges of building AI agents that are safe and aligned with humans.</p>\n<p>For example, you might:</p>\n<ul>\n<li>Research the science of AI agent capabilities with a focus on how they relate to safety, risk factors, and methodologies for benchmarking them;</li>\n<li>Design and build harnesses to test AI agents&#39; tendency to take harmful actions when pressured to do so by users or tricked into doing so by elements of their environment;</li>\n<li>Design and build exploits and mitigations for new and unique failure modes that arise as AI agents gain affordances like coding, web browsing, and computer use;</li>\n<li>Characterize and design mitigations for potential failure modes or broader risks of systems involving multiple interacting AI agents.</li>\n</ul>\n<p>Ideally you&#39;d have:</p>\n<ul>\n<li>Commitment to our mission of promoting safe, secure, and trustworthy AI deployments in the industry as frontier AI capabilities continue to advance;</li>\n<li>Practical experience conducting technical research collaboratively;</li>\n<li>Experience with post-training and RL techniques such as RLHF, DPO, GRPO, and similar approaches;</li>\n<li>A track record of published research in machine learning, particularly in generative AI;</li>\n<li>At least three years of experience addressing sophisticated ML problems, whether in a research setting or in product development;</li>\n<li>Strong written and verbal communication skills to operate in a cross-functional team.</li>\n</ul>\n<p>Nice to have:</p>\n<ul>\n<li>Hands-on experience with agent evaluation frameworks such as SWE-bench, WebArena, OSWorld, Inspect, or similar tools;</li>\n<li>Experience with red-teaming, prompt injection, or adversarial testing of AI systems.</li>\n</ul>\n<p>Our research interviews are crafted to assess candidates&#39; skills in practical ML prototyping and debugging, their grasp of research concepts, and their alignment with our organisational culture. We will not ask any LeetCode-style questions. If you&#39;re excited about advancing AI safety and contributing to our mission, we encourage you to apply, even if your experience doesn&#39;t perfectly align with every requirement.</p>\n<p>Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity-based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You&#39;ll also receive benefits including, but not limited to: Comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_769c0070-5b2","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Scale","sameAs":"https://scale.com/","logo":"https://logos.yubhub.co/scale.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/scaleai/jobs/4675684005","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$216,000-$270,000 USD","x-skills-required":["Commitment to our mission of promoting safe, secure, and trustworthy AI deployments in the industry as frontier AI capabilities continue to advance","Practical experience conducting technical research collaboratively","Experience with post-training and RL techniques such as RLHF, DPO, GRPO, and similar approaches","A track record of published research in machine learning, particularly in generative AI","At least three years of experience addressing sophisticated ML problems, whether in a research setting or in product development"],"x-skills-preferred":["Hands-on experience with agent evaluation frameworks such as SWE-bench, WebArena, OSWorld, Inspect, or similar tools","Experience with red-teaming, prompt injection, or adversarial testing of AI systems"],"datePosted":"2026-04-18T15:57:29.447Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA; New York, NY"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Commitment to our mission of promoting safe, secure, and trustworthy AI deployments in the industry as frontier AI capabilities continue to advance, Practical experience conducting technical research collaboratively, Experience with post-training and RL techniques such as RLHF, DPO, GRPO, and similar approaches, A track record of published research in machine learning, particularly in generative AI, At least three years of experience addressing sophisticated ML problems, whether in a research setting or in product development, Hands-on experience with agent evaluation frameworks such as SWE-bench, WebArena, OSWorld, Inspect, or similar tools, Experience with red-teaming, prompt injection, or adversarial testing of AI systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":216000,"maxValue":270000,"unitText":"YEAR"}}}]}