{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/infrastructure-expertise"},"x-facet":{"type":"skill","slug":"infrastructure-expertise","display":"Infrastructure Expertise","count":2},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_272bd1ad-99d"},"title":"Software Engineer, Sandboxing","description":"<p><strong>About the Role</strong></p>\n<p>Anthropic&#39;s sandboxing infrastructure enables Claude to safely execute code and interact with external systems. As we expand Claude&#39;s capabilities, the reliability, security, and developer experience of this infrastructure becomes increasingly critical. We&#39;re looking for an engineer to join the sandboxing team and help shape both the client-side library/API and the underlying infrastructure.</p>\n<p>In this role, you&#39;ll combine deep infrastructure expertise with an obsession for developer experience. You&#39;ll help maintain and evolve a system that must be correct, performant, and intuitive to use. You&#39;ll work closely with internal teams to understand their needs, burn down errors and edge cases, and build a roadmap that anticipates where the product needs to go. This is a role for someone who finds satisfaction in both the craft of building reliable systems and the empathy required to serve developers and researchers well.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Contribute to the client library, API surface, and underlying infrastructure for Anthropic&#39;s sandboxing system, ensuring it is reliable, well-documented, and intuitive to use</li>\n<li>Drive down error rates and improve correctness through systematic debugging, monitoring, and proactive fixes</li>\n<li>Help develop and maintain a product roadmap for sandboxing capabilities, balancing immediate needs with long-term architectural improvements</li>\n<li>Partner closely with internal teams using the sandboxing system to understand their requirements, debug issues, and build tooling that serves their use cases</li>\n<li>Respond to incidents and production issues with urgency, conducting thorough root cause analysis and implementing preventive measures</li>\n<li>Build comprehensive testing, observability, and documentation to ensure the system meets a high quality bar</li>\n<li>Collaborate across the sandboxing team, flexing between client-side and infrastructure work as needed</li>\n</ul>\n<p><strong>You May Be a Good Fit If You</strong></p>\n<ul>\n<li>Have 5+ years of software engineering experience, with meaningful time spent maintaining libraries, SDKs, or developer-facing APIs</li>\n<li>Obsess over developer experience,you&#39;ve thought deeply about API design, error propagation, documentation, and the small details that make a library feel well-crafted</li>\n<li>Have experience operating complex distributed systems</li>\n<li>Bring a track record of systematically improving reliability,you&#39;ve burned down error budgets, built monitoring, and driven issues to resolution</li>\n<li>Can develop and articulate a long-term vision for a product, translating user feedback and technical constraints into a coherent roadmap</li>\n<li>Are comfortable with ambiguity and can context-switch between reactive incident work and proactive product development</li>\n<li>Communicate clearly with both technical and non-technical stakeholders</li>\n</ul>\n<p><strong>Strong Candidates May Also Have</strong></p>\n<ul>\n<li>Experience as a founder or early engineer at an infrastructure-focused startup, where you owned a product end-to-end</li>\n<li>Background in security, sandboxing, or isolation technologies (containers, VMs, seccomp, namespaces, etc.)</li>\n<li>Open-source contributions in the Python ecosystem</li>\n<li>Experience building developer tools, CLIs, or platforms used by other engineers</li>\n<li>History of working on incident response and on-call rotations for production systems</li>\n<li>Exposure to reinforcement learning or model training infrastructure</li>\n</ul>\n<p><strong>Representative Projects</strong></p>\n<p>These are examples of past work that would indicate a good fit,not a description of the role itself:</p>\n<ul>\n<li>Maintaining an open source SDK through multiple major version upgrades while minimizing breaking changes for users</li>\n<li>Leading an initiative to reduce P0 incidents by XX% through improved error handling, retries, and observability</li>\n<li>Building a developer platform at a startup from zero to product-market fit, iterating based on user feedback</li>\n<li>Embedding with an internal team for a quarter to deeply understand their workflows and shipping targeted improvements to a piece of infrastructure they rely on</li>\n<li>Developing a multi-quarter roadmap for a developer tools product, balancing user requests with technical debt reduction</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<ul>\n<li>Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience</li>\n<li>Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience</li>\n<li>Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position</li>\n<li>Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</li>\n<li>Visa sponsorship: We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_272bd1ad-99d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5083039008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$300,000-$405,000 USD","x-skills-required":["software engineering","infrastructure expertise","developer experience","API design","error propagation","documentation","distributed systems","complex systems","reliability","monitoring","root cause analysis","preventive measures","testing","observability","collaboration","communication"],"x-skills-preferred":["founder","early engineer","security","sandboxing","isolation technologies","open-source contributions","developer tools","incident response","on-call rotations","reinforcement learning","model training infrastructure"],"datePosted":"2026-04-18T15:51:53.000Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, infrastructure expertise, developer experience, API design, error propagation, documentation, distributed systems, complex systems, reliability, monitoring, root cause analysis, preventive measures, testing, observability, collaboration, communication, founder, early engineer, security, sandboxing, isolation technologies, open-source contributions, developer tools, incident response, on-call rotations, reinforcement learning, model training infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":300000,"maxValue":405000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_abff148c-cfd"},"title":"Staff Machine Learning Engineer, GenAI Platform","description":"<p>As a Staff Machine Learning Engineer on the Machine Learning Platform team, you will be a key technical leader architecting and scaling our Generative AI and LLM platform capabilities.</p>\n<p>Training and deploying foundation models places unprecedented demands on our systems. You will define the technical strategy and build the core infrastructure that enables machine learning engineers and researchers to seamlessly train, evaluate, and iterate on large language models at Reddit scale.</p>\n<ul>\n<li>Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors.</li>\n<li>Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters.</li>\n<li>Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO).</li>\n<li>Develop Comprehensive Evaluation &amp; Benchmarking Infrastructure: Treat model evaluation as a first-class platform capability.</li>\n<li>Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets (text, image, video) required for modern GenAI workloads,</li>\n</ul>\n<p>You will have 10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline.</p>\n<p>GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) and LLM serving/inference optimization (e.g., vLLM, TensorRT-LLM).</p>\n<p>Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters.</p>\n<p>Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies.</p>\n<p>GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.</p>\n<p>Production Engineering Fundamentals: Hands-on experience with Kubernetes, Docker, and building production-quality, object-oriented code in Python and/or Go.</p>\n<p>Strong focus on scalability, reliability, performance, and ease of use.</p>\n<p>You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle.</p>\n<p>Strong organizational &amp; communication skills.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_abff148c-cfd","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Reddit","sameAs":"https://www.redditinc.com","logo":"https://logos.yubhub.co/redditinc.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/reddit/jobs/7772523","x-work-arrangement":"remote","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$253,300-$354,600 USD","x-skills-required":["GenAI/LLM Infrastructure Expertise","Distributed Systems Mastery","Advanced MLOps Knowledge","GPU Experience","Production Engineering Fundamentals"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:47:35.489Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote - United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GenAI/LLM Infrastructure Expertise, Distributed Systems Mastery, Advanced MLOps Knowledge, GPU Experience, Production Engineering Fundamentals","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":253300,"maxValue":354600,"unitText":"YEAR"}}}]}