{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/training-infrastructure"},"x-facet":{"type":"skill","slug":"training-infrastructure","display":"Training Infrastructure","count":18},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_74be15a1-bce"},"title":"Software Engineer, Inference Deployment","description":"<p>Our mandate is to make inference deployment boring and unattended. We serve Claude to millions of users across GPUs, TPUs, and Trainium , and every model update must reach production safely, quickly, and without disrupting service. As a Software Engineer on the Launch Engineering team, you&#39;ll design and build the deployment infrastructure that moves inference code from merge to production.</p>\n<p>This is a resource-constrained optimization problem at its core: validation and deployment consume the same accelerator chips that serve customer traffic , your deploys compete with live user requests for the same hardware. Every model brings different fleet sizes, startup times, and correctness requirements, so the system must adapt continuously. You&#39;ll build systems that navigate these constraints , orchestrating validation, scheduling deployments intelligently, and driving down cycle time from merge to production.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Own deployment orchestration that continuously moves validated inference builds into production across GPU, TPU, and Trainium fleets, unattended under normal conditions</li>\n</ul>\n<ul>\n<li>Improve capacity-aware deployment scheduling to maximize deployment throughput against constrained accelerator budgets and variable fleet sizes</li>\n</ul>\n<ul>\n<li>Extend deployment observability , dashboards and tooling that answer &quot;what code is running in production,&quot; &quot;where is my commit,&quot; and &quot;what validation passed for this deploy&quot;</li>\n</ul>\n<ul>\n<li>Drive down cycle time from code merge to production with pipeline architectures that minimize serial dependencies and maximize parallelism</li>\n</ul>\n<ul>\n<li>Optimize fleet rollout strategies for large-scale deployments across thousands of GPU, TPU, and Trainium chips, minimizing disruption to serving capacity</li>\n</ul>\n<ul>\n<li>Evolve self-service model onboarding so that new models can be added to the continuous deployment pipeline without Launch Engineering involvement</li>\n</ul>\n<ul>\n<li>Partner across the Inference organization with teams owning validation, autoscaling, and model routing to integrate deployment automation with their systems</li>\n</ul>\n<p>You May Be a Good Fit If You Have:</p>\n<ul>\n<li>5+ years of experience building deployment, release, or delivery infrastructure at scale</li>\n</ul>\n<ul>\n<li>Strong software engineering skills with experience designing systems that manage complex state machines and multi-stage pipelines</li>\n</ul>\n<ul>\n<li>Experience with deployment systems where resource constraints shape the design , whether that&#39;s fleet capacity, network bandwidth, hardware availability, or coordinated rollout windows</li>\n</ul>\n<ul>\n<li>A track record of building automation that measurably improves deployment velocity and reliability</li>\n</ul>\n<ul>\n<li>Proficiency with Kubernetes-based deployments, rolling update mechanics, and container orchestration</li>\n</ul>\n<ul>\n<li>Comfort working across the stack , from backend services and databases to CLI tools and web UIs</li>\n</ul>\n<ul>\n<li>Strong communication skills and the ability to work closely with oncall engineers, model teams, and infrastructure partners</li>\n</ul>\n<p>Strong Candidates May Also Have:</p>\n<ul>\n<li>Experience with ML inference or training infrastructure deployment, particularly across multiple accelerator types (GPU, TPU, Trainium)</li>\n</ul>\n<ul>\n<li>Background in capacity planning or resource-constrained scheduling (e.g., bin-packing, fleet management, job scheduling with hardware affinity)</li>\n</ul>\n<ul>\n<li>Experience with progressive delivery in systems with long validation cycles: canary/soak testing, blue-green deployments, traffic shifting, automated rollback</li>\n</ul>\n<ul>\n<li>Experience at companies with large-scale release engineering challenges (mobile release trains, monorepo deployments, multi-datacenter rollouts)</li>\n</ul>\n<ul>\n<li>Experience with Python and/or Rust in production systems</li>\n</ul>\n<p>The annual compensation range for this role is $320,000-$485,000 USD.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_74be15a1-bce","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5111745008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$320,000-$485,000 USD","x-skills-required":["deployment infrastructure","software engineering","complex state machines","multi-stage pipelines","Kubernetes-based deployments","container orchestration","backend services","databases","CLI tools","web UIs"],"x-skills-preferred":["ML inference","training infrastructure deployment","capacity planning","resource-constrained scheduling"," deployments","progressive delivery","Python","Rust"],"datePosted":"2026-04-18T15:53:04.252Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"deployment infrastructure, software engineering, complex state machines, multi-stage pipelines, Kubernetes-based deployments, container orchestration, backend services, databases, CLI tools, web UIs, ML inference, training infrastructure deployment, capacity planning, resource-constrained scheduling,  deployments, progressive delivery, Python, Rust","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":320000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_709b405a-48b"},"title":"Staff / Senior Software Engineer, AI Reliability","description":"<p>We&#39;re seeking a Staff / Senior Software Engineer, AI Reliability to join our team. As a key member of our AIRE (AI Reliability Engineering) team, you will partner with teams across Anthropic to improve reliability across our most critical serving paths. You will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, assist in the design and implementation of high-availability serving infrastructure, lead incident response for critical AI services, and support the reliability of safeguard model serving.</p>\n<p>You may be a good fit for this role if you have strong distributed systems, infrastructure, or reliability backgrounds, are curious and brave, think holistically about how systems compose and where the seams are, can build lasting relationships across teams, care about users and feel ownership over outcomes, have excellent communication and collaboration skills, and bring diverse experience.</p>\n<p>Strong candidates may also have experience operating large-scale model serving or training infrastructure, experience with one or more ML hardware accelerators, understanding of ML-specific networking optimizations, expertise in AI-specific observability tools and frameworks, experience with chaos engineering and systematic resilience testing, and contributions to open-source infrastructure or ML tooling.</p>\n<p>We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. We value impact and believe that the highest-impact AI research will be big science. We work as a single cohesive team on just a few large-scale research efforts and value communication skills.</p>\n<p>If you&#39;re interested in this role, please submit an application even if you don&#39;t believe you meet every single qualification. We encourage diversity and strive to include a range of diverse perspectives on our team.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_709b405a-48b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5113224008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$325,000-$485,000 USD","x-skills-required":["distributed systems","infrastructure","reliability","Service Level Objectives","monitoring and observability systems","high-availability serving infrastructure","incident response","safeguard model serving"],"x-skills-preferred":["large-scale model serving or training infrastructure","ML hardware accelerators","ML-specific networking optimizations","AI-specific observability tools and frameworks","chaos engineering and systematic resilience testing","open-source infrastructure or ML tooling"],"datePosted":"2026-04-18T15:52:16.313Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, Service Level Objectives, monitoring and observability systems, high-availability serving infrastructure, incident response, safeguard model serving, large-scale model serving or training infrastructure, ML hardware accelerators, ML-specific networking optimizations, AI-specific observability tools and frameworks, chaos engineering and systematic resilience testing, open-source infrastructure or ML tooling","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_272bd1ad-99d"},"title":"Software Engineer, Sandboxing","description":"<p><strong>About the Role</strong></p>\n<p>Anthropic&#39;s sandboxing infrastructure enables Claude to safely execute code and interact with external systems. As we expand Claude&#39;s capabilities, the reliability, security, and developer experience of this infrastructure becomes increasingly critical. We&#39;re looking for an engineer to join the sandboxing team and help shape both the client-side library/API and the underlying infrastructure.</p>\n<p>In this role, you&#39;ll combine deep infrastructure expertise with an obsession for developer experience. You&#39;ll help maintain and evolve a system that must be correct, performant, and intuitive to use. You&#39;ll work closely with internal teams to understand their needs, burn down errors and edge cases, and build a roadmap that anticipates where the product needs to go. This is a role for someone who finds satisfaction in both the craft of building reliable systems and the empathy required to serve developers and researchers well.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Contribute to the client library, API surface, and underlying infrastructure for Anthropic&#39;s sandboxing system, ensuring it is reliable, well-documented, and intuitive to use</li>\n<li>Drive down error rates and improve correctness through systematic debugging, monitoring, and proactive fixes</li>\n<li>Help develop and maintain a product roadmap for sandboxing capabilities, balancing immediate needs with long-term architectural improvements</li>\n<li>Partner closely with internal teams using the sandboxing system to understand their requirements, debug issues, and build tooling that serves their use cases</li>\n<li>Respond to incidents and production issues with urgency, conducting thorough root cause analysis and implementing preventive measures</li>\n<li>Build comprehensive testing, observability, and documentation to ensure the system meets a high quality bar</li>\n<li>Collaborate across the sandboxing team, flexing between client-side and infrastructure work as needed</li>\n</ul>\n<p><strong>You May Be a Good Fit If You</strong></p>\n<ul>\n<li>Have 5+ years of software engineering experience, with meaningful time spent maintaining libraries, SDKs, or developer-facing APIs</li>\n<li>Obsess over developer experience,you&#39;ve thought deeply about API design, error propagation, documentation, and the small details that make a library feel well-crafted</li>\n<li>Have experience operating complex distributed systems</li>\n<li>Bring a track record of systematically improving reliability,you&#39;ve burned down error budgets, built monitoring, and driven issues to resolution</li>\n<li>Can develop and articulate a long-term vision for a product, translating user feedback and technical constraints into a coherent roadmap</li>\n<li>Are comfortable with ambiguity and can context-switch between reactive incident work and proactive product development</li>\n<li>Communicate clearly with both technical and non-technical stakeholders</li>\n</ul>\n<p><strong>Strong Candidates May Also Have</strong></p>\n<ul>\n<li>Experience as a founder or early engineer at an infrastructure-focused startup, where you owned a product end-to-end</li>\n<li>Background in security, sandboxing, or isolation technologies (containers, VMs, seccomp, namespaces, etc.)</li>\n<li>Open-source contributions in the Python ecosystem</li>\n<li>Experience building developer tools, CLIs, or platforms used by other engineers</li>\n<li>History of working on incident response and on-call rotations for production systems</li>\n<li>Exposure to reinforcement learning or model training infrastructure</li>\n</ul>\n<p><strong>Representative Projects</strong></p>\n<p>These are examples of past work that would indicate a good fit,not a description of the role itself:</p>\n<ul>\n<li>Maintaining an open source SDK through multiple major version upgrades while minimizing breaking changes for users</li>\n<li>Leading an initiative to reduce P0 incidents by XX% through improved error handling, retries, and observability</li>\n<li>Building a developer platform at a startup from zero to product-market fit, iterating based on user feedback</li>\n<li>Embedding with an internal team for a quarter to deeply understand their workflows and shipping targeted improvements to a piece of infrastructure they rely on</li>\n<li>Developing a multi-quarter roadmap for a developer tools product, balancing user requests with technical debt reduction</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<ul>\n<li>Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience</li>\n<li>Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience</li>\n<li>Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position</li>\n<li>Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</li>\n<li>Visa sponsorship: We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_272bd1ad-99d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5083039008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$300,000-$405,000 USD","x-skills-required":["software engineering","infrastructure expertise","developer experience","API design","error propagation","documentation","distributed systems","complex systems","reliability","monitoring","root cause analysis","preventive measures","testing","observability","collaboration","communication"],"x-skills-preferred":["founder","early engineer","security","sandboxing","isolation technologies","open-source contributions","developer tools","incident response","on-call rotations","reinforcement learning","model training infrastructure"],"datePosted":"2026-04-18T15:51:53.000Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, infrastructure expertise, developer experience, API design, error propagation, documentation, distributed systems, complex systems, reliability, monitoring, root cause analysis, preventive measures, testing, observability, collaboration, communication, founder, early engineer, security, sandboxing, isolation technologies, open-source contributions, developer tools, incident response, on-call rotations, reinforcement learning, model training infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":300000,"maxValue":405000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_71554e46-b64"},"title":"Senior Engineering Manager, AI Runtime","description":"<p>At Databricks, we are committed to enabling data teams to solve the world&#39;s toughest problems. As a Senior Engineering Manager, you will lead the team owning both the product experience and the foundational infrastructure of our AI Runtime (AIR) product.</p>\n<p>You will be responsible for shaping customer-facing capabilities while designing for scalability, extensibility, and performance of GPU training and adjacent areas. This will involve collaborating closely across the platform, product, infrastructure, and research organisations.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Leading, mentoring, and growing a high-performing engineering team responsible for the Custom Training product and its foundational infrastructure</li>\n<li>Defining and owning the product and technical roadmap for AIR, balancing customer experience, functionality, and foundational investments</li>\n<li>Collaborating closely with product, research, platform, infrastructure teams, and customers to drive end-to-end delivery</li>\n<li>Driving architectural decisions and product design for managed GPU training at scale</li>\n<li>Advocating for customer needs through direct engagement, ensuring engineering decisions translate to clear product impact</li>\n</ul>\n<p>We are looking for someone with 8+ years of software engineering experience, with 3+ years in engineering management. You should have a track record of building and operating managed GPU training infrastructure at scale, as well as deep familiarity with distributed training frameworks and parallelism strategies.</p>\n<p>In addition, you should have experience with training resilience patterns, such as checkpointing, elastic training, and automated failure recovery for long-running jobs. You should also have a strong understanding of GPU performance fundamentals, including NCCL, interconnect topologies, and memory optimisation.</p>\n<p>Experience building platform products with clear SLAs is also essential, as is strong cross-functional leadership across platform, product, and research teams. Excellent collaboration and communication skills are also required.</p>\n<p>The pay range for this role is $228,600-$314,250 USD per year, depending on location. The total compensation package may also include eligibility for annual performance bonus, equity, and benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_71554e46-b64","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Databricks","sameAs":"https://databricks.com","logo":"https://logos.yubhub.co/databricks.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/databricks/jobs/8490282002","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$228,600-$314,250 USD per year","x-skills-required":["software engineering","engineering management","distributed training frameworks","parallelism strategies","GPU training infrastructure","checkpointing","elastic training","automated failure recovery","GPU performance fundamentals","NCCL","interconnect topologies","memory optimisation"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:45:28.312Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View, California; San Francisco, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, engineering management, distributed training frameworks, parallelism strategies, GPU training infrastructure, checkpointing, elastic training, automated failure recovery, GPU performance fundamentals, NCCL, interconnect topologies, memory optimisation","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":228600,"maxValue":314250,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_9ecceef8-349"},"title":"Research Engineer/Research Scientist, Audio","description":"<p>We are seeking a Research Engineer/Research Scientist to join our Audio team. As a member of this team, you will work across the full stack of audio ML, developing audio codecs and representations, sourcing and synthesizing high-quality audio data, training large-scale speech language models and large audio diffusion models, and developing novel architectures for incorporating continuous signals into LLMs.</p>\n<p>Our team focuses primarily but not exclusively on speech, building advanced steerable systems spanning end-to-end conversational systems, speech and audio understanding models, and speech synthesis capabilities. The team works closely with many collaborators across pretraining, finetuning, reinforcement learning, production inference, and product to get advanced audio technologies from early research to high-impact real-world deployments.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Develop and train audio models, including conversational speech-to-speech, speech translation, speech recognition, text-to-speech, diarization, codecs, and generative audio models</li>\n<li>Work across abstraction levels, from signal processing fundamentals to large-scale model training and inference optimization</li>\n<li>Collaborate with teams across the company to develop and deploy audio technologies</li>\n<li>Communicate clearly and effectively with colleagues and stakeholders</li>\n</ul>\n<p>Strong candidates may also have experience with:</p>\n<ul>\n<li>Large language model pretraining and finetuning</li>\n<li>Training diffusion models for image and audio generation</li>\n<li>Reinforcement learning for large language models and diffusion models</li>\n<li>End-to-end system optimization, from performance benchmarking to kernel optimization</li>\n<li>GPUs, Kubernetes, PyTorch, or distributed training infrastructure</li>\n</ul>\n<p>Representative projects:</p>\n<ul>\n<li>Training state-of-the-art neural audio codecs for 48 kHz stereo audio</li>\n<li>Developing novel algorithms for diffusion pretraining and reinforcement learning</li>\n<li>Scaling audio datasets to millions of hours of high-quality audio</li>\n<li>Creating robust evaluation methodologies for hard-to-measure qualities such as naturalness or expressiveness</li>\n<li>Studying training dynamics of mixed audio-text language models</li>\n<li>Optimizing latency and inference throughput for deployed streaming audio systems</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_9ecceef8-349","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5074815008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000-$500,000 USD","x-skills-required":["JAX","PyTorch","large-scale distributed training","signal processing fundamentals","speech language models","audio diffusion models","continuous signals","LLMs"],"x-skills-preferred":["large language model pretraining","diffusion models","reinforcement learning","end-to-end system optimization","GPUs","Kubernetes","distributed training infrastructure"],"datePosted":"2026-04-18T15:42:59.425Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"JAX, PyTorch, large-scale distributed training, signal processing fundamentals, speech language models, audio diffusion models, continuous signals, LLMs, large language model pretraining, diffusion models, reinforcement learning, end-to-end system optimization, GPUs, Kubernetes, distributed training infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":500000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_28107212-128"},"title":"Performance Engineer, GPU","description":"<p>As a GPU Performance Engineer at Anthropic, you will be responsible for architecting and implementing the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You will maximize GPU utilization and performance at unprecedented scale, develop cutting-edge optimizations that directly enable new model capabilities, and dramatically improve inference efficiency.</p>\n<p>Working at the intersection of hardware and software, you will implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>\n<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Architect and implement foundational systems that power Claude</li>\n<li>Maximize GPU utilization and performance at unprecedented scale</li>\n<li>Develop cutting-edge optimizations that directly enable new model capabilities</li>\n<li>Dramatically improve inference efficiency</li>\n<li>Implement state-of-the-art techniques from custom kernel development to distributed system architectures</li>\n<li>Work at the intersection of hardware and software</li>\n<li>Span the entire stack,from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization</li>\n</ul>\n<p>Requirements:</p>\n<ul>\n<li>Deep experience with GPU programming and optimization at scale</li>\n<li>Impact-driven, passionate about delivering measurable performance breakthroughs</li>\n<li>Ability to navigate complex systems from hardware interfaces to high-level ML frameworks</li>\n<li>Enjoy collaborative problem-solving and pair programming</li>\n<li>Want to work on state-of-the-art language models with real-world impact</li>\n<li>Care about the societal impacts of your work</li>\n<li>Thrive in ambiguous environments where you define the path forward</li>\n</ul>\n<p>Nice to have:</p>\n<ul>\n<li>Experience with GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>\n<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>\n<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>\n<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>\n<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>\n<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>\n</ul>\n<p>Representative projects:</p>\n<ul>\n<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>\n<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>\n<li>Design distributed communication strategies for multi-node GPU clusters</li>\n<li>Optimize end-to-end training and inference pipelines for frontier language models</li>\n<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>\n<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>\n<li>Create resilient systems for planet-scale distributed training infrastructure</li>\n<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>\n<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>\n</ul>\n<p>Note: The salary range for this position is $280,000-$850,000 USD per year.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_28107212-128","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4926227008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$280,000-$850,000 USD per year","x-skills-required":["GPU programming","optimization at scale","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","PyTorch/JAX internals","torch.compile","XLA","custom operators","kernel fusion","memory bandwidth optimization","profiling with Nsight","NCCL","NVLink","collective communication","model parallelism","INT8/FP8 quantization","mixed-precision techniques","large-scale training infrastructure","fault tolerance","cluster orchestration"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:40:11.758Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GPU programming, optimization at scale, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, PyTorch/JAX internals, torch.compile, XLA, custom operators, kernel fusion, memory bandwidth optimization, profiling with Nsight, NCCL, NVLink, collective communication, model parallelism, INT8/FP8 quantization, mixed-precision techniques, large-scale training infrastructure, fault tolerance, cluster orchestration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":280000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2e513a92-ec5"},"title":"Research Scientist (Generative Modeling)","description":"<p>We are seeking a talented Research Scientist with a strong background in generative modeling, particularly diffusion models, to join our modeling team. This role is ideal for candidates with deep expertise in diffusion models applied to images, videos, or 3D assets and scenes.</p>\n<p>While experience in one or more of the following areas is a strong plus: large-scale model training, research in 3D computer vision.</p>\n<p>You will collaborate closely with researchers, engineers, and product teams to bring advanced 3D modeling and machine learning techniques into real-world applications, ensuring that our technology remains at the forefront of visual innovation. This role involves significant hands-on research and engineering work, driving projects from conceptualization through to production deployment.</p>\n<p>Key responsibilities include designing, implementing, and training large-scale diffusion models for generating 3D worlds, developing and experimenting with large-scale diffusion models to add novel control signals, adapting to target aesthetic preferences, or distilling for efficient inference, collaborating closely with research and product teams to understand and translate product requirements into effective technical roadmaps, contributing hands-on to all stages of model development including data curation, experimentation, evaluation, and deployment, continuously exploring and integrating cutting-edge research in diffusion and generative AI more broadly, acting as a key technical resource within the team, mentoring colleagues, and driving best practices in generative modeling and ML engineering.</p>\n<p>Ideal candidate profile includes 3+ years of experience in generative modeling or applied ML roles, extensive experience with machine learning frameworks such as PyTorch or TensorFlow, especially in the context of diffusion models and other generative models, deep expertise in at least one area of generative modeling, strong history of publications or open-source contributions involving large-scale diffusion models, strong coding proficiency in Python and experience with GPU-accelerated computing, ability to engage effectively with researchers and cross-functional teams, clearly translating complex technical ideas into actionable tasks and outcomes, comfortable operating within a dynamic startup environment with high levels of ambiguity, ownership, and innovation.</p>\n<p>Nice to have includes contributions to open-source projects in the fields of computer vision, graphics, or ML, familiarity with large-scale training infrastructure, experience integrating machine learning models into production environments, led or been involved with the development or training of large-scale, state-of-the-art generative models.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2e513a92-ec5","directApply":true,"hiringOrganization":{"@type":"Organization","name":"World Labs","sameAs":"https://worldlabs.ai","logo":"https://logos.yubhub.co/worldlabs.ai.png"},"x-apply-url":"https://job-boards.greenhouse.io/worldlabs/jobs/4089324009","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$250,000 - $325,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)","x-skills-required":["generative modeling","diffusion models","PyTorch","TensorFlow","machine learning frameworks","large-scale model training","research in 3D computer vision","data curation","experimentation","evaluation","deployment","GPU-accelerated computing","Python"],"x-skills-preferred":["open-source contributions","large-scale training infrastructure","integrating machine learning models into production environments","leading or being involved with the development or training of large-scale, state-of-the-art generative models"],"datePosted":"2026-04-17T13:09:56.134Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"generative modeling, diffusion models, PyTorch, TensorFlow, machine learning frameworks, large-scale model training, research in 3D computer vision, data curation, experimentation, evaluation, deployment, GPU-accelerated computing, Python, open-source contributions, large-scale training infrastructure, integrating machine learning models into production environments, leading or being involved with the development or training of large-scale, state-of-the-art generative models","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":250000,"maxValue":325000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_d5b743bb-d8f"},"title":"Product Manager, AI Platforms","description":"<p>The AI Platform Product Manager will drive the strategy and execution of Shield AI&#39;s next-generation autonomy intelligence stack. This PM owns the product vision and roadmap for the Hivemind AI Platform, ensuring we can manufacture, govern, and field advanced world models, robotics foundation models, and vision-language-action systems safely and at scale.</p>\n<p>This role sits at the intersection of AI/ML, autonomy, model lifecycle, infrastructure, and product strategy. The PM partners closely with engineering, AI research, Hivemind Solutions, and field teams to deliver the tooling that enables sovereign autonomy, AI Factories at the edge, and continuous learning,capabilities that are central to Shield AI&#39;s strategic direction.</p>\n<p>This is a high-impact role for an experienced product leader excited to define how foundation models are trained, validated, governed, and deployed across thousands of autonomous systems in highly contested environments.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>AI Model Development &amp; Training Platform</li>\n</ul>\n<p>Own the roadmap for foundation model training workflows, including dataset ingestion, curation, labeling, synthetic data generation, domain model training, and distillation pipelines. Define requirements for world models, robotics models, and VLA-based training, evaluation, and specialization. Lead the evolution of MLOps capabilities in Forge, including data lineage, experiment tracking, model versioning, and scalable evaluation suites.</p>\n<ul>\n<li>Data, Simulation &amp; Synthetic Data Factory</li>\n</ul>\n<p>Define product requirements for synthetic data generation, simulation-integrated data flywheels, and automated scenario generation. Partner with Digital Twin, Simulation, and autonomy teams to convert natural-language mission inputs into data needs, training procedures, and model variants.</p>\n<ul>\n<li>Safe Deployment &amp; Model Governance</li>\n</ul>\n<p>Lead the development of model governance and auditability tooling, including model cards, dataset rights, lineage tracking, safety gates, and compliance evidence. Build guardrails and workflows to safely deploy models onto edge hardware in disconnected, GPS- or comms-denied environments. Partner with Safety, Certification, Cyber, and Engineering teams to ensure traceability and evaluation pipelines meet operational and accreditation requirements.</p>\n<ul>\n<li>Edge Deployment &amp; AI Factory Integration</li>\n</ul>\n<p>Partner with Pilot, EdgeOS, and hardware teams to integrate foundation-model-based perception and reasoning into autonomy behaviors. Define requirements for distillation, quantization, and inference tooling as part of the “three-computer” development and deployment model. Ensure closed-loop workflows between cloud model training and edge-native execution.</p>\n<ul>\n<li>Cross-Functional Leadership</li>\n</ul>\n<p>Collaborate with Engineering, Research, Product, Customer Engagement, and Solutions teams to ensure model outputs meet mission and platform constraints. Translate advanced AI capabilities into intuitive workflows that platform OEMs and partner nations can use to build sovereign AI factories. Sequence foundational capabilities that unblock autonomy, simulation, and customer-facing product teams.</p>\n<ul>\n<li>User &amp; Customer Impact</li>\n</ul>\n<p>Develop deep empathy for ML engineers, autonomy developers, and Solutions engineers who rely on the platform. Capture operational data gaps, mission-driven model needs, and domain-specific specialization requirements. Lead demos and onboarding for model-development capabilities across internal and external teams.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_d5b743bb-d8f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Shield AI","sameAs":"https://www.shield.ai","logo":"https://logos.yubhub.co/shield.ai.png"},"x-apply-url":"https://jobs.lever.co/shieldai/7886f437-2d5e-4616-8dcb-3dc488f1f585","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$190,000 - $290,000 a year","x-skills-required":["AI Model Development & Training Platform","Data, Simulation & Synthetic Data Factory","Safe Deployment & Model Governance","Edge Deployment & AI Factory Integration","Cross-Functional Leadership","User & Customer Impact","Strong engineering background","Deep understanding of foundation models, robotics models, multimodal models, MLOps, and training infrastructure","Experience managing complex products spanning data pipelines, cloud training clusters, model governance, and edge deployments","Proven success partnering with research teams to transition ML innovations into stable, production-grade workflows"],"x-skills-preferred":["Experience working on autonomy, robotics, embedded AI, or mission-critical systems","Hands-on familiarity with GPU infrastructure, distributed training, or data lakehouse architectures","Experience supporting defense, dual-use, or safety-critical AI systems","Background designing or operating AI Factory–style pipelines (data → training → evaluation → distillation → edge deployment)","Advanced degree in engineering, ML/AI, robotics, or a related field"],"datePosted":"2026-04-17T13:02:54.419Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Diego"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"AI Model Development & Training Platform, Data, Simulation & Synthetic Data Factory, Safe Deployment & Model Governance, Edge Deployment & AI Factory Integration, Cross-Functional Leadership, User & Customer Impact, Strong engineering background, Deep understanding of foundation models, robotics models, multimodal models, MLOps, and training infrastructure, Experience managing complex products spanning data pipelines, cloud training clusters, model governance, and edge deployments, Proven success partnering with research teams to transition ML innovations into stable, production-grade workflows, Experience working on autonomy, robotics, embedded AI, or mission-critical systems, Hands-on familiarity with GPU infrastructure, distributed training, or data lakehouse architectures, Experience supporting defense, dual-use, or safety-critical AI systems, Background designing or operating AI Factory–style pipelines (data → training → evaluation → distillation → edge deployment), Advanced degree in engineering, ML/AI, robotics, or a related field","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":190000,"maxValue":290000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a7dec4aa-ad1"},"title":"Technical Program Manager","description":"<p>Saronic Technologies is seeking a Technical Program Manager (TPM) to lead the technical execution, integration, and fleet readiness of Saronic&#39;s Autonomous Surface Vessel (ASV) programs.</p>\n<p>The TPM serves as the technical bridge between Saronic&#39;s internal engineering and product teams and its external customers, ensuring that what we build aligns with what&#39;s delivered and operated in the field. This role combines systems-level understanding with programmatic discipline, driving technical performance, risk management, and integration readiness across customer programs.</p>\n<p>Responsibilities:</p>\n<ul>\n<li><p>Technical Execution Leadership: Lead end-to-end technical execution for assigned ASV programs, from design through field deployment.</p>\n</li>\n<li><p>Translate customer and program requirements into actionable engineering deliverables and testable configurations.</p>\n</li>\n<li><p>Maintain technical baselines across multiple builds or variants, ensuring traceability from design to delivery.</p>\n</li>\n<li><p>Drive technical risk identification, mitigation, and resolution with cross-functional teams.</p>\n</li>\n<li><p>Contribute to Engineering Change Boards (ECB) and other technical working groups at Saronic</p>\n</li>\n<li><p>Integration &amp; Field Readiness: Plan and oversee integration, testing, and on-water operations in coordination with Product, Engineering, and Production.</p>\n</li>\n<li><p>Work closely with the Fleet Integration &amp; Services Specialists to ensure consistent, safe, and well-documented field execution.</p>\n</li>\n<li><p>Ensure technical configurations are validated and approved before production and deployment.</p>\n</li>\n<li><p>Capture and communicate lessons learned from test events and field operations into future iterations.</p>\n</li>\n<li><p>Cross Functional Coordination: Partner with Product teams to align technical execution with product roadmaps and capability development.</p>\n</li>\n<li><p>Coordinate with Engineering to ensure scope, design intent, and technical priorities are aligned with program goals.</p>\n</li>\n<li><p>Interface with Production and Quality to ensure build configurations and acceptance criteria meet customer expectations.</p>\n</li>\n<li><p>Support Program Managers with technical inputs for customer meetings, reporting, and deliverables.</p>\n</li>\n<li><p>Risk &amp; Configuration Management: Maintain visibility into technical, operational, and programmatic risks across assigned programs.</p>\n</li>\n<li><p>Lead risk review meetings and maintain mitigation plans in collaboration with Program, Product, and Engineering leadership.</p>\n</li>\n<li><p>Enforce disciplined configuration management for vessels, payloads, and software versions.</p>\n</li>\n<li><p>Oversee technical documentation, ensuring completeness, traceability, and readiness for audits or delivery.</p>\n</li>\n<li><p>Customer &amp; Stakeholder Engagement: Serve as the technical point of contact for customer and partner engagements, bridging operational needs with product capability.</p>\n</li>\n<li><p>Participate in design reviews, field demos, and technical exchanges with customer stakeholders.</p>\n</li>\n<li><p>Deliver concise technical updates and after-action reports to leadership and customer audiences.</p>\n</li>\n<li><p>Translate real-world customer feedback into actionable improvements for Product and Engineering.</p>\n</li>\n</ul>\n<p>Qualifications:</p>\n<ul>\n<li><p>Basic Qualifications:</p>\n<ul>\n<li>Bachelor&#39;s degree in Engineering, Systems, or a related technical field.</li>\n<li>8+ years of experience in technical program management, systems integration, or defense-related development programs.</li>\n<li>Demonstrated success leading cross-functional technical teams through the design, build, and integration lifecycle.</li>\n<li>Strong understanding of mechanical, electrical, software, and autonomy integration principles.</li>\n<li>Proven experience managing risk, configuration, and readiness for complex technical maritime systems.</li>\n<li>Excellent verbal and written communication skills, with the ability to interface confidently with technical and non-technical audiences.</li>\n</ul>\n</li>\n<li><p>Preferred Qualifications:</p>\n<ul>\n<li>Experience managing or integrating unmanned defense, maritime, or robotic systems.</li>\n<li>Familiarity with naval readiness systems and logistics management systems.</li>\n<li>Experience building or supporting sustainment and training infrastructures for new technologies.</li>\n<li>Ability to thrive in a fast-paced, mission-driven environment with a strong growth mandate.</li>\n</ul>\n</li>\n<li><p>Key Competencies:</p>\n<ul>\n<li>Can connect component-level design decisions to overall system and mission performance.</li>\n<li>Can translate complex technical data into clear, actionable information for stakeholders.</li>\n<li>Leadership of high-performing distributed technical teams.</li>\n<li>Ability to balance execution excellence with growth, team-building, and organizational scaling.</li>\n<li>Established track record of clear, consistent, and correct communications with peers, leaders, and external stakeholders.</li>\n<li>Comfort guiding customers through ambiguous or evolving requirements and shaping actionable solutions.</li>\n</ul>\n</li>\n</ul>\n<p>Benefits:</p>\n<ul>\n<li>Medical Insurance: Comprehensive health insurance plans covering a range of services</li>\n<li>Dental and Vision Insurance: Coverage for routine dental check-ups, orthodontics, and vision care</li>\n<li>Saronic pays 99% of the premium for employees and 80% for dependents</li>\n<li>Time Off: Generous PTO and Holidays</li>\n<li>Parental Leave: Paid maternity and paternity leave to support new parents</li>\n<li>Competitive Salary: Industry-standard salaries with opportunities for performance-based bonuses</li>\n<li>Retirement Plan: 401(k) plan</li>\n<li>Stock Options: Equity options to give employees a stake in the company&#39;s success</li>\n<li>Life and Disability Insurance: Basic life insurance and short- and long-term disability coverage</li>\n<li>Additional Perks: Free lunch benefit and unlimited free drinks and snacks in the office</li>\n</ul>\n<p>Physical Demands:</p>\n<ul>\n<li>Prolonged periods of sitting and computer work.</li>\n<li>Occasional standing and walking within the office.</li>\n<li>Manual dexterity to operate computers and office equipment.</li>\n<li>Visual acuity to read screens and documents.</li>\n<li>Occasional reaching or lifting up to 20 pounds (e.g., equipment or supplies).</li>\n</ul>\n<p>Additional Information:</p>\n<p>This role requires access to export-controlled information or items that require &#39;U.S. Person&#39; status. As defined by U.S. law, individuals who are any one of the following are considered to be a &#39;U.S. Person&#39;: (1) U.S. citizens, (2) legal permanent residents (a.k.a. green card holders), and (3) certain protected classes of asylees and refugees, as defined in 8 U.S.C. 1324b(a)(3).</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a7dec4aa-ad1","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Saronic Technologies","sameAs":"https://www.saronictechnologies.com/","logo":"https://logos.yubhub.co/saronictechnologies.com.png"},"x-apply-url":"https://jobs.lever.co/saronic/5bd48d5a-f655-41ea-8ea9-f91d0c05159c","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Technical Program Management","Systems Integration","Defense-Related Development Programs","Mechanical, Electrical, Software, and Autonomy Integration Principles","Risk Management","Configuration Management","Technical Documentation","Customer and Stakeholder Engagement"],"x-skills-preferred":["Unmanned Defense, Maritime, or Robotic Systems","Naval Readiness Systems and Logistics Management Systems","Sustainment and Training Infrastructures for New Technologies"],"datePosted":"2026-04-17T12:56:56.786Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Austin"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Technical Program Management, Systems Integration, Defense-Related Development Programs, Mechanical, Electrical, Software, and Autonomy Integration Principles, Risk Management, Configuration Management, Technical Documentation, Customer and Stakeholder Engagement, Unmanned Defense, Maritime, or Robotic Systems, Naval Readiness Systems and Logistics Management Systems, Sustainment and Training Infrastructures for New Technologies"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_78a9b8f2-81c"},"title":"Senior Software Engineer - Data Infrastructure","description":"<p>We believe that the way people interact with their finances will drastically improve in the next few years. We&#39;re dedicated to empowering this transformation by building the tools and experiences that thousands of developers use to create their own products.</p>\n<p>Plaid powers the tools millions of people rely on to live a healthier financial life. We work with thousands of companies like Venmo, SoFi, several of the Fortune 500, and many of the largest banks to make it easy for people to connect their financial accounts to the apps and services they want to use.</p>\n<p>Making data driven decisions is key to Plaid&#39;s culture. To support that, we need to scale our data systems while maintaining correct and complete data. We provide tooling and guidance to teams across engineering, product, and business and help them explore our data quickly and safely to get the data insights they need, which ultimately helps Plaid serve our customers more effectively.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Contribute towards the long-term technical roadmap for data-driven and machine learning iteration at Plaid</li>\n<li>Leading key data infrastructure projects such as improving ML development golden paths, implementing offline streaming solutions for data freshness, building net new ETL pipeline infrastructure, and evolving data warehouse or data lakehouse capabilities.</li>\n<li>Working with stakeholders in other teams and functions to define technical roadmaps for key backend systems and abstractions across Plaid.</li>\n<li>Debugging, troubleshooting, and reducing operational burden for our Data Platform.</li>\n<li>Growing the team via mentorship and leadership, reviewing technical documents and code changes.</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>5+ years of software engineering experience</li>\n<li>Extensive hands-on software engineering experience, with a strong track record of delivering successful projects within the Data Infrastructure or Platform domain at similar or larger companies.</li>\n<li>Deep understanding of one of: ML Infrastructure systems, including Feature Stores, Training Infrastructure, Serving Infrastructure, and Model Monitoring OR Data Infrastructure systems, including Data Warehouses, Data Lakehouses, Apache Spark, Streaming Infrastructure, Workflow Orchestration.</li>\n<li>Strong cross-functional collaboration, communication, and project management skills, with proven ability to coordinate effectively.</li>\n<li>Proficiency in coding, testing, and system design, ensuring reliable and scalable solutions.</li>\n<li>Demonstrated leadership abilities, including experience mentoring and guiding junior engineers.</li>\n</ul>\n<p><strong>Additional Information</strong></p>\n<p>Our mission at Plaid is to unlock financial freedom for everyone. To support that mission, we seek to build a diverse team of driven individuals who care deeply about making the financial ecosystem more equitable.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_78a9b8f2-81c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Plaid","sameAs":"https://plaid.com/","logo":"https://logos.yubhub.co/plaid.com.png"},"x-apply-url":"https://jobs.lever.co/plaid/05b0ae3f-ec60-48d6-ae27-1bd89d928c47","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$190,800-$286,800 per year","x-skills-required":["ML Infrastructure systems","Data Infrastructure systems","Apache Spark","Streaming Infrastructure","Workflow Orchestration","Feature Stores","Training Infrastructure","Serving Infrastructure","Model Monitoring","Data Warehouses","Data Lakehouses"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:51:58.720Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"ML Infrastructure systems, Data Infrastructure systems, Apache Spark, Streaming Infrastructure, Workflow Orchestration, Feature Stores, Training Infrastructure, Serving Infrastructure, Model Monitoring, Data Warehouses, Data Lakehouses","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":190800,"maxValue":286800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4396bfcf-940"},"title":"Software Engineer, Sandboxing","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>Anthropic&#39;s sandboxing infrastructure enables Claude to safely execute code and interact with external systems. As we expand Claude&#39;s capabilities, the reliability, security, and developer experience of this infrastructure becomes increasingly critical. We&#39;re looking for an engineer to join the sandboxing team and help shape both the client-side library/API and the underlying infrastructure.</p>\n<p>In this role, you&#39;ll combine deep infrastructure expertise with an obsession for developer experience. You&#39;ll help maintain and evolve a system that must be correct, performant, and intuitive to use. You&#39;ll work closely with internal teams to understand their needs, burn down errors and edge cases, and build a roadmap that anticipates where the product needs to go. This is a role for someone who finds satisfaction in both the craft of building reliable systems and the empathy required to serve developers and researchers well.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Contribute to the client library, API surface, and underlying infrastructure for Anthropic&#39;s sandboxing system, ensuring it is reliable, well-documented, and intuitive to use</li>\n</ul>\n<ul>\n<li>Drive down error rates and improve correctness through systematic debugging, monitoring, and proactive fixes</li>\n</ul>\n<ul>\n<li>Help develop and maintain a product roadmap for sandboxing capabilities, balancing immediate needs with long-term architectural improvements</li>\n</ul>\n<ul>\n<li>Partner closely with internal teams using the sandboxing system to understand their requirements, debug issues, and build tooling that serves their use cases</li>\n</ul>\n<ul>\n<li>Respond to incidents and production issues with urgency, conducting thorough root cause analysis and implementing preventive measures</li>\n</ul>\n<ul>\n<li>Build comprehensive testing, observability, and documentation to ensure the system meets a high quality bar</li>\n</ul>\n<ul>\n<li>Collaborate across the sandboxing team, flexing between client-side and infrastructure work as needed</li>\n</ul>\n<p><strong>You May Be a Good Fit If You</strong></p>\n<ul>\n<li>Have 5+ years of software engineering experience, with meaningful time spent maintaining libraries, SDKs, or developer-facing APIs</li>\n</ul>\n<ul>\n<li>Obsess over developer experience—you&#39;ve thought deeply about API design, error propagation, documentation, and the small details that make a library feel well-crafted</li>\n</ul>\n<ul>\n<li>Have experience operating complex distributed systems</li>\n</ul>\n<ul>\n<li>Bring a track record of systematically improving reliability—you&#39;ve burned down error budgets, built monitoring, and driven issues to resolution</li>\n</ul>\n<ul>\n<li>Can develop and articulate a long-term vision for a product, translating user feedback and technical constraints into a coherent roadmap</li>\n</ul>\n<ul>\n<li>Are comfortable with ambiguity and can context-switch between reactive incident work and proactive product development</li>\n</ul>\n<ul>\n<li>Communicate clearly with both technical and non-technical stakeholders</li>\n</ul>\n<p><strong>Strong Candidates May Also Have</strong></p>\n<ul>\n<li>Experience as a founder or early engineer at an infrastructure-focused startup, where you owned a product end-to-end</li>\n</ul>\n<ul>\n<li>Background in security, sandboxing, or isolation technologies (containers, VMs, seccomp, namespaces, etc.)</li>\n</ul>\n<ul>\n<li>Open-source contributions in the Python ecosystem</li>\n</ul>\n<ul>\n<li>Experience building developer tools, CLIs, or platforms used by other engineers</li>\n</ul>\n<ul>\n<li>History of working on incident response and on-call rotations for production systems</li>\n</ul>\n<ul>\n<li>Exposure to reinforcement learning or model training infrastructure</li>\n</ul>\n<p><strong>Representative Projects</strong></p>\n<p>These are examples of past work that would indicate a good fit—not a description of the role itself:</p>\n<ul>\n<li>Maintaining an open source SDK through multiple major version upgrades while minimizing breaking changes for users</li>\n</ul>\n<ul>\n<li>Leading an initiative to reduce P0 incidents by XX% through improved error handling, retries, and observability</li>\n</ul>\n<ul>\n<li>Building a developer platform at a startup from zero to product-market fit, iterating based on user feedback</li>\n</ul>\n<ul>\n<li>Embedding with an internal team for a quarter to deeply understand their workflows and shipping targeted improvements to a piece of infrastructure they rely on</li>\n</ul>\n<ul>\n<li>Developing a multi-quarter roadmap for a developer tools product, balancing user requests with technical debt reduction</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong> Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work. We think AI systems like the ones we&#39;re building can have a huge impact on society, and we want to make sure that the people building them are representative of the people they&#39;ll be serving.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4396bfcf-940","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5083039008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$300,000 - $405,000USD","x-skills-required":["software engineering","API design","error propagation","documentation","complex distributed systems","reliability","observability","testing","security","sandboxing","isolation technologies","containers","VMs","seccomp","namespaces","Python ecosystem","developer tools","CLIs","platforms","incident response","on-call rotations","reinforcement learning","model training infrastructure"],"x-skills-preferred":["founder","early engineer","infrastructure-focused startup","open-source contributions","developer platform","product-market fit","user feedback","incident response","on-call rotations","reinforcement learning","model training infrastructure"],"datePosted":"2026-03-08T14:03:30.986Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, API design, error propagation, documentation, complex distributed systems, reliability, observability, testing, security, sandboxing, isolation technologies, containers, VMs, seccomp, namespaces, Python ecosystem, developer tools, CLIs, platforms, incident response, on-call rotations, reinforcement learning, model training infrastructure, founder, early engineer, infrastructure-focused startup, open-source contributions, developer platform, product-market fit, user feedback, incident response, on-call rotations, reinforcement learning, model training infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":300000,"maxValue":405000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_7b2b97d5-0a1"},"title":"Software Engineer, Inference Deployment","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>Our mandate is to make inference deployment boring and unattended.</p>\n<p>Anthropic serves Claude to millions of users across GPUs, TPUs, and Trainium — and every model update must reach production safely, quickly, and without disrupting service. We&#39;re building the systems that make inference deployment continuous and unattended.</p>\n<p>As a Software Engineer on the Launch Engineering team, you&#39;ll design and build the deployment infrastructure that moves inference code from merge to production. This is a resource-constrained optimization problem at its core: validation and deployment consume the same accelerator chips that serve customer traffic — your deploys compete with live user requests for the same hardware. Every model brings different fleet sizes, startup times, and correctness requirements, so the system must adapt continuously. You&#39;ll build systems that navigate these constraints — orchestrating validation, scheduling deployments intelligently, and driving down cycle time from merge to production.</p>\n<p>If you&#39;ve built deployment systems at scale and gravitate toward the hardest problems at the intersection of automation and resource management, this team will give you an outsized scope to work on them.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li><strong>Own deployment orchestration</strong> that continuously moves validated inference builds into production across GPU, TPU, and Trainium fleets, unattended under normal conditions</li>\n<li><strong>Improve capacity-aware deployment scheduling</strong> to maximize deployment throughput against constrained accelerator budgets and variable fleet sizes</li>\n<li><strong>Extend deployment observability</strong> — dashboards and tooling that answer &quot;what code is running in production,&quot; &quot;where is my commit,&quot; and &quot;what validation passed for this deploy&quot;</li>\n<li><strong>Drive down cycle time</strong> from code merge to production with pipeline architectures that minimize serial dependencies and maximize parallelism</li>\n<li><strong>Optimize fleet rollout strategies</strong> for large-scale deployments across thousands of GPU, TPU, and Trainium chips, minimizing disruption to serving capacity</li>\n<li><strong>Evolve self-service model onboarding</strong> so that new models can be added to the continuous deployment pipeline without Launch Engineering involvement</li>\n<li><strong>Partner across the Inference organization</strong> with teams owning validation, autoscaling, and model routing to integrate deployment automation with their systems</li>\n</ul>\n<p><strong>You May Be a Good Fit If You Have</strong></p>\n<ul>\n<li>5+ years of experience building deployment, release, or delivery infrastructure at scale</li>\n<li>Strong software engineering skills with experience designing systems that manage complex state machines and multi-stage pipelines</li>\n<li>Experience with deployment systems where resource constraints shape the design — whether that&#39;s fleet capacity, network bandwidth, hardware availability, or coordinated rollout windows</li>\n<li>A track record of building automation that measurably improves deployment velocity and reliability</li>\n<li>Proficiency with Kubernetes-based deployments, rolling update mechanics, and container orchestration</li>\n<li>Comfort working across the stack — from backend services and databases to CLI tools and web UIs</li>\n<li>Strong communication skills and the ability to work closely with oncall engineers, model teams, and infrastructure partners</li>\n</ul>\n<p><strong>Strong Candidates May Also Have</strong></p>\n<ul>\n<li>Experience with ML inference or training infrastructure deployment, particularly across multiple accelerator types (GPU, TPU, Trainium)</li>\n<li>Background in capacity planning or resource-constrained scheduling (e.g., bin-packing, fleet management, job scheduling with hardware affinity)</li>\n<li>Experience with progressive delivery in systems with long validation cycles: canary/soak testing, blue-green deployments, traffic shifting, automated rollback</li>\n<li>Experience at companies with large-scale release engineering challenges (mobile release trains, monorepo deployments, multi-datacenter rollouts)</li>\n<li>Experience with Python and/or Rust in production systems</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong> Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_7b2b97d5-0a1","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5111745008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$320,000 - $485,000USD","x-skills-required":["deployment","release","delivery","infrastructure","Kubernetes","container","orchestration","pipelines","state machines","multi-stage","pipelines","parallelism","optimization","resource management","automation","velocity","reliability","communication","collaboration","oncall","model teams","infrastructure partners"],"x-skills-preferred":["ML inference","training infrastructure","capacity planning","resource-constrained scheduling","bin-packing","fleet management","job scheduling","hardware affinity","progressive delivery","canary/soak testing","blue-green deployments","traffic shifting","automated rollback","mobile release trains","monorepo deployments","multi-datacenter rollouts","Python","Rust"],"datePosted":"2026-03-08T13:54:19.012Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"deployment, release, delivery, infrastructure, Kubernetes, container, orchestration, pipelines, state machines, multi-stage, pipelines, parallelism, optimization, resource management, automation, velocity, reliability, communication, collaboration, oncall, model teams, infrastructure partners, ML inference, training infrastructure, capacity planning, resource-constrained scheduling, bin-packing, fleet management, job scheduling, hardware affinity, progressive delivery, canary/soak testing, blue-green deployments, traffic shifting, automated rollback, mobile release trains, monorepo deployments, multi-datacenter rollouts, Python, Rust","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":320000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_73ff6f07-c0e"},"title":"Staff Software Engineer, AI Reliability Engineering","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>Claude has your back. AIRE has Claude&#39;s. Help us keep Claude reliable for everyone who depends on it.</p>\n<p>AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.</p>\n<p>Reliability here is an emergent phenomenon that transcends any single team&#39;s boundaries, so someone has to zoom out and look at the whole picture. That&#39;s us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.</li>\n<li>Design and implement monitoring and observability systems across the token path.</li>\n<li>Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers</li>\n<li>Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.</li>\n<li>Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic&#39;s safety commitments.</li>\n</ul>\n<p><strong>You may be a good fit if you</strong></p>\n<ul>\n<li>Have strong distributed systems, infrastructure, or reliability backgrounds -- we&#39;re looking for reliability-minded software engineers and SREs.</li>\n<li>Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</li>\n<li>Think holistically about how systems compose and where the seams are.</li>\n<li>Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.</li>\n<li>Care about users and feel ownership over outcomes, even for systems you don&#39;t own.</li>\n<li>Have excellent communication and collaboration skills -- you&#39;ll be partnering across the entire company.</li>\n<li>Bring diverse experience -- the team&#39;s strength comes from people who&#39;ve built product stacks, scaled databases, run massive distributed systems, and everything in between.</li>\n</ul>\n<p><strong>Strong candidates may also</strong></p>\n<ul>\n<li>Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems</li>\n<li>Have experience operating large-scale model serving or training infrastructure (&gt;1000 GPUs).</li>\n<li>Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).</li>\n<li>Understand ML-specific networking optimizations like RDMA and InfiniBand.</li>\n<li>Have expertise in AI-specific observability tools and frameworks.</li>\n<li>Have experience with chaos engineering and systematic resilience testing.</li>\n<li>Have contributed to open-source infrastructure or ML tooling.</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship</strong></p>\n<p>We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong></p>\n<p>Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</p>\n<p><strong>Your safety matters to us.</strong></p>\n<p>To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links—visit anthropic.com/careers directly for confirmed position openings.</p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and engineering as it does with computer science.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_73ff6f07-c0e","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5101173008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"£325,000 - £390,000GBP","x-skills-required":["distributed systems","infrastructure","reliability","software engineering","SRE","large scale systems","model serving","training infrastructure","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"x-skills-preferred":["SRE","Production Engineer","reliability-focused roles","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"datePosted":"2026-03-08T13:51:34.354Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, SRE, Production Engineer, reliability-focused roles, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling","baseSalary":{"@type":"MonetaryAmount","currency":"GBP","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":390000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_10798a1e-9fa"},"title":"Staff Software Engineer, AI Reliability Engineering","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>Claude has your back. AIRE has Claude&#39;s. Help us keep Claude reliable for everyone who depends on it.</p>\n<p>AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.</p>\n<p>Reliability here is an emergent phenomenon that transcends any single team&#39;s boundaries, so someone has to zoom out and look at the whole picture. That&#39;s us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.</li>\n<li>Design and implement monitoring and observability systems across the token path.</li>\n<li>Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers</li>\n<li>Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.</li>\n<li>Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic&#39;s safety commitments.</li>\n</ul>\n<p><strong>You may be a good fit if you</strong></p>\n<ul>\n<li>Have strong distributed systems, infrastructure, or reliability backgrounds -- we&#39;re looking for reliability-minded software engineers and SREs.</li>\n<li>Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</li>\n<li>Think holistically about how systems compose and where the seams are.</li>\n<li>Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.</li>\n<li>Care about users and feel ownership over outcomes, even for systems you don&#39;t own.</li>\n<li>Have excellent communication and collaboration skills -- you&#39;ll be partnering across the entire company.</li>\n<li>Bring diverse experience -- the team&#39;s strength comes from people who&#39;ve built product stacks, scaled databases, run massive distributed systems, and everything in between.</li>\n</ul>\n<p><strong>Strong candidates may also</strong></p>\n<ul>\n<li>Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems</li>\n<li>Have experience operating large-scale model serving or training infrastructure (&gt;1000 GPUs).</li>\n<li>Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).</li>\n<li>Understand ML-specific networking optimizations like RDMA and InfiniBand.</li>\n<li>Have expertise in AI-specific observability tools and frameworks.</li>\n<li>Have experience with chaos engineering and systematic resilience testing.</li>\n<li>Have contributed to open-source infrastructure or ML tooling.</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Salary</strong></p>\n<p>The annual compensation range for this role is €235.000 - €295.000EUR.</p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and engineering as it does with computer science. We strive to build a team that reflects this perspective, with people from a wide range of backgrounds and disciplines.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_10798a1e-9fa","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5101169008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"€235.000 - €295.000EUR","x-skills-required":["distributed systems","infrastructure","reliability","software engineering","SRE","large scale systems","model serving","training infrastructure","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"x-skills-preferred":["communication","collaboration","diverse experience","product stacks","databases","distributed systems"],"datePosted":"2026-03-08T13:48:18.742Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Dublin"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, communication, collaboration, diverse experience, product stacks, databases, distributed systems"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_da726093-b19"},"title":"Research Engineer, Discovery","description":"<p><strong>About the Role</strong></p>\n<p>As a Research Engineer on our team, you will work end to end across the whole model stack, identifying and addressing key infra blockers on the path to scientific AGI. Strong candidates should have familiarity with elements of language model training, evaluation, and inference and eagerness to quickly dive and get up to speed in areas they are not yet an expert on. This may include performance optimization, distributed systems, VM/sandboxing/container deployment, and large scale data pipelines.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>Design and implement large-scale infrastructure systems to support AI scientist training, evaluation, and deployment across distributed environments</li>\n<li>Identify and resolve infrastructure bottlenecks impeding progress toward scientific capabilities</li>\n<li>Develop robust and reliable evaluation frameworks for measuring progress towards scientific AGI.</li>\n<li>Build scalable and performant VM/sandboxing/container architectures to safely execute long-horizon AI tasks and scientific workflows</li>\n<li>Collaborate to translate experimental requirements into production-ready infrastructure</li>\n<li>Develop large scale data pipelines to handle advanced language model training requirements</li>\n<li>Optimize large scale training and inference pipelines for stable and efficient reinforcement learning</li>\n</ul>\n<p><strong>You may be a good fit if you:</strong></p>\n<ul>\n<li>Have 6+ years of highly-relevant experience in infrastructure engineering with demonstrated expertise in large-scale distributed systems</li>\n<li>Are a strong communicator and enjoy working collaboratively</li>\n<li>Possess deep knowledge of performance optimization techniques and system architectures for high-throughput ML workloads</li>\n<li>Have experience with containerization technologies (Docker, Kubernetes) and orchestration at scale</li>\n<li>Have proven track record of building large-scale data pipelines and distributed storage systems</li>\n<li>Excel at diagnosing and resolving complex infrastructure challenges in production environments</li>\n<li>Can work effectively across the full ML stack from data pipelines to performance optimization</li>\n<li>Have experience collaborating with other researchers to scale experimental ideas</li>\n<li>Thrive in fast-paced environments and can rapidly iterate from experimentation to production</li>\n</ul>\n<p><strong>Strong candidates may also have:</strong></p>\n<ul>\n<li>Experience with language model training infrastructure and distributed ML frameworks (PyTorch, JAX, etc.)</li>\n<li>Background in building infrastructure for AI research labs or large-scale ML organizations</li>\n<li>Knowledge of GPU/TPU architectures and language model inference optimization</li>\n<li>Experience with cloud platforms (AWS, GCP) at enterprise scale</li>\n<li>Familiarity with VM and container orchestration.</li>\n<li>Experience with workflow orchestration tools and experiment management systems</li>\n<li>History working with large scale reinforcement learning</li>\n<li>Comfort with large scale data pipelines (Beam, Spark, Dask, …)</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<ul>\n<li>Education requirements: We require at least a Bachelor&#39;s degree in a related field or equivalent experience.</li>\n<li>Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</li>\n<li>Visa sponsorship: We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</li>\n</ul>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</strong></p>\n<p><strong>Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links—visit anthropic.com/careers directly for confirmed position openings.</strong></p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale projects, and we&#39;re committed to making a positive impact on the world.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_da726093-b19","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4669581008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $850,000 USD","x-skills-required":["infrastructure engineering","large-scale distributed systems","performance optimization","containerization technologies","orchestration at scale","data pipelines","distributed storage systems","complex infrastructure challenges","ML stack","workflow orchestration tools","experiment management systems","reinforcement learning","large scale data pipelines"],"x-skills-preferred":["language model training infrastructure","distributed ML frameworks","GPU/TPU architectures","language model inference optimization","cloud platforms","VM and container orchestration","workflow orchestration tools","experiment management systems","large scale reinforcement learning","large scale data pipelines"],"datePosted":"2026-03-08T13:46:32.661Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"infrastructure engineering, large-scale distributed systems, performance optimization, containerization technologies, orchestration at scale, data pipelines, distributed storage systems, complex infrastructure challenges, ML stack, workflow orchestration tools, experiment management systems, reinforcement learning, large scale data pipelines, language model training infrastructure, distributed ML frameworks, GPU/TPU architectures, language model inference optimization, cloud platforms, VM and container orchestration, workflow orchestration tools, experiment management systems, large scale reinforcement learning, large scale data pipelines","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_58928a28-64d"},"title":"Research Engineer/Research Scientist, Audio","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>You may be a good fit if you:</strong></p>\n<ul>\n<li>Have hands-on experience with training audio models, whether that&#39;s conversational speech-to-speech, speech translation, speech recognition, text-to-speech, diarization, codecs, or generative audio models</li>\n<li>Genuinely enjoy both research and engineering work, and you&#39;d describe your ideal split as roughly 50/50 rather than heavily weighted toward one or the other</li>\n<li>Are comfortable working across abstraction levels, from signal processing fundamentals to large-scale model training and inference optimization</li>\n<li>Have deep expertise with JAX, PyTorch, or large-scale distributed training, and can debug performance issues across the full stack</li>\n<li>Thrive in fast-moving environments where the most important problem might shift as we learn more about what works</li>\n<li>Communicate clearly and collaborate effectively; audio touches many parts of our systems, so you&#39;ll work closely with teams across the company</li>\n<li>Are passionate about building conversational AI that feels natural, steerable, and safe</li>\n<li>Care about the societal impacts of voice AI and want to help shape how these systems are developed responsibly</li>\n</ul>\n<p><strong>Strong candidates may also have experience with:</strong></p>\n<ul>\n<li>Large language model pretraining and finetuning</li>\n<li>Training diffusion models for image and audio generation</li>\n<li>Reinforcement learning for large language models and diffusion models</li>\n<li>End-to-end system optimization, from performance benchmarking to kernel optimization</li>\n<li>GPUs, Kubernetes, PyTorch, or distributed training infrastructure</li>\n</ul>\n<p><strong>Representative projects:</strong></p>\n<ul>\n<li>Training state-of-the art neural audio codecs for 48 kHz stereo audio</li>\n<li>Developing novel algorithms for diffusion pretraining and reinforcement learning</li>\n<li>Scaling audio datasets to millions of hours of high quality audio</li>\n<li>Creating robust evaluation methodologies for hard-to-measure qualities such as naturalness or expressiveness</li>\n<li>Studying training dynamics of mixed audio-text language models</li>\n<li>Optimizing latency and inference throughput for deployed streaming audio systems</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong> Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</p>\n<p><strong>Your safety matters to us.</strong> To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links—visit anthropic.com/careers directly for confirmed position openings.</p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI systems that benefit society.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_58928a28-64d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5074815008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $500,000 USD","x-skills-required":["audio models","speech-to-speech","speech translation","speech recognition","text-to-speech","diarization","codecs","generative audio models","JAX","PyTorch","large-scale distributed training"],"x-skills-preferred":["large language model pretraining","training diffusion models","reinforcement learning","end-to-end system optimization","GPUs","Kubernetes","PyTorch","distributed training infrastructure"],"datePosted":"2026-03-08T13:46:24.550Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"audio models, speech-to-speech, speech translation, speech recognition, text-to-speech, diarization, codecs, generative audio models, JAX, PyTorch, large-scale distributed training, large language model pretraining, training diffusion models, reinforcement learning, end-to-end system optimization, GPUs, Kubernetes, PyTorch, distributed training infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":500000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_11a60d5a-f54"},"title":"Performance Engineer, GPU","description":"<p><strong>About the role:</strong></p>\n<p>Pioneering the next generation of AI requires breakthrough innovations in GPU performance and systems engineering. As a GPU Performance Engineer, you&#39;ll architect and implement the foundational systems that power Claude and push the frontiers of what&#39;s possible with large language models. You&#39;ll be responsible for maximizing GPU utilization and performance at unprecedented scale, developing cutting-edge optimizations that directly enable new model capabilities and dramatically improve inference efficiency.</p>\n<p>Working at the intersection of hardware and software, you&#39;ll implement state-of-the-art techniques from custom kernel development to distributed system architectures. Your work will span the entire stack—from low-level tensor core optimizations to orchestrating thousands of GPUs in perfect synchronization.</p>\n<p>Strong candidates will have a track record of delivering transformative GPU performance improvements in production ML systems and will be excited to shape the future of AI infrastructure alongside world-class researchers and engineers.</p>\n<p><strong>You might be a good fit if you:</strong></p>\n<ul>\n<li>Have deep experience with GPU programming and optimization at scale</li>\n<li>Are impact-driven, passionate about delivering measurable performance breakthroughs</li>\n<li>Can navigate complex systems from hardware interfaces to high-level ML frameworks</li>\n<li>Enjoy collaborative problem-solving and pair programming</li>\n<li>Want to work on state-of-the-art language models with real-world impact</li>\n<li>Care about the societal impacts of your work</li>\n<li>Thrive in ambiguous environments where you define the path forward</li>\n</ul>\n<p><strong>Strong candidates may also have experience with:</strong></p>\n<ul>\n<li>GPU Kernel Development: CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization</li>\n<li>ML Compilers &amp; Frameworks: PyTorch/JAX internals, torch.compile, XLA, custom operators</li>\n<li>Performance Engineering: Kernel fusion, memory bandwidth optimization, profiling with Nsight</li>\n<li>Distributed Systems: NCCL, NVLink, collective communication, model parallelism</li>\n<li>Low-Precision: INT8/FP8 quantization, mixed-precision techniques</li>\n<li>Production Systems: Large-scale training infrastructure, fault tolerance, cluster orchestration</li>\n</ul>\n<p><strong>Representative projects:</strong></p>\n<ul>\n<li>Co-design attention mechanisms and algorithms for next-generation hardware architectures</li>\n<li>Develop custom kernels for emerging quantization formats and mixed-precision techniques</li>\n<li>Design distributed communication strategies for multi-node GPU clusters</li>\n<li>Optimize end-to-end training and inference pipelines for frontier language models</li>\n<li>Build performance modeling frameworks to predict and optimize GPU utilization</li>\n<li>Implement kernel fusion strategies to minimize memory bandwidth bottlenecks</li>\n<li>Create resilient systems for planet-scale distributed training infrastructure</li>\n<li>Profile and eliminate performance bottlenecks in production serving infrastructure</li>\n<li>Partner with hardware vendors to influence future accelerator capabilities and software stacks</li>\n</ul>\n<p><strong>Deadline to apply:</strong> None. Applications will be reviewed on a rolling basis.</p>\n<p>The expected salary range for this position is:</p>\n<p>Annual Salary: $280,000 - $850,000USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_11a60d5a-f54","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4926227008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$280,000 - $850,000USD","x-skills-required":["GPU programming","optimization at scale","custom kernel development","distributed system architectures","low-level tensor core optimizations","orchestrating thousands of GPUs","GPU kernel development","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","ML compilers & frameworks","PyTorch/JAX internals","torch.compile","XLA","custom operators","performance engineering","kernel fusion","memory bandwidth optimization","profiling with Nsight","distributed systems","NCCL","NVLink","collective communication","model parallelism","low-precision","INT8/FP8 quantization","mixed-precision techniques","production systems","large-scale training infrastructure","fault tolerance","cluster orchestration"],"x-skills-preferred":["GPU programming","optimization at scale","custom kernel development","distributed system architectures","low-level tensor core optimizations","orchestrating thousands of GPUs","GPU kernel development","CUDA","Triton","CUTLASS","Flash Attention","tensor core optimization","ML compilers & frameworks","PyTorch/JAX internals","torch.compile","XLA","custom operators","performance engineering","kernel fusion","memory bandwidth optimization","profiling with Nsight","distributed systems","NCCL","NVLink","collective communication","model parallelism","low-precision","INT8/FP8 quantization","mixed-precision techniques","production systems","large-scale training infrastructure","fault tolerance","cluster orchestration"],"datePosted":"2026-03-08T13:45:05.412Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration, GPU programming, optimization at scale, custom kernel development, distributed system architectures, low-level tensor core optimizations, orchestrating thousands of GPUs, GPU kernel development, CUDA, Triton, CUTLASS, Flash Attention, tensor core optimization, ML compilers & frameworks, PyTorch/JAX internals, torch.compile, XLA, custom operators, performance engineering, kernel fusion, memory bandwidth optimization, profiling with Nsight, distributed systems, NCCL, NVLink, collective communication, model parallelism, low-precision, INT8/FP8 quantization, mixed-precision techniques, production systems, large-scale training infrastructure, fault tolerance, cluster orchestration","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":280000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2bfc37e4-bc3"},"title":"Researcher, Pretraining Safety","description":"<p><strong>Job Posting</strong></p>\n<p><strong>Researcher, Pretraining Safety</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Safety Systems</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$295K – $445K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p>More details about our benefits are available to candidates during the hiring process.</p>\n<p>This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.</p>\n<p><strong><strong>About the Team</strong></strong></p>\n<p>The Safety Systems team is responsible for various safety work to ensure our best models can be safely deployed to the real world to benefit the society and is at the forefront of OpenAI&#39;s mission to build and deploy safe AGI, driving our commitment to AI safety and fostering a culture of trust and transparency.</p>\n<p>The Pretraining Safety team’s goal is to build safer, more capable base models and enable earlier, more reliable safety evaluation during training. We aim to:</p>\n<ol>\n<li><strong>Develop upstream safety evaluations</strong> that to monitor how and when unsafe behaviors and goals emerge;</li>\n</ol>\n<ol>\n<li><strong>Create safer priors</strong> through targeted pretraining and mid-training interventions that make downstream alignment more effective and efficient</li>\n</ol>\n<ol>\n<li><strong>Design safe-by-design architectures</strong> that allow for more controllability of model capabilities</li>\n</ol>\n<p>In addition, we will conduct the foundational research necessary for understanding how behaviors emerge, generalize, and can be reliably measured throughout training.</p>\n<p><strong><strong>About the Role</strong></strong></p>\n<p>The Pretraining Safety team is pioneering how safety is built into models before they reach post-training and deployment. In this role, you will work throughout the full stack of model development with a focus on pre-training:</p>\n<ul>\n<li>Identify safety-relevant behaviors as they first emerge in base models</li>\n</ul>\n<ul>\n<li>Evaluate and reduce risk without waiting for full-scale training runs</li>\n</ul>\n<ul>\n<li>Design architectures and training setups that make safer behavior the default</li>\n</ul>\n<ul>\n<li>Strengthen models by incorporating richer, earlier safety signals</li>\n</ul>\n<p>We collaborate across OpenAI’s safety ecosystem—from Safety Systems to Training—to ensure that safety foundations are robust, scalable, and grounded in real-world risks.</p>\n<p><strong><strong>In this role, you will:</strong></strong></p>\n<ul>\n<li>Develop new techniques to predict, measure, and evaluate unsafe behavior in early-stage models</li>\n</ul>\n<ul>\n<li>Design data curation strategies that improve pretraining priors and reduce downstream risk</li>\n</ul>\n<ul>\n<li>Explore safe-by-design architectures and training configurations that improve controllability</li>\n</ul>\n<ul>\n<li>Introduce novel safety-oriented loss functions, metrics, and evals into the pretraining stack</li>\n</ul>\n<ul>\n<li>Work closely with cross-functional safety teams to unify pre- and post-training risk reduction</li>\n</ul>\n<p><strong><strong>You might thrive in this role if you:</strong></strong></p>\n<ul>\n<li>Have experience developing or scaling pretraining architectures (LLMs, diffusion models, multimodal models, etc.)</li>\n</ul>\n<ul>\n<li>Are comfortable working with training infrastructure, data pipelines, and evaluation frameworks (e.g., Python, PyTorch/JAX, Apache Beam)</li>\n</ul>\n<ul>\n<li>Enjoy hands-on research — designing, implementing, and iterating on experiments</li>\n</ul>\n<ul>\n<li>Enjoy collaborating with diverse technical and cross-functional partners (e.g., policy, legal, training)</li>\n</ul>\n<ul>\n<li>Are data-driven with strong statistical reasoning and rigor in experimental design</li>\n</ul>\n<ul>\n<li>Value building clean, scalable research workflows and streamlining processes for yourself and others</li>\n</ul>\n<p><strong><strong>About OpenAI</strong></strong></p>\n<p>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2bfc37e4-bc3","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/d829b701-5ee2-414f-8596-ef94911a168a","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$295K – $445K • Offers Equity","x-skills-required":["pretraining architectures","training infrastructure","data pipelines","evaluation frameworks","Python","PyTorch/JAX","Apache Beam","hands-on research","collaboration","data-driven","statistical reasoning"],"x-skills-preferred":["LLMs","diffusion models","multimodal models","safe-by-design architectures","training configurations","loss functions","metrics","evals"],"datePosted":"2026-03-06T18:36:25.493Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"pretraining architectures, training infrastructure, data pipelines, evaluation frameworks, Python, PyTorch/JAX, Apache Beam, hands-on research, collaboration, data-driven, statistical reasoning, LLMs, diffusion models, multimodal models, safe-by-design architectures, training configurations, loss functions, metrics, evals","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":295000,"maxValue":445000,"unitText":"YEAR"}}}]}