{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/automation-and-infrastructure-as-code"},"x-facet":{"type":"skill","slug":"automation-and-infrastructure-as-code","display":"Automation and infrastructure-as-code","count":1},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_8eda17fb-432"},"title":"Senior HPC Engineer","description":"<p>We are seeking a Senior HPC Engineer to join our team in a senior, hands-on role building and evolving large-scale, high-throughput HPC and GPU platforms that underpin AI- and machine-learning-driven research.</p>\n<p>In this role, you will be part of a small, senior HPC team, taking end-to-end ownership of a significant area of the platform while collaborating closely with other subject-matter experts.</p>\n<p>You will be a systems-level engineer who is comfortable owning complex technical decisions and designing and building production infrastructure, rather than advising from the sidelines.</p>\n<p>We aim to build infrastructure that is reliable, understandable, and adaptable, and we value engineers who care about simplicity, clarity, and maintainability as much as raw performance.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Design, build, and operate large-scale, high-throughput HPC and GPU clusters (for example, tens of thousands of CPU cores and hundreds of GPUs) supporting AI and machine-learning workloads.</li>\n</ul>\n<ul>\n<li>Collaborate with other HPC engineers and subject-matter experts to co-design system architectures, review designs, and share knowledge.</li>\n</ul>\n<ul>\n<li>Partner with storage specialists to architect and maintain high-performance, low-latency storage solutions, including parallel or scale-out file systems.</li>\n</ul>\n<ul>\n<li>Work closely with researchers, data scientists, and engineers to understand computational needs and translate them into effective, scalable system designs.</li>\n</ul>\n<ul>\n<li>Monitor, analyze, and optimize performance across compute, scheduling, networking, and storage layers.</li>\n</ul>\n<ul>\n<li>Build and maintain automation and infrastructure-as-code for provisioning, configuration, monitoring, and lifecycle management, with an emphasis on repeatability and simplicity.</li>\n</ul>\n<ul>\n<li>Participate in design reviews, operational discussions, and post-incident reviews with a focus on learning, collaboration, and system improvement rather than blame.</li>\n</ul>\n<ul>\n<li>Explore alternative approaches to scheduling, data layout, cluster architectures, and GPU utilization through small experiments or prototypes, using data to guide decisions.</li>\n</ul>\n<ul>\n<li>Produce clear documentation, diagrams, and reusable tooling that enable others to operate, debug, and extend the platform.</li>\n</ul>\n<ul>\n<li>Stay current with advancements in HPC, GPU computing, networking, and storage, and help assess where new technologies can add real value.</li>\n</ul>\n<p>What you&#39;ll bring:</p>\n<ul>\n<li>Bachelor’s degree in Computer Science, Engineering, or a related technical field; a Master’s or PhD is a plus.</li>\n</ul>\n<ul>\n<li>Typically 7+ years of hands-on experience designing, building, and operating HPC or large-scale compute environments.</li>\n</ul>\n<ul>\n<li>Deep, practical experience with at least one major HPC scheduler (such as Slurm), including using it to operate large-scale or high-throughput clusters in production.</li>\n</ul>\n<ul>\n<li>Hands-on experience with GPU-accelerated computing, including NVIDIA GPUs and associated software ecosystems.</li>\n</ul>\n<ul>\n<li>Strong Linux systems engineering skills and comfort working close to the operating system, drivers, and hardware.</li>\n</ul>\n<ul>\n<li>Experience designing or operating high-performance storage systems, including parallel or scale-out file systems.</li>\n</ul>\n<ul>\n<li>Curious, evidence-driven problem solving, including experimenting with different approaches and using data to inform decisions.</li>\n</ul>\n<ul>\n<li>A collaborative working style that values listening, respectful discussion, and incorporating different perspectives , whether you are more quiet and reflective or more vocal in group settings.</li>\n</ul>\n<ul>\n<li>Clear written and verbal communication skills, and an ability to explain complex ideas in a way that works for different audiences.</li>\n</ul>\n<ul>\n<li>A strong sense of ownership for outcomes, paired with openness to feedback, learning, and evolving systems over time.</li>\n</ul>\n<p>Additional experience that may be helpful:</p>\n<ul>\n<li>Experience with Kubernetes, Run:ai, or other workload orchestration platforms alongside traditional HPC schedulers.</li>\n</ul>\n<ul>\n<li>Familiarity with Lustre, GPFS / Spectrum Scale, or similar high-performance storage technologies.</li>\n</ul>\n<ul>\n<li>Exposure to cloud-based HPC environments (e.g., GCP or other major cloud providers).</li>\n</ul>\n<ul>\n<li>Experience supporting quantitative research, finance, or other demanding compute-intensive workloads.</li>\n</ul>\n<ul>\n<li>Interest in applying AI or ML techniques to infrastructure (for example, optimization, anomaly detection, or predictive analysis).</li>\n</ul>\n<p>The estimated base salary range for this position is $175,000 to $250,000, which is specific to New York and may change in the future. Millennium pays a total compensation package which includes a base salary, discretionary performance bonus, and a comprehensive benefits package.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_8eda17fb-432","directApply":true,"hiringOrganization":{"@type":"Organization","name":"IT Infrastructure","sameAs":"https://mlp.eightfold.ai","logo":"https://logos.yubhub.co/mlp.eightfold.ai.png"},"x-apply-url":"https://mlp.eightfold.ai/careers/job/755955818333?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$175,000 to $250,000","x-skills-required":["HPC","GPU","Linux","Slurm","NVIDIA","GPU-accelerated computing","High-performance storage systems","Parallel or scale-out file systems","Automation and infrastructure-as-code","Provisioning","Configuration","Monitoring","Lifecycle management"],"x-skills-preferred":[],"datePosted":"2026-04-25T14:07:45.151Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, New York, United States of America"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"HPC, GPU, Linux, Slurm, NVIDIA, GPU-accelerated computing, High-performance storage systems, Parallel or scale-out file systems, Automation and infrastructure-as-code, Provisioning, Configuration, Monitoring, Lifecycle management","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":175000,"maxValue":250000,"unitText":"YEAR"}}}]}