Senior HPC Engineer

8eda17fb-432 Senior HPC Engineer We are seeking a Senior HPC Engineer to join our team in a senior, hands-on role building and evolving large-scale, high-throughput HPC and GPU platforms that underpin AI- and machine-learning-driven research.

In this role, you will be part of a small, senior HPC team, taking end-to-end ownership of a significant area of the platform while collaborating closely with other subject-matter experts.

You will be a systems-level engineer who is comfortable owning complex technical decisions and designing and building production infrastructure, rather than advising from the sidelines.

We aim to build infrastructure that is reliable, understandable, and adaptable, and we value engineers who care about simplicity, clarity, and maintainability as much as raw performance.

Responsibilities:

Design, build, and operate large-scale, high-throughput HPC and GPU clusters (for example, tens of thousands of CPU cores and hundreds of GPUs) supporting AI and machine-learning workloads.

Collaborate with other HPC engineers and subject-matter experts to co-design system architectures, review designs, and share knowledge.

Partner with storage specialists to architect and maintain high-performance, low-latency storage solutions, including parallel or scale-out file systems.

Work closely with researchers, data scientists, and engineers to understand computational needs and translate them into effective, scalable system designs.

Monitor, analyze, and optimize performance across compute, scheduling, networking, and storage layers.

Build and maintain automation and infrastructure-as-code for provisioning, configuration, monitoring, and lifecycle management, with an emphasis on repeatability and simplicity.

Participate in design reviews, operational discussions, and post-incident reviews with a focus on learning, collaboration, and system improvement rather than blame.

Explore alternative approaches to scheduling, data layout, cluster architectures, and GPU utilization through small experiments or prototypes, using data to guide decisions.

Produce clear documentation, diagrams, and reusable tooling that enable others to operate, debug, and extend the platform.

Stay current with advancements in HPC, GPU computing, networking, and storage, and help assess where new technologies can add real value.

What you'll bring:

Bachelor’s degree in Computer Science, Engineering, or a related technical field; a Master’s or PhD is a plus.

Typically 7+ years of hands-on experience designing, building, and operating HPC or large-scale compute environments.

Deep, practical experience with at least one major HPC scheduler (such as Slurm), including using it to operate large-scale or high-throughput clusters in production.

Hands-on experience with GPU-accelerated computing, including NVIDIA GPUs and associated software ecosystems.

Strong Linux systems engineering skills and comfort working close to the operating system, drivers, and hardware.

Experience designing or operating high-performance storage systems, including parallel or scale-out file systems.

Curious, evidence-driven problem solving, including experimenting with different approaches and using data to inform decisions.

A collaborative working style that values listening, respectful discussion, and incorporating different perspectives , whether you are more quiet and reflective or more vocal in group settings.

Clear written and verbal communication skills, and an ability to explain complex ideas in a way that works for different audiences.

A strong sense of ownership for outcomes, paired with openness to feedback, learning, and evolving systems over time.

Additional experience that may be helpful:

Experience with Kubernetes, Run:ai, or other workload orchestration platforms alongside traditional HPC schedulers.

Familiarity with Lustre, GPFS / Spectrum Scale, or similar high-performance storage technologies.

Exposure to cloud-based HPC environments (e.g., GCP or other major cloud providers).

Experience supporting quantitative research, finance, or other demanding compute-intensive workloads.

Interest in applying AI or ML techniques to infrastructure (for example, optimization, anomaly detection, or predictive analysis).

The estimated base salary range for this position is $175,000 to $250,000, which is specific to New York and may change in the future. Millennium pays a total compensation package which includes a base salary, discretionary performance bonus, and a comprehensive benefits package.

XML job scraping automation by YubHub

]]> full-time senior onsite $175,000 to $250,000 HPC, GPU, Linux, Slurm, NVIDIA, GPU-accelerated computing, High-performance storage systems, Parallel or scale-out file systems, Automation and infrastructure-as-code, Provisioning, Configuration, Monitoring, Lifecycle management Engineering Technology IT Infrastructure https://logos.yubhub.co/mlp.eightfold.ai.png Millennium's Infrastructure organization designs, engineers, and operates a robust global computing platform supporting WorldQuant's quantitative research. https://mlp.eightfold.ai https://mlp.eightfold.ai/careers/job/755955818333 New York, New York, United States of America 2026-04-25 2e513a92-ec5 Research Scientist (Generative Modeling) We are seeking a talented Research Scientist with a strong background in generative modeling, particularly diffusion models, to join our modeling team. This role is ideal for candidates with deep expertise in diffusion models applied to images, videos, or 3D assets and scenes.

While experience in one or more of the following areas is a strong plus: large-scale model training, research in 3D computer vision.

You will collaborate closely with researchers, engineers, and product teams to bring advanced 3D modeling and machine learning techniques into real-world applications, ensuring that our technology remains at the forefront of visual innovation. This role involves significant hands-on research and engineering work, driving projects from conceptualization through to production deployment.

Key responsibilities include designing, implementing, and training large-scale diffusion models for generating 3D worlds, developing and experimenting with large-scale diffusion models to add novel control signals, adapting to target aesthetic preferences, or distilling for efficient inference, collaborating closely with research and product teams to understand and translate product requirements into effective technical roadmaps, contributing hands-on to all stages of model development including data curation, experimentation, evaluation, and deployment, continuously exploring and integrating cutting-edge research in diffusion and generative AI more broadly, acting as a key technical resource within the team, mentoring colleagues, and driving best practices in generative modeling and ML engineering.

Ideal candidate profile includes 3+ years of experience in generative modeling or applied ML roles, extensive experience with machine learning frameworks such as PyTorch or TensorFlow, especially in the context of diffusion models and other generative models, deep expertise in at least one area of generative modeling, strong history of publications or open-source contributions involving large-scale diffusion models, strong coding proficiency in Python and experience with GPU-accelerated computing, ability to engage effectively with researchers and cross-functional teams, clearly translating complex technical ideas into actionable tasks and outcomes, comfortable operating within a dynamic startup environment with high levels of ambiguity, ownership, and innovation.

Nice to have includes contributions to open-source projects in the fields of computer vision, graphics, or ML, familiarity with large-scale training infrastructure, experience integrating machine learning models into production environments, led or been involved with the development or training of large-scale, state-of-the-art generative models.

XML job scraping automation by YubHub

]]> full-time senior onsite $250,000 - $325,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications) generative modeling, diffusion models, PyTorch, TensorFlow, machine learning frameworks, large-scale model training, research in 3D computer vision, data curation, experimentation, evaluation, deployment, GPU-accelerated computing, Python, open-source contributions, large-scale training infrastructure, integrating machine learning models into production environments, leading or being involved with the development or training of large-scale, state-of-the-art generative models Engineering Technology World Labs https://logos.yubhub.co/worldlabs.ai.png World Labs builds foundational world models that can perceive, generate, reason, and interact with the 3D world. https://worldlabs.ai https://job-boards.greenhouse.io/worldlabs/jobs/4089324009 San Francisco 2026-04-17