Staff Machine Learning Engineer, GenAI Platform

abff148c-cfd Staff Machine Learning Engineer, GenAI Platform As a Staff Machine Learning Engineer on the Machine Learning Platform team, you will be a key technical leader architecting and scaling our Generative AI and LLM platform capabilities.

Training and deploying foundation models places unprecedented demands on our systems. You will define the technical strategy and build the core infrastructure that enables machine learning engineers and researchers to seamlessly train, evaluate, and iterate on large language models at Reddit scale.

Drive GenAI Infrastructure Strategy: Propose, design, and lead the architecture of our next-generation LLM platform, significantly advancing our capabilities to support large-scale foundation models that serve millions of redditors.
Design Resilient, Large-Scale Distributed Systems: Architect highly fault-tolerant training infrastructure capable of supporting multi-week, distributed workloads across massive GPU clusters.
Build Self-Serve LLM Workflows: Design and implement robust, production-grade pipelines for LLM fine-tuning (e.g., SFT, RLHF/DPO).
Develop Comprehensive Evaluation & Benchmarking Infrastructure: Treat model evaluation as a first-class platform capability.
Architect Advanced Data Ingestion Pipelines: Extend our distributed data platforms to natively and efficiently handle the massive, multimodal datasets (text, image, video) required for modern GenAI workloads,

You will have 10+ years of work experience in a production software development environment or building complex distributed data systems, plus a degree in ML, Engineering, Computer Science, or a related discipline.

GenAI/LLM Infrastructure Expertise: Proven track record of designing and operating large-scale ML systems, specifically working with distributed training frameworks (e.g., FSDP, DeepSpeed, Megatron-LM) and LLM serving/inference optimization (e.g., vLLM, TensorRT-LLM).

Distributed Systems Mastery: Hands-on experience managing fault-tolerant, petabyte-scale distributed systems and multi-node/multi-GPU training clusters.

Advanced MLOps Knowledge: Deep understanding of modern ML orchestration, fine-tuning pipelines, and model evaluation methodologies.

GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.

Production Engineering Fundamentals: Hands-on experience with Kubernetes, Docker, and building production-quality, object-oriented code in Python and/or Go.

Strong focus on scalability, reliability, performance, and ease of use.

You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle.

Strong organizational & communication skills.

XML job scraping automation by YubHub

]]> full-time staff remote $253,300-$354,600 USD GenAI/LLM Infrastructure Expertise, Distributed Systems Mastery, Advanced MLOps Knowledge, GPU Experience, Production Engineering Fundamentals Engineering Technology Reddit https://logos.yubhub.co/redditinc.com.png Reddit is a social news and discussion website with over 121 million daily active unique visitors. https://www.redditinc.com https://job-boards.greenhouse.io/reddit/jobs/7772523 Remote - United States 2026-04-18 d1728879-43b Staff Product Manager, AI Platform At Databricks, we are building the world's best data and AI infrastructure platform. The AI Platform team builds the infrastructure that powers machine learning and AI at scale on Databricks. Our products span the full ML lifecycle , from feature engineering and model training to model serving and monitoring , enabling data and AI teams to build, deploy, and operate production ML systems with confidence.

You will join a team that ships products used by thousands of the world's most sophisticated data and AI organizations. You will drive the vision and roadmap for AI platform product areas and define how customers build, train, deploy, and monitor AI and ML systems on Databricks. You will collaborate across engineering teams to deliver an integrated and powerful path from experimentation to production.

The impact you will have:

Own the product roadmap for AI platform areas , defining what we build, why, and in what order , to accelerate customer adoption of AI and ML in production.
Drive strategy for key AI platform capabilities, shaping how enterprises operationalize AI at scale.
Partner closely with engineering teams to make deeply technical decisions about ML infrastructure , from distributed training architectures to real-time serving systems.
Represent the voice of the customer by engaging directly with enterprise ML teams, translating their pain points and workflows into platform capabilities that simplify the path to production AI.
Collaborate with GTM, Solutions Architecture, and Customer Success teams to drive enterprise adoption, shape field enablement, and inform competitive positioning.
Define pricing, packaging, and commercialization strategy for AI platform features, working with business teams to maximize value capture.
Grow end-user engagement with Databricks AI tools by identifying adoption bottlenecks and partnering cross-functionally to remove them.

XML job scraping automation by YubHub

]]> full-time staff remote $172,600-$237,325 USD Product Management, AI Platform, Machine Learning, Data Science, Cloud Services, ML/AI Infrastructure, Distributed Training Architectures, Real-Time Serving Systems, Recommendation Systems, Feature Stores, Vector Search, LLM Infrastructure Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks is a data and AI company that provides a unified platform for data and AI workloads. It was founded by the original creators of Apache Spark, Delta Lake, and MLflow. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8427940002 Seattle, Washington 2026-04-18