Senior AI Infrastructure Engineer, Model Serving Platform

859cb1cf-b9c Senior AI Infrastructure Engineer, Model Serving Platform As a Senior AI Infrastructure Engineer on the Model Serving Platform team, you will design and build platforms for scalable, reliable, and efficient serving of Large Language Models (LLMs). Our platform powers cutting-edge research and production systems, supporting both internal and external use cases across various environments.

The ideal candidate combines strong ML fundamentals with deep expertise in backend system design. You’ll work in a highly collaborative environment, bridging research and engineering to deliver seamless experiences to our customers and accelerate innovation across the company.

Responsibilities:

Build and maintain fault-tolerant, high-performance systems for serving LLM workloads at scale.
Build an internal platform to empower LLM capability discovery.
Collaborate with researchers and engineers to integrate and optimize models for production and research use cases.
Conduct architecture and design reviews to uphold best practices in system design and scalability.
Develop monitoring and observability solutions to ensure system health and performance.
Lead projects end-to-end, from requirements gathering to implementation, in a cross-functional environment.

Ideally you’d have:

5+ years of experience building large-scale, high-performance backend systems.
Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++).
Experience with LLM serving and routing fundamentals (e.g. rate limiting, token streaming, load balancing, budgets, etc.).
Experience with LLM capabilities and concepts such as reasoning, tool calling, prompt templates, etc.
Experience with containers and orchestration tools (e.g., Docker, Kubernetes).
Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform).
Proven ability to solve complex problems and work independently in fast-moving environments.

Nice to haves:

Experience with modern LLM serving frameworks such as vLLM, SGLang, TensorRT-LLM, or text-generation-inference.

Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position, determined by work location and additional factors, including job-related skills, experience, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You’ll also receive benefits including, but not limited to: Comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend.

XML job scraping automation by YubHub

]]> full-time senior hybrid $216,000-$270,000 USD Python, Go, Rust, C++, Docker, Kubernetes, AWS, GCP, Terraform, vLLM, SGLang, TensorRT-LLM, text-generation-inference Engineering Technology Scale https://logos.yubhub.co/scale.com.png Scale develops reliable AI systems for the world's most important decisions, providing high-quality data and full-stack technologies to power leading models. https://scale.com/ https://job-boards.greenhouse.io/scaleai/jobs/4520320005 San Francisco, CA; New York, NY 2026-04-18 a45e2e8c-400 Staff Software Engineer, Foundational Model Serving At Databricks, we are enabling data teams to solve the world's toughest problems by building and running the world's best data and AI infrastructure platform. Foundation Model Serving is the API Product for hosting and serving frontier AI model inference for open source models like Llama, Qwen, and GPT OSS as well as proprietary models like Claude and OpenAI GPT.

We're looking for engineers who have owned high scale operational sensitive systems like customer facing APIs, Edge Gateways, ML Inference, or similar services and have an interest in getting deep building LLM APIs and runtimes at scale. As a Staff Engineer, you'll play a critical role in shaping both the product experience and core infrastructure.

The impact you will have:

Design and implement core systems and APIs that power Databricks Foundation Model Serving, ensuring scalability, reliability, and operational excellence.
Partner with product and engineering leadership to define the technical roadmap and long-term architecture for serving workloads.
Drive architectural decisions and trade-offs to optimize performance, throughput, autoscaling, and operational efficiency for GPU serving workloads.
Contribute directly to key components across the serving infrastructure , from working in systems like vLLM and SGLang to creating token based rate limiters and optimizers , ensuring smooth and efficient operations at scale.
Collaborate cross-functionally with product, platform, and research teams to translate customer needs into reliable and performant systems.
Establish best practices for code quality, testing, and operational readiness, and mentor other engineers through design reviews and technical guidance.
Represent the team in cross-organizational technical discussions and influence Databricks’ broader AI platform strategy.

What we look for:

10+ years of experience building and operating large-scale distributed systems.
Experience leading high-scale operationally sensitive backend systems.
A track record of up-leveling teams engineering excellence.
Strong foundation in algorithms, data structures, and system design as applied to large-scale, low-latency serving systems.
Proven ability to deliver technically complex, high-impact initiatives that create measurable customer or business value.
Strong communication skills and ability to collaborate across teams in fast-moving environments.
Strategic and product-oriented mindset with the ability to align technical execution with long-term vision.
Passion for mentoring, growing engineers, and fostering technical excellence.

XML job scraping automation by YubHub

]]> full-time staff onsite $192,000-$260,000 USD large-scale distributed systems, high-scale operationally sensitive backend systems, algorithms, data structures, system design, low-latency serving systems, GPU serving workloads, vLLM, SGLang, token based rate limiters, optimizers Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks is a data and AI company that provides a unified platform for data, analytics, and AI. It was founded by the original creators of Lakehouse, Apache Spark, Delta Lake, and MLflow. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8224683002 San Francisco, California 2026-04-18 89406e8e-f38 Machine Learning Engineer, Open-Source Software You will be in charge of open-sourcing state-of-the-art models, whilst maintaining and improving Mistral’s publicly available libraries. Your work is critical in helping turn research breakthroughs into tangible solutions and improve Mistral's open-source ecosystem.

About the Open Source Software team Our OSS team is embedded in our Science team and works very closely with various engineering and marketing teams. All OSS team members can fluidly move on the production / research spectrum depending on where the needs are or where their interests lie

Responsibilities • Releasing our models to open-source platforms and libraries, e.g., vLLM, GitHub, Hugging Face • Maintaining Mistral’s open-source libraries (mistral-common, mistral-finetune, mistral-inference) • Create and maintain tooling and services: both internal facing (internal research) and external facing (open-source libraries) • Implement and optimize open-source and internal libraries for performance and accuracy, ensuring production readiness and employing cutting-edge technology and innovative approaches • Collaborate with the open-source community (PyTorch, vLLM, Hugging Face)

About you • Master’s degree in Computer Science, Machine Learning, Data Science, or a related field • Experience contributing to popular open-source libraries such as PyTorch, Tensorflow, JAX, vLLM, Transformers, Llama.cpp, ... • Passion for contributing to the open-source software ecosystem • Expert programming skills in Python, PyTorch, MLOps • Adaptable, proactive, and autonomous • Attention to detail and a drive to go the last mile to build almost perfect tools • Deep understanding of machine learning approaches, especially LLMs and algorithms • Low-ego, collaborative and have a real team player mindset

Now, it would be ideal if you have: • Experience with training and fine-tuning large language models (e.g., distillation, supervised fine-tuning, policy optimization) • Experience working with Slurm • Worked with research teams before • Experience as a core-maintainer of a popular ML open-source library

XML job scraping automation by YubHub

]]> full-time mid hybrid Python, PyTorch, MLOps, Machine Learning, Large Language Models, Slurm, Open-source libraries, vLLM, GitHub, Hugging Face, PyTorch, Tensorflow, JAX, Transformers, Llama.cpp Engineering Technology Mistral AI Mistral AI develops high-performance, optimized, open-source and cutting-edge AI models, products and solutions for enterprise use Gebased on-premises or in cloud environments. https://mistral.ai https://jobs.lever.co/mistral/ef4c26fc-3fdb-4dd2-a64e-95264ee769dd Paris 2026-03-10 290c3d28-4b2 Partner Solution Architect - ASEAN About Mistral AI

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.

We are a global company with teams distributed between France, USA, UK, Germany and Singapore. We are a diverse workforce that thrives in competitive environments and is committed to driving innovation.

Why This Role Matters

You will be the technical linchpin between Mistral and our strategic partners in ASEAN (Nvidia, Dell, Hyperscalers, Global System Integrators), translating our open-weight models and sovereign AI architecture into deployable, scalable solutions.

By designing joint architectures, influencing partner GTM motions, and earning a seat at the CIO/CTO table, you will accelerate Mistral’s technical credibility and deployment velocity across Asia Pacific.

This is a foundational role where you will define how open-weight AI is operationalized at scale in the region.

What You Will Do

Partner Technical Leadership & Architecture Design

Lead the technical design, deployment, and enablement of Mistral’s partner solutions, bridging our AI models with partner infrastructure (Nvidia, Dell, Hyperscalers, GSIs) to deliver scalable AI Labs, AI Factories, and sovereign AI architectures.

Serve as the trusted technical advisor to partner CTOs, CIOs, and engineering leaders—shaping joint architectures, guiding GPU/model deployment strategies, and accelerating GTM execution.

Design reference architectures and deployment patterns for partner-led implementations (e.g., multi-GPU inference clusters, AI Lab topologies, private AI clouds).

Innovate the Executive Briefing Center (EBC) function for technical leaders (CIOs, CTOs, CDOs), positioning Mistral as the default choice for enterprise AI.

Co-design sovereign AI reference architectures with Nvidia and Dell (H100, H200, GB200 platforms).

Co-Sell & Revenue Enablement

Collaborate with Mistral’s partner and sales teams to progress deals, providing technical expertise to penetrate accounts and influence GTM pipeline.

Support partners in qualifying/disqualifying opportunities, ensuring Mistral solutions unlock maximum value for customers.

Deploy Mistral’s enterprise AI suite (models, fine-tuning, use-case building) in partner-led environments, tailoring solutions to customer requirements.

Trusted Advisor & Lighthouse Implementations

Drive strategic partner-led opportunities through technical discovery, architecture design, and POC execution.

Lead lighthouse deployments that become referenceable case studies (e.g., Singtel AI Grid, Accenture AI Lab).

Establish a scalable partner enablement framework, training 100+ partner engineers across ASEAN.

Product Feedback & Internal Collaboration

Coordinate with Mistral’s product and engineering teams to relay partner-specific requirements and feedback.

Align joint GTM and technical execution between Mistral Science, Partner Engineering, and partner field teams.

About You

Must-Have

10–15 years’ experience in partner-facing technical sales or solution architecture (e.g., Partner SA, Alliance Architect, Partner Technology Strategist).

Proven ability to engage C-suite and senior technical stakeholders (CTO, CIO, Chief Architect) in strategic architecture discussions.

Deep GenAI/LLM expertise: RAG, fine-tuning, prompt engineering, model evaluation, and deployment patterns.

Technical mastery of AI/ML infrastructure (GPU clusters, cloud platforms, model deployment frameworks).

Track record of co-designing/deploying joint solutions with ecosystem partners (Nvidia, Dell, AWS, Accenture, etc.).

Executive communication: Ability to articulate science-driven value propositions to technical and business audiences.

Entrepreneurial mindset: Operates autonomously in high-growth environments; creates playbooks, not follows them.

Fluent in English; confident working across diverse, cross-cultural teams in Asia.

Nice-to-Have

Experience with open-weight LLMs or open-source AI stacks (Mistral, Hugging Face, LangChain, vLLM, RAG frameworks).

Prior involvement in AI Lab, AI Factory, or Sovereign Cloud deployments.

Familiarity with data governance, model evaluation, and GPU sizing for large-scale inference.

Network across GSIs and infrastructure partners in Asia

Exposure to multi-region partner programs or joint GTM initiatives in APJ.

Bonus languages: Korean, Japanese, or Mandarin for regional partner engagement.

What we offer

💰 Competitive cash salary and equity

🚑 Health Insurance : Best in Class

🥎 Sport : $90 for gym membership allowance

🥕 Food : $200 monthly allowance for meals (solution might evolve as we grow bigger)

🚴 Transportation : $120/month for public transport or Parking charges reimbursed

🏝️ PTO: 18 per year

XML job scraping automation by YubHub

]]> full-time senior onsite GenAI/LLM expertise, RAG, fine-tuning, prompt engineering, model evaluation, deployment patterns, AI/ML infrastructure, GPU clusters, cloud platforms, model deployment frameworks, co-designing/deploying joint solutions, ecosystem partners, Nvidia, Dell, AWS, Accenture, open-weight LLMs, open-source AI stacks, Mistral, Hugging Face, LangChain, vLLM, RAG frameworks, data governance, model evaluation, GPU sizing, large-scale inference, GSIs, infrastructure partners, multi-region partner programs, joint GTM initiatives, APJ, Korean, Japanese, Mandarin Engineering Technology Mistral AI Mistral AI is an AI technology company that provides high-performance, optimized, open-source and cutting-edge models, products and solutions. https://mistral.ai/careers https://jobs.lever.co/mistral/fe3542b5-4f99-4d62-af6a-fbdfd13bf0e4 Singapore 2026-03-10 f8883394-0fc Solutions Architect, AI and ML We are looking for an experienced Cloud Solution Architect to help assist customers with adoption of GPU hardware and Software, as well as building and deploying Machine Learning (ML) , Deep Learning (DL), data analytics solutions on various Cloud Computing Platforms.

As a Solutions Architect, you will engage directly with developers, researchers, and data scientists with some of NVIDIA’s most strategic technology customers as well as work directly with business and engineering teams on product strategy.

Key Responsibilities:

Help cloud customers craft, deploy, and maintain scalable, GPU-accelerated inference pipelines on cloud ML services and Kubernetes for large language models (LLMs) and generative AI workloads.
Enhance performance tuning using TensorRT/TensorRT-LLM, vLLM, Dynamo, and Triton Inference Server to improve GPU utilization and model efficiency.
Collaborate with multi-functional teams (engineering, product) and offer technical mentorship to cloud customers implementing AI inference at scale.
Build custom PoCs for solution that address customer’s critical business needs applying NVIDIA hardware and software technology
Partner with Sales Account Managers or Developer Relations Managers to identify and secure new business opportunities for NVIDIA products and solutions for ML/DL and other software solutions
Prepare and deliver technical content to customers including presentations about purpose-built solutions, workshops about NVIDIA products and solutions, etc.
Conduct regular technical customer meetings for project/product roadmap, feature discussions, and intro to new technologies. Establish close technical ties to the customer to facilitate rapid resolution of customer issues

Requirements:

BS/MS/PhD in Electrical/Computer Engineering, Computer Science, Statistics, Physics, or other Engineering fields or equivalent experience.
3+ Years in Solutions Architecture with a proven track record of moving AI inference from POC to production in cloud computing environments including AWS, GCP, or Azure
3+ years of hands-on experience with Deep Learning frameworks such as PyTorch and TensorFlow
Excellent knowledge of the theory and practice of LLM and DL inference
Strong fundamentals in programming, optimizations, and software design, especially in Python
Experience with containerization and orchestration technologies like Docker and Kubernetes, monitoring, and observability solutions for AI deployments
Knowledge of Inference technologies - NVIDIA NIM, TensorRT-LLM, Dynamo, Triton Inference Server, vLLM, etc
Proficiency in problem-solving and debugging skills in GPU environments
Excellent presentation, communication and collaboration skills

Nice to Have:

AWS, GCP or Azure Professional Solution Architect Certification.
Experience optimizing and deploying large MoE LLMs at scale
Active contributions to open-source AI inference projects (e.g., vLLM, TensorRT-LLM Dynamo, SGLang, Triton or similar)
Experience with Multi-GPU Multi-node Inference technologies like Tensor Parallelism/Expert Parallelism, Disaggregated Serving, LWS, MPI, EFA/Infiniband, NVLink/PCIe, etc
Experience in developing and integrating monitoring and alerting solutions using Prometheus, Grafana, and NVIDIA DCGM and GPU performance Analysis and tools like NVIDIA Nsight Systems

XML job scraping automation by YubHub

]]> full-time senior onsite Cloud Solution Architecture, GPU hardware and Software, Machine Learning (ML), Deep Learning (DL), Data Analytics, Cloud Computing Platforms, Kubernetes, TensorRT, TensorRT-LLM, vLLM, Dynamo, Triton Inference Server, Python, Containerization, Orchestration, Monitoring, Observability, Inference technologies, NVIDIA NIM, Problem-solving, Debugging, GPU environments, AWS, GCP, Azure, Professional Solution Architect Certification, Large MoE LLMs, Open-source AI inference projects, Multi-GPU Multi-node Inference technologies, Monitoring and alerting solutions, Prometheus, Grafana, NVIDIA DCGM, GPU performance Analysis, NVIDIA Nsight Systems Engineering Technology NVIDIA https://logos.yubhub.co/nvidia.com.png NVIDIA is a leading technology company that specializes in designing and manufacturing graphics processing units (GPUs) and high-performance computing hardware. https://nvidia.wd5.myworkdayjobs.com https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-WA-Redmond/Solutions-Architect--AI-and-ML_JR2005988-1 Redmond, CA, Santa Clara, Seattle 2026-03-09 db67438e-963 Director, System Software Engineering - Metropolis Accelerated and Inferencing Software Director, System Software Engineering - Metropolis Accelerated and Inferencing Software

We are looking for an engineering leader who is hands-on with deep learning—comfortable reading/modeling code, not just running it. You will lead, encourage, and develop world-class engineering and data teams distributed across Europe, Asia and the United States.

Key Responsibilities:

Architect and operationalize NVIDIA’s end-to-end data Inference Acceleration strategy, powering Inferencing and continuous performance improvements.
Drive Strategic Implementations of TensorRT, VLLM and other accelerated frameworks for inference solutions for Edge and Enterprise devices: Lead Accelerated Computing efforts and solutions for key Metropolis verticals. Set up Proofs of Readiness (PORs) and guide their implementations.
Leading customer solutions: Collaborate with major Metropolis OEMs and Partners to architect highly accelerated and optimized custom deep learning models and inference pipelines for their specific requirements. Offer direct customer support, including debugging, technical education, and handling customer inquiries for our Metropolis partner and customers. Responsible for drafting and finalizing SOWs with internal customers and partners.
Performance Benchmarking: Orchestrate efforts to achieve leading performance results on industry benchmarks like MLPerf on various edge and Enterprise devices.
Technical Leadership & Influence: Function as a technical leader for deep learning across multiple teams, giving oversight and build support. Apply customer insights to influence the composition and structure of upcoming SOC / GPU deep learning hardware.
Scaling the team: Strategically hiring to meet new demands while also mentoring and adjusting existing teams to new deep learning challenges.
Representing Nvidia Deep learning solutions in webinars, conferences and partner events

Requirements:

Masters in Computer Science/Electrical Engineering or equivalent experience.
A minimum of 8 years of meaningful involvement in machine learning/deep learning research or practical experience, coupled with 7+ years of leadership background and overall 15+ years of industry experience.
Over 10 years of validated expertise in the embedded software sector, holding technical leadership positions accountable for delivering outstanding production software within a multifaceted setting.
Deep Knowledge of GPU, CPU and dedicated deep learning architecture fundamentals and low-level performance optimizations using heterogeneous computing.
Hands-on experience with VLMs, LLMs, or multimodal AI systems applied to perception, data triage, or automated labeling.
Strong expertise in large-scale data processing, systems build, or machine learning pipelines.
Strong communication, careful planning, and technical leadership capabilities.

Benefits:

Competitive salary package and benefits
Eligible for equity

How to Apply:

Applications for this job will be accepted at least until March 13, 2026.

XML job scraping automation by YubHub

]]> full-time executive onsite Machine Learning, Deep Learning, GPU, CPU, Heterogeneous Computing, TensorRT, VLLM, Proof of Readiness, Customer Support, Technical Education, Performance Benchmarking, Technical Leadership, Team Scaling, Webinars, Conferences, Partner Events, VLMs, LLMs, Multimodal AI Systems, Perception, Data Triage, Automated Labeling, Large-Scale Data Processing, Systems Build, Machine Learning Pipelines Engineering Technology NVIDIA https://logos.yubhub.co/nvidia.com.png NVIDIA is a world leader in physical AI, powering self-driving cars, humanoid robots, intelligent environments, medical devices, and more. https://nvidia.wd5.myworkdayjobs.com https://nvidia.wd5.myworkdayjobs.com/en-US/NVIDIAExternalCareerSite/job/US-CA-Santa-Clara/Director--Metropolis-Accelerated-and-Inferencing-Software_JR2011299 Santa Clara 2026-03-09 6c07cdd9-125 Staff Full Stack Software Engineer Perplexity is seeking an experienced Staff Full Stack Engineer to help revolutionize the way people search and interact online. In this role, you'll translate cutting-edge AI advances into products that are both useful and engaging.

Our tech stack includes Python, Go, Rust, TypeScript, React FastAPI, PostgreSQL, Redis, Docker, vLLM, and AWS.

Roles and teams at Perplexity are fluid. By applying to this position, you will be eligible to join teams across Perplexity Engineering. During the interview process, we look forward to learning more about your unique talents and figuring out where in our organization you’ll grow and thrive the most.

Responsibilities

Building new 0-1 products at high scale

Working closely with Product, Design, and Data to ship experiments and learn

Launching new features, experiments, campaigns, and partnerships in a fast-moving environment

Building core growth infrastructure such as notification platforms, ad attribution, and more

Analyzing performance metrics and user feedback to identify opportunities for improvement and optimization

Building delightful and data-proven user journeys

Qualifications

Strong programming skills with the ability to work across the full stack

Self-motivated with a willingness to take ownership of tasks

Good quantitative understanding of data and experimentation

Experience making data-driven decisions and measuring impact of those decisions (experimentation, feature flags, adhoc analysis)

A passion for shipping quality products

8+ years of industry experience

How we work with AI

AI is at the heart of what we build, and using it effectively is an expectation for every role here. During interviews, we will be excited to see how you think and understand how you make decisions—qualities that would directly influence our AI development—there may be an opportunity for you to showcase your AI skills; however, we ask that you kindly avoid using AI tools throughout the process unless we explicitly indicate otherwise.

XML job scraping automation by YubHub

]]> Full time staff $220K – $405K • Offers Equity Python, Go, Rust, TypeScript, React FastAPI, PostgreSQL, Redis, Docker, vLLM, AWS Engineering Technology Perplexity https://logos.yubhub.co/perplexity.com.png Perplexity is a technology company that develops products for searching and interacting online. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/perplexity/8df04206-7b62-4163-9d94-a4f28680eeb1 San Francisco; New York City 2026-03-09 d0214534-b6a Senior Applied Scientist We're building the next-generation Grounding Service that powers the latest AI applications—chat assistants, copilots, and autonomous agents—with factual, cited, and trustworthy responses. Our platform stitches together retrieval, reasoning, and real-time data so that large language models stay anchored to enterprise knowledge, the public web, and proprietary tools. We're looking for a Senior Applied Scientist to lead end-to-end science for grounding: inventing retrieval and attribution methods, defining factuality/faithfulness metrics, and shipping production models and APIs that scale to billions of queries. You'll partner closely with engineering, product, research, and customers to deliver fast, reliable, and explainable answers with source citations across a diverse set of domains and modalities. As a team, we value curiosity, pragmatic rigor, and inclusive collaboration. We believe great systems emerge when scientists and engineers co-design metrics, models, and infrastructure—and when we obsess over customer impact, privacy, and safety. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Starting January 26, 2026, Microsoft AI (MAI) employees who live within a 50-mile commute of a designated Microsoft office in the U.S. or 25-mile commute of a non-U.S., country-specific location are expected to work from the office at least four days per week. This expectation is subject to local law and may vary by jurisdiction. Responsibilities

Owns the science roadmap for grounding—including retrieval, re-ranking, attribution, and reasoning—driving initiatives from problem framing to production impact. Designs and evolves state-of-the-art retrieval and RAG orchestration across documents, tables, code, and images. Builds citation and provenance systems (e.g., passage highlighting, quote-level alignment, confidence scoring) to reduce hallucinations and increase user trust. Leads experimentation and evaluation using A/B testing, interleaving, NDCG, MRR, precision/recall, and calibration curves to guide measurable trade-offs. Advances tool-augmented grounding through schema-aware retrieval, function calling, knowledge graph joins, and real-time connectors to databases, cloud object stores, search indexes, and the web. Partners with platform engineering to productionize models with scalable inference, embedding services, feature stores, caching, and privacy-compliant multi-tenant systems. Nurtures collaborative relationships with product and business leaders across Microsoft, influencing strategic decisions and driving business impact through technology. Authors white papers, contributes to internal tools and services, and may publish research to generate intellectual property. Bridges the gap between researchers (e.g., Microsoft Research) and development teams, applying long-term research to solve immediate product needs. Leads high-stakes negotiations to ensure cutting-edge technologies are applied practically and effectively. Identifies and solves significant business problems using novel, scalable, and data-driven solutions. Shapes the direction of Microsoft and the broader industry through pioneering product and tooling work. Mentors applied scientists and data scientists, establishing best practices in experimentation, error analysis, and incident review. Collaborates cross-functionally with PMs, research, infrastructure, and security teams to align on milestones, SLAs, and safety protocols. Communicates clearly through design documentation, progress updates, and presentations to executives and customers. Contributes to ethics and privacy policies, identifies bias in product development, and proposes mitigation strategies.

XML job scraping automation by YubHub

]]> full-time senior hybrid Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, Machine Learning, Information Retrieval, Large Language Model Development, Pretraining, Supervised Fine-Tuning, Reinforcement Learning, Optimizing LLM Inference, Master's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field, 6+ years related experience (e.g., statistics, predictive analytics, research), Demonstrated expertise in information retrieval, with publications in top-tier conferences or journals such as NeurIPS, ICML, ICLR, SIGIR, or ACL, Hands-on experience in large language model (LLM) development, including pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL), Proven track record in optimizing LLM inference, or active contributions to open-source frameworks like vLLM, SGLang, or related projects Engineering Technology Microsoft https://logos.yubhub.co/microsoft.ai.png Microsoft is a multinational technology company that develops, manufactures, licenses, and supports a wide range of software products, services, and devices. https://microsoft.ai https://microsoft.ai/job/senior-applied-scientist-37/ Beijing 2026-03-08 d3a39f4c-d95 Software Engineer, Inference - Multi Modal Software Engineer, Inference - Multi Modal

Location

San Francisco

Employment Type

Full time

Department

Scaling

Compensation

$295K – $555K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

OpenAI’s Inference team powers the deployment of our most advanced models - including our GPT models, 4o Image Generation, and Whisper - across a variety of platforms. Our work ensures these models are available, performant, and scalable in production, and we partner closely with Research to bring the next generation of models into the world. We're a small, fast-moving team of engineers focused on delivering a world-class developer experience while pushing the boundaries of what AI can do.

We’re expanding into multimodal inference, building the infrastructure needed to serve models that handle image, audio, and other non-text modalities. These workloads are inherently more heterogeneous and experimental, involving diverse model sizes and interactions, more complex input/output formats, and tighter coordination with product and research.

About the Role

We’re looking for a software engineer to help us serve OpenAI’s multimodal models at scale. You’ll be part of a small team responsible for building reliable, high-performance infrastructure for serving real-time audio, image, and other MM workloads in production.

This work is inherently cross-functional: you’ll collaborate directly with researchers training these models and with product teams defining new modalities of interaction. You'll build and optimize the systems that let users generate speech, understand images, and interact with models in ways far beyond text.

In this role, you will:

Design and implement inference infrastructure for large-scale multimodal models.

Optimize systems for high-throughput, low-latency delivery of image and audio inputs and outputs.

Enable experimental research workflows to transition into reliable production services.

Collaborate closely with researchers, infra teams, and product engineers to deploy state-of-the-art capabilities.

Contribute to system-level improvements including GPU utilization, tensor parallelism, and hardware abstraction layers.

You might thrive in this role if you:

Have experience building and scaling inference systems for LLMs or multimodal models.

Have worked with GPU-based ML workloads and understand the performance dynamics of large models, especially with complex data like images or audio.

Enjoy experimental, fast-evolving work and collaborating closely with research.

Are comfortable dealing with systems that span networking, distributed compute, and high-throughput data handling.

Have familiarity with inference tooling like vLLM, TensorRT-LLM, or custom model parallel systems.

Own problems end-to-end and are excited to operate in ambiguous, fast-moving spaces.

Nice to Have:

Experience working with image generation or audio synthesis models in production.

Exposure to distributed ML training or system-efficient model design.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time mid onsite $295K – $555K • Offers Equity Software Engineer, Inference Infrastructure, GPU-based ML Workloads, Tensor Parallelism, Hardware Abstraction Layers, vLLM, TensorRT-LLM, Custom Model Parallel Systems, Image Generation, Audio Synthesis, Distributed ML Training, System-Efficient Model Design Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/4d14449e-5e7f-45d4-b103-8776a6c87086 San Francisco 2026-03-06