Senior Software Engineer II, Inference

ec7cc743-ef4 Senior Software Engineer II, Inference We're seeking a senior software engineer to join our team and lead the design and development of our Kubernetes-native inference platform. As a senior engineer, you will be responsible for leading design reviews, driving architecture, and ensuring the reliability and scalability of our platform.

Key responsibilities include:

Leading design reviews and driving architecture within the team
Defining and owning SLIs/SLOs and ensuring post-incident actions land and reliability improves release-over-release
Implementing advanced optimizations such as micro-batch schedulers, speculative decoding, and KV-cache reuse
Strengthening incident posture through capacity planning, autoscaling policy, and rollback/traffic-shift strategies
Mentoring IC1/IC2 engineers and reviewing cross-team designs to elevate coding/testing standards

We're looking for someone with strong coding skills in Python or Go, deep familiarity with networked systems and performance, and hands-on experience with Kubernetes at production scale. If you have experience with inference internals, batching, caching, mixed precision, and streaming token delivery, that's a plus.

In addition to a competitive salary, we offer a range of benefits including medical, dental, and vision insurance, company-paid life insurance, and flexible PTO. We're committed to creating a work environment that's inclusive, diverse, and supportive of our employees' well-being.

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 Python, Go, Kubernetes, Networked systems, Performance, Inference internals, Batching, Caching, Mixed precision, Streaming token delivery, CUDA kernels, NCCL/SHARP, RDMA/NUMA, GPU interconnect topologies, Contributions to inference frameworks, Experience with multi-team initiatives Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4604832006 Sunnyvale, CA / Bellevue, WA 2026-04-18 9701c504-1a6 Senior Software Engineer I, Inference We're looking for a Senior Software Engineer I to join our team. As a senior engineer, you'll lead designs, raise engineering standards, and deliver measurable improvements to latency, throughput, and reliability across multiple services. You'll partner with product, orchestration, and hardware teams to evolve our Kubernetes-native inference platform and meet strict P99 SLAs at scale.

Key responsibilities include:

Lead design reviews and drive architecture within the team; decompose multi-service work into clear milestones.
Define and own SLIs/SLOs; ensure post-incident actions land and reliability improves release-over-release.
Implement advanced optimizations (e.g., micro-batch schedulers, speculative decoding, KV-cache reuse) and quantify impact.
Strengthen incident posture: capacity planning, autoscaling policy, graceful degradation, rollback/traffic-shift strategies.
Mentor IC1/IC2 engineers; review cross-team designs and elevate coding/testing standards.

Requirements include:

3-5 years of industry experience building distributed systems or cloud services.
Strong coding in Python or Go (C++ a plus) and deep familiarity with networked systems and performance.
Hands-on experience with Kubernetes at production scale, CI/CD, and observability stacks (Prometheus, Grafana, OpenTelemetry).
Practical knowledge of inference internals: batching, caching, mixed precision (BF16/FP8), streaming token delivery.
Proven track record improving tail latency (P95/P99) and service reliability through metrics-driven work.

Preferred qualifications include contributions to inference frameworks, experience with CUDA kernels, NCCL/SHARP, RDMA/NUMA, or GPU interconnect topologies, and leading multi-team initiatives or partnering with customers on mission-critical launches.

XML job scraping automation by YubHub

]]> full-time senior hybrid $139,000 to $204,000 Python, Go, Kubernetes, CI/CD, Observability stacks, Inference internals, Batching, Caching, Mixed precision, Streaming token delivery, Contributions to inference frameworks, CUDA kernels, NCCL/SHARP, RDMA/NUMA, GPU interconnect topologies Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI. It was founded in 2017 and became a publicly traded company in March 2025. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4647603006 Sunnyvale, CA / Bellevue, WA 2026-04-18 faffae87-882 Staff Software Engineer - GenAI Performance and Kernel As a staff software engineer for GenAI Performance and Kernel, you will own the design, implementation, optimization, and correctness of the high-performance GPU kernels powering our GenAI inference stack. You will lead development of highly-tuned, low-level compute paths, manage trade-offs between hardware efficiency and generality, and mentor others in kernel-level performance engineering.

Key responsibilities include:

Leading the design, implementation, benchmarking, and maintenance of core compute kernels optimized for various hardware backends (GPU, accelerators)
Driving the performance roadmap for kernel-level improvements: vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, auto-tuning, etc.
Integrating kernel optimizations with higher-level ML systems
Building and maintaining profiling, instrumentation, and verification tooling to detect correctness, performance regressions, numerical issues, and hardware utilization gaps
Leading performance investigations and root-cause analysis on inference bottlenecks, e.g. memory bandwidth, cache contention, kernel launch overhead, tensor fragmentation
Establishing coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend portability, and maintainability
Influencing system architecture decisions to make kernel improvements more effective (e.g. memory layout, dataflow scheduling, kernel fusion boundaries)
Mentoring and guiding other engineers working on lower-level performance, providing code reviews, and helping set best practices
Collaborating with infrastructure, tooling, and ML teams to roll out kernel-level optimizations into production, and monitoring their impact

Requirements include:

BS/MS/PhD in Computer Science, or a related field
Deep hands-on experience writing and tuning compute kernels (CUDA, Triton, OpenCL, LLVM IR, assembly or similar sort) for ML workloads
Strong knowledge of GPU/accelerator architecture: warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, SM occupancy, etc.
Experience with advanced optimization techniques: tiling, blocking, software pipelining, vectorization, fusion, loop transformations, auto-tuning
Familiarity with ML-specific kernel libraries (cuBLAS, cuDNN, CUTLASS, oneDNN, etc.) or open kernels
Strong debugging and profiling skills (Nsight, NVProf, perf, vtune, custom instrumentation)
Experience reasoning about numerical stability, mixed precision, quantization, and error propagation
Experience in integrating optimized kernels into real-world ML inference systems; exposure to distributed inference pipelines, memory management, and runtime systems
Experience building high-performance products leveraging GPU acceleration
Excellent communication and leadership skills , able to drive design discussions, mentor colleagues, and make trade-offs visible
A track record of shipping performance-critical, high-quality production software
Bonus: published in systems/ML performance venues (e.g. MLSys, ASPLOS, ISCA, PPoPP), experience with custom accelerators or FPGA, experience with sparsity or model compression techniques

The pay range for this role is $190,900-$232,800 USD per year, depending on location and experience.

XML job scraping automation by YubHub

]]> full-time staff onsite $190,900-$232,800 USD per year Compute kernels, GPU/accelerator architecture, Advanced optimization techniques, ML-specific kernel libraries, Debugging and profiling skills, Numerical stability, Mixed precision, Quantization, Error propagation, Distributed inference pipelines, Memory management, Runtime systems, High-performance products, GPU acceleration Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks is a data and AI company that provides a unified platform for data, analytics, and AI. https://databricks.com https://job-boards.greenhouse.io/databricks/jobs/8202700002 San Francisco, California 2026-04-18