{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/memory-partitioning"},"x-facet":{"type":"skill","slug":"memory-partitioning","display":"Memory Partitioning","count":1},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f2196e99-854"},"title":"Software Engineer - GenAI inference","description":"<p>As a software engineer for GenAI inference, you will help design, develop, and optimize the inference engine that powers Databricks&#39; Foundation Model API. You&#39;ll work at the intersection of research and production, ensuring our large language model (LLM) serving systems are fast, scalable, and efficient.</p>\n<p>Your work will touch the full GenAI inference stack , from kernels and runtimes to orchestration and memory management. You will contribute to the design and implementation of the inference engine, and collaborate on model-serving stack optimized for large-scale LLMs inference.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Collaborating with researchers to bring new model architectures or features (sparsity, activation compression, mixture-of-experts) into the engine</li>\n<li>Optimizing for latency, throughput, memory efficiency, and hardware utilization across GPUs, and accelerators</li>\n<li>Building and maintaining instrumentation, profiling, and tracing tooling to uncover bottlenecks and guide optimizations</li>\n<li>Developing and enhancing scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads</li>\n<li>Supporting reliability, reproducibility, and fault tolerance in the inference pipelines, including A/B launches, rollback, and model versioning</li>\n<li>Integrating with federated, distributed inference infrastructure – orchestrate across nodes, balance load, handle communication overhead</li>\n<li>Collaborating cross-functionally: with platform engineers, cloud infrastructure, and security/compliance teams</li>\n<li>Documenting and sharing learnings, contributing to internal best practices and open-source efforts when possible</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>BS/MS/PhD in Computer Science, or a related field</li>\n<li>Strong software engineering background (3+ years or equivalent) in performance-critical systems</li>\n<li>Solid understanding of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc.</li>\n<li>Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc.)</li>\n<li>Comfortable designing and operating distributed systems, including RPC frameworks, queuing, RPC batching, sharding, memory partitioning</li>\n<li>Demonstrated ability to uncover and solve performance bottlenecks across layers (kernel, memory, networking, scheduler)</li>\n<li>Experience building instrumentation, tracing, and profiling tools for ML models</li>\n<li>Ability to work closely with ML researchers, translate novel model ideas into production systems</li>\n<li>Ownership mindset and eagerness to dive deep into complex system challenges</li>\n<li>Bonus: published research or open-source contributions in ML systems, inference optimization, or model serving</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f2196e99-854","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Databricks","sameAs":"https://databricks.com","logo":"https://logos.yubhub.co/databricks.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/databricks/jobs/8202670002","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$142,200-$204,600 USD","x-skills-required":["software engineering","performance-critical systems","ML inference internals","CUDA","GPU programming","distributed systems","RPC frameworks","queuing","RPC batching","sharding","memory partitioning","instrumentation","tracing","profiling tools","ML researchers","complex system challenges"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:54:17.777Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, performance-critical systems, ML inference internals, CUDA, GPU programming, distributed systems, RPC frameworks, queuing, RPC batching, sharding, memory partitioning, instrumentation, tracing, profiling tools, ML researchers, complex system challenges","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":142200,"maxValue":204600,"unitText":"YEAR"}}}]}