{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/video-understanding"},"x-facet":{"type":"skill","slug":"video-understanding","display":"Video Understanding","count":3},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4ced2159-802"},"title":"Research, Vision Expertise","description":"<p>Thinking Machines Lab is seeking a researcher to join their team in San Francisco. The successful candidate will work on advancing the science of visual perception and multimodal learning. They will design architectures that fuse pixels and text, build datasets and evaluation methods that test real-world comprehension, and develop representations that let models ground abstract concepts in the physical world.</p>\n<p>The ideal candidate will have expertise in multimodality and experience running large-scale experiments. They will be comfortable contributing to complex engineering systems and have a strong grasp of probability, statistics, and machine learning fundamentals.</p>\n<p>This is an evergreen role, meaning that the position is open on an ongoing basis. The company receives many applications, and there may not always be an immediate role that aligns perfectly with the candidate&#39;s experience and skills. However, they encourage candidates to apply and continuously review applications.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Own research projects on training and performance analysis of multimodal AI models.</li>\n<li>Curate and build large-scale datasets and evaluation benchmarks to advance vision capabilities.</li>\n<li>Work with data infrastructure engineers, pretraining researchers and engineers, and product teams to create frontier multimodal models and the products that leverage them.</li>\n<li>Publish and present research that moves the entire community forward.</li>\n</ul>\n<p>Skills and Qualifications:</p>\n<ul>\n<li>Ability to design, run, and analyze experiments thoughtfully, with demonstrated research judgment and empirical rigor.</li>\n<li>Understanding of machine learning fundamentals, large-scale training, and distributed compute environments.</li>\n<li>Proficiency in Python and familiarity with at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX).</li>\n<li>Comfortable with debugging distributed training and writing code that scales.</li>\n<li>Bachelor&#39;s degree or equivalent experience in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding.</li>\n</ul>\n<p>Preferred qualifications include research or engineering contributions in visual reasoning, spatial understanding, or multimodal architecture design, experience developing evaluation frameworks for multimodal tasks, publications or open-source contributions in vision-language modeling, video understanding, or multimodal AI, and a strong grasp of probability, statistics, and ML fundamentals.</p>\n<p>Logistics:</p>\n<ul>\n<li>Location: San Francisco, California.</li>\n<li>Compensation: $350,000 - $475,000 USD per year, depending on background, skills, and experience.</li>\n<li>Visa sponsorship: Yes.</li>\n<li>Benefits: Generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4ced2159-802","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Thinking Machines Lab","sameAs":"https://thinkingmachines.ai/","logo":"https://logos.yubhub.co/thinkingmachines.ai.png"},"x-apply-url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5002288008","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $475,000 USD per year","x-skills-required":["Python","Deep learning framework (e.g., PyTorch, TensorFlow, or JAX)","Machine learning fundamentals","Large-scale training","Distributed compute environments"],"x-skills-preferred":["Visual reasoning","Spatial understanding","Multimodal architecture design","Evaluation frameworks for multimodal tasks","Vision-language modeling","Video understanding","Multimodal AI"],"datePosted":"2026-04-18T15:52:43.848Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Deep learning framework (e.g., PyTorch, TensorFlow, or JAX), Machine learning fundamentals, Large-scale training, Distributed compute environments, Visual reasoning, Spatial understanding, Multimodal architecture design, Evaluation frameworks for multimodal tasks, Vision-language modeling, Video understanding, Multimodal AI","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":475000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_28b01ce3-8a3"},"title":"Member of Technical Staff - Imagine Model","description":"<p>As a Member of Technical Staff on the Imagine Model Team, you will develop cutting-edge AI experiences beyond text, with a strong focus on enabling high-fidelity understanding and generation across image and video modalities, while also incorporating audio where it enhances visual content.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Create and drive engineering agendas to advance multimodal capabilities, with emphasis on image and video generation, editing, understanding, controllable/long-horizon synthesis, agentic planning, RL training, and world simulation (including audio integration for richer video experiences).</li>\n<li>Improve data quality through annotation, filtering, augmentation, synthetic generation, captioning, and in-depth data studies, particularly for visual and audio data.</li>\n<li>Design evaluation frameworks, metrics, benchmarks, evals, and reward models tailored to image/video/audio quality and coherence.</li>\n<li>Implement efficient algorithms for state-of-the-art model performance, including real-time inference, distillation, and scalable serving for visual content.</li>\n<li>Develop scalable data collection and processing pipelines for multimodal (primarily image/video-focused) datasets.</li>\n<li>Collaborate cross-functionally to integrate AI solutions into production and rapidly iterate based on user feedback.</li>\n</ul>\n<p>Basic Qualifications:</p>\n<ul>\n<li>Track record in leading studies that significantly improve neural network capabilities and performance through better data or modeling.</li>\n<li>Experience in data-driven experiment designs, systematic analysis, and iterative model debugging.</li>\n<li>Experience developing or working with large-scale distributed machine learning systems.</li>\n<li>Ability to deliver optimal end-to-end user experiences.</li>\n<li>Hands-on contributor with initiative, excellence, strong work ethic, prioritization skills, and excellent communication.</li>\n</ul>\n<p>Preferred Skills and Experience:</p>\n<ul>\n<li>Experience in SFT, RL, evals, human/synthetic data collection, or agentic systems.</li>\n<li>Proficiency in Python, JAX/XLA, PyTorch, Rust/C++, Spark, Ray, and related large-scale frameworks.</li>\n<li>Domain expertise in multimodal applications such as graphics engines, rendering techniques, image/video understanding and generation, world models, real-time simulation, or controllable/long-horizon visual content creation (audio/speech processing or music/audio generation experience is a plus where it supports video).</li>\n<li>Experience with agentic RL training, controllable/long-horizon generation, or multimodal agents that reason and act across modalities (especially in visual domains).</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_28b01ce3-8a3","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://www.xai.com/","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/5051985007","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$180,000 - $440,000 USD","x-skills-required":["Python","JAX/XLA","PyTorch","Rust/C++","Spark","Ray","multimodal applications","agentic systems","RL training","controllable/long-horizon generation"],"x-skills-preferred":["SFT","evals","human/synthetic data collection","graphics engines","rendering techniques","image/video understanding and generation","world models","real-time simulation"],"datePosted":"2026-04-18T15:24:12.847Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Palo Alto, CA; Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, JAX/XLA, PyTorch, Rust/C++, Spark, Ray, multimodal applications, agentic systems, RL training, controllable/long-horizon generation, SFT, evals, human/synthetic data collection, graphics engines, rendering techniques, image/video understanding and generation, world models, real-time simulation","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":440000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_efd94a1f-55b"},"title":"Research Scientist, Audio","description":"<p>We&#39;re seeking a Research Scientist to join our team in developing novel algorithmic architecture towards the end goal of solving and building Artificial General Intelligence.</p>\n<p>In this role, you will make key contributions to the latest research developed in the Gemini audio pillar, including:</p>\n<p>Unlocking new audio capabilities within the model, both in pre-training and post-training.\nImproving quality of models for understanding and generation, including research to improve our tokenizers, better techniques for generation quality, and looking at joint audio and visual representations.\nBetter evaluation methods (human, auto raters, automated metrics) to measure quality of open-ended tasks.</p>\n<p>To succeed in this role, you should have a PhD in Computer Science, Computer Vision, Speech Processing, or Machine Learning related field, experience working with LLMs, and audio or video understanding and/or generation experience. A proven track record of research and publications in areas such as audio generation, video generation, LLMs, and a real passion for AI are also advantages.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_efd94a1f-55b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Google DeepMind","sameAs":"https://deepmind.com/","logo":"https://logos.yubhub.co/deepmind.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/deepmind/jobs/7572463","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$147,000 - $211,000 + bonus + equity + benefits","x-skills-required":["PhD in Computer Science, Computer Vision, Speech Processing, or Machine Learning related field","Experience working with LLMs","Audio or video understanding and/or generation experience"],"x-skills-preferred":["Proven track record of research and publications in areas such as audio generation, video generation, LLMs","Real passion for AI"],"datePosted":"2026-03-16T14:38:44.527Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York City, New York, US"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"PhD in Computer Science, Computer Vision, Speech Processing, or Machine Learning related field, Experience working with LLMs, Audio or video understanding and/or generation experience, Proven track record of research and publications in areas such as audio generation, video generation, LLMs, Real passion for AI","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":147000,"maxValue":211000,"unitText":"YEAR"}}}]}