{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/data-lake-architectures"},"x-facet":{"type":"skill","slug":"data-lake-architectures","display":"Data Lake Architectures","count":1},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_9be280f4-cbc"},"title":"Software Engineer, Data Infrastructure","description":"<p>We&#39;re looking for an engineer to join our small, high-impact team responsible for architecting and scaling the core infrastructure behind distributed training pipelines, multimodal data catalogs, and intelligent processing systems that operate over petabytes of data.</p>\n<p>As a software engineer on our data infrastructure team, you&#39;ll design, build, and operate scalable, fault-tolerant infrastructure for LLM Research: distributed compute, data orchestration, and storage across modalities. You&#39;ll develop high-throughput systems for data ingestion, processing, and transformation , including training data catalogs, deduplication, quality checks, and search. You&#39;ll also build systems for traceability, reproducibility, and robust quality control at every stage of the data lifecycle.</p>\n<p>You&#39;ll collaborate with research teams to unlock new features, improve data quality, and accelerate training cycles. You&#39;ll implement and maintain monitoring and alerting to support platform reliability and performance.</p>\n<p>If you&#39;re excited by distributed systems, large-scale data mining, open-source tools like Spark, Kafka, Beam, Ray, and Delta Lake, and enjoy building from the ground up, we&#39;d love to hear from you.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_9be280f4-cbc","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Thinking Machines Lab","sameAs":"https://thinkingmachines.ai/","logo":"https://logos.yubhub.co/thinkingmachines.ai.png"},"x-apply-url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5013919008","x-work-arrangement":"onsite","x-experience-level":"entry|mid|senior","x-job-type":"full-time","x-salary-range":"$350,000 - $475,000 USD","x-skills-required":["backend language (Python or Rust)","distributed compute frameworks (Apache Spark or Ray)","cloud infrastructure","data lake architectures","batch and streaming pipelines"],"x-skills-preferred":["Kafka","dbt","Terraform","Airflow","web crawler","deduplication","data mining","search","file formats and storage systems"],"datePosted":"2026-04-18T15:54:00.309Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"backend language (Python or Rust), distributed compute frameworks (Apache Spark or Ray), cloud infrastructure, data lake architectures, batch and streaming pipelines, Kafka, dbt, Terraform, Airflow, web crawler, deduplication, data mining, search, file formats and storage systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":475000,"unitText":"YEAR"}}}]}