{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/observability-tools"},"x-facet":{"type":"skill","slug":"observability-tools","display":"Observability Tools","count":42},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_34fa7d64-89a"},"title":"Technical Product Manager - Linux Developer Experience","description":"<p>We&#39;re seeking a Technical Product Manager to join our team responsible for shaping and evolving the developer experience on our firm&#39;s developer platform.</p>\n<p>In this pivotal role, you&#39;ll serve as the primary liaison between the platform engineering team and our developer community , including quantitative analysts, researchers, and front-office trading teams , ensuring the platform meets their complex development needs and continuously improves.</p>\n<p>The Developer Platform team architects, engineers, and enhances the firm&#39;s developer’s toolchain and workflow. We collaborate closely with developers, quants, researchers, and front-office trading teams to ensure our platform provides a best-in-class development experience with the feel of native Mac/UNIX-like development.</p>\n<p>This role sits at the intersection of product management and technical enablement, acting as the voice of the developer within the platform team.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Build and maintain relationships with technologists and developers across the firm to deeply understand their workflows, pain points, and emerging needs</li>\n</ul>\n<ul>\n<li>Discover novel use cases and translate them into actionable product requirements for the platform engineering team</li>\n</ul>\n<ul>\n<li>Serve as the first point of contact for developer questions about the platform&#39;s environment, tooling, and capabilities</li>\n</ul>\n<ul>\n<li>Triage and reproduce issues reported by developers, driving initial diagnosis , including leveraging AI-assisted sessions for problem analysis , and escalating to the deeper technical engineering team when necessary</li>\n</ul>\n<ul>\n<li>Drive the roadmap and prioritization of platform enhancements in collaboration with engineering leadership</li>\n</ul>\n<ul>\n<li>Promote and evangelize the Linux developer platform , driving adoption and ensuring developers are aware of available features and best practices</li>\n</ul>\n<ul>\n<li>Manage project timelines, stakeholder communication, and delivery milestones for platform initiatives</li>\n</ul>\n<p>Qualifications / Skills Required:</p>\n<ul>\n<li>Demonstrated experience in Technical Product Management, Technical Project Management, or Developer Relations/Developer Experience roles</li>\n</ul>\n<ul>\n<li>Strong communication and stakeholder management skills , ability to engage credibly with both highly technical developers and senior leadership</li>\n</ul>\n<ul>\n<li>Working familiarity with Linux desktop environments , comfortable navigating the platform, understanding developer workflows, and answering environment/tooling questions</li>\n</ul>\n<ul>\n<li>Conceptual understanding of containerization and orchestration (Docker, Podman, Kubernetes) and how developers leverage these tools in their workflows</li>\n</ul>\n<ul>\n<li>Familiarity with CI/CD concepts and tools (e.g., Jenkins, Git) , enough to understand developer pipelines and identify friction points</li>\n</ul>\n<ul>\n<li>Problem reproduction and triage skills , ability to recreate reported issues in the environment and clearly document/escalate to engineering with relevant context</li>\n</ul>\n<ul>\n<li>Experience leveraging AI tools (e.g., LLM-based assistants, copilots) to assist in problem diagnosis, research, and knowledge synthesis</li>\n</ul>\n<ul>\n<li>Basic scripting literacy (Bash, Python) , enough to read, understand, and run existing scripts; not necessarily write complex automation from scratch</li>\n</ul>\n<p>Qualifications / Skills Desired:</p>\n<ul>\n<li>Familiarity with serverless compute concepts and cloud-native development paradigms</li>\n</ul>\n<ul>\n<li>Exposure to configuration management tools (e.g., Ansible) and image lifecycle management (e.g., Hashicorp Packer) , understanding what they do and how they fit into the platform, rather than hands-on administration</li>\n</ul>\n<ul>\n<li>Awareness of monitoring and observability tools (Prometheus, Grafana, ELK stack) from a user/consumer perspective</li>\n</ul>\n<ul>\n<li>Understanding of authentication and identity management concepts (e.g., Active Directory integration) as they relate to developer access and workflows</li>\n</ul>\n<ul>\n<li>Experience with agile project management methodologies and tools (Jira, Confluence, or similar)</li>\n</ul>\n<ul>\n<li>Strong communication skills working with engineering leadership, developer community, and stakeholders</li>\n</ul>\n<ul>\n<li>Bachelor’s degree in Computer Science or a related field</li>\n</ul>\n<p>The estimated base salary range for this position is $175,000 to $250,000, which is specific to New York and may change in the future. Millennium pays a total compensation package which includes a base salary, discretionary performance bonus, and a comprehensive benefits package.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_34fa7d64-89a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"IT Infrastructure","sameAs":"https://mlp.eightfold.ai","logo":"https://logos.yubhub.co/mlp.eightfold.ai.png"},"x-apply-url":"https://mlp.eightfold.ai/careers/job/755953932410","x-work-arrangement":null,"x-experience-level":null,"x-job-type":"full-time","x-salary-range":"$175,000 to $250,000","x-skills-required":["Technical Product Management","Technical Project Management","Developer Relations/Developer Experience","Linux desktop environments","Containerization and orchestration","CI/CD concepts and tools","Problem reproduction and triage skills","AI tools","Basic scripting literacy"],"x-skills-preferred":["Serverless compute concepts and cloud-native development paradigms","Configuration management tools","Image lifecycle management","Monitoring and observability tools","Authentication and identity management concepts","Agile project management methodologies and tools"],"datePosted":"2026-04-18T22:13:03.074Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, New York, United States of America"}},"employmentType":"FULL_TIME","occupationalCategory":"IT","industry":"Technology","skills":"Technical Product Management, Technical Project Management, Developer Relations/Developer Experience, Linux desktop environments, Containerization and orchestration, CI/CD concepts and tools, Problem reproduction and triage skills, AI tools, Basic scripting literacy, Serverless compute concepts and cloud-native development paradigms, Configuration management tools, Image lifecycle management, Monitoring and observability tools, Authentication and identity management concepts, Agile project management methodologies and tools","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":175000,"maxValue":250000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c6831d5f-7e9"},"title":"Principal AI Ops Architect, GPS","description":"<p><strong>Role Overview</strong></p>\n<p>Scale&#39;s rapidly growing Global Public Sector team is focused on using AI to address critical challenges facing the public sector around the world.</p>\n<p>Our core work consists of creating custom AI applications that will impact millions of citizens, generating high-quality training data for national LLMs, and upskilling and advisory services to spread the impact of AI.</p>\n<p>As a Principal AI Ops Architect, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for our international government partners.</p>\n<p>At Scale, we&#39;re not just building AI solutions,we&#39;re enabling the public sector to transform their operations and better serve citizens through cutting-edge technology.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Own the production outcome: Take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies.</li>\n<li>Ensure Full-Stack integrity: Oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment.</li>\n<li>Scale the feedback loop: Build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability.</li>\n<li>Navigate global compliance: Manage the technical lifecycle within diverse regulatory frameworks.</li>\n<li>Incident command: Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building the guardrails to prevent them from happening again.</li>\n<li>Bridge the gap: Translate deep technical performance metrics into clear insights for senior international government officials.</li>\n<li>Drive product evolution: Partner with our Engineering and ML teams to ensure the lessons learned in the field directly influence the technical architecture and decisions of future use cases.</li>\n</ul>\n<p><strong>Ideal Candidate</strong></p>\n<ul>\n<li>Experience: 6+ years in a high-impact technical role (SRE, FDE or MLOps) with experience in the public sector.</li>\n<li>Global perspective: Familiarity with international government security standards and the complexities of deploying sovereign AI.</li>\n<li>System architecture proficiency: Proven experience maintaining production-grade applications with a deep understanding of the full request lifecycle-connecting frontend/API layers to the backend and AI core.</li>\n<li>Modern AI Stack expertise: Proficiency in coding and the modern AI infrastructure, including Kubernetes, vector databases, agentic development, and LLM observability tools.</li>\n<li>Ownership: You treat every production deployment as your own. You race toward solving hard problems before the customer even sees them.</li>\n<li>Reliability: You understand that in the public sector, a model failure may be a risk to public safety or privacy.</li>\n<li>Customer communication: The ability to explain to a high-ranking official why the performance of the system has degraded and how we are fixing it.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Competitive salary and benefits package</li>\n<li>Opportunity to work with a leading AI company</li>\n<li>Collaborative and dynamic work environment</li>\n</ul>\n<p><strong>About Us</strong></p>\n<p>At Scale, our mission is to develop reliable AI systems for the world&#39;s most important decisions. Our products provide the high-quality data and full-stack technologies that power the world&#39;s leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c6831d5f-7e9","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Scale","sameAs":"https://scale.com/","logo":"https://logos.yubhub.co/scale.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/scaleai/jobs/4671740005","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["AI","Machine Learning","Cloud Computing","Kubernetes","Vector Databases","Agentic Development","LLM Observability Tools","System Architecture","Global Government Security Standards"],"x-skills-preferred":[],"datePosted":"2026-04-18T16:02:05.605Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Doha, Qatar; London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"AI, Machine Learning, Cloud Computing, Kubernetes, Vector Databases, Agentic Development, LLM Observability Tools, System Architecture, Global Government Security Standards"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_ded9d7ff-8aa"},"title":"Senior Engineering Manager, Data Streaming Services (Auth0)","description":"<p>Secure Every Identity, from AI to Human\\n\\nIdentity is the key to unlocking the potential of AI. As a Senior Engineering Manager, Data Streaming Services at Auth0, you will lead the evolution of our streaming data backbone across a multi-cloud footprint. You will oversee multiple engineering teams dedicated to making data streaming seamless, reliable, and high-performance.\\n\\nThis is a &quot;manager of managers&quot; role requiring a blend of strategic foresight, execution rigor, and technical grit. You will set the vision for our streaming services, mentor high-performing teams, and take accountability for our service uptime guarantees.\\n\\n<strong>Key Responsibilities:</strong>\\n\\n<em> Lead a world-class team of teams. Oversee data streaming infrastructure and services that power our global platform across AWS and Azure.\\n</em> Own roadmap and execution. Partner with product and stakeholder teams to define the team&#39;s strategy and prioritized roadmap.\\n<em> Drive engineering excellence. Set high standards of quality, reliability, and operational robustness, championing best practices in software development, from code reviews to observability and incident management.\\n</em> Lead an automation-first culture. Reduce operational friction and ensure infrastructure is self-healing and code-defined. Draw efficiency from AI-assisted development.\\n<em> Act as a technical leader. Lead response on incidents for services under ownership and help teams navigate complex distributed systems failures.\\n\\n<strong>Requirements:</strong>\\n\\n</em> Proven engineering leadership, building and leading teams of teams. Experience coaching Staff+ engineers and engineering managers.\\n<em> Strong technical and architectural acumen. Background in building scalable, distributed systems. Comfortable participating in and guiding technical discussions.\\n</em> Strong project management skills. Expertise in creating technical roadmaps, prioritizing effectively in an agile environment, and managing complex project dependencies.\\n<em> Collaborative leadership style, adapted to remote ways of working. Excellent written and verbal communication skills to build strong relationships with stakeholders and inspire others.\\n\\n<strong>Bonus Points:</strong>\\n\\n</em> Experience developing data-intensive applications in a modern programming language such as go, node.js, or Java.\\n<em> Experience with databases such as PostgreSQL and MongoDB.\\n</em> Experience with distributed streaming platforms like Kafka.\\n<em> Familiarity with concepts in the IAM (Identity and Access Management) domain.\\n</em> Experience with cloud providers (AWS, Azure), container technologies such as Kubernetes and Docker, and observability tools such as Datadog.\\n* Experience building reliable, high-availability platforms for enterprise SaaS applications.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_ded9d7ff-8aa","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Auth0","sameAs":"https://auth0.com/","logo":"https://logos.yubhub.co/auth0.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/okta/jobs/7719329","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$207,000-$284,000 USD","x-skills-required":["engineering leadership","technical and architectural acumen","project management skills","collaborative leadership style","data-intensive applications","databases","distributed streaming platforms","IAM domain","cloud providers","container technologies","observability tools"],"x-skills-preferred":["go","node.js","Java","PostgreSQL","MongoDB","Kafka","AWS","Azure","Kubernetes","Docker","Datadog"],"datePosted":"2026-04-18T15:58:08.018Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Chicago, Illinois; New York, New York; Washington, DC"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"engineering leadership, technical and architectural acumen, project management skills, collaborative leadership style, data-intensive applications, databases, distributed streaming platforms, IAM domain, cloud providers, container technologies, observability tools, go, node.js, Java, PostgreSQL, MongoDB, Kafka, AWS, Azure, Kubernetes, Docker, Datadog","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":207000,"maxValue":284000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0a2ea62c-943"},"title":"Research Engineer, Infrastructure, RL Systems","description":"<p>We&#39;re looking for an infrastructure research engineer to design and build the core systems that enable scalable, efficient training of large models through reinforcement learning.</p>\n<p>This role sits at the intersection of research and large-scale systems engineering: a builder who understands both the algorithms behind RL and the realities of distributed training and inference at scale. You&#39;ll wear many hats, from optimising rollout and reward pipelines to enhancing reliability, observability, and orchestration, collaborating closely with researchers and infra teams to make reinforcement learning stable, fast, and production-ready.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Design, build, and optimise the infrastructure that powers large-scale reinforcement learning and post-training workloads.</li>\n</ul>\n<ul>\n<li>Improve the reliability and scalability of RL training pipeline, distributed RL workloads, and training throughput.</li>\n</ul>\n<ul>\n<li>Develop shared monitoring and observability tools to ensure high uptime, debuggability, and reproducibility for RL systems.</li>\n</ul>\n<ul>\n<li>Collaborate with researchers to translate algorithmic ideas into production-grade training pipelines.</li>\n</ul>\n<ul>\n<li>Build evaluation and benchmarking infrastructure that measures model progress on helpfulness, safety, and factuality.</li>\n</ul>\n<ul>\n<li>Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure.</li>\n</ul>\n<p>We&#39;re looking for someone with strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases. You should have a good understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.</p>\n<p>Experience training or supporting large-scale language models with tens of billions of parameters or more is a plus. Familiarity with monitoring and observability tools (Prometheus, Grafana, OpenTelemetry) is also a plus.</p>\n<p>Logistics:</p>\n<ul>\n<li>Location: This role is based in San Francisco, California.</li>\n</ul>\n<ul>\n<li>Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.</li>\n</ul>\n<ul>\n<li>Visa sponsorship: We sponsor visas. While we can&#39;t guarantee success for every candidate or role, if you&#39;re the right fit, we&#39;re committed to working through the visa process together.</li>\n</ul>\n<ul>\n<li>Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0a2ea62c-943","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Thinking Machines Lab","sameAs":"https://thinkingmachineslab.com/","logo":"https://logos.yubhub.co/thinkingmachineslab.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5013930008","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $475,000 USD","x-skills-required":["deep learning frameworks","PyTorch","JAX","complex codebases","scalable AI infrastructure","large-scale language models","monitoring and observability tools"],"x-skills-preferred":["experience training or supporting large-scale language models","familiarity with monitoring and observability tools"],"datePosted":"2026-04-18T15:56:59.642Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"deep learning frameworks, PyTorch, JAX, complex codebases, scalable AI infrastructure, large-scale language models, monitoring and observability tools, experience training or supporting large-scale language models, familiarity with monitoring and observability tools","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":475000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_53ee0ef3-c62"},"title":"Staff Data Engineer, Analytics Data Engineering","description":"<p>We are looking for a Staff Data Engineer to join our Analytics Data Engineering (ADE) team within Data Science &amp; AI Platform. As a Staff Data Engineer, you will be responsible for solving cross-cutting data challenges that span multiple lines of business while driving standardization in how we build, deploy, and govern analytics pipelines across Dropbox.</p>\n<p>This is not a maintenance role. We are modernizing our analytics platform, upgrading orchestration infrastructure, building shared and reusable data models with conformed dimensions, establishing a certified metrics framework, and laying the foundation for AI-native data development. You will partner closely with Data Science, Data Infrastructure, Product Engineering, and Business Intelligence teams to make this happen.</p>\n<p>You will play a crucial role in establishing analytics engineering standards, designing scalable data models, and driving cross-functional alignment on data governance. You will get substantial exposure to senior leadership, shape the technical direction of analytics infrastructure at Dropbox, and directly influence how data powers product and business decisions.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Lead the design and implementation of shared, reusable data models, defining shared fact tables, conformed dimensions, and a semantic/metrics layer that serves as the single source of truth across analytics functions</li>\n</ul>\n<ul>\n<li>Drive standardization of data engineering practices across ADE and functional analytics teams, including pipeline patterns, CI/CD workflows, naming conventions, and data modeling standards</li>\n</ul>\n<ul>\n<li>Partner with Data Infrastructure to modernize orchestration, improve pipeline decomposition, and establish secure dev/test environments with production data access</li>\n</ul>\n<ul>\n<li>Architect and implement a shift-left data governance strategy, working with upstream data producers to establish data contracts, SLOs, and code-enforced quality gates that catch issues before production</li>\n</ul>\n<ul>\n<li>Collaborate with Data Science leads and Product Management to translate metric definitions into reliable, certified data pipelines that power executive dashboards, WBR reporting, and growth measurement</li>\n</ul>\n<ul>\n<li>Reduce operational burden by improving pipeline granularity, observability, and failure recovery, establishing runbooks and alerting standards that make on-call sustainable</li>\n</ul>\n<ul>\n<li>Evaluate and integrate AI-native tooling into the data development lifecycle, enabling conversational data exploration with guardrails and AI-assisted pipeline development</li>\n</ul>\n<p>Requirements:</p>\n<ul>\n<li>BS degree in Computer Science or related technical field, or equivalent technical experience</li>\n</ul>\n<ul>\n<li>12+ years of experience in data engineering or analytics engineering with increasing scope and technical leadership</li>\n</ul>\n<ul>\n<li>12+ years of SQL experience, including complex analytical queries, window functions, and performance optimization at scale (Spark SQL)</li>\n</ul>\n<ul>\n<li>8+ years of Python development experience, including building and maintaining production data pipelines</li>\n</ul>\n<ul>\n<li>Deep expertise in dimensional data modeling, schema design, and scalable data architecture, with hands-on experience building shared data models across multiple business domains</li>\n</ul>\n<ul>\n<li>Strong experience with orchestration tools (Airflow strongly preferred) and dbt, including pipeline design, scheduling strategies, and failure recovery patterns</li>\n</ul>\n<p>Preferred Qualifications:</p>\n<ul>\n<li>Experience with Databricks (Unity Catalog, Delta Lake) and modern lakehouse architectures</li>\n</ul>\n<ul>\n<li>Experience leading orchestration or platform modernization efforts at scale</li>\n</ul>\n<ul>\n<li>Familiarity with data governance and observability tools such as Atlan, Monte Carlo, Great Expectations, or similar</li>\n</ul>\n<ul>\n<li>Experience building or contributing to a metrics/semantic layer (dbt MetricFlow, Databricks Metric Views, or equivalent)</li>\n</ul>\n<ul>\n<li>Track record of establishing data engineering standards and best practices in a federated analytics organization</li>\n</ul>\n<p>Compensation:</p>\n<p>US Zone 2 $198,900-$269,100 USD</p>\n<p>US Zone 3 $176,800-$239,200 USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_53ee0ef3-c62","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Dropbox","sameAs":"https://www.dropbox.com/","logo":"https://logos.yubhub.co/dropbox.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/dropbox/jobs/7595183","x-work-arrangement":"remote","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$198,900-$269,100 USD","x-skills-required":["SQL","Python","Dimensional data modeling","Schema design","Scalable data architecture","Orchestration tools","dbt"],"x-skills-preferred":["Databricks","Modern lakehouse architectures","Data governance and observability tools","Metrics/semantic layer"],"datePosted":"2026-04-18T15:56:35.190Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote - US: Select locations"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"SQL, Python, Dimensional data modeling, Schema design, Scalable data architecture, Orchestration tools, dbt, Databricks, Modern lakehouse architectures, Data governance and observability tools, Metrics/semantic layer","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":198900,"maxValue":269100,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_7bc4518a-7e3"},"title":"AI Applications Ops Lead, GPS","description":"<p><strong>Role Overview</strong></p>\n<p>Scale&#39;s rapidly growing Global Public Sector team is focused on using AI to address critical challenges facing the public sector around the world.</p>\n<p>Our core work consists of creating custom AI applications that will impact millions of citizens, generating high-quality training data for national LLMs, and upskilling and advisory services to spread the impact of AI.</p>\n<p>As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for our international government partners.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Own the production outcome: Take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies.</li>\n</ul>\n<ul>\n<li>Ensure Full-Stack integrity: Oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment.</li>\n</ul>\n<ul>\n<li>Scale the feedback loop: Build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability.</li>\n</ul>\n<ul>\n<li>Navigate global compliance: Manage the technical lifecycle within diverse regulatory frameworks.</li>\n</ul>\n<ul>\n<li>Incident command: Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building the guardrails to prevent them from happening again.</li>\n</ul>\n<ul>\n<li>Bridge the gap: Translate deep technical performance metrics into clear insights for senior international government officials.</li>\n</ul>\n<ul>\n<li>Drive product evolution: Partner with our Engineering and ML teams to ensure the lessons learned in the field directly influence the technical architecture and decisions of future use cases.</li>\n</ul>\n<p><strong>Ideal Candidate</strong></p>\n<ul>\n<li>Experience: 6+ years in a high-impact technical role (SRE, FDE or MLOps) with experience in the public sector.</li>\n</ul>\n<ul>\n<li>Global perspective: Familiarity with international government security standards and the complexities of deploying sovereign AI.</li>\n</ul>\n<ul>\n<li>System architecture proficiency: Proven experience maintaining production-grade applications with a deep understanding of the full request lifecycle-connecting frontend/API layers to the backend and AI core.</li>\n</ul>\n<ul>\n<li>Modern AI Stack expertise: Proficiency in coding and the modern AI infrastructure, including Kubernetes, vector databases, agentic development, and LLM observability tools.</li>\n</ul>\n<ul>\n<li>Ownership: You treat every production deployment as your own. You race toward solving hard problems before the customer even sees them.</li>\n</ul>\n<ul>\n<li>Reliability: You understand that in the public sector, a model failure may be a risk to public safety or privacy.</li>\n</ul>\n<ul>\n<li>Customer communication: The ability to explain to a high-ranking official why the performance of the system has degraded and how we are fixing it.</li>\n</ul>\n<p><strong>About Us</strong></p>\n<p>At Scale, our mission is to develop reliable AI systems for the world&#39;s most important decisions. Our products provide the high-quality data and full-stack technologies that power the world&#39;s leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_7bc4518a-7e3","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Scale","sameAs":"https://scale.com/","logo":"https://logos.yubhub.co/scale.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/scaleai/jobs/4654510005","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Kubernetes","Vector databases","Agentic development","LLM observability tools","SRE","FDE","MLOps"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:56:02.011Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Doha, Qatar; London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Vector databases, Agentic development, LLM observability tools, SRE, FDE, MLOps"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6b0282a9-9ee"},"title":"Staff Software Engineer, Observability","description":"<p>We are seeking a highly experienced Staff Software Engineer to lead our efforts in building, maintaining, and optimizing highly scalable, reliable, and secure systems. The Observability team is responsible for deploying and maintaining critical infrastructure at CoreWeave including our logging, tracing, and metrics platforms as well as the pipelines that feed them.</p>\n<p>Key Responsibilities:</p>\n<ul>\n<li>Lead and mentor engineers, fostering a culture of collaboration and continuous improvement.</li>\n<li>Scale logging, tracing, and metrics platforms to support a global datacenter footprint.</li>\n<li>Develop and refine monitoring and alerting to enhance system reliability.</li>\n<li>Advise engineers across CoreWeave on optimal usage of Observability systems.</li>\n<li>Automate interactions with CoreWeave&#39;s Compute Infrastructure layer.</li>\n<li>Manage production clusters and ensure development teams follow best practices for deployments.</li>\n</ul>\n<p>Required Qualifications:</p>\n<ul>\n<li>7+ years of experience in Software Engineering, Site Reliability Engineering, DevOps, or a related field.</li>\n<li>Deep expertise across all observability pillars using tools like ClickHouse, Elastic, Loki, Victoria Metrics, Prometheus, Thanos and/or Grafana.</li>\n<li>Expertise in Kubernetes, containerization, and microservices architectures.</li>\n<li>Proven track record of leading incident management and post-mortem analysis.</li>\n<li>Excellent problem-solving, analytical, and communication skills.</li>\n</ul>\n<p>Preferred Qualifications:</p>\n<ul>\n<li>Experience running and scaling observability tools as a cloud provider.</li>\n<li>Experience administering large-scale kubernetes clusters.</li>\n<li>Deep understanding of data-streaming systems.</li>\n</ul>\n<p>The base salary range for this role is $188,000 to $250,000.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6b0282a9-9ee","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4577361006","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$188,000 to $250,000","x-skills-required":["ClickHouse","Elastic","Loki","Victoria Metrics","Prometheus","Thanos","Grafana","Kubernetes","containerization","microservices architectures"],"x-skills-preferred":["Experience running and scaling observability tools as a cloud provider","Experience administering large-scale kubernetes clusters","Deep understanding of data-streaming systems"],"datePosted":"2026-04-18T15:54:03.521Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"ClickHouse, Elastic, Loki, Victoria Metrics, Prometheus, Thanos, Grafana, Kubernetes, containerization, microservices architectures, Experience running and scaling observability tools as a cloud provider, Experience administering large-scale kubernetes clusters, Deep understanding of data-streaming systems","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":188000,"maxValue":250000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6f3a053e-c43"},"title":"Staff Software Engineer, AI Reliability Engineering","description":"<p>We&#39;re seeking a Staff Software Engineer to join our AI Reliability Engineering team. As a key member of our team, you will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, and lead incident response for critical AI services.</p>\n<p>You will work closely with teams across Anthropic to improve reliability across our most critical serving paths. You will be responsible for making the systems that deliver Claude more robust and resilient, whether during an incident or collaborating on projects.</p>\n<p>To be successful in this role, you should have strong distributed systems, infrastructure, or reliability backgrounds. You should be curious and brave, comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</p>\n<p>You will be working on high-availability serving infrastructure across multiple regions and cloud providers. You will support the reliability of safeguard model serving, which is critical for both site reliability and Anthropic&#39;s safety commitments.</p>\n<p>If you&#39;re committed to creating reliable, interpretable, and steerable AI systems, and you&#39;re passionate about working on complex technical problems, we&#39;d love to hear from you.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6f3a053e-c43","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5101169008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"€235.000-€295.000 EUR","x-skills-required":["distributed systems","infrastructure","reliability","Service Level Objectives","monitoring","observability","incident response","high-availability serving infrastructure","cloud providers"],"x-skills-preferred":["SRE","Production Engineer","chaos engineering","systematic resilience testing","AI-specific observability tools and frameworks","ML hardware accelerators","RDMA","InfiniBand"],"datePosted":"2026-04-18T15:53:59.220Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Dublin, IE"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, Service Level Objectives, monitoring, observability, incident response, high-availability serving infrastructure, cloud providers, SRE, Production Engineer, chaos engineering, systematic resilience testing, AI-specific observability tools and frameworks, ML hardware accelerators, RDMA, InfiniBand"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_09e766cb-2a4"},"title":"Software Engineer, Enterprise Integrations","description":"<p>Aboutfrica</p>\n<p>At Cloudflare, we are on a mission to help build a better Internet. We protect and accelerate any Internet application online without adding hardware, installing software, or changing a line of code. Internet properties powered by Cloudflare all have web traffic routed through its intelligent global network, which gets smarter with every request.</p>\n<p>Available Locations: Austin Texas</p>\n<p>About the Department</p>\n<p>Cloudflare&#39;s Enterprise Integrations Engineering Team designs, builds, and maintains integrations across a wide range of SaaS applications used throughout the organization. Our mission is to create scalable, reliable, and maintainable systems that ensure data flows securely and efficiently between platforms.</p>\n<p>What You&#39;ll Do</p>\n<p>We&#39;re looking for a software engineer to join our Enterprise Integrations Team. You&#39;ll work on building and maintaining integration workflows between Cloudflare and a variety of SaaS applications. This includes taking work from concept through implementation, including gathering requirements, writing technical specifications, development, testing, and deployment. You&#39;ll collaborate closely with internal teams to ensure integrations meet business needs and are built following engineering best practices. As you grow in the role, you&#39;ll have the opportunity to lead larger initiatives and own projects from end to end.</p>\n<p>Qualifications &amp; Skills Required:</p>\n<ul>\n<li>Bachelor’s degree in Computer Science or a related field, or equivalent work experience</li>\n<li>Minimum of 5 years of professional experience as a software engineer</li>\n<li>Experience working with internal stakeholders to solve business problems through integration solutions</li>\n<li>Proficiency in Golang</li>\n<li>Experience building RESTful APIs with proper service security practices</li>\n<li>Experience working with observability tools such as Grafana, Prometheus, Sentry, or Kibana</li>\n<li>Experience with Kubernetes</li>\n<li>Experience with GitLab or other CI/CD tools</li>\n</ul>\n<p>Nice to Have:</p>\n<ul>\n<li>Experience working with ERP systems such as Oracle or NetSuite</li>\n<li>Experience working in an Agile Scrum environment</li>\n<li>Familiarity with tools like Jira and Confluence</li>\n<li>Familiarity with integration patterns such as pub/sub, CDM (Common Data Model), and batch processing</li>\n<li>Experience working with PostgreSQL</li>\n<li>Experience with Cloudflare Developer’s Platform</li>\n</ul>\n<p>What Makes Cloudflare Special?</p>\n<p>We’re not just a highly ambitious, large-scale technology company. We’re a highly ambitious, large-scale technology company with a soul. Fundamental to our mission to help build a better Internet is protecting the free and open Internet.</p>\n<p>Project Galileo: Since 2014, we&#39;ve equipped more than 2,400 journalism and civil society organizations in 111 countries with powerful tools to defend themselves against attacks that would otherwise censor their work, technology already used by Cloudflare’s enterprise customers--at no cost.</p>\n<p>Athenian Project: In 2017, we created the Athenian Project to ensure that state and local governments have the highest level of protection and reliability for free, so that their constituents have access to election information and voter registration. Since the project, we&#39;ve provided services to more than 425 local government election websites in 33 states.</p>\n<p>1.1.1.1: We released 1.1.1.1 to help fix the foundation of the Internet by building a faster, more secure and privacy-centric public DNS resolver. This is available publicly for everyone to use - it is the first consumer-focused service Cloudflare has ever released.</p>\n<p>Here’s the deal - we don’t store client IP addresses never, ever. We will continue to abide by our privacy commitment and ensure that no user data is sold to advertisers or used to target consumers.</p>\n<p>Sound like something you’d like to be a part of? We’d love to hear from you!</p>\n<p>This position may require access to information protected under U.S. export control laws, including the U.S. Export Administration Regulations. Please note that any offer of employment may be conditioned on your authorization to receive software or technology controlled under these U.S. export laws without sponsorship for an export license.</p>\n<p>Cloudflare is proud to be an equal opportunity employer. We are committed to providing equal employment opportunity for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person&#39;s, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law.</p>\n<p>We are an AA/Veterans/Disabled Employer. Cloudflare provides reasonable accommodations to qualified individuals with disabilities. Please tell us if you require a reasonable accommodation to apply for a job. Examples of reasonable accommodations include, but are not limited to, changing the application process, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment. If you require a reasonable accommodation to apply for a job, please contact us via e-mail at hr@cloudflare.com or via mail at 101 Townsend St. San Francisco, CA 94107.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_09e766cb-2a4","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Cloudflare","sameAs":"https://www.cloudflare.com/","logo":"https://logos.yubhub.co/cloudflare.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/cloudflare/jobs/7336735","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Golang","RESTful APIs","Observability tools","Kubernetes","GitLab"],"x-skills-preferred":["ERP systems","Agile Scrum","Jira","Confluence","Integration patterns","PostgreSQL","Cloudflare Developer’s Platform"],"datePosted":"2026-04-18T15:52:36.450Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Hybrid"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Golang, RESTful APIs, Observability tools, Kubernetes, GitLab, ERP systems, Agile Scrum, Jira, Confluence, Integration patterns, PostgreSQL, Cloudflare Developer’s Platform"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_709b405a-48b"},"title":"Staff / Senior Software Engineer, AI Reliability","description":"<p>We&#39;re seeking a Staff / Senior Software Engineer, AI Reliability to join our team. As a key member of our AIRE (AI Reliability Engineering) team, you will partner with teams across Anthropic to improve reliability across our most critical serving paths. You will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, assist in the design and implementation of high-availability serving infrastructure, lead incident response for critical AI services, and support the reliability of safeguard model serving.</p>\n<p>You may be a good fit for this role if you have strong distributed systems, infrastructure, or reliability backgrounds, are curious and brave, think holistically about how systems compose and where the seams are, can build lasting relationships across teams, care about users and feel ownership over outcomes, have excellent communication and collaboration skills, and bring diverse experience.</p>\n<p>Strong candidates may also have experience operating large-scale model serving or training infrastructure, experience with one or more ML hardware accelerators, understanding of ML-specific networking optimizations, expertise in AI-specific observability tools and frameworks, experience with chaos engineering and systematic resilience testing, and contributions to open-source infrastructure or ML tooling.</p>\n<p>We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues. We value impact and believe that the highest-impact AI research will be big science. We work as a single cohesive team on just a few large-scale research efforts and value communication skills.</p>\n<p>If you&#39;re interested in this role, please submit an application even if you don&#39;t believe you meet every single qualification. We encourage diversity and strive to include a range of diverse perspectives on our team.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_709b405a-48b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5113224008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$325,000-$485,000 USD","x-skills-required":["distributed systems","infrastructure","reliability","Service Level Objectives","monitoring and observability systems","high-availability serving infrastructure","incident response","safeguard model serving"],"x-skills-preferred":["large-scale model serving or training infrastructure","ML hardware accelerators","ML-specific networking optimizations","AI-specific observability tools and frameworks","chaos engineering and systematic resilience testing","open-source infrastructure or ML tooling"],"datePosted":"2026-04-18T15:52:16.313Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, Service Level Objectives, monitoring and observability systems, high-availability serving infrastructure, incident response, safeguard model serving, large-scale model serving or training infrastructure, ML hardware accelerators, ML-specific networking optimizations, AI-specific observability tools and frameworks, chaos engineering and systematic resilience testing, open-source infrastructure or ML tooling","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0ae48270-bef"},"title":"Senior Software Engineer, Storage Engineer","description":"<p>The Storage Engine Organisation at CoreWeave is responsible for the product capabilities and data plane function of CoreWeave&#39;s managed storage products.</p>\n<p>We build reliable, scalable storage solutions with segment leading performance. Storage engine works with engineering teams across infrastructure, compute, and platform to ensure our storage services meet the needs of the world&#39;s most demanding AI workloads.</p>\n<p>The role involves designing and implementing distributed storage solutions to support scaling data-intensive AI workloads, contributing to the development of exabyte-scale, S3-compatible object storage, and integrating dedicated storage clusters into diverse customer environments.</p>\n<p>Key responsibilities include working with technologies such as RDMA, GPU Direct Storage, and distributed filesystems protocols like NFS or FUSE to optimise storage performance and efficiency, participating in efforts to improve the reliability, durability, and observability of our storage stack, collaborating with operations teams to monitor, troubleshoot, and improve storage systems in production environments, and helping develop metrics and dashboards to provide visibility into storage performance and health.</p>\n<p>The ideal candidate will have a strong background in storage systems engineering or infrastructure, with experience working with object storage or distributed filesystems in production environments, proficiency in a systems programming language like Go, C, or Rust, and familiarity with storage observability tools and telemetry pipelines.</p>\n<p>As a senior software engineer, you will be responsible for designing, developing, and deploying scalable and efficient storage solutions, working closely with cross-functional teams to ensure seamless integration with other components of the platform, and mentoring junior engineers to help them grow in their roles.</p>\n<p>If you are passionate about building high-performance storage solutions and have a strong background in software engineering, we encourage you to apply for this exciting opportunity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0ae48270-bef","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4643524006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$139,000 to $204,000","x-skills-required":["Storage systems engineering","Infrastructure","Object storage","Distributed filesystems","RDMA","GPU Direct Storage","NFS","FUSE","Systems programming languages (Go, C, Rust)","Storage observability tools","Telemetry pipelines"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:51:26.395Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ/ New York , NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Storage systems engineering, Infrastructure, Object storage, Distributed filesystems, RDMA, GPU Direct Storage, NFS, FUSE, Systems programming languages (Go, C, Rust), Storage observability tools, Telemetry pipelines","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139000,"maxValue":204000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f2c6f765-eca"},"title":"Staff Engineer, Storage Control Plane","description":"<p>We&#39;re looking for a Staff Storage Engineer to play a key role in designing, building, and operating the control plane for our high-performance AI storage platform. You&#39;ll help evolve CoreWeave&#39;s storage systems by building reliable, scalable, and high-throughput solutions that power some of the largest and innovative AI workloads in the world.</p>\n<p>This role involves close collaboration with teams across infrastructure, compute, and platform to ensure our storage services scale automatically and seamlessly while maximizing performance and reliability.</p>\n<p>About the role:</p>\n<ul>\n<li>Design and implement a highly scalable multi-tenant control plane that supports CoreWeave&#39;s growing AI storage and cloud infrastructure needs.</li>\n</ul>\n<ul>\n<li>Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file system and integrate dedicated storage clusters into diverse customer environments.</li>\n</ul>\n<ul>\n<li>Work with technologies such as RDMA, GPU Direct Storage, RoCE, InfiniBand, SPDK, and distributed filesystems to optimize storage performance and efficiency.</li>\n</ul>\n<ul>\n<li>Participate in efforts to improve the reliability, durability, and observability of our storage stack.</li>\n</ul>\n<ul>\n<li>Collaborate with operations teams to monitor, analyze, and optimize storage systems using telemetry, metrics, and dashboards to improve performance, latency, and resilience.</li>\n</ul>\n<ul>\n<li>Work cross-functionally with platform, product, and infrastructure teams to deliver seamless storage capabilities across the stack.</li>\n</ul>\n<ul>\n<li>Share your knowledge and mentor other engineers on best practices in building distributed, high-performance systems.</li>\n</ul>\n<p>Who You Are:</p>\n<ul>\n<li>Bachelor&#39;s or Master&#39;s degree in Computer Science, Engineering, or a related field.</li>\n</ul>\n<ul>\n<li>10+ years of experience working in storage systems engineering or infrastructure.</li>\n</ul>\n<ul>\n<li>Strong hands-on experience with object storage or distributed filesystems in production environments.</li>\n</ul>\n<ul>\n<li>Experience with one or more storage protocols (e.g. S3, NFS) and file systems such as Ceph, DAOS, or similar.</li>\n</ul>\n<ul>\n<li>Proficiency in a systems programming language such as Go, C++, or Rust.</li>\n</ul>\n<ul>\n<li>Familiarity with storage observability tools and telemetry pipelines (e.g., ClickHouse, Prometheus, Grafana).</li>\n</ul>\n<ul>\n<li>Solid understanding of cloud-native infrastructure, Kubernetes, and scalable system architecture.</li>\n</ul>\n<ul>\n<li>Strong debugging and problem-solving skills in distributed, high-performance environments.</li>\n</ul>\n<ul>\n<li>Clear communicator, able to work collaboratively across teams and share technical insights effectively.</li>\n</ul>\n<p>Wondering if you&#39;re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren&#39;t a 100% skill or experience match. Here are a few qualities we&#39;ve found compatible with our team. If some of this describes you, we&#39;d love to talk.</p>\n<p>Why CoreWeave?</p>\n<p>At CoreWeave, we work hard, have fun, and move fast! We&#39;re in an exciting stage of hyper-growth that you will not want to miss out on. We&#39;re not afraid of a little chaos, and we&#39;re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n</ul>\n<ul>\n<li>Act Like an Owner</li>\n</ul>\n<ul>\n<li>Empower Employees</li>\n</ul>\n<ul>\n<li>Deliver Best-in-Class Client Experiences</li>\n</ul>\n<ul>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!</p>\n<p>The base salary range for this role is $165,000 to $242,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>\n<p>What We Offer</p>\n<p>The range we&#39;ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location. In addition to a competitive salary, we offer a variety of benefits to support your needs, including:</p>\n<ul>\n<li>Medical, dental, and vision insurance</li>\n</ul>\n<ul>\n<li>100% paid for by CoreWeave</li>\n</ul>\n<ul>\n<li>Company-paid Life Insurance</li>\n</ul>\n<ul>\n<li>Voluntary supplemental life insurance</li>\n</ul>\n<ul>\n<li>Short and long-term disability insurance</li>\n</ul>\n<ul>\n<li>Flexible Spending Account</li>\n</ul>\n<ul>\n<li>Health Savings Account</li>\n</ul>\n<ul>\n<li>Tuition Reimbursement</li>\n</ul>\n<ul>\n<li>Ability to Participate in Employee Stock Purchase Program (ESPP)</li>\n</ul>\n<ul>\n<li>Mental Wellness Benefits through Spring Health</li>\n</ul>\n<ul>\n<li>Family-Forming support provided by Carrot</li>\n</ul>\n<ul>\n<li>Paid Parental Leave</li>\n</ul>\n<ul>\n<li>Flexible, full-service childcare support with Kinside</li>\n</ul>\n<ul>\n<li>401(k) with a generous employer match</li>\n</ul>\n<ul>\n<li>Flexible PTO</li>\n</ul>\n<ul>\n<li>Catered lunch each day in our office and data center locations</li>\n</ul>\n<ul>\n<li>A casual work environment</li>\n</ul>\n<ul>\n<li>A work culture focused on innovative disruption</li>\n</ul>\n<p>Our Workplace</p>\n<p>While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.</p>\n<p>California Consumer Privacy Act - California applicants only</p>\n<p>CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information. As part of this commitment and consistent with the Americans with Disabilities Act (ADA), CoreWeave will ensure that qualified applicants and candidates with disabilities are provided reasonable accommodations for the hiring process, unless such accommodation would cause an undue hardship. If reasonable accommodation is needed, please contact: careers@coreweave.com.</p>\n<p>Export Control Compliance</p>\n<p>This position requires access to export controlled information. To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. § 1157, or (iv) asylee under 8 U.S.C. § 1158, (B) eligible to access the export controlled information without a required export authorization, or (C) eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency. CoreWeave may, for legitimate business reasons, decline to pursue any export licensing process.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f2c6f765-eca","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4669836006","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$165,000 to $242,000","x-skills-required":["object storage","distributed filesystems","storage protocols","file systems","cloud-native infrastructure","Kubernetes","scalable system architecture","systems programming language","Go","C++","Rust","storage observability tools","telemetry pipelines"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:51:06.353Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / Dallas, TX"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"object storage, distributed filesystems, storage protocols, file systems, cloud-native infrastructure, Kubernetes, scalable system architecture, systems programming language, Go, C++, Rust, storage observability tools, telemetry pipelines","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":242000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_fbd265ea-621"},"title":"Software Engineer, Workers Deploy & Config","description":"<p>Join the Workers Deploy &amp; Config team, the engine behind Cloudflare&#39;s unique serverless, edge-computing developer platform. This isn&#39;t just another backend role; you&#39;ll be building the critical, large-scale systems that empower developers worldwide to deploy everything - from a personal static site to full-stack applications serving millions of users.</p>\n<p>In fact, you&#39;ll be building the very foundation that the rest of our developer platform,from Pages to R2,is built upon. You will tackle the complex challenges of distributed systems and high-traffic APIs every single day. Your mission? To build and scale the platform that lets customers upload, configure, and manage their Workers, ensuring it&#39;s incredibly fast, extremely resilient, and scales effortlessly.</p>\n<p>You’ll drive projects from the initial idea to global release, delivering solutions at every layer of the stack. You’ll get to master a diverse and modern tech stack, writing high-performance Go, architecting APIs, optimizing storage interactions, building Workers with JavaScript/TypeScript, and managing it all on Kubernetes.</p>\n<p>We&#39;re looking for engineers who are obsessed with the developer experience and thrive on solving large-scale problems with a track record to prove it. If you care as much about the quality of the user&#39;s experience as you do about the quality of your code, and you want to join a high-impact, fast-growing team helping to build a better Internet, we want to talk to you.</p>\n<p>This role is about solving some of the most challenging problems in large scale, distributed systems. You&#39;ll be making a massive, direct impact on the broader developer community. Build &amp; Architect for Massive Scale - Own the core architecture of the Workers control plane, the system that deploys and configures millions of applications globally.</p>\n<p>Proactively identify and eliminate performance bottlenecks, re-architecting critical services to handle exponential growth. Design and implement resilient database schemas and read/write patterns built to support exponential platform growth and long-term usage.</p>\n<p>Evolve our services into a true developer platform, building the foundational capabilities that unlock future products.</p>\n<p>Drive for Extreme Performance &amp; Reliability - Obsess over the developer experience, with a relentless focus on reducing API latency and increasing API availability.</p>\n<p>Own the reliability of one of Cloudflare’s most critical, customer-facing systems. Take pride in production ownership by participating in an on-call rotation to ensure our platform is always on.</p>\n<p>Lead, Collaborate, &amp; Innovate - Partner directly with Product Managers and customers to translate complex problems into simple, elegant, and scalable solutions.</p>\n<p>Lead technical design from the ground up, collaborating with a brilliant, globally-distributed team of engineers.</p>\n<p>Act as a mentor and knowledge-sharer, leveling up the entire team.</p>\n<p>Constantly research, prototype, and introduce cutting-edge technologies to solve new classes of problems.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_fbd265ea-621","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Cloudflare","sameAs":"https://www.cloudflare.com/","logo":"https://logos.yubhub.co/cloudflare.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/cloudflare/jobs/7377424","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Strong experience using Go","Experience with Javascript and Typescript","Experience with metrics and observability tools such as Prometheus and Grafana","Experience with SQL and common relational database systems such as PostgreSQL","Experience with Kubernetes or similar deployment tools","Experience with distributed systems","Proven ability to drive projects independently, from concept to implementation – gathering requirements, writing technical specifications, implementing, testing, and releasing","Familiarity with implementing and consuming RESTful APIs"],"x-skills-preferred":["Experience with C++ or Rust","Experience scaling systems to meet increasing performance and usability demands","Experience working on a control and/or data plane","Experience using Cloudflare Workers or Pages","Experience working in frontend frameworks such as React","Experience managing interns or mentoring junior engineers","Product mindset and comfortable talking to customers and partners","Familiarity with GraphQL","Familiarity with RPC"],"datePosted":"2026-04-18T15:49:32.037Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Hybrid"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Strong experience using Go, Experience with Javascript and Typescript, Experience with metrics and observability tools such as Prometheus and Grafana, Experience with SQL and common relational database systems such as PostgreSQL, Experience with Kubernetes or similar deployment tools, Experience with distributed systems, Proven ability to drive projects independently, from concept to implementation – gathering requirements, writing technical specifications, implementing, testing, and releasing, Familiarity with implementing and consuming RESTful APIs, Experience with C++ or Rust, Experience scaling systems to meet increasing performance and usability demands, Experience working on a control and/or data plane, Experience using Cloudflare Workers or Pages, Experience working in frontend frameworks such as React, Experience managing interns or mentoring junior engineers, Product mindset and comfortable talking to customers and partners, Familiarity with GraphQL, Familiarity with RPC"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_18646b21-352"},"title":"Senior Enterprise Account Executive - W&B","description":"<p>At CoreWeave, we&#39;re looking for a Senior Enterprise Account Executive to join our team. As a quota-carrying, enterprise software sales position, you will be responsible for meeting and exceeding sales goals through generating and closing new opportunities while increasing awareness of Weights &amp; Biases in the marketplace.</p>\n<p>Your primary focus will be on driving new business and account expansion into the San Francisco/West Coast Enterprise territory. You will develop and implement a sales strategy aligned to regional and industry needs to help drive awareness, engagement, and growth. You will also collaborate with technology ecosystem and alliance partners to accelerate new opportunity discovery.</p>\n<p>As a Senior Enterprise Account Executive, you will manage opportunities through the sales cycle from initial inquiry/outbound interaction through to forecasted pipeline. You will meet quarterly and annual revenue objectives for the territory, while reporting on sales, activities, and progress on a regular basis through CRM and sales forecasting tools.</p>\n<p>We are looking for motivated, focused, and coachable sales professionals with experience across the full spectrum of the software sales cycle – prospecting, defining and articulating value proposition, pilot process management, business case development, negotiation, and closing.</p>\n<p>Requirements:</p>\n<ul>\n<li>5+ years of experience in B2B sales and/or account management roles</li>\n<li>Minimum of 7 years direct enterprise selling experience</li>\n<li>Track record of success in closing business</li>\n<li>Excellent negotiation, analytical, financial, and organizational capabilities</li>\n<li>Able to thrive in an evolving, entrepreneurial structure and environment</li>\n<li>Outstanding verbal and written communication skills</li>\n<li>Ability to work at both a tactical and strategic level</li>\n<li>Must possess a can-do, self-starter mentality in a highly collaborative atmosphere</li>\n</ul>\n<p>Preferred:</p>\n<ul>\n<li>Experience selling developer tools/technical platforms/observability tools to builders (developers/engineering/platform/DevOps/data/AI/ML)</li>\n<li>Experience selling to AI/ML leaders and builders</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_18646b21-352","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4650861006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$130,000 to $160,000","x-skills-required":["B2B sales","account management","software sales cycle","negotiation","analytical skills","financial skills","organizational skills","communication skills"],"x-skills-preferred":["developer tools","technical platforms","observability tools","AI/ML leadership","AI/ML sales"],"datePosted":"2026-04-18T15:49:10.999Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Sales","industry":"Technology","skills":"B2B sales, account management, software sales cycle, negotiation, analytical skills, financial skills, organizational skills, communication skills, developer tools, technical platforms, observability tools, AI/ML leadership, AI/ML sales","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":130000,"maxValue":160000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2d198020-3d5"},"title":"Sr. Engineer, Storage","description":"<p>The Storage Engine Team at CoreWeave is responsible for the product capabilities and data plane function of CoreWeave&#39;s managed storage products. We build reliable, scalable storage solutions with segment leading performance. Storage engine works with engineering teams across infrastructure, compute, and platform to ensure our storage services meet the needs of the world&#39;s most demanding AI workloads.</p>\n<p>The primary responsibilities of this role include designing and implementing distributed storage solutions to support scaling data-intensive AI workloads, contributing to the development of exabyte-scale, S3-compatible object storage, and integrating dedicated storage clusters into diverse customer environments. Additionally, the successful candidate will work with technologies such as RDMA, GPU Direct Storage, and distributed filesystems protocols such as NFS or FUSE to optimize storage performance and efficiency.</p>\n<p>Key responsibilities also include leading efforts to improve the reliability, durability, security, and observability of our storage stack, collaborating with operations teams to monitor, troubleshoot, and improve storage systems in production environments, setting the bar for developing metrics and dashboards to provide visibility into storage performance and health, analyzing telemetry and system data to drive improvements in throughput, latency, and resilience, and working cross-functionally with platform, product, and infrastructure teams to deliver seamless storage capabilities across the stack.</p>\n<p>A key aspect of this role is sharing knowledge and mentoring other engineers on best practices in building distributed, high-performance systems.</p>\n<p>To be successful in this role, the ideal candidate will have a strong background in storage systems engineering or infrastructure, with a minimum of 8-10 years of experience. They will also have hands-on experience with object storage or distributed filesystems in production environments, as well as proficiency in a systems programming language such as Go, C, or Rust. Additionally, they will have experience working with cloud-native infrastructure, Kubernetes, and scalable system architectures, and familiarity with storage observability tools and telemetry pipelines.</p>\n<p>If you&#39;re a motivated and experienced engineer looking to join a dynamic team and contribute to the development of cutting-edge storage solutions, we encourage you to apply for this exciting opportunity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2d198020-3d5","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4664429006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$143,000 to $210,000","x-skills-required":["storage systems engineering","infrastructure","object storage","distributed filesystems","RDMA","GPU Direct Storage","NFS","FUSE","cloud-native infrastructure","Kubernetes","scalable system architectures","storage observability tools","telemetry pipelines"],"x-skills-preferred":["Go","C","Rust","distributed systems","high-performance systems","storage performance and efficiency"],"datePosted":"2026-04-18T15:49:07.662Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"storage systems engineering, infrastructure, object storage, distributed filesystems, RDMA, GPU Direct Storage, NFS, FUSE, cloud-native infrastructure, Kubernetes, scalable system architectures, storage observability tools, telemetry pipelines, Go, C, Rust, distributed systems, high-performance systems, storage performance and efficiency","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":143000,"maxValue":210000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_782a1c68-325"},"title":"Senior DevOps Engineer","description":"<p>At ZoomInfo, we&#39;re looking for a Senior DevOps Engineer to join our Infrastructure Engineering group. As a Senior DevOps Engineer, you will be responsible for innovation in infrastructure and automation for ZoomInfo Engineering. You will have a strong background in modern infrastructure, with a thorough understanding of industry best practices. You will have a high level of comfort participating in challenging technical discussions and advocating for best practices in a high-paced environment.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Thorough, clear, concise documentation of new and existing standards, procedures, and automated workflows</li>\n<li>Championing of best practices and standards around infrastructure configuration and management</li>\n<li>Experience in creating internal products and managing their software development lifecycle</li>\n<li>Deployment, configuration, and management of infrastructure via infrastructure as code</li>\n<li>Working hands on with cloud infrastructure (AWS, Azure, and GCP)</li>\n<li>Working hands on with container infrastructure (Docker, Kubernetes, ECS, EKS, GKE, GAE, etc.)</li>\n<li>Configuration and management of Linux based tools and third-party cloud services</li>\n<li>Continuous improvement of our infrastructure, ensuring that it is highly available and observable</li>\n</ul>\n<p>Minimum Requirements:</p>\n<ul>\n<li>Solid foundation of experience managing Linux systems in virtual environments (6+ years)</li>\n<li>Deploying and maintaining highly available infrastructure in one or more Cloud providers (5+ years, AWS or GCP preferred)</li>\n<li>Infrastructure as code using Terraform (4+ years)</li>\n<li>Creating, deploying, maintaining, and troubleshooting Docker images (4+ years)</li>\n<li>Scoping, deploying, maintaining and troubleshooting Kubernetes clusters (4+ years)</li>\n<li>Developing and maintaining an active codebase in Go, Python preferably (3+ years)</li>\n<li>Experience with PaaS technologies (5+ years, EKS and GKE preferred)</li>\n<li>Maintaining monitoring and observability tools (Datadog, Prometheus preferred)</li>\n<li>Thorough understanding of network infrastructure and concepts (VPNs, routers and routing protocols, TCP/IP, IPv4 and v6, UDP, OSI layers, etc.)</li>\n<li>Experience with load balancing and proxy technologies (Istio, Nginx, HAProxy, Apache, Cloud load balancers, etc.)</li>\n<li>Debugging and troubleshooting complex problems in cloud-native infrastructure.</li>\n<li>Slack native mentality.</li>\n<li>Bachelor’s Degree in Computer Science or a related technical discipline, or the equivalent combination of education, technical certifications, training, or work experience.</li>\n</ul>\n<p>Abilities Required:</p>\n<ul>\n<li>Demonstrated ability to learn new technologies quickly and independently</li>\n<li>Strong technical, organizational and interpersonal skills</li>\n<li>Strong written and verbal communication skills</li>\n<li>Must be able to read, understand, and communicate complex problems and solutions in English over a textual medium (such as Slack)</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_782a1c68-325","directApply":true,"hiringOrganization":{"@type":"Organization","name":"ZoomInfo","sameAs":"https://www.zoominfo.com/","logo":"https://logos.yubhub.co/zoominfo.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/zoominfo/jobs/8287254002","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Linux","Cloud infrastructure (AWS, Azure, GCP)","Container infrastructure (Docker, Kubernetes, ECS, EKS, GKE, GAE)","Infrastructure as code (Terraform)","Go","Python","PaaS technologies (EKS, GKE)","Monitoring and observability tools (Datadog, Prometheus)"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:47:10.427Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Ra'anana, Israel"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Linux, Cloud infrastructure (AWS, Azure, GCP), Container infrastructure (Docker, Kubernetes, ECS, EKS, GKE, GAE), Infrastructure as code (Terraform), Go, Python, PaaS technologies (EKS, GKE), Monitoring and observability tools (Datadog, Prometheus)"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6984004d-b3f"},"title":"Intermediate Backend Engineer, Gitlab Delivery: Upgrades","description":"<p>As a Backend Engineer on the GitLab Upgrades team, you&#39;ll help self-managed customers run GitLab with assurance by building and supporting the deployment tooling, infrastructure, and automation behind how GitLab is installed, upgraded, and operated.</p>\n<p>You&#39;ll work across Omnibus GitLab, GitLab Helm Charts, the GitLab Environment Toolkit (GET), and the GitLab Operator to improve reliability, security, and scalability in production-grade environments. This is a hands-on role where you&#39;ll partner with Distribution Engineers, Site Reliability Engineers, Release Managers, Security, and Development teams to make self-managed GitLab easier to use across a wide range of platforms.</p>\n<p>Some examples of our projects:</p>\n<ul>\n<li>Evolve Omnibus GitLab, Helm Charts, GET, and the GitLab Operator to support new GitLab features and architectures</li>\n</ul>\n<ul>\n<li>Improve installation, upgrade, and validation automation for large-scale self-managed GitLab deployments</li>\n</ul>\n<p>Maintain and improve the Omnibus GitLab package so GitLab components work reliably in self-managed deployments.</p>\n<p>Develop and support GitLab Helm Charts for scalable, production-ready Kubernetes deployments.</p>\n<p>Enhance the GitLab Environment Toolkit (GET) and validated reference architectures used by enterprise and internal users.</p>\n<p>Support and extend the GitLab Operator for Kubernetes-native lifecycle management of GitLab installations.</p>\n<p>Improve the installation, upgrade, and day-to-day operating experience across supported self-managed platforms.</p>\n<p>Collaborate with Security to address vulnerabilities and strengthen secure defaults and configurations across the deployment stack.</p>\n<p>Build and maintain automation and continuous integration and continuous deployment pipelines that validate deployment tooling across Omnibus, Charts, GET, and the Operator.</p>\n<p>Partner with Distribution Engineers, Site Reliability Engineers, Release Managers, and Development teams to integrate new features and keep user-facing documentation accurate and useful.</p>\n<p>Experience building and maintaining backend services in production environments, especially in deployment, infrastructure, or platform tooling.</p>\n<p>Practical knowledge of Kubernetes operations, including authoring and maintaining Helm charts.</p>\n<p>Proficiency with Ruby and Go, along with scripting skills to automate workflows and tooling.</p>\n<p>Familiarity with Terraform and infrastructure as code practices across cloud and on-premises environments.</p>\n<p>Hands-on experience with relational databases, especially PostgreSQL, including performance and reliability considerations.</p>\n<p>Understanding of secure, scalable, and supportable deployment practices, along with observability tools such as Prometheus and Grafana.</p>\n<p>Experience collaborating in large codebases and distributed teams, including writing clear user-facing documentation and implementation guides.</p>\n<p>Openness to learning new technologies and applying transferable skills across different parts of the GitLab deployment stack.</p>\n<p>The Upgrades team is part of GitLab Delivery and delivers GitLab to self-managed users through supported, validated deployment tooling. The team maintains Omnibus GitLab, Helm Charts, the GitLab Operator, and the GitLab Environment Toolkit (GET) to help self-managed users deploy GitLab securely and reliably across diverse environments. You&#39;ll join a distributed group of backend engineers that works asynchronously across time zones and collaborates closely with Site Reliability Engineering, Release, Security, and Development teams. The team is focused on improving installation and upgrade workflows, strengthening automation and security, and helping self-managed customers run GitLab successfully at any scale.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6984004d-b3f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"GitLab","sameAs":"https://about.gitlab.com/","logo":"https://logos.yubhub.co/about.gitlab.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/gitlab/jobs/8463951002","x-work-arrangement":"remote","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Ruby","Go","Kubernetes","Helm charts","Terraform","infrastructure as code","PostgreSQL","relational databases","observability tools","Prometheus","Grafana"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:46:16.737Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote, India"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Ruby, Go, Kubernetes, Helm charts, Terraform, infrastructure as code, PostgreSQL, relational databases, observability tools, Prometheus, Grafana"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_fd1da18e-84d"},"title":"Principal Software Engineer II - Observability","description":"<p>We&#39;re looking for a Principal Software Engineer to join the Observability Experience Team as one of the Tech Leads. As part of this team, you will work at the intersection of big data engineering, backend architecture, and experiences to help users obtain the best insights from their Observability signals, especially logs, metrics, and traces.</p>\n<p>Key responsibilities include collaborating with product management, product design, and multiple teams across Elastic to define and evolve the end-to-end experiences for Observability. You will also be a contact point for other teams within Elastic, providing hands-on support and guidance. Additionally, you will help the team define coding practices and standards, foster a culture of mutual respect, collaboration, and consensus-based decision-making, and stay true to the principles of software development as adopted by the team.</p>\n<p>The ideal candidate will have experience leading technical projects in the data and enterprise architecture areas, with a proven knowledge in building and running sophisticated technical infrastructures and engineering sound software systems. They should also have hands-on experience using and developing Observability tools, preferably in the Logs space, and experience mentoring expert engineers, providing technical and professional guidance. Furthermore, they should be able to define a long-term technical vision for an area of a data-intensive application, working across teams and organizations to collaboratively build the technical roadmap.</p>\n<p>Bonus points for experience as a user of the Elastic Stack and experience in SRE roles.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_fd1da18e-84d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Elastic, the Search AI Company","sameAs":"https://www.elastic.co/","logo":"https://logos.yubhub.co/elastic.co.png"},"x-apply-url":"https://job-boards.greenhouse.io/elastic/jobs/7635297","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Observability tools","Logs space","Big data engineering","Backend architecture","Experiences"],"x-skills-preferred":["Elastic Stack","SRE roles"],"datePosted":"2026-04-18T15:45:58.943Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Greece"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Observability tools, Logs space, Big data engineering, Backend architecture, Experiences, Elastic Stack, SRE roles"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_da7679a6-e4f"},"title":"Senior Technical Operations Lead","description":"<p>Job Title: Senior Technical Operations Lead</p>\n<p>We are seeking an experienced Senior Technical Operations Lead to drive operational excellence across our Infrastructure Engineering organization.</p>\n<p>As a Senior Technical Operations Lead, you will design and implement world-class operational processes, establish SRE best practices, and mentor technical teams to achieve exceptional reliability and efficiency.</p>\n<p>Key Responsibilities:</p>\n<p>SRE Leadership &amp; Transformation</p>\n<ul>\n<li>Lead the design and implementation of SRE practices and tooling across Infrastructure Engineering</li>\n</ul>\n<ul>\n<li>Establish and cultivate an SRE-focused culture at Zoominfo</li>\n</ul>\n<p>Operational Process Design &amp; Governance</p>\n<ul>\n<li>Establish clear governance frameworks and procedural consistency</li>\n</ul>\n<ul>\n<li>Make decisions about process exceptions and/or changes to accommodate different team contexts</li>\n</ul>\n<ul>\n<li>Design and/or implement process automations using scripts and integrations</li>\n</ul>\n<ul>\n<li>Define functional requirements and goals for process automations</li>\n</ul>\n<ul>\n<li>Conduct hands-on and/or automated audits to ensure process adherence and identify improvement opportunities</li>\n</ul>\n<p>Incident Management &amp; Root Cause Analysis</p>\n<ul>\n<li>Design, implement, and continuously improve Incident Management and Change Management procedures that scale across the organization, using tools such as PagerDuty, Slack, Jira, ServiceNow, and custom integrations</li>\n</ul>\n<ul>\n<li>Lead and participate in root cause analysis sessions, driving teams toward systemic improvements rather than blame</li>\n</ul>\n<ul>\n<li>Design and execute incident dry runs and tabletop exercises to build organizational resilience</li>\n</ul>\n<ul>\n<li>Establish metrics and KPIs that measure incident response effectiveness and drive continuous improvement</li>\n</ul>\n<p>Enable Data-Driven Decision Making</p>\n<ul>\n<li>Identify, define, and automate the tracking of operational KPIs and departmental metrics that matter, enabling senior managers to make informed decisions on the basis of data</li>\n</ul>\n<ul>\n<li>Build and maintain metric dashboards and automated reporting systems that provide real-time visibility into operational health</li>\n</ul>\n<ul>\n<li>Analyze trends and surface opportunities for optimization</li>\n</ul>\n<p>Stakeholder Engagement, Training &amp; Mentorship</p>\n<ul>\n<li>Build and maintain strong relationships with Engineering managers, Product Managers, and cross-functional stakeholders across geographies</li>\n</ul>\n<ul>\n<li>Maintain a feedback loop. Meet with stakeholders to understand process pain points.</li>\n</ul>\n<ul>\n<li>Influence others by fostering trust, leading by example, and inspiring them with your expertise and passion for reliability practices.</li>\n</ul>\n<ul>\n<li>Enhance internal knowledge of third-party tools such as Pagerduty, Datadog, and more, by educating Zoominfo employees on these tools.</li>\n</ul>\n<p>Deliver training sessions that make Operational Excellence engaging and motivating for diverse audiences.</p>\n<p>Required Experience &amp; Qualifications:</p>\n<ul>\n<li>Bachelor’s degree in Software Engineering, Operations Management, or related field</li>\n</ul>\n<ul>\n<li>7+ years of hands-on experience in technical operations, Site Reliability Engineering (SRE), Incident Management, or IT Service Management roles within SaaS or technical organizations</li>\n</ul>\n<ul>\n<li>Fluent English proficiency (written and verbal)</li>\n</ul>\n<ul>\n<li>Proven track record designing and implementing operational processes at scale</li>\n</ul>\n<ul>\n<li>Demonstrated expertise in SRE principles, practices, and tooling</li>\n</ul>\n<ul>\n<li>Strong data analysis skills with ability to define metrics, build or design dashboards, and use data to drive strategic decisions</li>\n</ul>\n<ul>\n<li>Proven ability to work effectively in a matrix organizational structure</li>\n</ul>\n<ul>\n<li>Ability and experience working with senior management at global organizations</li>\n</ul>\n<ul>\n<li>Hands-on experience with monitoring and observability tools such as PagerDuty and/or Datadog</li>\n</ul>\n<ul>\n<li>Familiarity with Jira, Confluence, Google Data Studio, or Tableau</li>\n</ul>\n<ul>\n<li>Experience with scripting and integrations (Python, JavaScript, Google AppScript, or similar)</li>\n</ul>\n<ul>\n<li>Background in SRE transformation or organizational process improvement initiatives</li>\n</ul>\n<p>#LI-SS4 #LI-Hybrid</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_da7679a6-e4f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"ZoomInfo","sameAs":"https://www.zoominfo.com/","logo":"https://logos.yubhub.co/zoominfo.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/zoominfo/jobs/8451386002","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Site Reliability Engineering (SRE)","Technical Operations","Incident Management","IT Service Management","Monitoring and Observability Tools","Jira","Confluence","Google Data Studio","Tableau","Scripting and Integrations","Python","JavaScript","Google AppScript"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:45:47.393Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Ra'anana, Israel"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Site Reliability Engineering (SRE), Technical Operations, Incident Management, IT Service Management, Monitoring and Observability Tools, Jira, Confluence, Google Data Studio, Tableau, Scripting and Integrations, Python, JavaScript, Google AppScript"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_1a4d732c-42c"},"title":"Principal Site Reliability Engineer - Observability","description":"<p>We&#39;re looking for a Principal Site Reliability Engineer to join the Observability Solution team. As a key member of the team, you will collaborate with product management, product design, customers, and multiple teams across Elastic to define and evolve end-to-end InfraObs experiences. You will deliver and continually evolve these experiences leveraging the Elastic Platform capabilities and coding agents.</p>\n<p>Key responsibilities include being a contact point for other teams within Elastic, fostering a culture of mutual respect, collaboration, and consensus-based decision-making, and being an awesome person to work with.</p>\n<p>To be successful in this role, you will need to have a SRE background and experience operating large-scale production services with the help of Observability tools. You should be proficient in operating production infrastructure in K8s and at least one of the three major CSPs, as well as using Observability tools. You will also need to be able to use AI coding agents in the delivery workflow and have excellent verbal and written communication skills.</p>\n<p>Bonus points will be given to those with experience as a user of the Elastic Stack.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_1a4d732c-42c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Elastic, the Search AI Company","sameAs":"https://www.elastic.co/","logo":"https://logos.yubhub.co/elastic.co.png"},"x-apply-url":"https://job-boards.greenhouse.io/elastic/jobs/7721575","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Site Reliability Engineering","Observability tools","Kubernetes","Cloud Service Providers","AI coding agents"],"x-skills-preferred":["Elastic Stack","Product management","Product design","Collaboration","Communication"],"datePosted":"2026-04-18T15:44:28.865Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Spain"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Site Reliability Engineering, Observability tools, Kubernetes, Cloud Service Providers, AI coding agents, Elastic Stack, Product management, Product design, Collaboration, Communication"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_99aa7ac0-2c6"},"title":"Senior Engineering Manager, Data Streaming Services (Auth0)","description":"<p>Secure Every Identity, from AI to Human\\n\\nIdentity is the key to unlocking the potential of AI. As the Senior Manager of Data Streaming Services, you will lead the evolution of our streaming data backbone across a multi-cloud footprint. You will oversee multiple engineering teams dedicated to making data streaming seamless, reliable, and high-performance.\\n\\nThis is a &quot;manager of managers&quot; role requiring a blend of strategic foresight, execution rigor, and technical grit. You will set the vision for our streaming services, mentor high-performing teams, and take accountability for our service uptime guarantees.\\n\\nResponsibilities:\\n\\n- Lead a world-class team of teams. Oversee data streaming infrastructure and services that power our global platform across AWS and Azure.\\n\\n- Own roadmap and execution. Partner with product and stakeholder teams to define the team&#39;s strategy and prioritized roadmap.\\n\\n- Drive engineering excellence. Set high standards of quality, reliability, and operational robustness, championing best practices in software development, from code reviews to observability and incident management.\\n\\n- Lead an automation-first culture. Reduce operational friction and ensure infrastructure is self-healing and code-defined. Draw efficiency from AI-assisted development.\\n\\n- Act as a technical leader. Lead response on incidents for services under ownership and help teams navigate complex distributed systems failures.\\n\\nWhat you&#39;ll bring:\\n\\n- Proven engineering leadership, building and leading teams of teams. Experience coaching Staff+ engineers and engineering managers.\\n\\n- Strong technical and architectural acumen. Background in building scalable, distributed systems. Comfortable participating in and guiding technical discussions.\\n\\n- Strong project management skills. Expertise in creating technical roadmaps, prioritizing effectively in an agile environment, and managing complex project dependencies.\\n\\n- Collaborative leadership style, adapted to remote ways of working. Excellent written and verbal communication skills to build strong relationships with stakeholders and inspire others.\\n\\nBonus Points:\\n\\n- Experience developing data-intensive applications in a modern programming language such as go, node.js, or Java.\\n\\n- Experience with databases such as PostgreSQL and MongoDB.\\n\\n- Experience with distributed streaming platforms like Kafka.\\n\\n- Familiarity with concepts in the IAM (Identity and Access Management) domain.\\n\\n- Experience with cloud providers (AWS, Azure), container technologies such as Kubernetes and Docker, and observability tools such as Datadog.\\n\\n- Experience building reliable, high-availability platforms for enterprise SaaS applications.\\n\\nTo learn more about our Total Rewards program please visit: https://rewards.okta.com/us\\n\\nThe annual base salary range for this position for candidates located in the San Francisco Bay area is between: $194,000-$266,000 CAD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_99aa7ac0-2c6","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Auth0","sameAs":"https://auth0.com/","logo":"https://logos.yubhub.co/auth0.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/okta/jobs/7735781","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$194,000-$266,000 CAD","x-skills-required":["engineering leadership","team management","technical architecture","distributed systems","project management","agile development","cloud providers","container technologies","observability tools"],"x-skills-preferred":["go","node.js","Java","PostgreSQL","MongoDB","Kafka","IAM","Kubernetes","Docker","Datadog"],"datePosted":"2026-04-18T15:43:55.807Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Toronto, Ontario, Canada"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"engineering leadership, team management, technical architecture, distributed systems, project management, agile development, cloud providers, container technologies, observability tools, go, node.js, Java, PostgreSQL, MongoDB, Kafka, IAM, Kubernetes, Docker, Datadog","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":194000,"maxValue":266000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_9238107d-204"},"title":"Software Architect, Reliability Engineering","description":"<p>Join the team as Twilio&#39;s next Reliability Architect.</p>\n<p>As an Architect in SRE, you will drive the technical strategy, vision and outcomes for Twilio&#39;s Reliability Engineering organisation. You will define and lead solutions and initiatives that ensure Twilio products are reliable worldwide, and you will define standards and guide engineering teams on best practices for designing, building, and operating resilient systems.</p>\n<p>This role is pivotal to Twilio&#39;s commitment to operational excellence, scalability, and pragmatic, large-scale systems design in the cloud.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes.</li>\n<li>Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs.</li>\n<li>Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services;</li>\n<li>Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability.</li>\n<li>Ensure integrity and quality across the service lifecycle; design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management.</li>\n<li>Collaborate with product and cross-functional teams to identify reliability risks and convert them into actionable designs, programs, and tooling.</li>\n<li>Establish and champion reliability practices and drive systemic improvements.</li>\n<li>Mentor and grow engineers and technical leaders</li>\n<li>Track and apply emerging SRE, cloud, and large-scale systems best practices; introduce pragmatic innovations that improve reliability at scale.</li>\n</ul>\n<p>Qualifications:</p>\n<ul>\n<li>15+ years of experience in Reliability Engineering, Software Engineering, DevOps roles with a focus on infrastructure, backend systems, and reliability, including as a principal/architect.</li>\n<li>Strong experience in driving strategic technical decisions and defining long-term technical vision.</li>\n<li>In-depth understanding of the role of Reliability Engineering in a large and diverse SaaS organisation.</li>\n<li>Experience driving cross-org technical architecture outcomes.</li>\n<li>Knowledge of cloud architecture, devops practices, and large-scale systems design with microservices.</li>\n<li>Bachelor&#39;s or Master&#39;s degree in Computer Science, Engineering, or a related field (or equivalent experience).</li>\n<li>Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability in high-scale environments.</li>\n<li>Hands-on experience with Kubernetes (e.g., EKS), deploying and managing stateful services, and cloud services like AWS.</li>\n<li>Proficiency in infrastructure-as-code tools such as Terraform or CloudFormation for automating infrastructure.</li>\n<li>Expertise in observability tools (e.g., Prometheus, Grafana, Datadog) for monitoring distributed systems and setting up alerting.</li>\n<li>Proficient in at least one programming language (e.g., Go, Python, Java) for building automation and tooling.</li>\n<li>Experience designing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.</li>\n<li>Experience running cross-functional post-incident reviews and driving improvements.</li>\n<li>Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.</li>\n<li>Proven track record of leading reliability improvements in data-intensive or mission-critical systems and collaborating with engineering teams.</li>\n<li>Excellent problem-solving, analytical, verbal, and written communication skills, with the ability to work in cross-functional and distributed environments.</li>\n<li>Demonstrated leadership in mentoring teams, influencing decisions, and balancing long-term objectives with short-term needs.</li>\n<li>Ability to influence and build effective working relationships with all levels of the organisation.</li>\n</ul>\n<p>Desired:</p>\n<ul>\n<li>Specific experience owning and operating large AWS footprints.</li>\n<li>Knowledge of Kubernetes architecture and concepts.</li>\n<li>Experience with data technologies like Apache Kafka, AWS MSK, or similar for reliable streaming.</li>\n<li>Passion for building reliable products, with prior projects in high-availability systems</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_9238107d-204","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Twilio","sameAs":"https://www.twilio.com/","logo":"https://logos.yubhub.co/twilio.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/twilio/jobs/7658259","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$227,840.00 - $284,800.00 per year","x-skills-required":["Reliability Engineering","Software Engineering","DevOps","Cloud Architecture","Microservices","Kubernetes","AWS","Terraform","Observability Tools","Programming Languages","Incident Response","Distributed Systems Principles"],"x-skills-preferred":["Apache Kafka","AWS MSK","Kubernetes Architecture","Data Technologies"],"datePosted":"2026-04-18T15:42:56.209Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote - US"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Reliability Engineering, Software Engineering, DevOps, Cloud Architecture, Microservices, Kubernetes, AWS, Terraform, Observability Tools, Programming Languages, Incident Response, Distributed Systems Principles, Apache Kafka, AWS MSK, Kubernetes Architecture, Data Technologies","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":227840,"maxValue":284800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3ac0b2f4-6c9"},"title":"Member of Technical Staff - Imagine Product","description":"<p><strong>About the Role</strong></p>\n<p>The Imagine Product team is redefining AI-driven media experiences for Grok users worldwide. You&#39;ll build and scale robust, high-performance systems that power immersive, multi-modal media interactions,leveraging cutting-edge AI to enable seamless generation, processing, and delivery of images, video, audio, and beyond.</p>\n<p>Your work will drive engaging, real-time user experiences that captivate and delight millions, turning advanced multimodal models into production-grade features. If you&#39;re a driven problem-solver passionate about AI, media technologies, and creating scalable solutions that shape the future of consumer AI, this is your opportunity to make a lasting impact.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Design and implement scalable systems to support Grok&#39;s AI-driven media experiences, ensuring high performance, reliability, and low-latency at global scale.</li>\n<li>Architect robust infrastructure for real-time multi-modal interactions, including handling generation requests, media processing, and seamless integration with frontend and model serving layers.</li>\n<li>Build and optimise large-scale data pipelines to ingest, process, and analyse multi-modal data (images, video, audio), fueling continuous improvement and personalisation of Grok&#39;s media capabilities.</li>\n<li>Collaborate closely with frontend engineers, AI researchers, and product teams to deliver captivating, media-rich features and end-to-end user experiences.</li>\n<li>Own full-cycle development of solutions: from system design and prototyping to deployment, monitoring, observability, and iterative refinement.</li>\n<li>Deliver production-ready, maintainable code that powers features reaching hundreds of millions of users.</li>\n</ul>\n<p><strong>Basic Qualifications</strong></p>\n<ul>\n<li>Proficiency in Python or Rust, with a strong track record of writing clean, efficient, maintainable, and scalable code.</li>\n<li>Experience designing and building systems for consumer-facing products, with emphasis on performance, reliability, and handling high-throughput workloads.</li>\n<li>Hands-on expertise in large-scale data infrastructure and pipelines, particularly for multi-modal or media-heavy AI applications.</li>\n<li>Proven ability to deliver robust, production-grade solutions to millions of users while maintaining high standards of quality and uptime.</li>\n<li>Strong problem-solving skills and a passion for turning innovative ideas into high-impact, scalable realities.</li>\n<li>Deep enthusiasm for AI and media technologies, with a commitment to building user-focused products that inspire and engage.</li>\n</ul>\n<p><strong>Preferred Skills and Experience</strong></p>\n<ul>\n<li>Experience with real-time systems, inference serving, or multi-modal data processing at scale.</li>\n<li>Familiarity with distributed systems, containerisation (e.g., Kubernetes), observability tools, or performance tuning for AI workloads.</li>\n<li>Background in AI-driven consumer products or media generation technologies.</li>\n<li>Track record collaborating across engineering, research, and product teams to ship delightful features quickly.</li>\n</ul>\n<p><strong>Compensation and Benefits</strong></p>\n<p>$180,000 - $440,000 USD</p>\n<p>Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short &amp; long-term disability insurance, life insurance, and various other discounts and perks.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3ac0b2f4-6c9","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://xAI.com","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/5052027007","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$180,000 - $440,000 USD","x-skills-required":["Python","Rust","clean, efficient, maintainable, and scalable code","large-scale data infrastructure and pipelines","multi-modal or media-heavy AI applications","production-grade solutions","quality and uptime"],"x-skills-preferred":["real-time systems","inference serving","multi-modal data processing at scale","distributed systems","containerisation","observability tools","performance tuning for AI workloads","AI-driven consumer products","media generation technologies"],"datePosted":"2026-04-18T15:41:51.975Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Palo Alto, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Rust, clean, efficient, maintainable, and scalable code, large-scale data infrastructure and pipelines, multi-modal or media-heavy AI applications, production-grade solutions, quality and uptime, real-time systems, inference serving, multi-modal data processing at scale, distributed systems, containerisation, observability tools, performance tuning for AI workloads, AI-driven consumer products, media generation technologies","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":440000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a2e88648-d1d"},"title":"Mistral Cloud - Site Reliability Engineer","description":"<p>We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our Cloud platform and customer facing applications.</p>\n<p>You will work closely with our software engineers and product teams to ensure our systems meet and exceed our internal and external customers&#39; expectations.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Design, build, and maintain scalable, highly available and fault-tolerant infrastructures</li>\n<li>Operate systems and troubleshoot issues in production environments</li>\n<li>Implement and improve monitoring, alerting, and incident response systems</li>\n<li>Implement and maintain workflows and tools for both our customer-facing APIs and large training runs</li>\n</ul>\n<p>Development responsibilities include:</p>\n<ul>\n<li>Drive continuous improvement in infrastructure automation, deployment, and orchestration</li>\n<li>Collaborate with software engineers to develop and implement solutions that enable safe and reproducible model-training experiments</li>\n<li>Help build a cloud platform offering an abstraction layer between science, engineering and infrastructure</li>\n<li>Design and develop new workflows and tooling to improve the reliability, availability and performance of our systems</li>\n</ul>\n<p>Additional responsibilities include:</p>\n<ul>\n<li>Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements</li>\n<li>Document processes and procedures to ensure consistency and knowledge sharing across the team</li>\n<li>Contribute to open-source projects, research publications, blog articles and conferences</li>\n</ul>\n<p>About you:</p>\n<ul>\n<li>Master’s degree in Computer Science, Engineering or a related field</li>\n<li>5+ years of experience in a DevOps/SRE role</li>\n<li>Strong experience with bare metal infrastructure and highly available distributed systems</li>\n<li>Exposure to site reliability issues in critical environments</li>\n<li>Experience working against reliability KPIs</li>\n<li>Hands-on experience with CI/CD, containerization and orchestration tools</li>\n<li>Knowledge of monitoring, logging, alerting and observability tools</li>\n<li>Familiarity with infrastructure-as-code tools</li>\n<li>Proficiency in scripting languages and knowledge of software development best practices</li>\n<li>Strong understanding of networking, security, and system administration concepts</li>\n<li>Excellent problem-solving and communication skills</li>\n</ul>\n<p>Your application will be all the more interesting if you also have:</p>\n<ul>\n<li>Experience in an AI/ML environment</li>\n<li>Experience of high-performance computing (HPC) systems and workload managers</li>\n<li>Worked with modern AI-oriented solutions</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a2e88648-d1d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mistral AI","sameAs":"https://mistral.ai","logo":"https://logos.yubhub.co/mistral.ai.png"},"x-apply-url":"https://jobs.lever.co/mistral/f76907fd-428a-4824-a1cf-8013974fde29","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["bare metal infrastructure","highly available distributed systems","CI/CD","containerization","orchestration tools","monitoring","logging","alerting","observability tools","infrastructure-as-code tools","scripting languages","software development best practices","networking","security","system administration"],"x-skills-preferred":["AI/ML environment","high-performance computing (HPC) systems","workload managers","modern AI-oriented solutions"],"datePosted":"2026-04-17T12:47:48.920Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Paris"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"bare metal infrastructure, highly available distributed systems, CI/CD, containerization, orchestration tools, monitoring, logging, alerting, observability tools, infrastructure-as-code tools, scripting languages, software development best practices, networking, security, system administration, AI/ML environment, high-performance computing (HPC) systems, workload managers, modern AI-oriented solutions"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3ff27117-053"},"title":"Technical Support Engineer","description":"<p>Job Title: Technical Support Engineer</p>\n<p>We are seeking a highly skilled Technical Support Engineer to provide high-quality support and service to our Customer base and Internal teams.</p>\n<p>As a Technical Support Engineer, you will play a critical role in providing advanced support directly to our Customers, and collaborating with engineering and Sales teams to enhance our products and services.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Resolve technical issues and provide advanced support directly to customers, including support for fal&#39;s platform (APIs, UI issues, and troubleshooting errors).</li>\n<li>Support users across multiple products via email, chat, and Slack.</li>\n<li>Troubleshoot integration issues, including authentication problems (OAuth, API keys), HTTP errors, malformed requests, rate limits, and API misconfigurations.</li>\n<li>Analyze API logs, error messages, and request/response payloads to identify root causes.</li>\n<li>Manage support tickets by responding within SLA timeframes, escalating complex issues appropriately, and maintaining detailed case records.</li>\n<li>Reproduce, escalate, and document bugs or edge cases in collaboration with engineering.</li>\n<li>Provide structured feedback to engineering teams regarding platform reliability, performance bottlenecks, and customer-reported issues, serving as an internal advocate for customer pain points and product improvement.</li>\n<li>Assist with testing and validation of new features, releases, and infrastructure changes before production deployment.</li>\n<li>Write and maintain technical content, including use case guides, how-to examples, FAQs, solutions for common errors, and documentation of issues and resolutions for the knowledge base.</li>\n<li>Improve developer documentation to make integration as self-serve as possible.</li>\n</ul>\n<p>What You Bring:</p>\n<ul>\n<li>Strong analytical thinking, technical problem-solving skills, and a systematic approach to troubleshooting technical issues across web platforms, cloud environments, and enterprise software.</li>\n<li>Experience supporting and troubleshooting REST APIs and backend services, including working directly with REST APIs and authentication flows (OAuth2, API keys).</li>\n<li>Experience using monitoring, logging, and observability tools to support production systems.</li>\n<li>Familiarity with AI platforms, machine learning systems, or data-intensive applications.</li>\n<li>Excellent written and verbal communication and interpersonal skills, with the ability to clearly and empathetically explain complex technical concepts to both technical and non-technical stakeholders/users in English.</li>\n<li>Experience providing technical support with a customer-first mindset, demonstrating patience, empathy, and a focus on user success.</li>\n<li>Strong technical writing abilities with experience creating and maintaining user guides, FAQs, and troubleshooting documentation.</li>\n<li>Demonstrated ability to prioritize effectively, respond quickly to critical issues with a sense of urgency, and maintain composure under pressure.</li>\n<li>Ability to work independently and collaboratively, handling multiple concurrent support cases while maintaining quality and meeting response time commitments.</li>\n<li>Self-starter who can identify process improvements and proactively address recurring issues (Initiative).</li>\n<li>Familiarity with tools such as Slack, Linear, Notion, and GitHub.</li>\n<li>Familiarity with authentication protocols like REST APIs, OAuth2, JWT, and API key auth.</li>\n</ul>\n<p>Why fal:</p>\n<p>At fal, you&#39;ll join a rapidly scaling company defining how AI moves from experimentation to production. This is an opportunity to shape the future of enterprise AI adoption while building deep relationships with customers who are transforming their industries through intelligent technology.</p>\n<p>What we offer at fal:</p>\n<ul>\n<li>Interesting and challenging work</li>\n<li>Competitive salary and equity</li>\n<li>A lot of learning and growth opportunities</li>\n<li>Regular team events and offsites</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3ff27117-053","directApply":true,"hiringOrganization":{"@type":"Organization","name":"fal","sameAs":"https://www.fal.com/","logo":"https://logos.yubhub.co/fal.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/fal/jobs/4210654009","x-work-arrangement":"remote","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["REST APIs","backend services","monitoring, logging, and observability tools","AI platforms","machine learning systems","data-intensive applications","technical writing","customer support","problem-solving","analytical thinking","communication","interpersonal skills"],"x-skills-preferred":["Slack","Linear","Notion","GitHub","authentication protocols","OAuth2","JWT","API key auth"],"datePosted":"2026-04-17T12:33:16.745Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote (IST Hours)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"REST APIs, backend services, monitoring, logging, and observability tools, AI platforms, machine learning systems, data-intensive applications, technical writing, customer support, problem-solving, analytical thinking, communication, interpersonal skills, Slack, Linear, Notion, GitHub, authentication protocols, OAuth2, JWT, API key auth"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f0f321c2-15d"},"title":"Data Platform Engineer","description":"<p>At Anchorage Digital, we are building the world&#39;s most advanced digital asset platform for institutions to participate in crypto. Join the Data Platform team and build the Trusted Data Platform that powers Anchorage&#39;s transition to Data 3.0.</p>\n<p>You&#39;ll help shape the unified orchestration foundation, collaborate on governance-as-code patterns, and contribute to self-service frameworks that make quality and compliance automatic. We&#39;re moving from manual spreadsheets and theoretical architectures to automated control planes where every dataset is trusted, monitored, and traceable by default.</p>\n<p><strong>Technical Skills:</strong></p>\n<ul>\n<li>Collaborate on designing and implementing unified orchestration patterns (Dagster/Airflow) to replace legacy and fragmented scheduling</li>\n<li>Develop governance-as-code systems in partnership with the team that automatically apply policy tags, RLS, and access controls through an active control plane</li>\n</ul>\n<p><strong>Complexity and Impact of Work:</strong></p>\n<ul>\n<li>Help guide the technical design for platform capabilities like data contracts, automated quality gating, observability, and cost visibility</li>\n<li>Support the migration of workloads from legacy patterns to the modern platform, ensuring domain teams have clear paths and golden templates</li>\n</ul>\n<p><strong>Organizational Knowledge:</strong></p>\n<ul>\n<li>Partner with domain teams (Asset Data, Reporting &amp; Statements, Product teams) to understand their needs and design platform capabilities that enable their success</li>\n<li>Promote and support data mesh principles and dbt best practices, helping domain owners build and own their data products while platform ensures quality</li>\n</ul>\n<p><strong>Communication and Influence:</strong></p>\n<ul>\n<li>Promote data platform engineering best practices, developer experience, and &#39;Data as a Product&#39; principles across the engineering organization</li>\n<li>Contribute to architectural decisions and help establish engineering culture around reliability, cost efficiency, and operational excellence</li>\n</ul>\n<p><strong>You may be a fit for this role if you:</strong></p>\n<ul>\n<li>5-7+ years building data platforms or infrastructure: You bring experience helping design and operate modern data platforms that handle enterprise-scale workloads with quality, governance, and cost controls</li>\n<li>Strong dbt and SQL expertise: You&#39;re proficient with dbt and SQL, understand dbt Mesh, and have strong opinions on data modeling, testing, and documentation best practices</li>\n<li>Orchestration experience: You&#39;ve implemented production data orchestration with Airflow, Dagster, Prefect, or similar tools, and understand the trade-offs between different orchestration patterns</li>\n<li>Cloud data warehouse proficiency: You have strong experience with BigQuery, Snowflake, or Redshift, including query optimization, cost management, and security configurations</li>\n<li>Platform mindset: You think in terms of golden paths, reusable abstractions, and developer experience - you build systems that let others move fast safely</li>\n</ul>\n<p><strong>Although not a requirement, bonus points if:</strong></p>\n<ul>\n<li>Metadata and catalog experience: You&#39;ve worked with Atlan, Collibra, DataHub, or similar metadata platforms and understand active governance patterns</li>\n<li>Data observability tools: You&#39;ve implemented data quality monitoring with Great Expectations, Monte Carlo, Soda, or similar tools</li>\n<li>Infrastructure as code: You have experience with Terraform, Kubernetes, and modern DevOps practices for data infrastructure</li>\n<li>You&#39;re the kind of person who gets excited about declarative config, immutable infrastructure, and metrics dashboards showing cost-per-query trending down</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f0f321c2-15d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anchorage Digital","sameAs":"https://www.anchorage.co/","logo":"https://logos.yubhub.co/anchorage.co.png"},"x-apply-url":"https://jobs.lever.co/anchorage/8a325cd5-ef99-4f1e-bba8-7bb1fca64f12","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["dbt","SQL","Airflow","Dagster","Prefect","BigQuery","Snowflake","Redshift"],"x-skills-preferred":["Metadata and catalog experience","Data observability tools","Infrastructure as code"],"datePosted":"2026-04-17T12:24:40.602Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York City"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"dbt, SQL, Airflow, Dagster, Prefect, BigQuery, Snowflake, Redshift, Metadata and catalog experience, Data observability tools, Infrastructure as code"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2299e559-5df"},"title":"Customer Support Engineer","description":"<p><strong>Job Summary</strong></p>\n<p>As a Customer Support Engineer at Electronic Arts, you will work directly with Game Developer teams to resolve technical challenges and improve product quality. You will have experience of minimum 3 years working with customers and empathy for the customer experience.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Work directly with Game Developer teams to help them solve technical challenges</li>\n<li>Resolve issues involving project implementation, code error diagnosis, debugging, validation, and root cause analysis for the products/platforms assigned</li>\n<li>Build internal relationships with our development and product management teams to help us communicate the priorities of our customers</li>\n<li>Improve product quality by injecting fresh ideas and bringing innovations to existing products or platforms, building automation</li>\n<li>Develop additional software components for tasks associated with the projects/platforms, designing and debugging software applications</li>\n<li>Participate in projects that improve overall product and documentation quality</li>\n<li>Participate in product/platform testing and updates</li>\n<li>Achieve knowledge transfer through the delivery of training, knowledge sessions, mentoring</li>\n<li>Help increase the team efficiency by sharing knowledge, providing feedback about best practices, writing tools/utilities using AI stack</li>\n<li>Develop your technology skills and become a versatile IT professional</li>\n<li>Participate in schedule rotations and working shifts</li>\n</ul>\n<p><strong>Qualifications</strong></p>\n<ul>\n<li>Experience of minimum 3 years working with customers</li>\n<li>Empathy for the customer experience</li>\n<li>Balance varying levels of priority and urgency</li>\n<li>Understand complex technical issues and manage to express them in simple and concise language</li>\n<li>Familiarity with database concepts (e.g. SQL Server, MongoDB)</li>\n<li>Experience around the following technology concepts - virtualization, operating systems and server administration (Linux, Windows), cloud infrastructure and services(AWS, Azure), networking (IP, routing, firewall, ACL&#39;s), coding/scripting (Python, Bash, Powershell), application and API endpoints, monitoring and observability tools (Grafana, Kibana)</li>\n<li>Qualification in Computer Engineering or relevant experience needed</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2299e559-5df","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Electronic Arts","sameAs":"https://jobs.ea.com","logo":"https://logos.yubhub.co/jobs.ea.com.png"},"x-apply-url":"https://jobs.ea.com/en_US/careers/JobDetail/GameKit-Operations-Engineer-12-months-contract/212687","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"temporary","x-salary-range":null,"x-skills-required":["database concepts","virtualization","operating systems and server administration","cloud infrastructure and services","networking","coding/scripting","application and API endpoints","monitoring and observability tools"],"x-skills-preferred":[],"datePosted":"2026-03-10T12:17:50.433Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Hyderabad"}},"employmentType":"TEMPORARY","occupationalCategory":"Engineering","industry":"Technology","skills":"database concepts, virtualization, operating systems and server administration, cloud infrastructure and services, networking, coding/scripting, application and API endpoints, monitoring and observability tools"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_ec8eeead-726"},"title":"Java Engineer, Aladdin Engineering, Associate","description":"<p><strong>About this role</strong></p>\n<p>At BlackRock, technology is the foundation of our business. As a Java Back-End Engineer, you&#39;ll lead by example — architecting, coding, and mentoring teams to build resilient systems that power our global post-trade operations. You&#39;ll design and deliver enterprise-scale software with a focus on reliability, performance, and clean engineering practices.</p>\n<p><strong>Key Responsibilities</strong></p>\n<ul>\n<li>Design and develop robust, high-performance back-end systems using Java 11+ and the Spring Boot ecosystem.</li>\n<li>Lead design discussions, code reviews, and architecture sessions with a hands-on approach.</li>\n<li>Build and maintain microservices and event-driven systems to process and distribute large-scale financial data.</li>\n<li>Develop data integration and pipeline components that connect systems across Snowflake, SQL Server, and real-time streaming platforms.</li>\n<li>Implement and optimize Redis-based caching and data stores for low-latency access patterns.</li>\n<li>Champion best practices for code quality, testing, automation, and performance tuning.</li>\n<li>Collaborate cross-functionally to ensure technical solutions align with product goals and business outcomes.</li>\n</ul>\n<p><strong>Qualifications / Competencies</strong></p>\n<ul>\n<li>B.S./M.S. in Computer Science, Engineering, or related discipline.</li>\n<li>3+ years of professional experience in Java and object-oriented design.</li>\n<li>Strong knowledge of Spring Boot, REST APIs, and enterprise integration patterns.</li>\n<li>Deep expertise in SQL Server, including stored procedures, performance tuning, and data modeling.</li>\n<li>Experience with Redis for caching or data persistence.</li>\n<li>Hands-on exposure to Kafka or similar publish-subscribe systems for real-time event processing.</li>\n<li>Familiarity with Snowflake and data pipeline concepts (ETL, batch vs. streaming).</li>\n<li>Experience with Agile coding and general understanding of how LLMs are working</li>\n<li>Strong focus on clean architecture, maintainability, and production readiness.</li>\n</ul>\n<p><strong>Our benefits</strong></p>\n<p>To help you stay energized, engaged and inspired, we offer a wide range of employee benefits including: retirement investment and tools designed to help you in building a sound financial future; access to education reimbursement; comprehensive resources to support your physical health and emotional well-being; family support programs; and Flexible Time Off (FTO) so you can relax, recharge and be there for the people you care about.</p>\n<p><strong>Our hybrid work model</strong></p>\n<p>BlackRock’s hybrid work model is designed to enable a culture of collaboration and apprenticeship that enriches the experience of our employees, while supporting flexibility for all. Employees are currently required to work at least 4 days in the office per week, with the flexibility to work from home 1 day a week. Some business groups may require more time in the office due to their roles and responsibilities. We remain focused on increasing the impactful moments that arise when we work together in person – aligned with our commitment to performance and innovation. As a new joiner, you can count on this hybrid model to accelerate your learning and onboarding experience here at BlackRock.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_ec8eeead-726","directApply":true,"hiringOrganization":{"@type":"Organization","name":"BlackRock","sameAs":"https://jobs.workable.com","logo":"https://logos.yubhub.co/view.com.png"},"x-apply-url":"https://jobs.workable.com/view/3vLTpfkn1mYzZFEn6qtubs/java-engineer%2C-aladdin-engineering%2C-associate-in-edinburgh-at-blackrock","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Java","Spring Boot","SQL Server","Redis","Kafka","Snowflake","Agile coding"],"x-skills-preferred":["Kubernetes","Docker","cloud-native environments","observability tools","scripting experience in Python"],"datePosted":"2026-03-09T16:43:22.343Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Edinburgh, Scotland"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"Java, Spring Boot, SQL Server, Redis, Kafka, Snowflake, Agile coding, Kubernetes, Docker, cloud-native environments, observability tools, scripting experience in Python"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4f9c5908-033"},"title":"Java Lead Engineer, Vice President","description":"<p><strong>About this role</strong></p>\n<p>At BlackRock, technology is the foundation of our business. As a VP, Java Back-End Engineer, you&#39;ll lead by example — architecting, coding, and mentoring teams to build resilient systems that power our global post-trade operations. You&#39;ll design and deliver enterprise-scale software with a focus on reliability, performance, and clean engineering practices.</p>\n<p>This role is ideal for a technical leader who enjoys staying close to the code, guiding design decisions, and solving complex data challenges — all while fostering a culture of excellence and continuous improvement.</p>\n<p><strong>Key Responsibilities</strong></p>\n<ul>\n<li>Design and develop robust, high-performance back-end systems using Java 11+ and the Spring Boot ecosystem.</li>\n<li>Lead design discussions, code reviews, and architecture sessions with a hands-on approach.</li>\n<li>Build and maintain microservices and event-driven systems to process and distribute large-scale financial data.</li>\n<li>Develop data integration and pipeline components that connect systems across Snowflake, SQL Server, and real-time streaming platforms.</li>\n<li>Implement and optimize Redis-based caching and data stores for low-latency access patterns.</li>\n<li>Champion best practices for code quality, testing, automation, and performance tuning.</li>\n<li>Mentor engineers to elevate technical craftsmanship, problem-solving, and design thinking.</li>\n<li>Collaborate cross-functionally to ensure technical solutions align with product goals and business outcomes.</li>\n</ul>\n<p><strong>Qualifications / Competencies</strong></p>\n<ul>\n<li>B.S./M.S. in Computer Science, Engineering, or related discipline.</li>\n<li>6+ years of professional experience in Java and object-oriented design.</li>\n<li>Strong knowledge of Spring Boot, REST APIs, and enterprise integration patterns.</li>\n<li>Deep expertise in SQL Server, including stored procedures, performance tuning, and data modeling.</li>\n<li>Experience with Redis for caching or data persistence.</li>\n<li>Hands-on exposure to Kafka or similar publish-subscribe systems for real-time event processing.</li>\n<li>Familiarity with Snowflake and data pipeline concepts (ETL, batch vs. streaming).</li>\n<li>Experience with Agentic coding and general understanding of how LLMs are working</li>\n<li>Strong focus on clean architecture, maintainability, and production readiness.</li>\n<li>Excellent communication and leadership skills — able to guide teams and influence design direction.</li>\n</ul>\n<p><strong>Our benefits</strong></p>\n<p>To help you stay energized, engaged and inspired, we offer a wide range of employee benefits including: retirement investment and tools designed to help you in building a sound financial future; access to education reimbursement; comprehensive resources to support your physical health and emotional well-being; family support programs; and Flexible Time Off (FTO) so you can relax, recharge and be there for the people you care about.</p>\n<p><strong>Our hybrid work model</strong></p>\n<p>BlackRock’s hybrid work model is designed to enable a culture of collaboration and apprenticeship that enriches the experience of our employees, while supporting flexibility for all. Employees are currently required to work at least 4 days in the office per week, with the flexibility to work from home 1 day a week. Some business groups may require more time in the office due to their roles and responsibilities. We remain focused on increasing the impactful moments that arise when we work together in person – aligned with our commitment to performance and innovation. As a new joiner, you can count on this hybrid model to accelerate your learning and onboarding experience here at BlackRock.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4f9c5908-033","directApply":true,"hiringOrganization":{"@type":"Organization","name":"BlackRock","sameAs":"https://jobs.workable.com","logo":"https://logos.yubhub.co/view.com.png"},"x-apply-url":"https://jobs.workable.com/view/d3qM6fNZpN7MyQafCxMbPN/java-lead-engineer%2C-vice-president-in-edinburgh-at-blackrock","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Java","Spring Boot","SQL Server","Redis","Kafka","Snowflake","Agentic coding","LLMs"],"x-skills-preferred":["Kubernetes","Docker","cloud-native environments","observability tools","scripting experience in Python","Prompt Engineering","Agentic AI"],"datePosted":"2026-03-09T16:40:00.694Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Edinburgh, Scotland"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"Java, Spring Boot, SQL Server, Redis, Kafka, Snowflake, Agentic coding, LLMs, Kubernetes, Docker, cloud-native environments, observability tools, scripting experience in Python, Prompt Engineering, Agentic AI"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_5d911052-764"},"title":"Senior Data Engineer","description":"<p><strong>About the Role</strong></p>\n<p>We&#39;re hiring a Senior Data Engineer to work on our Data Lake Team. As a key member of the team, you will be responsible for building and operating various data platform components, including data quality, data pipelines, infrastructure, and monitoring.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Maintain data pipeline job framework</li>\n<li>Develop Data Quality framework ( internal set of tools for internal and external data sources validation )</li>\n<li>Maintain and develop public facing data ingestion service with 17 000+ RPS.</li>\n<li>Maintain and develop core data pipelines in batch and streaming manners.</li>\n<li>Be a last line of support for our internal platform users.</li>\n<li>Take a part in an on-call rotation for data platform incidents (shared across the team).</li>\n</ul>\n<p><strong>Requirements</strong></p>\n<ul>\n<li>Fluent English</li>\n<li>4+ years building production services and data pipelines (batch and/or streaming)</li>\n<li>Strong experience with Python or the readiness to ramp up quickly.</li>\n<li>Hands-on experience with at least one MPP system (Spark, Trino, Redshift etc.)</li>\n<li>Hands-on experience operating services in a cloud environment (AWS preferred)</li>\n</ul>\n<p><strong>Nice to Have</strong></p>\n<ul>\n<li>Terraform/CloudFormation or other IaC tools</li>\n<li>ClickHouse or similar analytical databases</li>\n<li>Experiences with data quality/observability tools</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Unlimited vacation time - we strongly encourage all employees to take at least 3 weeks per year</li>\n<li>Fully remote team - choose where you live</li>\n<li>Work from home stipend - we want you to have the resources you need to set up your home office</li>\n<li>Apple laptops provided for new employees</li>\n<li>Training and development budget - refreshed each year for every employee</li>\n<li>Maternity &amp; Paternity leave for qualified employees</li>\n<li>Work with smart people who will help you grow and make a meaningful impact</li>\n<li>Base salary: $80k–$120k USD, depending on knowledge, skills, experience, and interview results</li>\n<li>Stock options - offered in addition to the base salary</li>\n<li>Regular team offsites to connect and collaborate</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_5d911052-764","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Constructor","sameAs":"https://apply.workable.com","logo":"https://logos.yubhub.co/j.com.png"},"x-apply-url":"https://apply.workable.com/j/FF201D8AA3","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$80k–$120k USD","x-skills-required":["Python","MPP system","AWS"],"x-skills-preferred":["Terraform","ClickHouse","data quality/observability tools"],"datePosted":"2026-03-09T10:57:58.178Z","jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, MPP system, AWS, Terraform, ClickHouse, data quality/observability tools","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":80000,"maxValue":120000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c043b353-08f"},"title":"Scaled Support Specialist","description":"<p>Your job is to produce a job description for the job seeker. Treat copy that describes the job as more important than copy that talks about the company.  Start with an opening paragraph (no heading): what the role is, who the company is, why it matters. If the ad mentions salary, include it here. One short paragraph about the company is enough — do not reproduce lengthy &quot;About Us&quot; text.  For the role details, reuse the same section headings from the original ad (e.g. if the ad says &quot;Responsibilities&quot;, use that heading, not &quot;What you&#39;ll do&quot;). Match the tone of the original: if formal, stay formal. If casual, stay casual.  Rephrase bullet points in your own words while keeping the factual content. Combine related points where it makes sense.  Content that is not directly about the role (long company history, mission statements, investor lists, press quotes) should be paraphrased into a sentence or two at most — the job seeker needs to understand the company, not read its pitch deck.  For benefits/perks: gather them from anywhere in the ad into one section. If the ad mentions nothing about benefits, omit a benefits section entirely.  Do not invent information that is not in the original ad.  ## <strong>About the Role</strong>  We&#39;re looking for a Scaled Support Specialist who lives at the intersection of deep technical troubleshooting and exceptional human communication. You&#39;ll be the front line for developers integrating with OpenRouter&#39;s API — diagnosing complex issues across dozens of model providers, untangling new edge cases, and making sure every developer who reaches out feels like they have a partner, not a ticket number.  This is not a scripted helpdesk role. Our users are highly capable engineers building the next generation of AI applications, which means the problems they bring to us are complex, nuanced, and frequently novel. You&#39;ll encounter issues daily where there is no runbook. You&#39;ll need to figure it out, often with incomplete information, and usually before anyone else on the team has seen it either.  If you&#39;re the kind of person who reads API changelogs for fun, has strong opinions about error message quality, and gets genuine satisfaction from turning a frustrated developer into a happy one — keep reading.  ## <strong>Key Responsibilities</strong>  ### <strong>Troubleshooting &amp; Problem Solving</strong> (Core Focus)  - Diagnose and resolve complex technical issues across OpenRouter&#39;s API, spanning multiple LLM providers - Reproduce bugs in ambiguous environments — different SDKs, languages, frameworks, and auth configurations — using tools like `curl`, Postman, and small test apps - Read and interpret logs, headers, and request traces; identify whether the problem is client-side, OpenRouter-side, or an upstream provider issue vs. a user misconfiguration - Turn &quot;it doesn&#39;t work&quot; into actionable findings: exact steps to reproduce, clear hypotheses, and verified fixes or workarounds  ### <strong>Developer Communication &amp; Advocacy</strong>  - Respond to developer inquiries across support channels (email, Discord, GitHub) with clarity, empathy, and technical precision - Translate complex technical root causes into human-friendly explanations - Set expectations on timelines and next steps; provide proactive updates and close the loop - Identify patterns in support requests and advocate internally for documentation improvements, API design changes, or better messages  ### <strong>Self-Directed Research &amp; Learning</strong>  - Stay current with the rapidly evolving LLM ecosystem - Develop deep expertise in OpenRouter&#39;s routing logic, fallback behavior, rate limiting, streaming (SSE), and billing systems with minimal hand-holding  ### <strong>Bridge to Product &amp; Engineering</strong>  - Spot systemic issues underneath individual tickets and push for the fix that prevents 50 more - Identify trends in support volume to capture product feedback and inform roadmap priorities - Collaborate on improving the developer experience  ## <strong>About You</strong>  ### <strong>Required:</strong>  - 4+ years in a technical support, developer support, solutions engineering, or similar role — ideally supporting an API or developer tools product - Exceptional troubleshooting instincts - Strong API fluency - Proficiency in at least one scripting language (Python or TypeScript) - Excellent written communication - Comfort with ambiguity - Genuine passion for AI and LLMs  ### <strong>Nice-to-Haves:</strong>  - Familiarity with the OpenAI SDK / Chat Completions API format - Experience with AI/ML frameworks like LangChain, LlamaIndex, or Hugging Face - Experience with observability tools (logging, tracing, metrics) - Experience scaling support operations — e.g., implementing AI-assisted support bots, building internal support dashboards, or creating automated triage workflows - Contributions to open-source projects or developer communities - Background in or exposure to ML/AI concepts beyond just using APIs (benchmarking, evals, fine-tuning)  ## <strong>Why OpenRouter</strong>  - Work at the center of the AI infrastructure stack as enterprises define how they adopt LLMs. - High ownership and autonomy to define how developer education and community scale. - Opportunity to shape a foundational function at a fast-growing company. - Fully remote team with a culture of autonomy and trust. - Competitive compensation, including base salary and equity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c043b353-08f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenRouter","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openrouter.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openrouter/89ff6b47-ba08-4418-b24b-c136dbf2ef82","x-work-arrangement":"Remote","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":null,"x-skills-required":["API fluency","scripting language (Python or TypeScript)","exceptional troubleshooting instincts","strong API fluency","excellent written communication"],"x-skills-preferred":["OpenAI SDK / Chat Completions API format","AI/ML frameworks like LangChain, LlamaIndex, or Hugging Face","observability tools (logging, tracing, metrics)","scaling support operations","contributions to open-source projects or developer communities"],"datePosted":"2026-03-09T09:48:23.067Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote (US)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"API fluency, scripting language (Python or TypeScript), exceptional troubleshooting instincts, strong API fluency, excellent written communication, OpenAI SDK / Chat Completions API format, AI/ML frameworks like LangChain, LlamaIndex, or Hugging Face, observability tools (logging, tracing, metrics), scaling support operations, contributions to open-source projects or developer communities"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_93a4ece6-182"},"title":"Member of Technical Staff, Site Reliability Engineer (HPC)","description":"<p>As Microsoft continues to push the boundaries of AI, we are on the lookout for experienced individuals to work with us on the most interesting and challenging AI questions of our time. Our vision is to build systems that have true artificial intelligence across agents, applications, services, and infrastructure. We&#39;re looking for an experienced HPC Site Reliability Engineer (SRE) to join our High Performance Computing (HPC) infrastructure team. In this role, you&#39;ll blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You&#39;ll ensure that AI systems stay efficient and reliable with very high uptimes.</p>\n<p>Microsoft&#39;s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.</p>\n<p>This role is part of Microsoft AI&#39;s Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanity remains firmly in control. We aim to deliver breakthroughs that benefit society—advancing science, education, and global well-being.</p>\n<p>Responsibilities\nReliability &amp; Availability : Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference.\nObservability : Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking.\nAutomation &amp; Tooling : Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments.\nIncident Management : Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.\nSecurity &amp; Compliance : Ensure data privacy, compliance, and secure operations across model training and serving environments.\nCollaboration : Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.</p>\n<p>Qualifications\nRequired Qualifications\nMaster’s Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR Bachelor’s Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering OR equivalent experience</p>\n<p>Preferred Qualifications\nStrong proficiency in Kubernetes, Docker, and container orchestration.\nKnowledge of CI/CD pipelines for Inference and ML model deployment.\nHands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code.\nExpertise in monitoring &amp; observability tools (Grafana, Datadog, OpenTelemetry, etc.).\nStrong programming/scripting skills in Python, Go, or Bash.\nSolid knowledge of distributed systems, networking, and storage.\nExperience running large-scale GPU clusters for ML/AI workloads (preferred).\nFamiliarity with ML training/inference pipelines.\nExperience with high-performance computing (HPC) and workload schedulers (Kubernetes operators).\nBackground in capacity planning &amp; cost optimization for GPU-heavy environments.</p>\n<p>Work on cutting-edge infrastructure that powers the future of Generative AI. Collaborate with world-class researchers and engineers. Impact millions of users through reliable and responsible AI deployments. Competitive compensation, equity options, and comprehensive benefits.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_93a4ece6-182","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-site-reliability-engineer-hpc-mai-superintelligence-team/","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$139,900 – $274,800 per year","x-skills-required":["Kubernetes","Docker","container orchestration","CI/CD pipelines","public cloud platforms","infrastructure-as-code","monitoring & observability tools","programming/scripting skills in Python, Go, or Bash","distributed systems","networking","storage","GPU clusters","ML training/inference pipelines","high-performance computing","workload schedulers"],"x-skills-preferred":["strong proficiency in Kubernetes","knowledge of CI/CD pipelines","hands-on experience with public cloud platforms","expertise in monitoring & observability tools","strong programming/scripting skills in Python, Go, or Bash","solid knowledge of distributed systems","experience running large-scale GPU clusters","familiarity with ML training/inference pipelines","experience with high-performance computing"],"datePosted":"2026-03-08T22:09:23.399Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Docker, container orchestration, CI/CD pipelines, public cloud platforms, infrastructure-as-code, monitoring & observability tools, programming/scripting skills in Python, Go, or Bash, distributed systems, networking, storage, GPU clusters, ML training/inference pipelines, high-performance computing, workload schedulers, strong proficiency in Kubernetes, knowledge of CI/CD pipelines, hands-on experience with public cloud platforms, expertise in monitoring & observability tools, strong programming/scripting skills in Python, Go, or Bash, solid knowledge of distributed systems, experience running large-scale GPU clusters, familiarity with ML training/inference pipelines, experience with high-performance computing","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_73ff6f07-c0e"},"title":"Staff Software Engineer, AI Reliability Engineering","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>Claude has your back. AIRE has Claude&#39;s. Help us keep Claude reliable for everyone who depends on it.</p>\n<p>AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.</p>\n<p>Reliability here is an emergent phenomenon that transcends any single team&#39;s boundaries, so someone has to zoom out and look at the whole picture. That&#39;s us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.</li>\n<li>Design and implement monitoring and observability systems across the token path.</li>\n<li>Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers</li>\n<li>Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.</li>\n<li>Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic&#39;s safety commitments.</li>\n</ul>\n<p><strong>You may be a good fit if you</strong></p>\n<ul>\n<li>Have strong distributed systems, infrastructure, or reliability backgrounds -- we&#39;re looking for reliability-minded software engineers and SREs.</li>\n<li>Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</li>\n<li>Think holistically about how systems compose and where the seams are.</li>\n<li>Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.</li>\n<li>Care about users and feel ownership over outcomes, even for systems you don&#39;t own.</li>\n<li>Have excellent communication and collaboration skills -- you&#39;ll be partnering across the entire company.</li>\n<li>Bring diverse experience -- the team&#39;s strength comes from people who&#39;ve built product stacks, scaled databases, run massive distributed systems, and everything in between.</li>\n</ul>\n<p><strong>Strong candidates may also</strong></p>\n<ul>\n<li>Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems</li>\n<li>Have experience operating large-scale model serving or training infrastructure (&gt;1000 GPUs).</li>\n<li>Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).</li>\n<li>Understand ML-specific networking optimizations like RDMA and InfiniBand.</li>\n<li>Have expertise in AI-specific observability tools and frameworks.</li>\n<li>Have experience with chaos engineering and systematic resilience testing.</li>\n<li>Have contributed to open-source infrastructure or ML tooling.</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship</strong></p>\n<p>We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong></p>\n<p>Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</p>\n<p><strong>Your safety matters to us.</strong></p>\n<p>To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links—visit anthropic.com/careers directly for confirmed position openings.</p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and engineering as it does with computer science.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_73ff6f07-c0e","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5101173008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"£325,000 - £390,000GBP","x-skills-required":["distributed systems","infrastructure","reliability","software engineering","SRE","large scale systems","model serving","training infrastructure","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"x-skills-preferred":["SRE","Production Engineer","reliability-focused roles","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"datePosted":"2026-03-08T13:51:34.354Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, SRE, Production Engineer, reliability-focused roles, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling","baseSalary":{"@type":"MonetaryAmount","currency":"GBP","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":390000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c930b80e-7a6"},"title":"Staff / Senior Software Engineer, AI Reliability","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.</p>\n<p>Reliability here is an emergent phenomenon that transcends any single team&#39;s boundaries, so someone has to zoom out and look at the whole picture. That&#39;s us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.</p>\n<p>Claude has your back. AIRE has Claude&#39;s. Help us keep Claude reliable for everyone who depends on it.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.</li>\n</ul>\n<ul>\n<li>Design and implement monitoring and observability systems across the token path.</li>\n</ul>\n<ul>\n<li>Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers</li>\n</ul>\n<ul>\n<li>Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.</li>\n</ul>\n<ul>\n<li>Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic&#39;s safety commitments.</li>\n</ul>\n<p><strong>You may be a good fit if you:</strong></p>\n<ul>\n<li>Have strong distributed systems, infrastructure, or reliability backgrounds -- we&#39;re looking for reliability-minded software engineers and SREs.</li>\n</ul>\n<ul>\n<li>Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</li>\n</ul>\n<ul>\n<li>Think holistically about how systems compose and where the seams are.</li>\n</ul>\n<ul>\n<li>Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.</li>\n</ul>\n<ul>\n<li>Care about users and feel ownership over outcomes, even for systems you don&#39;t own.</li>\n</ul>\n<ul>\n<li>Have excellent communication and collaboration skills -- you&#39;ll be partnering across the entire company.</li>\n</ul>\n<ul>\n<li>Bring diverse experience -- the team&#39;s strength comes from people who&#39;ve built product stacks, scaled databases, run massive distributed systems, and everything in between.</li>\n</ul>\n<p><strong>Strong candidates may also:</strong></p>\n<ul>\n<li>Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems</li>\n</ul>\n<ul>\n<li>Have experience operating large-scale model serving or training infrastructure (&gt;1000 GPUs).</li>\n</ul>\n<ul>\n<li>Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).</li>\n</ul>\n<ul>\n<li>Understand ML-specific networking optimizations like RDMA and InfiniBand.</li>\n</ul>\n<ul>\n<li>Have expertise in AI-specific observability tools and frameworks.</li>\n</ul>\n<ul>\n<li>Have experience with chaos engineering and systematic resilience testing.</li>\n</ul>\n<ul>\n<li>Have contributed to open-source infrastructure or ML tooling.</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</strong></p>\n<p><strong>Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links—visit anthropic.com/careers directly for confirmed position openings.</strong></p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as a team sport, where everyone contributes to the overall success of the team.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c930b80e-7a6","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5113224008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$325,000 - $485,000 USD","x-skills-required":["distributed systems","infrastructure","reliability","large language model serving systems","monitoring and observability systems","high-availability serving infrastructure","incident response","safeguard model serving"],"x-skills-preferred":["SRE","Production Engineer","ML hardware accelerators","ML-specific networking optimizations","AI-specific observability tools and frameworks","chaos engineering","systematic resilience testing","open-source infrastructure or ML tooling"],"datePosted":"2026-03-08T13:50:54.182Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, large language model serving systems, monitoring and observability systems, high-availability serving infrastructure, incident response, safeguard model serving, SRE, Production Engineer, ML hardware accelerators, ML-specific networking optimizations, AI-specific observability tools and frameworks, chaos engineering, systematic resilience testing, open-source infrastructure or ML tooling","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_5d38ab71-400"},"title":"Research Engineer, Pretraining Scaling","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role:</strong></p>\n<p>Anthropic&#39;s ML Performance and Scaling team trains our production pretrained models, work that directly shapes the company&#39;s future and our mission to build safe, beneficial AI systems. As a Research Engineer on this team, you&#39;ll ensure our frontier models train reliably, efficiently, and at scale. This is demanding, high-impact work that requires both deep technical expertise and a genuine passion for the craft of large-scale ML systems.</p>\n<p>This role lives at the boundary between research and engineering. You&#39;ll work across our entire production training stack: performance optimisation, hardware debugging, experimental design, and launch coordination. During launches, the team works in tight lockstep, responding to production issues that can&#39;t wait for tomorrow.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>Own critical aspects of our production pretraining pipeline, including model operations, performance optimisation, observability, and reliability</li>\n<li>Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure</li>\n<li>Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance</li>\n<li>Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams</li>\n<li>Build and maintain production logging, monitoring dashboards, and evaluation infrastructure</li>\n<li>Add new capabilities to the training codebase, such as long context support or novel architectures</li>\n<li>Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams</li>\n<li>Contribute to the team&#39;s institutional knowledge by documenting systems, debugging approaches, and lessons learned</li>\n</ul>\n<p><strong>You May Be a Good Fit If You:</strong></p>\n<ul>\n<li>Have hands-on experience training large language models, or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems</li>\n<li>Genuinely enjoy both research and engineering work—you&#39;d describe your ideal split as roughly 50/50 rather than heavily weighted toward one or the other</li>\n<li>Are excited about being on-call for production systems, working long days during launches, and solving hard problems under pressure</li>\n<li>Thrive when working on whatever is most impactful, even if that changes day-to-day based on what the production model needs</li>\n<li>Excel at debugging complex, ambiguous problems across multiple layers of the stack</li>\n<li>Communicate clearly and collaborate effectively, especially when coordinating across time zones or during high-stress incidents</li>\n<li>Are passionate about the work itself and want to refine your craft as a research engineer</li>\n<li>Care about the societal impacts of AI and responsible scaling</li>\n</ul>\n<p><strong>Strong Candidates May Also Have:</strong></p>\n<ul>\n<li>Previous experience training LLM’s or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale</li>\n<li>Contributed to open-source LLM frameworks (e.g., open\\_lm, llm-foundry, mesh-transformer-jax)</li>\n<li>Published research on model training, scaling laws, or ML systems</li>\n<li>Experience with production ML systems, observability tools, or evaluation infrastructure</li>\n<li>Background as a systems engineer, quant, or in other roles requiring both technical depth and operational excellence</li>\n</ul>\n<p><strong>What Makes This Role Unique:</strong></p>\n<p>This is not a typical research engineering role. The work is highly operational—you&#39;ll be deeply involved in keeping our production models training smoothly, which means being responsive to incidents, flexible about priorities, and comfortable with uncertainty. During launches, the team often works extended hours and may need to respond to issues on evenings and weekends.</p>\n<p>However, this operational intensity comes with extraordinary learning opportunities. You&#39;ll gain hands-on experience with some of the largest, most sophisticated training runs in the industry. You&#39;ll work alongside world-class researchers and engineers, and the institutional knowledge you build will compound in ways that can&#39;t be easily transferred. For people who thrive on this type of work, it&#39;s uniquely rewarding.</p>\n<p>We&#39;re building a close-knit team of people who genuinely care about doing excellent work together. If you&#39;re someone who wants to be part of training the models that will define the future of AI—and you&#39;re excited about the full reality of what that entails—we&#39;d love to hear from you.</p>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification.</strong></p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_5d38ab71-400","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4938432008","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000 - $850,000USD","x-skills-required":["JAX","TPU","PyTorch","large-scale distributed systems","model operations","performance optimisation","observability","reliability","model training","scaling laws","ML systems"],"x-skills-preferred":["open-source LLM frameworks","production ML systems","observability tools","evaluation infrastructure","systems engineer","quant"],"datePosted":"2026-03-08T13:48:54.589Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"JAX, TPU, PyTorch, large-scale distributed systems, model operations, performance optimisation, observability, reliability, model training, scaling laws, ML systems, open-source LLM frameworks, production ML systems, observability tools, evaluation infrastructure, systems engineer, quant","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_10798a1e-9fa"},"title":"Staff Software Engineer, AI Reliability Engineering","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role</strong></p>\n<p>Claude has your back. AIRE has Claude&#39;s. Help us keep Claude reliable for everyone who depends on it.</p>\n<p>AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.</p>\n<p>Reliability here is an emergent phenomenon that transcends any single team&#39;s boundaries, so someone has to zoom out and look at the whole picture. That&#39;s us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.</li>\n<li>Design and implement monitoring and observability systems across the token path.</li>\n<li>Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers</li>\n<li>Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.</li>\n<li>Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic&#39;s safety commitments.</li>\n</ul>\n<p><strong>You may be a good fit if you</strong></p>\n<ul>\n<li>Have strong distributed systems, infrastructure, or reliability backgrounds -- we&#39;re looking for reliability-minded software engineers and SREs.</li>\n<li>Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don&#39;t have deep expertise yet.</li>\n<li>Think holistically about how systems compose and where the seams are.</li>\n<li>Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.</li>\n<li>Care about users and feel ownership over outcomes, even for systems you don&#39;t own.</li>\n<li>Have excellent communication and collaboration skills -- you&#39;ll be partnering across the entire company.</li>\n<li>Bring diverse experience -- the team&#39;s strength comes from people who&#39;ve built product stacks, scaled databases, run massive distributed systems, and everything in between.</li>\n</ul>\n<p><strong>Strong candidates may also</strong></p>\n<ul>\n<li>Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems</li>\n<li>Have experience operating large-scale model serving or training infrastructure (&gt;1000 GPUs).</li>\n<li>Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).</li>\n<li>Understand ML-specific networking optimizations like RDMA and InfiniBand.</li>\n<li>Have expertise in AI-specific observability tools and frameworks.</li>\n<li>Have experience with chaos engineering and systematic resilience testing.</li>\n<li>Have contributed to open-source infrastructure or ML tooling.</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Salary</strong></p>\n<p>The annual compensation range for this role is €235.000 - €295.000EUR.</p>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and engineering as it does with computer science. We strive to build a team that reflects this perspective, with people from a wide range of backgrounds and disciplines.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_10798a1e-9fa","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5101169008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"€235.000 - €295.000EUR","x-skills-required":["distributed systems","infrastructure","reliability","software engineering","SRE","large scale systems","model serving","training infrastructure","ML hardware accelerators","RDMA","InfiniBand","AI-specific observability tools","chaos engineering","resilience testing","open-source infrastructure","ML tooling"],"x-skills-preferred":["communication","collaboration","diverse experience","product stacks","databases","distributed systems"],"datePosted":"2026-03-08T13:48:18.742Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Dublin"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, communication, collaboration, diverse experience, product stacks, databases, distributed systems"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a05bfa1a-d23"},"title":"Research Engineer, Pretraining Scaling","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role:</strong></p>\n<p>Anthropic&#39;s ML Performance and Scaling team trains our production pretrained models, work that directly shapes the company&#39;s future and our mission to build safe, beneficial AI systems. As a Research Engineer on this team, you&#39;ll ensure our frontier models train reliably, efficiently, and at scale. This is demanding, high-impact work that requires both deep technical expertise and a genuine passion for the craft of large-scale ML systems.</p>\n<p>This role lives at the boundary between research and engineering. You&#39;ll work across our entire production training stack: performance optimization, hardware debugging, experimental design, and launch coordination. During launches, the team works in tight lockstep, responding to production issues that can&#39;t wait for tomorrow.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>Own critical aspects of our production pretraining pipeline, including model operations, performance optimization, observability, and reliability</li>\n<li>Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure</li>\n<li>Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance</li>\n<li>Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams</li>\n<li>Build and maintain production logging, monitoring dashboards, and evaluation infrastructure</li>\n<li>Add new capabilities to the training codebase, such as long context support or novel architectures</li>\n<li>Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams</li>\n<li>Contribute to the team&#39;s institutional knowledge by documenting systems, debugging approaches, and lessons learned</li>\n</ul>\n<p><strong>You May Be a Good Fit If You:</strong></p>\n<ul>\n<li>Have hands-on experience training large language models, or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems</li>\n<li>Genuinely enjoy both research and engineering work—you&#39;d describe your ideal split as roughly 50/50 rather than heavily weighted toward one or the other</li>\n<li>Are excited about being on-call for production systems, working long days during launches, and solving hard problems under pressure</li>\n<li>Thrive when working on whatever is most impactful, even if that changes day-to-day based on what the production model needs</li>\n<li>Excel at debugging complex, ambiguous problems across multiple layers of the stack</li>\n<li>Communicate clearly and collaborate effectively, especially when coordinating across time zones or during high-stress incidents</li>\n<li>Are passionate about the work itself and want to refine your craft as a research engineer</li>\n<li>Care about the societal impacts of AI and responsible scaling</li>\n</ul>\n<p><strong>Strong Candidates May Also Have:</strong></p>\n<ul>\n<li>Previous experience training LLM’s or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale</li>\n<li>Contributed to open-source LLM frameworks (e.g., open\\_lm, llm-foundry, mesh-transformer-jax)</li>\n<li>Published research on model training, scaling laws, or ML systems</li>\n<li>Experience with production ML systems, observability tools, or evaluation infrastructure</li>\n<li>Background as a systems engineer, quant, or in other roles requiring both technical depth and operational excellence</li>\n</ul>\n<p><strong>What Makes This Role Unique:</strong></p>\n<p>This is not a typical research engineering role. The work is highly operational—you&#39;ll be deeply involved in keeping our production models training smoothly, which means being responsive to incidents, flexible about priorities, and comfortable with uncertainty. During launches, the team often works extended hours and may need to respond to issues on evenings and weekends.</p>\n<p>However, this operational intensity comes with extraordinary learning opportunities. You&#39;ll gain hands-on experience with some of the largest, most sophisticated training runs in the industry. You&#39;ll work alongside world-class researchers and engineers, and the institutional knowledge you build will compound in ways that can&#39;t be easily transferred. For people who thrive on this type of work, it&#39;s uniquely rewarding.</p>\n<p>We&#39;re building a close-knit team of people who genuinely care about doing excellent work together. If you&#39;re someone who wants to be part of training the models that will define the future of AI—and you&#39;re excited about the full reality of what that entails—we&#39;d love to hear from you.</p>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a05bfa1a-d23","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4938436008","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"£260,000 - £630,000GBP","x-skills-required":["JAX","TPU","PyTorch","large-scale distributed systems","model operations","performance optimization","observability","reliability","debugging","experimental design","launch coordination","production logging","monitoring dashboards","evaluation infrastructure","collaboration","communication"],"x-skills-preferred":["open-source LLM frameworks","research on model training","scaling laws","ML systems","production ML systems","observability tools","evaluation infrastructure","systems engineering","quant","operational excellence"],"datePosted":"2026-03-08T13:44:15.893Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"JAX, TPU, PyTorch, large-scale distributed systems, model operations, performance optimization, observability, reliability, debugging, experimental design, launch coordination, production logging, monitoring dashboards, evaluation infrastructure, collaboration, communication, open-source LLM frameworks, research on model training, scaling laws, ML systems, production ML systems, observability tools, evaluation infrastructure, systems engineering, quant, operational excellence","baseSalary":{"@type":"MonetaryAmount","currency":"GBP","value":{"@type":"QuantitativeValue","minValue":260000,"maxValue":630000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_35fc7f23-917"},"title":"Software Engineer, Growth Infrastructure","description":"<p>We are looking for an experienced Growth Infrastructure Engineer to build and maintain the technical backbone that enables scalable growth experiments, high-performance data pipelines, and automated systems that drive user acquisition, engagement, and product iteration.</p>\n<p>This role sits at the intersection of growth, product, and infrastructure — combining deep technical engineering with experimentation and data-driven optimization. You will collaborate with product, data science, and backend teams to ensure that growth initiatives run smoothly and scale efficiently across systems.</p>\n<p><strong>Key Responsibilities</strong></p>\n<p><strong>Growth Infrastructure &amp; Systems</strong></p>\n<ul>\n<li>Design, implement, and maintain scalable infrastructure that supports growth and experimentation needs.</li>\n<li>Build and optimize analytics pipelines to capture key product and growth metrics (acquisition, activation, retention, etc.).</li>\n<li>Develop automated workflows for user onboarding, campaign delivery, and performance tracking.</li>\n</ul>\n<p><strong>Experimentation &amp; Optimization</strong></p>\n<ul>\n<li>Support A/B testing frameworks and integrate them into production systems.</li>\n<li>Enable reliable data collection and evaluation for growth experiments.</li>\n<li>Automate deployment and rollout of growth feature flags and tests.</li>\n</ul>\n<p><strong>Cross-Functional Collaboration</strong></p>\n<ul>\n<li>Partner with Growth Product Managers, Data Engineers, and Analysts to define technical requirements for growth initiatives.</li>\n<li>Translate business goals into technical specifications and system designs.</li>\n<li>Provide guidance on performance, reliability, and scalability trade-offs.</li>\n</ul>\n<p><strong>Monitoring &amp; Reliability</strong></p>\n<ul>\n<li>Implement monitoring and alerting for growth infrastructure services.</li>\n<li>Troubleshoot production issues and optimize for uptime and performance.</li>\n<li>Ensure data quality and consistency for reporting and decision-making.</li>\n</ul>\n<p><strong>Continuous Improvement</strong></p>\n<ul>\n<li>Evaluate new tools, frameworks, and platforms that accelerate growth engineering.</li>\n<li>Drive best practices in infrastructure as code, CI/CD, and automated testing.</li>\n<li>Train and mentor teammates on growth infrastructure principles.</li>\n</ul>\n<p><strong>Required Qualifications</strong></p>\n<ul>\n<li>Bachelor’s degree in Computer Science, Software Engineering, or related technical field.</li>\n<li>3+ years of experience building backend or infrastructure-focused services.</li>\n<li>Strong programming skills in languages such as Python, Go, or JavaScript.</li>\n<li>Experience with cloud platform infrastructure (e.g., AWS, GCP, Azure).</li>\n<li>Solid understanding of data pipelines, ETL processes, and databases.</li>\n<li>Experience with CI/CD systems and Infrastructure as Code (Terraform, CloudFormation, etc.).</li>\n<li>Experience with experimentation platforms (StatSig, Segment, LaunchDarkly, or in-house).</li>\n</ul>\n<p><strong>Nice to Have</strong></p>\n<ul>\n<li>Familiarity with growth metrics and product analytics tools (Amplitude, etc.).</li>\n<li>Experience designing and scaling A/B testing systems and feature flag orchestration.</li>\n<li>Proficient with containerization (Docker, Kubernetes) and distributed systems.</li>\n<li>Knowledge of observability tools (Datadog, etc).</li>\n</ul>\n<p><strong>What Success Looks Like</strong></p>\n<ul>\n<li>Growth initiatives that deploy reliably and quickly with minimal manual intervention.</li>\n<li>High-fidelity data pipelines that enable real-time insight into key growth metrics.</li>\n<li>Growth teams are empowered to run experiments and launch campaigns without heavy infrastructure support.</li>\n</ul>\n<p><strong>Why Join Us?</strong></p>\n<ul>\n<li>This is a rare opportunity to be among the first engineers on a newly formed Growth team, with significant ownership and influence over both technical direction and product outcomes.</li>\n<li>You’ll be working on a product that scaled from ~2M to 250M users in under a year, operating in a massive and still largely untapped market.</li>\n<li>The impact of your work will be visible, measurable, and foundational to how the company grows next.</li>\n</ul>\n<p><strong>Full-Time Employee Benefits Include:</strong></p>\n<ul>\n<li>Competitive Salary &amp; Equity</li>\n<li>401(k) Program with a 4% match</li>\n<li>Health, Dental, Vision and Life Insurance</li>\n<li>Short Term and Long Term Disability</li>\n<li>Paid Parental, Medical, Caregiver Leave</li>\n<li>Commuter Benefits</li>\n<li>Monthly Wellness Stipend</li>\n<li>Autonomous Work Environment</li>\n<li>In Office Set-Up Reimbursement</li>\n<li>Flexible Time Off (FTO) + Holidays</li>\n<li>Quarterly Team Gatherings</li>\n<li>In Office Amenities</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_35fc7f23-917","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Replit","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/replit.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/replit/37f81c18-c742-4f7d-bf81-34c3f5142973","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$180K - $290K","x-skills-required":["Python","Go","JavaScript","AWS","GCP","Azure","CI/CD","Infrastructure as Code","Terraform","CloudFormation","Experimentation platforms"],"x-skills-preferred":["Growth metrics","Product analytics tools","Containerization","Distributed systems","Observability tools"],"datePosted":"2026-03-07T15:18:47.535Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Foster City, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Go, JavaScript, AWS, GCP, Azure, CI/CD, Infrastructure as Code, Terraform, CloudFormation, Experimentation platforms, Growth metrics, Product analytics tools, Containerization, Distributed systems, Observability tools","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":290000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3f16d353-491"},"title":"Software Engineer, Infrastructure Reliability","description":"<p><strong>Software Engineer, Infrastructure Reliability</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Applied AI</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$255K – $385K</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>We’re hiring Software Engineers to join our Applied Infrastructure organization, and more specifically for our Database Systems and Online Storage teams. These teams operate with a high degree of autonomy and are deeply collaborative, with a shared mandate to raise the bar on safety, reliability, and velocity across OpenAI.</p>\n<p><strong>About the Role</strong></p>\n<p>You’ll be at the heart of scaling and hardening the infrastructure that powers some of the most widely used AI systems in the world. You’ll help ensure our systems are highly reliable, observable, performant, and secure—so researchers can iterate quickly, and products like ChatGPT and the OpenAI API can serve millions of users safely and effectively.</p>\n<p>This is a hands-on, high-leverage role for engineers who thrive on ownership, love solving deep technical problems across the stack, and want to work on systems that support cutting-edge research and deploy at global scale. You’ll play a key part in shaping technical direction, proactively improving system resilience, and collaborating closely with infra, product, and research teams to turn complex infrastructure into reliable platforms.</p>\n<p><strong>In this role you will:</strong></p>\n<ul>\n<li>Design, build, and operate reliable and performant systems used across engineering.</li>\n</ul>\n<ul>\n<li>Identify and fix performance bottlenecks and inefficiencies, ensuring our infrastructure can scale to the next order of magnitude.</li>\n</ul>\n<ul>\n<li>Dig deep to resolve complex issues.</li>\n</ul>\n<ul>\n<li>Continuously improve automation to reduce manual work. Improve internal tooling and our developer experience.</li>\n</ul>\n<ul>\n<li>Contribute to incident response, postmortems, and the development of best practices around system reliability and scalability.</li>\n</ul>\n<p><strong>You might thrive in this role if you:</strong></p>\n<ul>\n<li>Have a deep understanding of distributed systems principles and a proven track record in building and operating scalable and reliable systems.</li>\n</ul>\n<ul>\n<li>Have a keen eye for performance and optimization. You know how to squeeze the most performance out of complex, globally-distributed systems.</li>\n</ul>\n<ul>\n<li>Have experience operating orchestration systems such as Kubernetes at scale and building abstractions over cloud platforms</li>\n</ul>\n<ul>\n<li>Are comfortable working in Linux environments, and with tools like Kubernetes, Terraform, CI/CD pipelines, and modern observability stacks.</li>\n</ul>\n<ul>\n<li>Are experienced in collaborating with cross-functional teams to ensure that reliability and scalability are considered in the design and development of new features and services.</li>\n</ul>\n<ul>\n<li>Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed.</li>\n</ul>\n<ul>\n<li>Own problems end-to-end, and are willing to pick up whatever knowledge you&#39;re missing to get the job done.</li>\n</ul>\n<ul>\n<li>Are comfortable with ambiguity and rapid change.</li>\n</ul>\n<p><strong>Qualifications:</strong></p>\n<ul>\n<li>4+ years of relevant industry experience, with 2+ years leading large scale, complex projects or teams as an engineer or tech lead</li>\n</ul>\n<ul>\n<li>A passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement.</li>\n</ul>\n<ul>\n<li>Proven experience as an reliability engineer, production engineer, or a similar role in a fast-paced, rapidly scaling company.</li>\n</ul>\n<ul>\n<li>Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform. Proficiency in programming / scripting languages.</li>\n</ul>\n<ul>\n<li>Experience with containerization technologies and container orchestration platforms like Kubernetes.</li>\n</ul>\n<ul>\n<li>Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack.</li>\n</ul>\n<ul>\n<li>Experience with microservices architecture and service mesh technologies.</li>\n</ul>\n<ul>\n<li>Knowledge of security best practices in cloud environments.</li>\n</ul>\n<ul>\n<li>Strong understanding of distributed systems, networking, and database technologies.</li>\n</ul>\n<ul>\n<li>Excellent problem-solving skills and ability to work in a fast-paced environment.</li>\n</ul>\n<p><strong>About OpenAI</strong></p>\n<p>OpenAI is an AI research and deployment company that aims to develop and apply general-purpose technologies to align with human values.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3f16d353-491","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/779b340d-e645-4da1-a923-b3070a26d936","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$255K – $385K","x-skills-required":["cloud infrastructure","IaC tools","programming/scripting languages","containerization technologies","container orchestration platforms","observability tools","microservices architecture","service mesh technologies","security best practices","distributed systems","networking","database technologies"],"x-skills-preferred":["Kubernetes","Terraform","Datadog","Prometheus","Grafana","Splunk","ELK stack"],"datePosted":"2026-03-06T18:24:50.552Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud infrastructure, IaC tools, programming/scripting languages, containerization technologies, container orchestration platforms, observability tools, microservices architecture, service mesh technologies, security best practices, distributed systems, networking, database technologies, Kubernetes, Terraform, Datadog, Prometheus, Grafana, Splunk, ELK stack","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":255000,"maxValue":385000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2b3a3ab9-2bc"},"title":"Member of Technical Staff, HPC Operations Engineering Manager","description":"<p><strong>Summary</strong></p>\n<p>Microsoft AI are looking for a talented Member of Technical Staff, HPC Operations Engineering Manager to join their MAI SuperIntelligence Team. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that&#39;s revolutionising haptic entertainment technology. You&#39;ll work directly with leadership to shape the company&#39;s direction in the cinema and simulation markets.</p>\n<p><strong>About the Role</strong></p>\n<p>In this role, you&#39;ll lead a team of Site Reliability Engineers who blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You&#39;ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.</p>\n<p><strong>Accountabilities</strong></p>\n<ul>\n<li>Conduct in-depth market research across cinema and simulation sectors, identifying emerging trends, competitive threats, and partnership opportunities that directly inform the company&#39;s quarterly strategic planning sessions</li>\n<li>Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems</li>\n</ul>\n<p><strong>The Candidate we&#39;re looking for</strong></p>\n<p><strong>Experience:</strong></p>\n<ul>\n<li>8+ years technical engineering experience with Site Reliability Engineering, DevOps, or Infrastructure Engineering Leadership roles</li>\n</ul>\n<p><strong>Technical skills:</strong></p>\n<ul>\n<li>Kubernetes, Docker, and container orchestration</li>\n<li>Public cloud platforms like Azure/AWS/GCP and infrastructure-as-code</li>\n</ul>\n<p><strong>Personal attributes:</strong></p>\n<ul>\n<li>Low ego individual</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Competitive salary</li>\n<li>Benefits and other compensation</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2b3a3ab9-2bc","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Microsoft AI","sameAs":"https://microsoft.ai","logo":"https://logos.yubhub.co/microsoft.ai.png"},"x-apply-url":"https://microsoft.ai/job/member-of-technical-staff-hpc-operations-engineering-manager-mai-superintelligence-team/","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"USD $139,900 – $274,800 per year","x-skills-required":["Kubernetes","Docker","container orchestration","public cloud platforms","infrastructure-as-code"],"x-skills-preferred":["monitoring & observability tools","Grafana","Datadog","OpenTelemetry"],"datePosted":"2026-03-06T07:26:34.569Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, Docker, container orchestration, public cloud platforms, infrastructure-as-code, monitoring & observability tools, Grafana, Datadog, OpenTelemetry","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139900,"maxValue":274800,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_93ab9223-11a"},"title":"Sr. Network Architect - Datacenter, Automation, Cloud","description":"<p>Opening. This role is a Sr. Network Architect position that exists to lead the design and implementation of scalable, high-performance network architectures integrating both data center and cloud environments to support evolving business needs.</p>\n<p><strong>What you&#39;ll do</strong></p>\n<p>As a Sr. Network Architect, you will lead the design and implementation of scalable, high-performance network architectures integrating both data center and cloud environments to support evolving business needs. You will develop and influence strategy and standards for data center engineering, setting technology direction across hybrid environments.</p>\n<ul>\n<li>Lead the design and implementation of scalable, high-performance network architectures integrating both data center and cloud environments to support evolving business needs.</li>\n<li>Develop and influence strategy and standards for data center engineering, setting technology direction across hybrid environments.</li>\n</ul>\n<p><strong>What you need</strong></p>\n<ul>\n<li>BS in Engineering or related field (MS preferred).</li>\n<li>10+ years of experience in network engineering/data center infrastructure with significant ownership of large-scale networks.</li>\n<li>Expert-level knowledge of DC and service provider protocols: BGP, MPLS, Segment Routing, EVPN, VXLAN, QoS, and traffic engineering.</li>\n<li>Advanced proficiency with Cisco ACI (Application Centric Infrastructure), including automation via APIC REST, SDK, and Ansible collections.</li>\n<li>Strong experience with network automation (Python, Ansible) and observability tools (gNMI, NetFlow/IPFIX/sFlow, Elastic, Grafana).</li>\n<li>Demonstrated leadership in mentoring teams and managing complex projects from inception to completion.</li>\n<li>Proven program management skills with the ability to align cross-functional teams and achieve results.</li>\n<li>Excellent analytical, organizational, and problem-solving abilities with a focus on KPIs and operational excellence.</li>\n<li>Strong written and verbal communication skills, with the ability to convey technical information to diverse audiences.</li>\n</ul>\n<p><strong>Why this matters</strong></p>\n<p>This role will shape the strategic direction of Synopsys&#39; network infrastructure, enabling rapid business growth and innovation. It will drive the adoption of cutting-edge technologies and best practices, enhancing Synopsys&#39; competitive edge in the industry.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_93ab9223-11a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Synopsys","sameAs":"https://careers.synopsys.com","logo":"https://logos.yubhub.co/careers.synopsys.com.png"},"x-apply-url":"https://careers.synopsys.com/job/sunnyvale/sr-network-architect-datacenter-automation-cloud/44408/92101524752","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$144000-$216000","x-skills-required":["network engineering","data center infrastructure","network automation","observability tools","program management","leadership","communication skills"],"x-skills-preferred":["Cisco ACI","Python","Ansible","gNMI","NetFlow/IPFIX/sFlow","Elastic","Grafana"],"datePosted":"2026-03-04T17:10:09.777Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Sunnyvale, California, United States"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"network engineering, data center infrastructure, network automation, observability tools, program management, leadership, communication skills, Cisco ACI, Python, Ansible, gNMI, NetFlow/IPFIX/sFlow, Elastic, Grafana","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":144000,"maxValue":216000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c72aeb6c-01c"},"title":"Software Engineer","description":"<p>We&#39;re looking for a talented Software Engineer to join our team. As a Software Engineer, you will work with our internal customers to design and implement new automated workflows, monitor our solutions to ensure they are running as expected, and debug and fix any issues found promptly, while communicating with our partners.</p>\n<p><strong>What you&#39;ll do</strong></p>\n<ul>\n<li>Work with our internal customers to design and implement new automated workflows</li>\n<li>Monitor our solutions to ensure they are running as expected, and debug and fix any issues found promptly, while communicating with our partners</li>\n</ul>\n<p><strong>What you need</strong></p>\n<ul>\n<li>3+ years of experience as a software engineer</li>\n<li>Object-oriented/scripting languages (e.g. Python, Groovy, C#, Java, or Ruby)</li>\n<li>Implement CI/CD pipelines (e.g. Jenkins, GitLab CI)</li>\n<li>Source control management tools (e.g. Perforce, Git)</li>\n<li>Configuration management tools (e.g. Chef, Ansible, Terraform, Packer)</li>\n<li>Cloud platforms (e.g. AWS, GCP, Azure)</li>\n<li>Containerization technologies (e.g. Docker, Kubernetes)</li>\n<li>Secrets management tools (e.g Vault)</li>\n<li>Artifact repositories (e.g. Artifactory, NPM, NuGet)</li>\n<li>Virtualization environments and tools (e.g. VMs, vSphere)</li>\n<li>Data and Observability tools (e.g. Splunk, Grafana, New Relic, Open Telemetry)</li>\n</ul>\n<p><strong>Why this matters</strong></p>\n<p>As a Software Engineer at Electronic Arts, you will have the opportunity to work on a wide range of projects and technologies, and to contribute to the development of our next-generation entertainment experiences. You will be part of a collaborative and dynamic team, and will have the opportunity to learn and grow with the company.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c72aeb6c-01c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Electronic Arts","sameAs":"https://jobs.ea.com","logo":"https://logos.yubhub.co/jobs.ea.com.png"},"x-apply-url":"https://jobs.ea.com/en_US/careers/JobDetail/Build-Software-Engineer/209795","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["object-oriented/scripting languages","CI/CD pipelines","source control management tools","configuration management tools","cloud platforms","containerization technologies","secrets management tools","artifact repositories","virtualization environments and tools","data and Observability tools"],"x-skills-preferred":["agile familiarity","growth-oriented mindset","collaboration skills"],"datePosted":"2026-01-01T16:48:49.347Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Guildford"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"object-oriented/scripting languages, CI/CD pipelines, source control management tools, configuration management tools, cloud platforms, containerization technologies, secrets management tools, artifact repositories, virtualization environments and tools, data and Observability tools, agile familiarity, growth-oriented mindset, collaboration skills"}]}