{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/failure-detection"},"x-facet":{"type":"skill","slug":"failure-detection","display":"Failure Detection","count":4},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_353075cd-1a3"},"title":"Software Engineer II - Backend","description":"<p>At Helpshift, we&#39;re looking for a skilled Software Engineer II to join our team. As a Software Engineer II, you will be responsible for designing and building scalable and resilient systems for our chatbot conversation engine and issue routing system. You will work closely with our clients as an extension of their team, using your expertise to bring their stories to life.</p>\n<p>We have 5 Leadership Principles that guide us in our goals:</p>\n<ul>\n<li>The Power of Partnerships: We collaborate with our clients to bring their stories to life.</li>\n<li>One Keywords: We combine the strength of a global platform with the agility of local studios.</li>\n<li>Raise the Game: We use technology and innovation to help our clients and the industry thrive.</li>\n<li>Embrace an Open World: We champion diversity of talent and ideas from every corner of our global community.</li>\n<li>Trust through Transparency: We pursue open and honest relationships with our people, clients, and communities.</li>\n</ul>\n<p>Responsibilities:</p>\n<ul>\n<li>Design and build scalable and resilient systems for our chatbot conversation engine and issue routing system.</li>\n<li>Work on a chatbot conversation engine scaling to millions of conversations per day.</li>\n<li>Design and build workflows for automatically routing issues based on events.</li>\n<li>Design and implement APIs for features to be consumed internally in the agent dashboard as well as external facing APIs to enable integrations.</li>\n<li>Collaborate with our clients as an extension of their team.</li>\n</ul>\n<p>Requirements:</p>\n<ul>\n<li>4+ years of medium/large scale server-side software development experience.</li>\n<li>Excellent verbal and written communication skills.</li>\n<li>Thorough knowledge of CS fundamentals: Data structures, time complexity of algorithms.</li>\n<li>Knowledge of Posix compliant Operating Systems (we develop on Mac OS X and deploy on GNU/Linux).</li>\n<li>Comfortable using CLI tools for achieving day-to-day tasks.</li>\n<li>Experience in writing Unit, Functional &amp; Regression tests.</li>\n<li>Knowledge of generative testing is preferred.</li>\n<li>Bachelor&#39;s Degree in Computer Science (or equivalent).</li>\n</ul>\n<p>Preferred skills:</p>\n<ul>\n<li>Experience in working with a distributed version control tool (we use Git).</li>\n<li>Knowledge of functional programming (we use Clojure).</li>\n<li>Knowledge of the JVM.</li>\n<li>Experience in working with any one of MongoDB, Redis, Elasticsearch, Kafka or Postgresql at scale.</li>\n<li>Experience with benchmarking systems for performance, failure detection.</li>\n</ul>\n<p>Benefits:</p>\n<ul>\n<li>Hybrid setup</li>\n<li>Worker&#39;s insurance</li>\n<li>Paid Time Offs</li>\n<li>Other employee benefits to be discussed by our Talent Acquisition team in India</li>\n</ul>\n<p>Helpshift embraces diversity. We are proud to be an equal opportunity workplace and do not discriminate on the basis of sex, race, color, age, sexual orientation, gender identity, religion, national origin, citizenship, marital status, veteran status, or disability status.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_353075cd-1a3","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Helpshift","sameAs":"https://apply.workable.com","logo":"https://logos.yubhub.co/j.com.png"},"x-apply-url":"https://apply.workable.com/j/6AE65D7D7E","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["server-side software development","data structures","time complexity of algorithms","posix compliant operating systems","cli tools","unit testing","functional testing","regression testing","generative testing","distributed version control","functional programming","jvm","mongodb","redis","elasticsearch","kafka","postgresql"],"x-skills-preferred":["git","clojure","benchmarking systems","performance","failure detection"],"datePosted":"2026-03-09T10:57:00.354Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Pune, Maharashtra, India"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"server-side software development, data structures, time complexity of algorithms, posix compliant operating systems, cli tools, unit testing, functional testing, regression testing, generative testing, distributed version control, functional programming, jvm, mongodb, redis, elasticsearch, kafka, postgresql, git, clojure, benchmarking systems, performance, failure detection"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_5ea7b3e3-440"},"title":"Software Development Engineer","description":"<p><strong>About the Role</strong></p>\n<p>We are looking for an ambitious and experienced software development engineer to join our product and platform development team. As a key member of our team, you will collaborate with multiple teams to deliver high-quality and highly scalable products.</p>\n<p><strong>Responsibilities</strong></p>\n<p><strong>Core Platform Development</strong></p>\n<ul>\n<li>Work on the resiliency, availability, and latency of our core platform and services that are delivered to 820 million monthly active users and can scale to 100K+ RPS.</li>\n<li>Work with multiple databases and ensure scalability for the interacting platform components.</li>\n<li>Take ownership and publish (internal) reusable services and APIs.</li>\n<li>Enable feature teams to use the core platform.</li>\n</ul>\n<p><strong>Code Quality and Review</strong></p>\n<ul>\n<li>Write clean code with proper test coverage.</li>\n<li>Review others&#39; code and ensure that it is up to organization standards.</li>\n<li>Mentor junior members of the team.</li>\n</ul>\n<p><strong>Optimization and Performance</strong></p>\n<ul>\n<li>Optimize application for maximum speed and scalability.</li>\n<li>Participate in the hiring process.</li>\n<li>Keep calm and learn every day.</li>\n</ul>\n<p><strong>Nice to Have</strong></p>\n<ul>\n<li>Knowledge of frontend development and tools, especially JavaScript and React.</li>\n<li>Knowledge of functional programming is a plus (we use Clojure).</li>\n<li>Experience with benchmarking systems for performance, and failure detection.</li>\n<li>Experience in working with any of the above databases at scale is good to have.</li>\n</ul>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Hybrid setup</li>\n<li>Worker&#39;s insurance</li>\n<li>Paid Time Offs</li>\n<li>Other employee benefits to be discussed by our Talent Acquisition team in India.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_5ea7b3e3-440","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Helpshift","sameAs":"https://apply.workable.com","logo":"https://logos.yubhub.co/j.com.png"},"x-apply-url":"https://apply.workable.com/j/47EB4FCF3F","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Clojure","Java","Data structures","Time complexity of algorithms","System design and architecture","YugabyteDB","Redis","Elasticsearch","Kafka","Postgresql","Posix compliant operating systems","CLI tools","Code editor","Unit and integration tests"],"x-skills-preferred":["JavaScript","React","Functional programming","Benchmarking systems","Failure detection"],"datePosted":"2026-03-09T10:55:29.632Z","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Clojure, Java, Data structures, Time complexity of algorithms, System design and architecture, YugabyteDB, Redis, Elasticsearch, Kafka, Postgresql, Posix compliant operating systems, CLI tools, Code editor, Unit and integration tests, JavaScript, React, Functional programming, Benchmarking systems, Failure detection"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3a34dc62-295"},"title":"Software Engineer, Platform Systems","description":"<p><strong>Software Engineer, Platform Systems</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Scaling</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$310K – $460K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p>More details about our benefits are available to candidates during the hiring process.</p>\n<p>This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.</p>\n<p><strong>About the Team</strong></p>\n<p>The Platform Systems team at OpenAI operates at the intersection of cutting-edge AI and large-scale distributed systems. We build the engineering and research infrastructure required to train OpenAI’s flagship models on some of the world’s largest, custom-built supercomputers.</p>\n<p>Our team develops core model training software and works deep in the stack - spanning collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we build are foundational to OpenAI’s research velocity, enabling reliable, efficient training at frontier scale.</p>\n<p>We collaborate closely with researchers across the organization, continuously incorporating learnings from across OpenAI into the evolution of our training platform.</p>\n<p><strong>About the Role</strong></p>\n<p>As a Software Engineer, Platform Systems, you will design and build distributed systems that provide visibility into large-scale training workloads and help operate them reliably at scale.</p>\n<p>You’ll work on failure detection, tracing, and observability systems that identify slow or faulty nodes, surface performance bottlenecks, and help engineers understand and optimize massive distributed training jobs. This infrastructure is critical to operating OpenAI’s training stack and is actively evolving to support new use cases and increasingly complex workloads.</p>\n<p>This role sits at the core of our training infrastructure, blending systems engineering, performance analysis, and large-scale debugging.</p>\n<p><strong>In This Role, You Will</strong></p>\n<ul>\n<li>Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs</li>\n</ul>\n<ul>\n<li>Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior</li>\n</ul>\n<ul>\n<li>Improve observability, reliability, and performance across OpenAI’s training platform</li>\n</ul>\n<ul>\n<li>Debug and resolve issues in complex, high-throughput distributed systems</li>\n</ul>\n<ul>\n<li>Collaborate with systems, infrastructure, and research teams to evolve platform capabilities</li>\n</ul>\n<ul>\n<li>Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads</li>\n</ul>\n<p><strong>You Might Thrive in This Role If You</strong></p>\n<ul>\n<li>Care deeply about performance, stability, and observability in distributed systems</li>\n</ul>\n<ul>\n<li>Enjoy finding and fixing issues in large-scale systems and automating operational workflows</li>\n</ul>\n<ul>\n<li>Have experience writing low-level software where system details matter</li>\n</ul>\n<ul>\n<li>Understand hardware, operating systems, networking, concurrency, and distributed systems</li>\n</ul>\n<ul>\n<li>Have a background in high-performance computing or low-level systems engineering</li>\n</ul>\n<ul>\n<li>Are excited to work on critical infrastructure that powers frontier AI research</li>\n</ul>\n<p><strong>About OpenAI</strong></p>\n<p>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3a34dc62-295","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/5e4ed6d1-2417-4bf5-bae0-905931c488e3","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$310K – $460K • Offers Equity","x-skills-required":["Distributed systems","Failure detection","Tracing","Observability","Performance analysis","Low-level software development","Hardware","Operating systems","Networking","Concurrency","Distributed systems engineering","High-performance computing"],"x-skills-preferred":["Cloud computing","Containerization","DevOps","Machine learning","Data science"],"datePosted":"2026-03-06T18:31:07.008Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Distributed systems, Failure detection, Tracing, Observability, Performance analysis, Low-level software development, Hardware, Operating systems, Networking, Concurrency, Distributed systems engineering, High-performance computing, Cloud computing, Containerization, DevOps, Machine learning, Data science","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":310000,"maxValue":460000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_49ecc85f-6cb"},"title":"Software Engineer, Platform Systems","description":"<p><strong>Software Engineer, Platform Systems</strong></p>\n<p><strong>Location</strong></p>\n<p>London, UK</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Scaling</p>\n<p><strong><strong>About the Team</strong></strong></p>\n<p>The Platform Systems team at OpenAI operates at the intersection of cutting-edge AI and large-scale distributed systems. We build the engineering and research infrastructure required to train OpenAI’s flagship models on some of the world’s largest, custom-built supercomputers.</p>\n<p>Our team develops core model training software and works deep in the stack - spanning collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we build are foundational to OpenAI’s research velocity, enabling reliable, efficient training at frontier scale.</p>\n<p>We collaborate closely with researchers across the organisation, continuously incorporating learnings from across OpenAI into the evolution of our training platform.</p>\n<p><strong><strong>About the Role</strong></strong></p>\n<p>As a Software Engineer, Platform Systems, you will design and build distributed systems that provide visibility into large-scale training workloads and help operate them reliably at scale.</p>\n<p>You’ll work on failure detection, tracing, and observability systems that identify slow or faulty nodes, surface performance bottlenecks, and help engineers understand and optimise massive distributed training jobs. This infrastructure is critical to operating OpenAI’s training stack and is actively evolving to support new use cases and increasingly complex workloads.</p>\n<p>This role sits at the core of our training infrastructure, blending systems engineering, performance analysis, and large-scale debugging.</p>\n<p><strong><strong>In This Role, You Will</strong></strong></p>\n<ul>\n<li>Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs</li>\n</ul>\n<ul>\n<li>Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behaviour</li>\n</ul>\n<ul>\n<li>Improve observability, reliability, and performance across OpenAI’s training platform</li>\n</ul>\n<ul>\n<li>Debug and resolve issues in complex, high-throughput distributed systems</li>\n</ul>\n<ul>\n<li>Collaborate with systems, infrastructure, and research teams to evolve platform capabilities</li>\n</ul>\n<ul>\n<li>Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads</li>\n</ul>\n<p><strong><strong>You Might Thrive in This Role If You</strong></strong></p>\n<ul>\n<li>Care deeply about performance, stability, and observability in distributed systems</li>\n</ul>\n<ul>\n<li>Enjoy finding and fixing issues in large-scale systems and automating operational workflows</li>\n</ul>\n<ul>\n<li>Have experience writing low-level software where system details matter</li>\n</ul>\n<ul>\n<li>Understand hardware, operating systems, networking, concurrency, and distributed systems</li>\n</ul>\n<ul>\n<li>Have a background in high-performance computing or low-level systems engineering</li>\n</ul>\n<ul>\n<li>Are excited to work on critical infrastructure that powers frontier AI research</li>\n</ul>\n<p><strong><strong>About OpenAI</strong></strong></p>\n<p>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_49ecc85f-6cb","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/4349f80b-3518-4e4d-b9eb-3e5e9b490cc7","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Distributed systems","Failure detection","Tracing","Observability","Performance analysis","Low-level software development","Hardware","Operating systems","Networking","Concurrency","Distributed systems engineering","High-performance computing"],"x-skills-preferred":["Cloud computing","Containerization","DevOps","Machine learning","Data science"],"datePosted":"2026-03-06T18:27:53.954Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Distributed systems, Failure detection, Tracing, Observability, Performance analysis, Low-level software development, Hardware, Operating systems, Networking, Concurrency, Distributed systems engineering, High-performance computing, Cloud computing, Containerization, DevOps, Machine learning, Data science"}]}