{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/monitoring-dashboards"},"x-facet":{"type":"skill","slug":"monitoring-dashboards","display":"Monitoring Dashboards","count":4},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_dc6154f8-cff"},"title":"Research Engineer, Pretraining Scaling - London","description":"<p>About Anthropic\\n\\nAnthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems.\\n\\nAbout the Role:\\n\\nAs a Research Engineer on Anthropic&#39;s ML Performance and Scaling team, you&#39;ll ensure our frontier models train reliably, efficiently, and at scale. This is demanding, high-impact work that requires both deep technical expertise and a genuine passion for the craft of large-scale ML systems.\\n\\nResponsibilities:\\n\\n- Own critical aspects of our production pretraining pipeline, including model operations, performance optimization, observability, and reliability\\n- Debug and resolve complex issues across the full stack,from hardware errors and networking to training dynamics and evaluation infrastructure\\n- Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance\\n- Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams\\n- Build and maintain production logging, monitoring dashboards, and evaluation infrastructure\\n- Add new capabilities to the training codebase, such as long context support or novel architectures\\n- Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams\\n- Contribute to the team&#39;s institutional knowledge by documenting systems, debugging approaches, and lessons learned\\n\\nYou May Be a Good Fit If You:\\n\\n- Have hands-on experience training large language models, or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems\\n- Genuinely enjoy both research and engineering work,you&#39;d describe your ideal split as roughly 50/50 rather than heavily weighted toward one or the other\\n- Are excited about being on-call for production systems, working long days during launches, and solving hard problems under pressure\\n- Thrive when working on whatever is most impactful, even if that changes day-to-day based on what the production model needs\\n- Excel at debugging complex, ambiguous problems across multiple layers of the stack\\n- Communicate clearly and collaborate effectively, especially when coordinating across time zones or during high-stress incidents\\n- Are passionate about the work itself and want to refine your craft as a research engineer\\n- Care about the societal impacts of AI and responsible scaling\\n\\nStrong Candidates May Also Have:\\n\\n- Previous experience training LLM’s or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale\\n- Contributed to open-source LLM frameworks (e.g., open_lm, llm-foundry, mesh-transformer-jax)\\n- Published research on model training, scaling laws, or ML systems\\n- Experience with production ML systems, observability tools, or evaluation infrastructure\\n- Background as a systems engineer, quant, or in other roles requiring both technical depth and operational excellence\\n\\nWhat Makes This Role Unique:\\n\\nThis is not a typical research engineering role. The work is highly operational,you&#39;ll be deeply involved in keeping our production models training smoothly, which means being responsive to incidents, flexible about priorities, and comfortable with uncertainty. During launches, the team often works extended hours and may need to respond to issues on evenings and weekends.\\n\\nHowever, this operational intensity comes with extraordinary learning opportunities. You&#39;ll gain hands-on experience with some of the largest, most sophisticated training runs in the industry. You&#39;ll work alongside world-class researchers and engineers, and the institutional knowledge you build will compound in ways that can&#39;t be easily transferred. For people who thrive on this type of work, it&#39;s uniquely rewarding.\\n\\nWe&#39;re building a close-knit team of people who genuinely care about doing excellent work together. If you&#39;re someone who wants to be part of training the models that will define the future of AI,and you&#39;re excited about the full reality of what that entails,we&#39;d love to hear from you.\\n\\nLocation:\\n\\nThis role requires working in-office 5 days per week in London.\\n\\nDeadline to apply:\\n\\nNone. Applications will be reviewed on a rolling basis.\\n\\nThe annual compensation range for this role is listed below.\\n\\nFor sales roles, the range provided is the role’s On Target Earnings (&quot;OTE&quot;) range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\\n\\nAnnual Salary:\\n\\n£260,000-£630,000 GBP\\n\\nLogistics\\n\\nMinimum education:\\n\\nBachelor’s degree or an equivalent combination of education, training, and/or experience\\n\\nRequired field of study:\\n\\nA field relevant to the role as demonstrated through coursework, training, or professional experience\\n\\nMinimum years of experience:\\n\\nYears of experience required will correlate with the internal job level requirements for the position\\n\\nLocation-based hybrid policy:\\n\\nCurrently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.\\n\\nVisa sponsorship:\\n\\nWe do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.\\n\\nWe encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work. We think AI systems like the ones we&#39;re building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team.\\n\\nYour safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links,visit anthropic.com/careers directly for confirmed position openings.\\n\\nHow we&#39;re different\\n\\nWe believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact , advancing our long-term goals of steerable, trustworthy AI , rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We&#39;re an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the h</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_dc6154f8-cff","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4938436008","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"£260,000-£630,000 GBP","x-skills-required":["JAX","TPU","PyTorch","large-scale distributed systems","model operations","performance optimization","observability","reliability","debugging","complex issues","hardware errors","networking","training dynamics","evaluation infrastructure","experiments","training efficiency","step time","uptime","model performance","production logging","monitoring dashboards","codebase","long context support","novel architectures","collaboration","institutional knowledge","documentation","debugging approaches","lessons learned"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:42:55.023Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"JAX, TPU, PyTorch, large-scale distributed systems, model operations, performance optimization, observability, reliability, debugging, complex issues, hardware errors, networking, training dynamics, evaluation infrastructure, experiments, training efficiency, step time, uptime, model performance, production logging, monitoring dashboards, codebase, long context support, novel architectures, collaboration, institutional knowledge, documentation, debugging approaches, lessons learned","baseSalary":{"@type":"MonetaryAmount","currency":"GBP","value":{"@type":"QuantitativeValue","minValue":260000,"maxValue":630000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6960fd5f-0e8"},"title":"Research Engineer, Pretraining Scaling","description":"<p><strong>About the Role:\\n\\nAs a Research Engineer on Anthropic&#39;s ML Performance and Scaling team, you&#39;ll ensure our frontier models train reliably, efficiently, and at scale. This is demanding, high-impact work that requires both deep technical expertise and a genuine passion for the craft of large-scale ML systems.\\n\\n## Responsibilities:\\n\\n- Own critical aspects of our production pretraining pipeline, including model operations, performance optimization, observability, and reliability\\n- Debug and resolve complex issues across the full stack,from hardware errors and networking to training dynamics and evaluation infrastructure\\n- Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance\\n- Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams\\n- Build and maintain production logging, monitoring dashboards, and evaluation infrastructure\\n- Add new capabilities to the training codebase, such as long context support or novel architectures\\n- Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams\\n- Contribute to the team&#39;s institutional knowledge by documenting systems, debugging approaches, and lessons learned\\n\\n## You May Be a Good Fit If You:\\n\\n- Have hands-on experience training large language models, or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems\\n- Genuinely enjoy both research and engineering work,you&#39;d describe your ideal split as roughly 50/50 rather than heavily weighted toward one or the other\\n- Are excited about being on-call for production systems, working long days during launches, and solving hard problems under pressure\\n- Thrive when working on whatever is most impactful, even if that changes day-to-day based on what the production model needs\\n- Excel at debugging complex, ambiguous problems across multiple layers of the stack\\n- Communicate clearly and collaborate effectively, especially when coordinating across time zones or during high-stress incidents\\n- Are passionate about the work itself and want to refine your craft as a research engineer\\n- Care about the societal impacts of AI and responsible scaling\\n\\n## Strong Candidates May Also Have:\\n\\n- Previous experience training LLM’s or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale\\n- Contributed to open-source LLM frameworks (e.g., open_lm, llm-foundry, mesh-transformer-jax)\\n- Published research on model training, scaling laws, or ML systems\\n- Experience with production ML systems, observability tools, or evaluation infrastructure\\n- Background as a systems engineer, quant, or in other roles requiring both technical depth and operational excellence\\n\\n## What Makes This Role Unique:\\n\\nThis is not a typical research engineering role. The work is highly operational,you&#39;ll be deeply involved in keeping our production models training smoothly, which means being responsive to incidents, flexible about priorities, and comfortable with uncertainty. During launches, the team often works extended hours and may need to respond to issues on evenings and weekends.\\n\\nHowever, this operational intensity comes with extraordinary learning opportunities. You&#39;ll gain hands-on experience with some of the largest, most sophisticated training runs in the industry. You&#39;ll work alongside world-class researchers and engineers, and the institutional knowledge you build will compound in ways that can&#39;t be easily transferred. For people who thrive on this type of work, it&#39;s uniquely rewarding.\\n\\nWe&#39;re building a close-knit team of people who genuinely care about doing excellent work together. If you&#39;re someone who wants to be part of training the models that will define the future of AI,and you&#39;re excited about the full reality of what that entails,we&#39;d love to hear from you.\\n\\nLocation: This role requires working in-office 5 days per week in San Francisco.\\n\\nDeadline to apply: None. Applications will be reviewed on a rolling basis.\\n\\nThe annual compensation range for this role is listed below.\\n\\nFor sales roles, the range provided is the role’s On Target Earnings (&quot;OTE&quot;) range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\\n\\nAnnual Salary: $350,000-$850,000 USD\\n\\n## Logistics\\n\\nMinimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\\n\\nRequired field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience\\n\\nMinimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\\n\\nLocation-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.\\n\\nVisa sponsorship: We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.\\n\\nWe encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work. We think AI systems like the ones we&#39;re building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team.\\n\\nYour safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you&#39;re ever unsure about a communication, don&#39;t click any links,visit anthropic.com/careers directly for confirmed position openings.\\n\\n## How we&#39;re different\\n\\nWe believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact , advancing our long-term goals of steerable, trustworthy AI , rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We&#39;re an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing</strong></p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6960fd5f-0e8","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4938432008","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$350,000-$850,000 USD","x-skills-required":["JAX","TPU","PyTorch","large-scale distributed systems","model operations","performance optimization","observability","reliability","debugging","complex issues","hardware errors","networking","training dynamics","evaluation infrastructure","experiments","training efficiency","step time","uptime","model performance","production logging","monitoring dashboards","new capabilities","long context support","novel architectures","collaboration","institutional knowledge","documentation","debugging approaches","lessons learned"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:42:31.268Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"JAX, TPU, PyTorch, large-scale distributed systems, model operations, performance optimization, observability, reliability, debugging, complex issues, hardware errors, networking, training dynamics, evaluation infrastructure, experiments, training efficiency, step time, uptime, model performance, production logging, monitoring dashboards, new capabilities, long context support, novel architectures, collaboration, institutional knowledge, documentation, debugging approaches, lessons learned","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":350000,"maxValue":850000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a05bfa1a-d23"},"title":"Research Engineer, Pretraining Scaling","description":"<p><strong>About Anthropic</strong></p>\n<p>Anthropic&#39;s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.</p>\n<p><strong>About the Role:</strong></p>\n<p>Anthropic&#39;s ML Performance and Scaling team trains our production pretrained models, work that directly shapes the company&#39;s future and our mission to build safe, beneficial AI systems. As a Research Engineer on this team, you&#39;ll ensure our frontier models train reliably, efficiently, and at scale. This is demanding, high-impact work that requires both deep technical expertise and a genuine passion for the craft of large-scale ML systems.</p>\n<p>This role lives at the boundary between research and engineering. You&#39;ll work across our entire production training stack: performance optimization, hardware debugging, experimental design, and launch coordination. During launches, the team works in tight lockstep, responding to production issues that can&#39;t wait for tomorrow.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>Own critical aspects of our production pretraining pipeline, including model operations, performance optimization, observability, and reliability</li>\n<li>Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure</li>\n<li>Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance</li>\n<li>Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams</li>\n<li>Build and maintain production logging, monitoring dashboards, and evaluation infrastructure</li>\n<li>Add new capabilities to the training codebase, such as long context support or novel architectures</li>\n<li>Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams</li>\n<li>Contribute to the team&#39;s institutional knowledge by documenting systems, debugging approaches, and lessons learned</li>\n</ul>\n<p><strong>You May Be a Good Fit If You:</strong></p>\n<ul>\n<li>Have hands-on experience training large language models, or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems</li>\n<li>Genuinely enjoy both research and engineering work—you&#39;d describe your ideal split as roughly 50/50 rather than heavily weighted toward one or the other</li>\n<li>Are excited about being on-call for production systems, working long days during launches, and solving hard problems under pressure</li>\n<li>Thrive when working on whatever is most impactful, even if that changes day-to-day based on what the production model needs</li>\n<li>Excel at debugging complex, ambiguous problems across multiple layers of the stack</li>\n<li>Communicate clearly and collaborate effectively, especially when coordinating across time zones or during high-stress incidents</li>\n<li>Are passionate about the work itself and want to refine your craft as a research engineer</li>\n<li>Care about the societal impacts of AI and responsible scaling</li>\n</ul>\n<p><strong>Strong Candidates May Also Have:</strong></p>\n<ul>\n<li>Previous experience training LLM’s or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale</li>\n<li>Contributed to open-source LLM frameworks (e.g., open\\_lm, llm-foundry, mesh-transformer-jax)</li>\n<li>Published research on model training, scaling laws, or ML systems</li>\n<li>Experience with production ML systems, observability tools, or evaluation infrastructure</li>\n<li>Background as a systems engineer, quant, or in other roles requiring both technical depth and operational excellence</li>\n</ul>\n<p><strong>What Makes This Role Unique:</strong></p>\n<p>This is not a typical research engineering role. The work is highly operational—you&#39;ll be deeply involved in keeping our production models training smoothly, which means being responsive to incidents, flexible about priorities, and comfortable with uncertainty. During launches, the team often works extended hours and may need to respond to issues on evenings and weekends.</p>\n<p>However, this operational intensity comes with extraordinary learning opportunities. You&#39;ll gain hands-on experience with some of the largest, most sophisticated training runs in the industry. You&#39;ll work alongside world-class researchers and engineers, and the institutional knowledge you build will compound in ways that can&#39;t be easily transferred. For people who thrive on this type of work, it&#39;s uniquely rewarding.</p>\n<p>We&#39;re building a close-knit team of people who genuinely care about doing excellent work together. If you&#39;re someone who wants to be part of training the models that will define the future of AI—and you&#39;re excited about the full reality of what that entails—we&#39;d love to hear from you.</p>\n<p><strong>Logistics</strong></p>\n<p><strong>Education requirements:</strong> We require at least a Bachelor&#39;s degree in a related field or equivalent experience. <strong>Location-based hybrid policy:</strong> Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</p>\n<p><strong>Visa sponsorship:</strong> We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a05bfa1a-d23","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/4938436008","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"£260,000 - £630,000GBP","x-skills-required":["JAX","TPU","PyTorch","large-scale distributed systems","model operations","performance optimization","observability","reliability","debugging","experimental design","launch coordination","production logging","monitoring dashboards","evaluation infrastructure","collaboration","communication"],"x-skills-preferred":["open-source LLM frameworks","research on model training","scaling laws","ML systems","production ML systems","observability tools","evaluation infrastructure","systems engineering","quant","operational excellence"],"datePosted":"2026-03-08T13:44:15.893Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"JAX, TPU, PyTorch, large-scale distributed systems, model operations, performance optimization, observability, reliability, debugging, experimental design, launch coordination, production logging, monitoring dashboards, evaluation infrastructure, collaboration, communication, open-source LLM frameworks, research on model training, scaling laws, ML systems, production ML systems, observability tools, evaluation infrastructure, systems engineering, quant, operational excellence","baseSalary":{"@type":"MonetaryAmount","currency":"GBP","value":{"@type":"QuantitativeValue","minValue":260000,"maxValue":630000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3514d749-08c"},"title":"Senior Support Engineer","description":"<p><strong>Senior Support Engineer - San Francisco</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$234K – $260K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>The Technical Support team is responsible for ensuring that developers and enterprises can reliably build mission critical solutions using OpenAI models. We provide technical guidance, resolve complex issues and support customers in maximizing value and adoption from deploying our highly-capable models. We work closely with Technical Success, Product, Engineering and others to deliver the best possible experience to our customers at scale. We think from an automation-first mindset and leverage the latest in AI to scale our support operations. Join the Senior Support Engineering (SSE) team at OpenAI and help shape the future of Technical Support in the age of AI.</p>\n<p><strong>About the Role</strong></p>\n<p>We are looking for a Senior Support Engineer to collaborate directly with our strategic enterprise accounts and product teams, helping solve some of the most difficult problems faced by our Customers. You will be part of the best technical troubleshooting team at OpenAI, and our Customers and Engineering teams will look to you for technical guidance in addressing the most technically difficult issues in our environment.</p>\n<p>As a Senior Support Engineer, you will design and run operational processes to monitor our top strategic customers and a 24x7 response team. You’ll work closely with our Infrastructure and Engineering teams to deliver the best possible experience to customers at scale. Working directly with our most strategic Customers - You will be crucial to the success of the most innovative, disruptive, and high-scale AI solutions being built with the OpenAI API platform.</p>\n<p>The nature of this role will be low volume, high difficulty.</p>\n<p>This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.</p>\n<p><strong>In this role, you will:</strong></p>\n<ul>\n<li>Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.</li>\n</ul>\n<ul>\n<li>Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.</li>\n</ul>\n<ul>\n<li>Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.</li>\n</ul>\n<ul>\n<li>In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.</li>\n</ul>\n<ul>\n<li>Design and refine incident response processes and documentation across strategic customers, engineering and support teams.</li>\n</ul>\n<ul>\n<li>Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.</li>\n</ul>\n<ul>\n<li>Provide support coverage during holidays and weekends based on business needs.</li>\n</ul>\n<p><strong>You might thrive in this role if you:</strong></p>\n<ul>\n<li>Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.</li>\n</ul>\n<ul>\n<li>Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.</li>\n</ul>\n<ul>\n<li>Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).</li>\n</ul>\n<ul>\n<li>Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and communication with stakeholders.</li>\n</ul>\n<ul>\n<li>Are able to work effectively in a fast-paced environment, prioritize tasks, and manage multiple projects simultaneously.</li>\n</ul>\n<ul>\n<li>Are a strong communicator and team player, with excellent written and verbal communication skills.</li>\n</ul>\n<ul>\n<li>Are able to adapt to changing priorities and requirements, and are flexible in your approach to problem-solving.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3514d749-08c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/5431666c-530b-49c0-b67e-32477f9eaf5e","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$234K – $260K","x-skills-required":["Bachelor’s degree in Computer Science or a related field","8+ years of experience in technical operations roles such as SRE/NOC","Designing monitoring systems and resolving production issues in fast-paced and mission-critical environments","Troubleshooting complex technical problems at the systems level","Modern monitoring, alerting, and observability practices","Metrics, logging, and tracing for distributed systems","SLIs/SLOs, alert tuning, dashboard creation","Incident response for high‑severity outages or service disruptions","Real-time incident coordination, root cause analysis, and communication with stakeholders"],"x-skills-preferred":["Automation and advancements in AI technologies","Automation-first mindset and leveraging the latest in AI to scale support operations","Technical and troubleshooting expertise for API platform at OpenAI","Proactive identification and implementation of opportunities to scale support operations","Advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time","Reliability reviews and preparedness for new features, launches, or strategic customer requirement updates","Operational readiness (monitoring, alerting, and fallback plans)","Incident response processes and documentation across strategic customers, engineering and support teams","Operational metrics and incident RCAs to identify areas for improvement","Enhancements to monitoring dashboards, alert configurations, and support workflows"],"datePosted":"2026-03-06T18:43:55.714Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Bachelor’s degree in Computer Science or a related field, 8+ years of experience in technical operations roles such as SRE/NOC, Designing monitoring systems and resolving production issues in fast-paced and mission-critical environments, Troubleshooting complex technical problems at the systems level, Modern monitoring, alerting, and observability practices, Metrics, logging, and tracing for distributed systems, SLIs/SLOs, alert tuning, dashboard creation, Incident response for high‑severity outages or service disruptions, Real-time incident coordination, root cause analysis, and communication with stakeholders, Automation and advancements in AI technologies, Automation-first mindset and leveraging the latest in AI to scale support operations, Technical and troubleshooting expertise for API platform at OpenAI, Proactive identification and implementation of opportunities to scale support operations, Advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time, Reliability reviews and preparedness for new features, launches, or strategic customer requirement updates, Operational readiness (monitoring, alerting, and fallback plans), Incident response processes and documentation across strategic customers, engineering and support teams, Operational metrics and incident RCAs to identify areas for improvement, Enhancements to monitoring dashboards, alert configurations, and support workflows","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":234000,"maxValue":260000,"unitText":"YEAR"}}}]}