{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/tracing"},"x-facet":{"type":"skill","slug":"tracing","display":"Tracing","count":35},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_103d358c-a85"},"title":"Senior Software Engineer, Release Infra","description":"<p>Why join us</p>\n<p>Brex is the intelligent finance platform that enables companies to spend smarter and move faster in more than 200 markets. By combining global corporate cards and banking with intuitive spend management, bill pay, and travel software, Brex enables founders and finance teams to accelerate operations, gain real-time visibility, and control spend effortlessly.</p>\n<p>As a Senior Software Engineer, Infrastructure (Release Engineering) at Brex, you will design, build, and operate the core systems that power Brex’s release, observability, and incident management processes. You will partner closely with product, platform, and operations teams to ensure releases are safe, fast, and reliable, and that our infrastructure scales securely as Brex grows.</p>\n<p>Responsibilities</p>\n<ul>\n<li>Design, build, and maintain the release infrastructure that powers Brex’s deployment pipelines and incident workflows</li>\n<li>Drive technical strategy and architecture for release and observability systems, making them more scalable, reliable, and secure</li>\n<li>Collaborate with product, engineering, and operations partners to ensure Brex’s releases are safe, predictable, and low-friction</li>\n<li>Identify and deliver improvements to the end-to-end release process (from code merge to production) to reduce risk and cycle time</li>\n<li>Build and evolve tooling for observability and incident response, enabling fast detection, triage, and resolution</li>\n<li>Proactively identify and mitigate risks in our release and infrastructure stack, including performance, reliability, and security concerns</li>\n<li>Define, instrument, and monitor key metrics for release engineering (e.g., deployment frequency, change failure rate, MTTR) and use them to guide improvements</li>\n<li>Partner with other infrastructure and product teams to debug complex production issues and drive long-term fixes</li>\n<li>Contribute to and champion best practices in release engineering, reliability, and operational excellence across the organization</li>\n<li>Mentor other engineers on the team, providing technical guidance and code reviews to elevate the overall quality of our infrastructure</li>\n<li>Stay up-to-date on emerging tools and practices in release engineering, observability, and SRE, and bring relevant ideas into Brex’s stack</li>\n</ul>\n<p>Requirements</p>\n<ul>\n<li>7+ years of professional experience designing, building, and operating backend or infrastructure systems in production</li>\n<li>Strong proficiency in backend programming languages (e.g., Go, Java, Kotlin, or Python) with a focus on reliability and performance</li>\n<li>Hands-on experience with CI/CD and release pipelines (e.g., GitHub Actions, CircleCI, Buildkite, Argo, Spinnaker, Jenkins) including build, test, and deployment automation</li>\n<li>Experience architecting and operating scalable, high-availability distributed systems on cloud platforms (e.g., AWS, GCP, Azure)</li>\n<li>Deep familiarity with containerization and orchestration (e.g., Docker, Kubernetes) and infrastructure-as-code (e.g., Terraform, CloudFormation)</li>\n<li>Experience designing and maintaining observability tooling (metrics, logs, tracing) and integrating it into incident response workflows</li>\n<li>Strong understanding of reliability and SRE practices, including SLIs/SLOs, error budgets, and incident management best practices</li>\n<li>Experience designing and optimizing data storage systems (SQL and/or NoSQL) for operational and observability use cases</li>\n<li>Proven track record of improving release processes (e.g., reducing deployment risk, increasing deployment frequency, automating rollbacks)</li>\n<li>Comfort working cross-functionally with product and other engineering teams to debug complex production issues and ship changes safely</li>\n<li>Strong communication and collaboration skills, including writing clear design docs and driving technical decisions across teams</li>\n</ul>\n<p>Experience Level: senior Employment Type: full-time Workplace Type: hybrid Category: Engineering Industry: Technology Salary Range: Not stated Salary Min: Not stated Salary Max: Not stated Salary Currency: USD Salary Period: year Required Skills:</p>\n<ul>\n<li>Backend programming languages (e.g., Go, Java, Kotlin, or Python)</li>\n<li>CI/CD and release pipelines (e.g., GitHub Actions, CircleCI, Buildkite, Argo, Spinnaker, Jenkins)</li>\n<li>Containerization and orchestration (e.g., Docker, Kubernetes)</li>\n<li>Infrastructure-as-code (e.g., Terraform, CloudFormation)</li>\n<li>Observability tooling (metrics, logs, tracing)</li>\n<li>Reliability and SRE practices (SLIs/SLOs, error budgets, incident management)</li>\n</ul>\n<p>Preferred Skills:</p>\n<ul>\n<li>Emerging tools and practices in release engineering, observability, and SRE</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_103d358c-a85","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Brex","sameAs":"https://brex.com/","logo":"https://logos.yubhub.co/brex.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/brex/jobs/8522011002","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Backend programming languages (e.g., Go, Java, Kotlin, or Python)","CI/CD and release pipelines (e.g., GitHub Actions, CircleCI, Buildkite, Argo, Spinnaker, Jenkins)","Containerization and orchestration (e.g., Docker, Kubernetes)","Infrastructure-as-code (e.g., Terraform, CloudFormation)","Observability tooling (metrics, logs, tracing)","Reliability and SRE practices (SLIs/SLOs, error budgets, incident management)"],"x-skills-preferred":["Emerging tools and practices in release engineering, observability, and SRE"],"datePosted":"2026-04-25T12:08:33.194Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"São Paulo, São Paulo, Brazil"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Backend programming languages (e.g., Go, Java, Kotlin, or Python), CI/CD and release pipelines (e.g., GitHub Actions, CircleCI, Buildkite, Argo, Spinnaker, Jenkins), Containerization and orchestration (e.g., Docker, Kubernetes), Infrastructure-as-code (e.g., Terraform, CloudFormation), Observability tooling (metrics, logs, tracing), Reliability and SRE practices (SLIs/SLOs, error budgets, incident management), Emerging tools and practices in release engineering, observability, and SRE"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0a7e7df8-c61"},"title":"Site Reliability Engineer (Mid / Senior) - Platform Infrastructure","description":"<p>Elastic is seeking a Site Reliability Engineer (Mid/Senior) to join our Platform Infrastructure team. As a Site Reliability Engineer, you will be responsible for designing and developing tooling that facilitates building, testing, and shipping the Elastic Stack. You will also build and operate production services that power core aspects of the Elastic business, including downloads, Docker registry, maps service, and more.</p>\n<p>In this role, you will support internal adoption of the Elastic Stack for software development and analytics use cases. You will work closely with cross-functional teams to ensure the smooth operation of our platform and services.</p>\n<p>We are looking for a highly skilled engineer with a broad development background and experience in Site-Reliability Engineering. You should have multiple years of hands-on experience administering Linux systems, ideally at scale and in distributed environments. Experience helping operate a SaaS platform is a plus.</p>\n<p>You should be comfortable automating production systems collaboratively, treating configuration as code, managing it through version control, and working with tools such as Docker, Terraform, Puppet, Chef, Ansible, Salt, Packer, Kubernetes, or your own well-crafted shell scripts.</p>\n<p>A drive to automate and monitor everything is essential. If it can be automated, you&#39;ll find a way. Experience building reusable software components; open source contributions are a bonus. Comfort with a versioned, Git-based workflow driven by issues and pull requests is required.</p>\n<p>Strong Linux fundamentals are a must. You know your way around syscall tracing, TCP internals, init systems (sysvinit/runit/systemd), and aren&#39;t afraid to go deep when a problem demands it. A passion for open source, whether through code, mailing lists, documentation, or community participation, is a plus.</p>\n<p>Experience thriving in a distributed, asynchronous work environment with strong written communication habits is essential. A genuine appreciation for diverse, globally distributed teams and a collaborative, inclusive approach to getting work done is required.</p>\n<p>This role is eligible to participate in Elastic&#39;s stock program. Our total rewards package also includes a company-matched 401k with dollar-for-dollar matching up to 6% of eligible earnings, along with a range of other benefits offered with a holistic emphasis on employee well-being.</p>\n<p>The typical starting salary range for this role is $130,900-$192,600 USD. In select locations (including Seattle WA, Los Angeles CA, the San Francisco Bay Area CA, and the New York City Metro Area), an alternate range may apply as specified below.</p>\n<p>These ranges represent the lowest to highest salary we reasonably and in good faith believe we would pay for this role at the time of this posting. We may ultimately pay more or less than the posted range, and the ranges may be modified in the future.</p>\n<p>An employee&#39;s position within the salary range will be based on several factors including, but not limited to, relevant education, qualifications, certifications, experience, skills, geographic location, performance, and business or organizational needs.</p>\n<p>As a distributed company, diversity drives our identity. Whether you’re looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life. Your age is only a number. It doesn’t matter if you’re just out of college or your children are; we need you for what you can do.</p>\n<p>We strive to have parity of benefits across regions and while regulations differ from place to place, we believe taking care of our people is the right thing to do.</p>\n<p>Competitive pay based on the work you do here and not your previous salary</p>\n<p>Health coverage for you and your family in many locations</p>\n<p>Ability to craft your calendar with flexible locations and schedules for many roles</p>\n<p>Generous number of vacation days each year</p>\n<p>Increase your impact - We match up to $2000 (or local currency equivalent) for financial donations and service</p>\n<p>Up to 40 hours each year to use toward volunteer projects you love</p>\n<p>Embracing parenthood with minimum of 16 weeks of parental leave</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0a7e7df8-c61","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Elastic","sameAs":"https://www.elastic.co/","logo":"https://logos.yubhub.co/elastic.co.png"},"x-apply-url":"https://job-boards.greenhouse.io/elastic/jobs/7852721","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$130,900-$192,600 USD","x-skills-required":["Python","JavaScript","Clojure","Haskell","Linux","Docker","Terraform","Puppet","Chef","Ansible","Salt","Packer","Kubernetes"],"x-skills-preferred":["Open source","Syscall tracing","TCP internals","Init systems","Version control","Git-based workflow"],"datePosted":"2026-04-25T12:08:29.195Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, JavaScript, Clojure, Haskell, Linux, Docker, Terraform, Puppet, Chef, Ansible, Salt, Packer, Kubernetes, Open source, Syscall tracing, TCP internals, Init systems, Version control, Git-based workflow","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":130900,"maxValue":192600,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_11f43791-f75"},"title":"Senior Product Manager, APM - Observability","description":"<p>We&#39;re eager to welcome an outstanding Product Manager to join the Elastic Observability product team, with a focus on APM, Synthetics and RUM capabilities. With your knowledge and understanding of the Observability domain, you&#39;ll help ensure that we give our customers the best end to end full stack Observability experience for the applications we monitor.</p>\n<p>This role significantly impacts the success of our Observability product!</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Evolving our APM, Synthetics, and RUM experience, aligning at every step with organisational and company strategy.</li>\n<li>Working closely across teams, collaborating with Engineering, UX, Design and other teams to bring top-notch application monitoring experiences to our users.</li>\n<li>Working with our Sales &amp; Marketing teams on go-to-market, messaging, competitive analysis, enablement, and feature adoption.</li>\n<li>Tracking user engagement and feedback to inform product and feature evolution.</li>\n</ul>\n<p><strong>What You Bring</strong></p>\n<ul>\n<li>Ideally 3+ years in product management or closely related area</li>\n<li>Experience in APM and distributed tracing</li>\n<li>Synthetic transaction monitoring and RUM experience are optional but highly desired</li>\n<li>Comfort working with and converting abstract problems into concrete, prioritised solutions, working from the customer backwards</li>\n<li>Outstanding spoken and written communication skills</li>\n<li>Passion for industry trends, competitive landscape</li>\n<li>Experience working in remote, globally distributed work environment is optional but desired</li>\n<li>Knowledge of OpenTelemetry and of AI are optional but desired</li>\n</ul>\n<p><strong>Compensation</strong></p>\n<p>Compensation for this role is in the form of base salary. This role does not have a variable compensation component.</p>\n<p>The typical starting salary range for new hires in this role is listed below. In select locations (including Seattle WA, Los Angeles CA, the San Francisco Bay Area CA, and the New York City Metro Area), an alternate range may apply as specified below.</p>\n<p>These ranges represent the lowest to highest salary we reasonably and in good faith believe we would pay for this role at the time of this posting. We may ultimately pay more or less than the posted range, and the ranges may be modified in the future.</p>\n<p>An employee&#39;s position within the salary range will be based on several factors including, but not limited to, relevant education, qualifications, certifications, experience, skills, geographic location, performance, and business or organisational needs.</p>\n<p>Elastic believes that employees should have the opportunity to share in the value that we create together for our shareholders. Therefore, in addition to cash compensation, this role is currently eligible to participate in Elastic&#39;s stock program. Our total rewards package also includes a company-matched 401k with dollar-for-dollar matching up to 6% of eligible earnings, along with a range of other benefits offered with a holistic emphasis on employee well-being.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_11f43791-f75","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Elastic","sameAs":"https://www.elastic.co/","logo":"https://logos.yubhub.co/elastic.co.png"},"x-apply-url":"https://job-boards.greenhouse.io/elastic/jobs/7850986","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$133,100-$210,600 USD","x-skills-required":["APM","Distributed Tracing","Synthetic Transaction Monitoring","RUM","OpenTelemetry","AI"],"x-skills-preferred":[],"datePosted":"2026-04-25T12:07:30.934Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"APM, Distributed Tracing, Synthetic Transaction Monitoring, RUM, OpenTelemetry, AI","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":133100,"maxValue":210600,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_fb470c7c-d85"},"title":"Senior Product Manager, APM - Observability","description":"<p>We&#39;re eager to welcome an outstanding Product Manager to join the Elastic Observability product team, with a focus on APM, Synthetics and RUM capabilities. With your knowledge and understanding of the Observability domain, you&#39;ll help ensure that we give our customers the best end to end full stack Observability experience for the applications we monitor.</p>\n<p>This role significantly impacts the success of our Observability product!</p>\n<p><strong>Key Responsibilities:</strong></p>\n<ul>\n<li>Evolving our APM, Synthetics, and RUM experience, aligning at every step with organisational and company strategy.</li>\n</ul>\n<ul>\n<li>Working closely across teams, collaborating with Engineering, UX, Design and other teams to bring top-notch application monitoring experiences to our users.</li>\n</ul>\n<ul>\n<li>Working with our Sales &amp; Marketing teams on go-to-market, messaging, competitive analysis, enablement, and feature adoption.</li>\n</ul>\n<ul>\n<li>Tracking user engagement and feedback to inform product and feature evolution.</li>\n</ul>\n<p><strong>Requirements:</strong></p>\n<ul>\n<li>Ideally 3+ years in product management or closely related area.</li>\n</ul>\n<ul>\n<li>Experience in APM and distributed tracing.</li>\n</ul>\n<ul>\n<li>Synthetic transaction monitoring and RUM experience are optional but highly desired.</li>\n</ul>\n<ul>\n<li>Comfort working with and converting abstract problems into concrete, prioritised solutions, working from the customer backwards.</li>\n</ul>\n<ul>\n<li>Outstanding spoken and written communication skills.</li>\n</ul>\n<ul>\n<li>Passion for industry trends, competitive landscape.</li>\n</ul>\n<ul>\n<li>Experience working in remote, globally distributed work environment is optional but desired.</li>\n</ul>\n<ul>\n<li>Knowledge of OpenTelemetry and of AI are optional but desired.</li>\n</ul>\n<p><strong>Additional Information:</strong></p>\n<ul>\n<li>Compensation for this role is in the form of base salary. This role does not have a variable compensation component.</li>\n</ul>\n<ul>\n<li>The typical starting salary range for new hires in this role is $128,300-$203,000 CAD.</li>\n</ul>\n<ul>\n<li>Elastic believes that employees should have the opportunity to share in the value that we create together for our shareholders. Therefore, in addition to cash compensation, this role is currently eligible to participate in Elastic&#39;s stock program.</li>\n</ul>\n<ul>\n<li>Our total rewards package also includes a company-matched Registered Retirement Savings Plan (RRSP) with dollar-for-dollar matching up to 6% of eligible earnings, along with a range of other benefits offered with a holistic emphasis on employee well-being.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_fb470c7c-d85","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Elastic","sameAs":"https://www.elastic.co/","logo":"https://logos.yubhub.co/elastic.co.png"},"x-apply-url":"https://job-boards.greenhouse.io/elastic/jobs/7850987","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$128,300-$203,000 CAD","x-skills-required":["APM","distributed tracing","Synthetic transaction monitoring","RUM","OpenTelemetry","AI"],"x-skills-preferred":["UX","Design","Sales & Marketing","competitive analysis","enablement","feature adoption"],"datePosted":"2026-04-25T12:07:06.596Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Canada"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"APM, distributed tracing, Synthetic transaction monitoring, RUM, OpenTelemetry, AI, UX, Design, Sales & Marketing, competitive analysis, enablement, feature adoption","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":128300,"maxValue":203000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_8670699f-35b"},"title":"Senior Failure Analysis Engineer (Reliability Test Team)","description":"<p>As a Senior Failure Analysis (FA) Engineer, you will be embedded within our verification and validation development cycle, performing hands-on investigation of hardware failures coming directly out of test environments, qualification experiments, and design verification activities.</p>\n<p>You will work at the front of the product lifecycle; providing rapid, actionable root-cause findings that feed directly back into active design iterations. You will be the primary technical owner of understanding why prototype hardware fails, translating physical evidence into design, process, and supplier improvements that accelerate our path to reliable production hardware.</p>\n<p>Key responsibilities include:</p>\n<p>Performing failure analysis investigations on Printed Circuit Board Assemblies (PCBAs), components, modules, and subsystems returned from environmental, electrical, and mechanical cycling test campaigns Troubleshooting and isolating failures using drawings, schematics, circuit tracing, fault isolation techniques, and Design of Experiments (DOE) where appropriate Applying analytical techniques including optical and digital microscopy, X-ray/Computed Tomography (CT), mechanical cross-sectioning, curve tracing, and thermal imaging to identify failure mechanisms at the sub-system, board, and component level Coordinating with external analytical laboratories for advanced techniques such as Scanning Electron Microscopy / Energy Dispersive Spectroscopy (SEM/EDS), Fourier-Transform Infrared Spectroscopy (FTIR), and Focused Ion Beam (FIB) analysis when deeper characterization is required Authoring concise technical FA reports with documented findings, images, and conclusions that support design review discussions Translating physical failure evidence into clear, actionable root-cause conclusions and corrective action recommendations Collaborating with design, reliability, quality, manufacturing, and test engineering teams to drive corrective actions and prevent recurrence across prototype iterations Supporting Failure Modes and Effects Analysis, Design Verification, Validation, and Qualification planning and with proactive FA insights and risk identification based on lessons learned from previous builds Identifying and flagging failure patterns related to PCB/PCBA assembly, Surface Mount Technology (SMT) soldering processes, and supplier quality issues to inform Design for Quality and Reliability Building and maintaining a failure database to capture institutional knowledge and enable trend analysis across prototype builds Establishing and continuously improving FA processes, workflows, and documentation standards appropriate for a fast-moving prototype environment</p>\n<p>Requirements include:</p>\n<p>Bachelor&#39;s degree in Electrical Engineering, Materials Science, Electronic Engineering Technology, Mechanical Engineering or a related field 5+ years of hands-on failure analysis experience on electronic hardware (PCBAs, motors, batteries, components, interconnects) Background in a prototype or New Product Introduction (NPI) environment with rapid iteration cycles Proficiency with standard FA tools: optical/digital microscopes, multimeters, oscilloscopes, and curve tracers Experience with thermal imaging and thermal characterization of electronic assemblies Experience with non-destructive and destructive analysis methods including X-ray/CT and mechanical cross-sectioning Demonstrated ability to read and interpret PCB schematics and layouts Experience coordinating with external analytical labs or vendors for advanced characterization Strong technical writing skills with the ability to produce clear, evidence-based FA reports</p>\n<p>Bonus qualifications include:</p>\n<p>Master&#39;s degree in Electrical Engineering, Materials Science, or a related field Experience with structured problem-solving methodologies such as 8D, Fishbone, or 5-Why analysis Familiarity with physics-of-failure approaches to root-cause analysis Familiarity with PCB/PCBA assembly and SMT processes, including the ability to identify assembly and soldering-related defects</p>\n<p>What we offer:</p>\n<p>All our positions offer a compensation package that includes equity and robust benefits. Base pay is just one component of Astranis&#39;s total rewards package. Your compensation also includes a significant equity package via incentive stock options, high-quality company-subsidized healthcare, disability and life insurance, 401(k) retirement planning, flexible PTO, and free on-site catered meals.</p>\n<p>Astranis pay ranges are informed and defined through professional-grade salary surveys and compensation data sources. The actual base salary offered to a successful candidate will additionally be influenced by a variety of factors including experience, credentials &amp; certifications, educational attainment, skill level requirements, and the level and scope of the position.</p>\n<p>Base Salary: $145,000-$190,000 USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_8670699f-35b","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Astranis","sameAs":"https://astranis.com/","logo":"https://logos.yubhub.co/astranis.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/astranis/jobs/4661576006","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"Base Salary: $145,000-$190,000 USD","x-skills-required":["Failure Analysis","Printed Circuit Board Assemblies","Components","Modules","Subsystems","Environmental Testing","Electrical Testing","Mechanical Testing","Optical Microscopy","Digital Microscopy","X-ray Computed Tomography","Mechanical Cross-Sectioning","Curve Tracing","Thermal Imaging","Scanning Electron Microscopy","Energy Dispersive Spectroscopy","Fourier-Transform Infrared Spectroscopy","Focused Ion Beam Analysis","Design of Experiments","Fault Isolation Techniques","Root Cause Analysis","Corrective Action Recommendations","Design for Quality and Reliability","Failure Modes and Effects Analysis","Design Verification","Validation","Qualification Planning","Trend Analysis","Institutional Knowledge","Prototype Builds","New Product Introduction","Rapid Iteration Cycles","Structured Problem-Solving Methodologies","Physics-of-Failure Approaches","PCB/PCBA Assembly","Surface Mount Technology","Soldering Processes","Supplier Quality Issues"],"x-skills-preferred":[],"datePosted":"2026-04-24T15:20:34.136Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Failure Analysis, Printed Circuit Board Assemblies, Components, Modules, Subsystems, Environmental Testing, Electrical Testing, Mechanical Testing, Optical Microscopy, Digital Microscopy, X-ray Computed Tomography, Mechanical Cross-Sectioning, Curve Tracing, Thermal Imaging, Scanning Electron Microscopy, Energy Dispersive Spectroscopy, Fourier-Transform Infrared Spectroscopy, Focused Ion Beam Analysis, Design of Experiments, Fault Isolation Techniques, Root Cause Analysis, Corrective Action Recommendations, Design for Quality and Reliability, Failure Modes and Effects Analysis, Design Verification, Validation, Qualification Planning, Trend Analysis, Institutional Knowledge, Prototype Builds, New Product Introduction, Rapid Iteration Cycles, Structured Problem-Solving Methodologies, Physics-of-Failure Approaches, PCB/PCBA Assembly, Surface Mount Technology, Soldering Processes, Supplier Quality Issues","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":145000,"maxValue":190000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2c452765-84f"},"title":"Site Reliability Data Engineer","description":"<p>For over 31,000 growing businesses and HR teams seeking a comprehensive, all-in-one HR suite, Workable emerges as the premier solution. We uniquely combine the world&#39;s most widely adopted Applicant Tracking System (Workable Recruiting) with a full-spectrum employee management system (Workable HR).</p>\n<p>At Workable, we empower companies to focus on what truly matters: hiring the right people and fostering their growth. While we take HR seriously, we maintain a lighthearted and collaborative culture. At Workable, you&#39;ll find smart people who have fun, learn, innovate, and help others do the same.</p>\n<p>We respect everyone, we hire the best, and make sure every experience is special.</p>\n<p>As a Site Reliability Data Engineer based in Athens, you will play a critical role in ensuring the reliability, scalability, and performance of our data infrastructure and pipelines. You will collaborate closely with engineering teams to build and operate robust cloud-based systems, driving automation and observability across our platform.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Build, operate, and improve ETL/ELT pipelines, Spark workloads, and data warehouse components.</li>\n<li>Develop tools and automations to simplify and harden data pipeline workflows and general operations.</li>\n<li>Design, implement, and maintain scalable, highly available cloud infrastructure and services with a focus on automation and reliability.</li>\n<li>Develop and operate observability tooling for monitoring, logging, tracing, and data-pipeline metrics (freshness, completeness, latency, error rates).</li>\n<li>Collaborate with development teams to instrument, deploy, and troubleshoot production systems across microservices on Kubernetes.</li>\n<li>Operate, deploy, and monitor data infrastructure and cloud services from development to production.</li>\n<li>Own availability, scalability, and performance of systems, focusing on data pipelines and warehousing components.</li>\n<li>Partner with peer SREs to roll out production changes and mitigate data-related and infrastructure incidents.</li>\n<li>Troubleshoot issues across data pipelines and production systems; support capacity planning and analyze system and data workflow performance.</li>\n<li>Provide data engineering expertise to engineering teams and work cross-functionally with developers and analysts on designing, releasing, and troubleshooting production systems.</li>\n<li>Own team projects and ensure timely delivery.</li>\n</ul>\n<p>Requirements</p>\n<ul>\n<li>BS/MS degree in Computer Science, Engineering, or equivalent practical experience</li>\n<li>2+ years of experience in site reliability engineering, data engineering, or a closely related role, including programming</li>\n<li>Experience with a major cloud provider (AWS or GCP)</li>\n<li>Hands-on experience with infrastructure-as-code or configuration management tools (Terraform or Ansible)</li>\n<li>Experience with ETL/ELT concepts and tools (Airflow or dbt)</li>\n<li>Experience with Apache Spark or similar distributed data processing frameworks</li>\n<li>Experience with cloud data warehouses (BigQuery, Redshift, or Snowflake)</li>\n<li>Proficiency in at least one programming language (Python, Go, or Scala)</li>\n<li>Excellent written English proficiency</li>\n<li>Legally authorized to work in Greece</li>\n</ul>\n<p>Preferred Qualifications</p>\n<ul>\n<li>Production experience with Kubernetes</li>\n<li>Experience with centralized monitoring and logging systems</li>\n<li>Experience with streaming systems (Kafka or Spark Streaming)</li>\n</ul>\n<p>Benefits</p>\n<p>Our employees enjoy benefits that make them more productive and contribute directly to the development of their professional skills. We want to be able to attract the best of the best and make sure they keep getting better. On top of an exciting, vibrant and intellectually challenging environment, we are offering:</p>\n<ul>\n<li>Comprehensive Health Coverage: A robust health insurance plan that includes coverage for your dependents.</li>\n<li>Competitive Compensation: An attractive salary paired with a performance-based bonus plan.</li>\n<li>Flexible Work Model: Enjoy the best of both worlds with a hybrid setup,two days working from home and three in the office.</li>\n<li>Top-Tier Tools: Apple gear and access to the latest productivity tools to help you excel.</li>\n<li>Stay Connected: A mobile data plan to keep you online wherever you are.</li>\n<li>Delicious Perks: Fresh, tasty food at the office to fuel your productivity.</li>\n<li>Relocation Bonus: To help you settle in smoothly in Athens.</li>\n</ul>\n<p>Workable is most decidedly an equal opportunity employer. We want applicants of diverse background and hire without regard to colour, gender, religion, national origin, citizenship, disability, age, sexual orientation, or any other characteristic protected by law.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2c452765-84f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Workable"},"x-apply-url":"https://apply.workable.com/j/273C8E852D","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Cloud computing","Data engineering","ETL/ELT","Apache Spark","Cloud data warehouses","Kubernetes","Infrastructure-as-code","Configuration management","Observability tooling","Monitoring","Logging","Tracing","Data-pipeline metrics"],"x-skills-preferred":["Production experience with Kubernetes","Centralized monitoring and logging systems","Streaming systems (Kafka or Spark Streaming)"],"datePosted":"2026-04-24T14:14:22.101Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Athens"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Cloud computing, Data engineering, ETL/ELT, Apache Spark, Cloud data warehouses, Kubernetes, Infrastructure-as-code, Configuration management, Observability tooling, Monitoring, Logging, Tracing, Data-pipeline metrics, Production experience with Kubernetes, Centralized monitoring and logging systems, Streaming systems (Kafka or Spark Streaming)"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6d1365ca-bb1"},"title":"Lighting Artist","description":"<p>The Ubisoft Bucharest studio is looking for a Lighting Artist to join our ambitious Production team. The Lighting Artist we are looking for should have strong creative, technical, and visual skills and the ability to produce high quality lighting. The Lighting artist will work closely with the Production team and follow the Artistic Direction.</p>\n<p>Key responsibilities include: Lighting game environments, characters, and cinematics to direct the player’s attention, set the emotional tone, communicate atmosphere, and plot details, etc. Collaborate with the level team to develop the look and feel of the game. Find visual references based on the artistic and technical direction. Set up light sources to support narrative elements and gameplay progression. Balance the artistic aspects with technical constraints. Refine your integrated lighting effects based on the feedback received from the stakeholders in the project and debug where necessary.</p>\n<p>Requirements include: High level of attention to detail with strong problem-solving skills. Strong technical skills with procedural systems as well as an artistic eye for color, light, and FX. Creative with the ability to adapt to a realistic game style. Experience working with lighting software. Knowledge of Ray Tracing is a plus. Worked in at least one industry standard or proprietary engine.</p>\n<p>Personal qualities include: easily adaptable, able to learn technical constraints of platforms, engines, and software fast. effective communicator with both technical and non-technical parties. strong interpersonal and communication skills, both written and spoken. Good knowledge of English.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6d1365ca-bb1","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Ubisoft","sameAs":"https://www.ubisoft.com/","logo":"https://logos.yubhub.co/ubisoft.com.png"},"x-apply-url":"https://jobs.smartrecruiters.com/Ubisoft2/744000116214423-lighting-artist-the-division-2-","x-work-arrangement":"onsite","x-experience-level":null,"x-job-type":"full-time","x-salary-range":null,"x-skills-required":["lighting software","procedural systems","artistic eye for color, light, and FX","Ray Tracing","industry standard or proprietary engine"],"x-skills-preferred":[],"datePosted":"2026-04-24T12:16:55.563Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bucharest"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"lighting software, procedural systems, artistic eye for color, light, and FX, Ray Tracing, industry standard or proprietary engine"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3b419874-946"},"title":"Senior Production Engineer","description":"<p>Production Engineering ensures CoreWeave&#39;s cloud delivers world-class reliability, performance, and operational excellence. We are hiring a Senior Production Engineer to take direct, hands-on ownership of critical tooling that drives reliability and delivery success.</p>\n<p>In this role, you will work broadly across the cloud stack designing, implementing, deploying, and operating systems that improve delivery velocity, service availability, and operational safety. You’ll be responsible for leading end-to-end technical projects, maintaining long-lived systems the team owns, and strengthening our operational foundations through durable engineering investments.</p>\n<p>This is a role for someone who enjoys building, debugging, and operating production systems. You will collaborate closely with service owners, but your primary impact comes from the reliability, quality, and maturity of the systems you deliver and maintain over time.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Take hands-on ownership of critical systems and frameworks, driving their architecture, implementation, and long-term evolution.</li>\n<li>Lead end-to-end delivery of engineering projects that improve availability, scalability, operational automation, and failure recovery.</li>\n<li>Build and maintain observability, alerting, automated remediation, and resilience testing for the systems you support.</li>\n<li>Participate in incident response as a subject-matter expert; drive deep root-cause investigations and implement lasting fixes.</li>\n<li>Improve runbooks, sources of truth, deployment workflows, and operational tooling to harden production readiness.</li>\n<li>Eliminate single points of failure and reduce operational toil through automation, refactors, and system redesigns.</li>\n<li>Ship production code regularly in Python, Go, or similar languages, and participate in on-call rotations.</li>\n<li>Maintain and mature long-term projects and frameworks owned by the team, ensuring they remain reliable, well-instrumented, and easy to operate.</li>\n<li>Collaborate with platform teams to ensure new features and services integrate cleanly with our reliability best-practices and tooling.</li>\n</ul>\n<p><strong>What You’ve Worked On (Minimum Qualifications)</strong></p>\n<ul>\n<li>7+ years of engineering experience building and operating distributed systems or cloud platforms.</li>\n<li>Demonstrated ability to debug complex production issues end-to-end, across services, infrastructure layers, and automation.</li>\n<li>Strong programming or scripting ability (Python, Go, or similar), with experience shipping and operating production services and tools.</li>\n<li>Deep knowledge of cloud-native technologies and distributed system patterns, particularly Kubernetes.</li>\n<li>Experience with modern observability stacks: metrics, tracing, structured logs, SLOs/SLIs, and incident lifecycle practices.</li>\n<li>A track record of successfully delivering hands-on reliability improvements through engineering execution.</li>\n</ul>\n<p><strong>Preferred Qualifications</strong></p>\n<ul>\n<li>Experience building internal tooling, frameworks, or automation that supports high-availability cloud operations.</li>\n<li>Familiarity with DR/BCP, service tiering, capacity planning, or chaos engineering.</li>\n<li>Background operating or building large-scale AI or GPU-accelerated infrastructure.</li>\n<li>Experience maintaining multi-year ownership of foundational production systems.</li>\n</ul>\n<p><strong>Why CoreWeave</strong></p>\n<p>At CoreWeave, we work hard, have fun, and move fast. You’ll join a team that values curiosity, ownership, and creative problem-solving. Production Engineering sits at the intersection of reliability and AI infrastructure, building systems that enable the world’s most powerful AI cloud.</p>\n<p><strong>Core Values</strong></p>\n<ul>\n<li>Be Curious at Your Core</li>\n<li>Act Like an Owner</li>\n<li>Empower Employees</li>\n<li>Deliver Best-in-Class Client Experiences</li>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization&#39;s growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!</p>\n<p><strong>Compensation</strong></p>\n<p>The base salary range for this role is 160,000 to 214,000 SGD. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>\n<p><strong>What We Offer</strong></p>\n<p>The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.</p>\n<p>In addition to a competitive salary, we offer a variety of benefits to support your needs, including:</p>\n<ul>\n<li>Medical, dental, and vision insurance - 100% paid for by CoreWeave</li>\n<li>Company-paid Life Insurance</li>\n<li>Voluntary supplemental life insurance</li>\n<li>Short and long-term disability insurance</li>\n<li>Flexible Spending Account</li>\n<li>Health Savings Account</li>\n<li>Tuition Reimbursement</li>\n<li>Ability to Participate in Employee Stock Purchase Program (ESPP)</li>\n<li>Mental Wellness Benefits through Spring Health</li>\n<li>Family-Forming support provided by Carrot</li>\n<li>Paid Parental Leave</li>\n<li>Flexible, full-service childcare support with Kinside</li>\n<li>401(k) with a generous employer match</li>\n<li>Flexible PTO</li>\n<li>Catered lunch each day in our office and data center locations</li>\n<li>A casual work environment</li>\n<li>A work culture focused on innovative disruption</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3b419874-946","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4675297006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"160,000 to 214,000 SGD","x-skills-required":["cloud computing","distributed systems","Kubernetes","observability stacks","metrics","tracing","structured logs","SLOs/SLIs","incident lifecycle practices","Python","Go","engineering experience"],"x-skills-preferred":["internal tooling","frameworks","automation","DR/BCP","service tiering","capacity planning","chaos engineering","large-scale AI","GPU-accelerated infrastructure"],"datePosted":"2026-04-24T12:14:03.335Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Singapore"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing, distributed systems, Kubernetes, observability stacks, metrics, tracing, structured logs, SLOs/SLIs, incident lifecycle practices, Python, Go, engineering experience, internal tooling, frameworks, automation, DR/BCP, service tiering, capacity planning, chaos engineering, large-scale AI, GPU-accelerated infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":160000,"maxValue":214000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_95c49f85-a98"},"title":"Staff+ Software Engineer, Observability","description":"<p><strong>About the Role</strong></p>\n<p>Anthropic is seeking talented and experienced Software Engineers to join our Observability team within the Infrastructure organization. The Observability team owns the monitoring and telemetry infrastructure that every engineer and researcher at Anthropic depends on,from metrics and logging pipelines to distributed tracing, error analytics, alerting, and the dashboards and query interfaces that make it all actionable.</p>\n<p>As Anthropic scales its infrastructure across massive GPU, TPU, and Trainium clusters, the volume and complexity of operational data is growing by orders of magnitude. We’re building next-generation observability systems,high-throughput ingest pipelines, cost-efficient columnar storage, unified query layers across signals, and agentic diagnostic tools,to ensure that engineers can detect, diagnose, and resolve issues in minutes rather than hours, even as the systems they operate become exponentially more complex.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across Anthropic’s multi-cluster infrastructure</li>\n</ul>\n<ul>\n<li>Own and evolve core observability platforms, driving migrations and architectural improvements that improve reliability, reduce cost, and scale with organisational growth</li>\n</ul>\n<ul>\n<li>Build instrumentation libraries, SDKs, and integrations that make it easy for engineering teams to emit high-quality telemetry from their services</li>\n</ul>\n<ul>\n<li>Drive alerting and SLO infrastructure that enables teams to define, monitor, and respond to reliability targets with minimal noise</li>\n</ul>\n<ul>\n<li>Reduce mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling</li>\n</ul>\n<ul>\n<li>Partner with Research, Inference, Product, and Infrastructure teams to ensure observability solutions meet the unique needs of each organisation</li>\n</ul>\n<p><strong>You May Be a Good Fit If You</strong></p>\n<ul>\n<li>Have 10+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure</li>\n</ul>\n<ul>\n<li>Have deep experience with at least one observability signal area (metrics, logging, tracing, or error analytics) and familiarity with the others</li>\n</ul>\n<ul>\n<li>Understand high-throughput data pipelines, columnar storage engines, and the tradeoffs involved in ingesting and querying telemetry data at scale</li>\n</ul>\n<ul>\n<li>Have experience operating or building on top of observability platforms such as Prometheus, Grafana, ClickHouse, OpenTelemetry, or similar systems</li>\n</ul>\n<ul>\n<li>Have strong proficiency in at least one of Python, Rust, or Go</li>\n</ul>\n<ul>\n<li>Have excellent communication skills and enjoy partnering with internal teams to improve their operational visibility and incident response capabilities</li>\n</ul>\n<ul>\n<li>Are excited about building foundational infrastructure and are comfortable working independently on ambiguous, high-impact technical challenges</li>\n</ul>\n<p><strong>Strong Candidates May Also Have</strong></p>\n<ul>\n<li>Experience operating metrics systems at very high cardinality (hundreds of millions of active time series or more)</li>\n</ul>\n<ul>\n<li>Experience with log storage migrations or operating columnar databases (ClickHouse, BigQuery, or similar) for analytics workloads</li>\n</ul>\n<ul>\n<li>Experience with OpenTelemetry instrumentation, collector pipelines, and tail-based sampling strategies</li>\n</ul>\n<ul>\n<li>Experience building or operating alerting platforms, on-call tooling, or SLO frameworks at scale</li>\n</ul>\n<ul>\n<li>Experience with Kubernetes-native monitoring, eBPF-based observability, or continuous profiling</li>\n</ul>\n<ul>\n<li>Interest in applying AI/LLMs to operational workflows such as automated root cause analysis, anomaly detection, or intelligent alerting</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<ul>\n<li>Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience</li>\n</ul>\n<ul>\n<li>Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience</li>\n</ul>\n<ul>\n<li>Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position</li>\n</ul>\n<ul>\n<li>Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</li>\n</ul>\n<ul>\n<li>Visa sponsorship: We do sponsor visas! However, we aren’t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</li>\n</ul>\n<p><strong>How we&#39;re different</strong></p>\n<p>We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact , advancing our long-term goals of steerable, trustworthy AI , rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We’re an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication skills.</p>\n<p><strong>Come work with us!</strong></p>\n<p>Anthropic is a public benefit corporation headquartered in San Francisco. We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_95c49f85-a98","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5102440008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"£325,000-£390,000 GBP","x-skills-required":["observability","telemetry","metrics","logging","tracing","error analytics","alerting","SLO infrastructure","cross-signal correlation","unified query interfaces","AI-assisted diagnostic tooling","Python","Rust","Go","Prometheus","Grafana","ClickHouse","OpenTelemetry"],"x-skills-preferred":["high-throughput data pipelines","columnar storage engines","Kubernetes-native monitoring","eBPF-based observability","continuous profiling","AI/LLMs","automated root cause analysis","anomaly detection","intelligent alerting"],"datePosted":"2026-04-18T15:57:27.177Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"observability, telemetry, metrics, logging, tracing, error analytics, alerting, SLO infrastructure, cross-signal correlation, unified query interfaces, AI-assisted diagnostic tooling, Python, Rust, Go, Prometheus, Grafana, ClickHouse, OpenTelemetry, high-throughput data pipelines, columnar storage engines, Kubernetes-native monitoring, eBPF-based observability, continuous profiling, AI/LLMs, automated root cause analysis, anomaly detection, intelligent alerting","baseSalary":{"@type":"MonetaryAmount","currency":"GBP","value":{"@type":"QuantitativeValue","minValue":325000,"maxValue":390000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_0ed46937-df6"},"title":"Staff Developer Success Engineer - West","description":"<p>We&#39;re looking for a Staff Developer Success Engineer to join our team. As a frontline technical expert for our developer community, you will help users deploy and scale Temporal in cloud-native environments. You will also troubleshoot complex infrastructure issues, optimize performance, and develop automation solutions.</p>\n<p>At Temporal, you&#39;ll work with cloud-native, highly scalable infrastructure spanning AWS, GCP, Kubernetes, and microservices. You&#39;ll gain deep expertise in container orchestration, networking, and observability while learning from complex, real-world customer use cases.</p>\n<p>As a Staff Developer Success Engineer, you&#39;ll work directly with developers to debug complex infrastructure issues, optimize cloud performance, and enhance reliability for Temporal users. You&#39;ll develop observability solutions (Grafana, Prometheus), improve networking (load balancing, DNS, ingress/egress), and automate infrastructure operations (Terraform, IaC) to help customers run Temporal efficiently at scale.</p>\n<p>Once ramped up, we expect you to independently drive technical solutions, whether debugging complex production issues or designing infrastructure best practices. Don&#39;t worry, we have seasoned engineers and mentors to support you along the way!</p>\n<p>As a Staff Developer Success Engineer you will engage directly with developers, engineering teams, and product teams to understand infrastructure challenges and provide solutions that enhance scalability, performance, and reliability.</p>\n<p>Your insights will influence platform improvements, from enhancing observability tooling to developing self-service infrastructure solutions that simplify troubleshooting (e.g., building diagnostic tools similar to Twilio’s Network Test).</p>\n<p>You’ll serve as a bridge between developers and infrastructure, ensuring that reliability, performance, and developer experience remain top priorities as Temporal scales.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_0ed46937-df6","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Temporal","sameAs":"https://temporal.io/","logo":"https://logos.yubhub.co/temporal.io.png"},"x-apply-url":"https://job-boards.greenhouse.io/temporaltechnologies/jobs/5076742007","x-work-arrangement":"remote","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$170,000 - $215,000","x-skills-required":["cloud-native infrastructure","container orchestration","networking","observability","infrastructure automation","Terraform","IaC","Kubernetes","AWS","GCP","Python","Java","Go","Grafana","Prometheus"],"x-skills-preferred":["security certificate management","security implementation","use case analysis","Temporal design decisions","architecture best practices","EKS","GKE","OpenTracing","Ansible","CDK"],"datePosted":"2026-04-18T15:56:34.606Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"United States - Remote Opportunity"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud-native infrastructure, container orchestration, networking, observability, infrastructure automation, Terraform, IaC, Kubernetes, AWS, GCP, Python, Java, Go, Grafana, Prometheus, security certificate management, security implementation, use case analysis, Temporal design decisions, architecture best practices, EKS, GKE, OpenTracing, Ansible, CDK","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":170000,"maxValue":215000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_24176cb8-311"},"title":"Member of Technical Staff - Compute Infrastructure","description":"<p>We&#39;re seeking a highly skilled Member of Technical Staff to join our Compute Infrastructure team. As a key member of this team, you will design, build, and operate massive-scale clusters and orchestration platforms that power frontier AI training, inference, and agent workloads at unprecedented scale.</p>\n<p>In this role, you will push the boundaries of container orchestration far beyond existing systems like Kubernetes, manage exascale compute resources, optimize for high-performance training runs and production serving, and collaborate closely with research and systems teams to deliver reliable, ultra-scalable infrastructure that enables xAI&#39;s next-generation models and applications.</p>\n<p>Responsibilities include building and managing massive-scale clusters, designing, developing, and extending an in-house container orchestration platform, collaborating with research teams to architect and optimize compute clusters, profiling, debugging, and resolving complex system-level performance bottlenecks, and owning end-to-end infrastructure initiatives.</p>\n<p>To succeed in this role, you will need deep expertise in virtualization technologies and advanced containerization/sandboxing, strong proficiency in systems programming languages such as C/C++ and Rust, and proven track record profiling, debugging, and optimizing complex system-level performance issues.</p>\n<p>Preferred skills and experience include experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads, operating or designing large-scale AI training/inference clusters, and familiarity with performance tools, tracing, and debugging in production distributed environments.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_24176cb8-311","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://www.xai.com/","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/5052040007","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$180,000 - $440,000 USD","x-skills-required":["Deep expertise in virtualization technologies (KVM, Xen, QEMU) and advanced containerization/sandboxing (Kata, Firecracker, gVisor, Sysbox, or equivalent)","Strong proficiency in systems programming languages such as C/C++ and Rust","Proven track record profiling, debugging, and optimizing complex system-level performance issues, with deep knowledge of Linux kernel internals, resource management, scheduling, memory management, and low-level engineering","Hands-on experience building or significantly enhancing distributed compute platforms, orchestration systems, or high-performance infrastructure at scale"],"x-skills-preferred":["Experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads","Proven track record operating or designing large-scale AI training/inference clusters (GPU/TPU scale)","Experience with custom runtimes, isolation techniques, or bespoke platforms for specialized AI compute","Familiarity with performance tools, tracing, and debugging in production distributed environments"],"datePosted":"2026-04-18T15:55:50.213Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Palo Alto, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Deep expertise in virtualization technologies (KVM, Xen, QEMU) and advanced containerization/sandboxing (Kata, Firecracker, gVisor, Sysbox, or equivalent), Strong proficiency in systems programming languages such as C/C++ and Rust, Proven track record profiling, debugging, and optimizing complex system-level performance issues, with deep knowledge of Linux kernel internals, resource management, scheduling, memory management, and low-level engineering, Hands-on experience building or significantly enhancing distributed compute platforms, orchestration systems, or high-performance infrastructure at scale, Experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads, Proven track record operating or designing large-scale AI training/inference clusters (GPU/TPU scale), Experience with custom runtimes, isolation techniques, or bespoke platforms for specialized AI compute, Familiarity with performance tools, tracing, and debugging in production distributed environments","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":180000,"maxValue":440000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6d4292d1-227"},"title":"Software Engineer, Sandboxing (Systems)","description":"<p>We are seeking a Linux OS and System Programming Subject Matter Expert to join our Infrastructure team. In this role, you&#39;ll work on accelerating and optimizing our virtualization and VM workloads that power our AI infrastructure.</p>\n<p>Your expertise in low-level system programming, kernel optimization, and virtualization technologies will be crucial in ensuring Anthropic can scale our compute infrastructure efficiently and reliably for training and serving frontier AI models.</p>\n<p>Responsibilities:</p>\n<p>Optimize our virtualization stack, improving performance, reliability, and efficiency of our VM environments</p>\n<p>Design and implement kernel modules, drivers, and system-level components to enhance our compute infrastructure</p>\n<p>Investigate and resolve performance bottlenecks in virtualized environments</p>\n<p>Collaborate with cloud engineering teams to optimize interactions between our workloads and underlying hardware</p>\n<p>Develop tooling for monitoring and improving virtualization performance</p>\n<p>Work with our ML engineers to understand their computational needs and optimize our systems accordingly</p>\n<p>Contribute to the design and implementation of our next-generation compute infrastructure</p>\n<p>Share knowledge with team members on low-level systems programming and Linux kernel internals</p>\n<p>Partner with cloud providers to influence hardware and platform features for AI workloads</p>\n<p>You may be a good fit if you:</p>\n<p>Have experience with Linux kernel development, system programming, or related low-level software engineering</p>\n<p>Understand virtualization technologies (KVM, Xen, QEMU, etc.) and their performance characteristics</p>\n<p>Have experience optimizing system performance for compute-intensive workloads</p>\n<p>Are familiar with modern CPU architectures and memory systems</p>\n<p>Have strong C/C++ programming skills and ideally experience with systems languages like Rust</p>\n<p>Understand Linux resource management, scheduling, and memory management</p>\n<p>Have experience profiling and debugging system-level performance issues</p>\n<p>Are comfortable diving into unfamiliar codebases and technical domains</p>\n<p>Are results-oriented, with a bias towards practical solutions and measurable impact</p>\n<p>Care about the societal impacts of AI and are passionate about building safe, reliable systems</p>\n<p>Strong candidates may also have experience with:</p>\n<p>GPU virtualization and acceleration technologies</p>\n<p>Cloud infrastructure at scale (AWS, GCP)</p>\n<p>Container technologies and their underlying implementation (Docker, containerd, runc, OCI)</p>\n<p>eBPF programming and kernel tracing tools</p>\n<p>OS-level security hardening and isolation techniques</p>\n<p>Developing custom scheduling algorithms for specialized workloads</p>\n<p>Performance optimization for ML/AI specific workloads</p>\n<p>Network stack optimization and high-performance networking</p>\n<p>Experience with TPUs, custom ASICs, or other ML accelerators</p>\n<p>Representative projects:</p>\n<p>Optimizing kernel parameters and VM configurations to reduce inference latency for large language models</p>\n<p>Implementing custom memory management schemes for large-scale distributed training</p>\n<p>Developing specialized I/O schedulers to prioritize ML workloads</p>\n<p>Creating lightweight virtualization solutions tailored for AI inference</p>\n<p>Building monitoring and instrumentation tools to identify system-level bottlenecks</p>\n<p>Enhancing communication between VMs for distributed training workloads</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6d4292d1-227","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5025591008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$300,000-$405,000 USD","x-skills-required":["Linux kernel development","System programming","Virtualization technologies","C/C++ programming","Rust programming","Linux resource management","Scheduling","Memory management"],"x-skills-preferred":["GPU virtualization","Cloud infrastructure","Container technologies","eBPF programming","Kernel tracing tools","OS-level security hardening","Custom scheduling algorithms","Performance optimization for ML/AI","Network stack optimization"],"datePosted":"2026-04-18T15:55:40.026Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Linux kernel development, System programming, Virtualization technologies, C/C++ programming, Rust programming, Linux resource management, Scheduling, Memory management, GPU virtualization, Cloud infrastructure, Container technologies, eBPF programming, Kernel tracing tools, OS-level security hardening, Custom scheduling algorithms, Performance optimization for ML/AI, Network stack optimization","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":300000,"maxValue":405000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_491db8e9-776"},"title":"Staff Site Reliability Engineer- Splunk Expert","description":"<p>We are seeking a highly technical Staff Site Reliability Engineer with deep expertise in Splunk and Grafana to own and evolve our observability ecosystem.</p>\n<p>As a Staff Site Reliability Engineer, you will move beyond simple monitoring to architect a comprehensive, scalable telemetry platform. You will be our subject-matter expert in Splunk optimisation, ensuring our logging architecture is performant, cost-effective, and deeply integrated with our automated workflows.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Splunk Architecture &amp; Optimisation: Lead the design and tuning of Splunk environments. Optimise indexer performance, search efficiency, and data models to ensure rapid troubleshooting and cost-efficiency.</li>\n</ul>\n<ul>\n<li>Advanced Visualisation: Architect and maintain sophisticated Grafana dashboards that correlate disparate data sources into a single pane of glass for real-time system health.</li>\n</ul>\n<ul>\n<li>Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.</li>\n</ul>\n<ul>\n<li>Pipeline Engineering: Optimise the collection, processing, and storage of telemetry data (Metrics, Logs, Traces) to ensure high reliability and low latency.</li>\n</ul>\n<ul>\n<li>Workflow Automation: Develop custom Splunk workflows and integrations that trigger automated responses to system events, reducing Mean Time to Resolution (MTTR).</li>\n</ul>\n<ul>\n<li>Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements through &#39;observability-driven development.&#39;</li>\n</ul>\n<p>Required skills and experience include:</p>\n<ul>\n<li>Splunk Mastery: Deep, hands-on experience with Splunk administration, search optimisation (SPL), and architecting complex data pipelines.</li>\n</ul>\n<ul>\n<li>Grafana Expertise: Proven ability to build actionable, intuitive dashboards in Grafana that go beyond simple charts to provide deep operational insights.</li>\n</ul>\n<ul>\n<li>SRE Mindset: Minimum 8+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.</li>\n</ul>\n<ul>\n<li>Programming Proficiency: Strong coding skills in Go, Python, or Ruby for building internal tools and automating observability workflows.</li>\n</ul>\n<ul>\n<li>Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Prometheus, or similar frameworks for instrumenting applications.</li>\n</ul>\n<ul>\n<li>Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/EKS).</li>\n</ul>\n<p>Bonus skills include:</p>\n<ul>\n<li>Tracing: Implementation of distributed tracing (Jaeger, Tempo, or Honeycomb) to visualise request flow across microservices.</li>\n</ul>\n<ul>\n<li>Security Observability: Experience using Splunk for security orchestration (SOAR) or SIEM-related workflows.</li>\n</ul>\n<ul>\n<li>Cloud Platforms: Experience managing observability native tools within AWS, Azure, or GCP.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_491db8e9-776","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Okta","sameAs":"https://www.okta.com/","logo":"https://logos.yubhub.co/okta.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/okta/jobs/6874616","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Splunk","Grafana","SRE","Go","Python","Ruby","OpenTelemetry","Prometheus","Linux","Networking","Container Orchestration"],"x-skills-preferred":["Tracing","Security Observability","Cloud Platforms"],"datePosted":"2026-04-18T15:54:34.221Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bengaluru, India"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Splunk, Grafana, SRE, Go, Python, Ruby, OpenTelemetry, Prometheus, Linux, Networking, Container Orchestration, Tracing, Security Observability, Cloud Platforms"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f2196e99-854"},"title":"Software Engineer - GenAI inference","description":"<p>As a software engineer for GenAI inference, you will help design, develop, and optimize the inference engine that powers Databricks&#39; Foundation Model API. You&#39;ll work at the intersection of research and production, ensuring our large language model (LLM) serving systems are fast, scalable, and efficient.</p>\n<p>Your work will touch the full GenAI inference stack , from kernels and runtimes to orchestration and memory management. You will contribute to the design and implementation of the inference engine, and collaborate on model-serving stack optimized for large-scale LLMs inference.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Collaborating with researchers to bring new model architectures or features (sparsity, activation compression, mixture-of-experts) into the engine</li>\n<li>Optimizing for latency, throughput, memory efficiency, and hardware utilization across GPUs, and accelerators</li>\n<li>Building and maintaining instrumentation, profiling, and tracing tooling to uncover bottlenecks and guide optimizations</li>\n<li>Developing and enhancing scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads</li>\n<li>Supporting reliability, reproducibility, and fault tolerance in the inference pipelines, including A/B launches, rollback, and model versioning</li>\n<li>Integrating with federated, distributed inference infrastructure – orchestrate across nodes, balance load, handle communication overhead</li>\n<li>Collaborating cross-functionally: with platform engineers, cloud infrastructure, and security/compliance teams</li>\n<li>Documenting and sharing learnings, contributing to internal best practices and open-source efforts when possible</li>\n</ul>\n<p>Requirements include:</p>\n<ul>\n<li>BS/MS/PhD in Computer Science, or a related field</li>\n<li>Strong software engineering background (3+ years or equivalent) in performance-critical systems</li>\n<li>Solid understanding of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc.</li>\n<li>Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc.)</li>\n<li>Comfortable designing and operating distributed systems, including RPC frameworks, queuing, RPC batching, sharding, memory partitioning</li>\n<li>Demonstrated ability to uncover and solve performance bottlenecks across layers (kernel, memory, networking, scheduler)</li>\n<li>Experience building instrumentation, tracing, and profiling tools for ML models</li>\n<li>Ability to work closely with ML researchers, translate novel model ideas into production systems</li>\n<li>Ownership mindset and eagerness to dive deep into complex system challenges</li>\n<li>Bonus: published research or open-source contributions in ML systems, inference optimization, or model serving</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f2196e99-854","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Databricks","sameAs":"https://databricks.com","logo":"https://logos.yubhub.co/databricks.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/databricks/jobs/8202670002","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$142,200-$204,600 USD","x-skills-required":["software engineering","performance-critical systems","ML inference internals","CUDA","GPU programming","distributed systems","RPC frameworks","queuing","RPC batching","sharding","memory partitioning","instrumentation","tracing","profiling tools","ML researchers","complex system challenges"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:54:17.777Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, performance-critical systems, ML inference internals, CUDA, GPU programming, distributed systems, RPC frameworks, queuing, RPC batching, sharding, memory partitioning, instrumentation, tracing, profiling tools, ML researchers, complex system challenges","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":142200,"maxValue":204600,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_ebdf39f1-65a"},"title":"Senior Product Manager, APM - Observability","description":"<p>We&#39;re eager to welcome an outstanding Product Manager to join the Elastic Observability product team, with a focus on APM, Synthetics and RUM capabilities. With your knowledge and understanding of the Observability domain, you&#39;ll help ensure that we give our customers the best end to end full stack Observability experience for the applications we monitor.</p>\n<p>This role significantly impacts the success of our Observability product!</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Evolving our APM, Synthetics, and RUM experience, aligning at every step with organisational and company strategy.</li>\n</ul>\n<ul>\n<li>Working closely across teams, collaborating with Engineering, UX, Design and other teams to bring top-notch application monitoring experiences to our users.</li>\n</ul>\n<ul>\n<li>Working with our Sales &amp; Marketing teams on go-to-market, messaging, competitive analysis, enablement, and feature adoption.</li>\n</ul>\n<ul>\n<li>Tracking user engagement and feedback to inform product and feature evolution.</li>\n</ul>\n<p><strong>What You Bring</strong></p>\n<ul>\n<li>Ideally 3+ years in product management or closely related area.</li>\n</ul>\n<ul>\n<li>Experience in APM and distributed tracing.</li>\n</ul>\n<ul>\n<li>Synthetic transaction monitoring and RUM experience are optional but highly desired.</li>\n</ul>\n<ul>\n<li>Comfort working with and converting abstract problems into concrete, prioritised solutions, working from the customer backwards.</li>\n</ul>\n<ul>\n<li>Outstanding spoken and written communication skills.</li>\n</ul>\n<ul>\n<li>Passion for industry trends, competitive landscape.</li>\n</ul>\n<ul>\n<li>Experience working in remote, globally distributed work environment is optional but desired.</li>\n</ul>\n<ul>\n<li>Knowledge of OpenTelemetry and of AI are optional but desired.</li>\n</ul>\n<p><strong>Additional Information</strong></p>\n<p>As a distributed company, diversity drives our identity. Whether you’re looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life. Your age is only a number. It doesn’t matter if you’re just out of college or your children are; we need you for what you can do.</p>\n<p>We strive to have parity of benefits across regions and while regulations differ from place to place, we believe taking care of our people is the right thing to do.</p>\n<ul>\n<li>Competitive pay based on the work you do here and not your previous salary.</li>\n</ul>\n<ul>\n<li>Health coverage for you and your family in many locations.</li>\n</ul>\n<ul>\n<li>Ability to craft your calendar with flexible locations and schedules for many roles.</li>\n</ul>\n<ul>\n<li>Generous number of vacation days each year.</li>\n</ul>\n<ul>\n<li>Increase your impact - We match up to $2000 (or local currency equivalent) for financial donations and service.</li>\n</ul>\n<ul>\n<li>Up to 40 hours each year to use toward volunteer projects you love.</li>\n</ul>\n<ul>\n<li>Embracing parenthood with minimum of 16 weeks of parental leave.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_ebdf39f1-65a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Elastic, the Search AI Company","sameAs":"https://www.elastic.co/","logo":"https://logos.yubhub.co/elastic.co.png"},"x-apply-url":"https://job-boards.greenhouse.io/elastic/jobs/7677216","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["APM","Distributed tracing","Synthetic transaction monitoring","RUM","OpenTelemetry","AI"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:52:11.433Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Poland"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"APM, Distributed tracing, Synthetic transaction monitoring, RUM, OpenTelemetry, AI"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_15a29cc3-0bf"},"title":"Senior Production Engineer","description":"<p>CORPORATION</p>\n<p>CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025.</p>\n<p><strong>About the Role</strong></p>\n<p>Production Engineering ensures CoreWeave’s cloud delivers world-class reliability, performance, and operational excellence. We are hiring a Senior Production Engineer to take direct, hands-on ownership of critical tooling that drives reliability and delivery success.</p>\n<p>In this role, you will work broadly across the cloud stack designing, implementing, deploying, and operating systems that improve delivery velocity, service availability, and operational safety. You’ll be responsible for leading end-to-end technical projects, maintaining long-lived systems the team owns, and strengthening our operational foundations through durable engineering investments.</p>\n<p>This is a role for someone who enjoys building, debugging, and operating production systems. You will collaborate closely with service owners, but your primary impact comes from the reliability, quality, and maturity of the systems you deliver and maintain over time.</p>\n<p><strong>What You’ll Do</strong></p>\n<ul>\n<li>Take hands-on ownership of critical systems and frameworks, driving their architecture, implementation, and long-term evolution.</li>\n</ul>\n<ul>\n<li>Lead end-to-end delivery of engineering projects that improve availability, scalability, operational automation, and failure recovery.</li>\n</ul>\n<ul>\n<li>Build and maintain observability, alerting, automated remediation, and resilience testing for the systems you support.</li>\n</ul>\n<ul>\n<li>Participate in incident response as a subject-matter expert; drive deep root-cause investigations and implement lasting fixes.</li>\n</ul>\n<ul>\n<li>Improve runbooks, sources of truth, deployment workflows, and operational tooling to harden production readiness.</li>\n</ul>\n<ul>\n<li>Eliminate single points of failure and reduce operational toil through automation, refactors, and system redesigns.</li>\n</ul>\n<ul>\n<li>Ship production code regularly in Python, Go, or similar languages, and participate in on-call rotations.</li>\n</ul>\n<ul>\n<li>Maintain and mature long-term projects and frameworks owned by the team, ensuring they remain reliable, well-instrumented, and easy to operate.</li>\n</ul>\n<ul>\n<li>Collaborate with platform teams to ensure new features and services integrate cleanly with our reliability best-practices and tooling.</li>\n</ul>\n<p><strong>What You’ve Worked On (Minimum Qualifications)</strong></p>\n<ul>\n<li>7+ years of engineering experience building and operating distributed systems or cloud platforms.</li>\n</ul>\n<ul>\n<li>Demonstrated ability to debug complex production issues end-to-end, across services, infrastructure layers, and automation.</li>\n</ul>\n<ul>\n<li>Strong programming or scripting ability (Python, Go, or similar), with experience shipping and operating production services and tools.</li>\n</ul>\n<ul>\n<li>Deep knowledge of cloud-native technologies and distributed system patterns, particularly Kubernetes.</li>\n</ul>\n<ul>\n<li>Experience with modern observability stacks: metrics, tracing, structured logs, SLOs/SLIs, and incident lifecycle practices.</li>\n</ul>\n<ul>\n<li>A track record of successfully delivering hands-on reliability improvements through engineering execution.</li>\n</ul>\n<p><strong>Preferred Qualifications</strong></p>\n<ul>\n<li>Experience building internal tooling, frameworks, or automation that supports high-availability cloud operations.</li>\n</ul>\n<ul>\n<li>Familiarity with DR/BCP, service tiering, capacity planning, or chaos engineering.</li>\n</ul>\n<ul>\n<li>Background operating or building large-scale AI or GPU-accelerated infrastructure.</li>\n</ul>\n<ul>\n<li>Experience maintaining multi-year ownership of foundational production systems.</li>\n</ul>\n<p>Why CoreWeave?</p>\n<p>At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n</ul>\n<ul>\n<li>Act Like an Owner</li>\n</ul>\n<ul>\n<li>Empower Employees</li>\n</ul>\n<ul>\n<li>Deliver Best-in-Class Client Experiences</li>\n</ul>\n<ul>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization&#39;s growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!</p>\n<p>The base salary range for this role is $139,000 to $204,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>\n<p>What We Offer</p>\n<p>The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.</p>\n<p>In addition to a competitive salary, we offer a variety of benefits to support your needs, including:</p>\n<ul>\n<li>Medical, dental, and vision insurance - 100% paid for by CoreWeave</li>\n</ul>\n<ul>\n<li>Company-paid Life Insurance</li>\n</ul>\n<ul>\n<li>Voluntary supplemental life insurance</li>\n</ul>\n<ul>\n<li>Short and long-term disability insurance</li>\n</ul>\n<ul>\n<li>Flexible Spending Account</li>\n</ul>\n<ul>\n<li>Health Savings Account</li>\n</ul>\n<ul>\n<li>Tuition Reimbursement</li>\n</ul>\n<ul>\n<li>Ability to Participate in Employee Stock Purchase Program (ESPP)</li>\n</ul>\n<ul>\n<li>Mental Wellness Benefits through Spring Health</li>\n</ul>\n<ul>\n<li>Family-Forming support provided by Carrot</li>\n</ul>\n<ul>\n<li>Paid Parental Leave</li>\n</ul>\n<ul>\n<li>Flexible, full-service childcare support with Kinside</li>\n</ul>\n<ul>\n<li>401(k) with a generous employer match</li>\n</ul>\n<ul>\n<li>Flexible PTO</li>\n</ul>\n<ul>\n<li>Catered lunch each day in our office and data center locations</li>\n</ul>\n<ul>\n<li>A casual work environment</li>\n</ul>\n<ul>\n<li>A work culture focused on innovative disruption</li>\n</ul>\n<p>Our Workplace</p>\n<p>While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.</p>\n<p>California Consumer Privacy Act - California applicants only</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_15a29cc3-0bf","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4670172006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$139,000 to $204,000","x-skills-required":["cloud computing","distributed systems","cloud platforms","Kubernetes","observability stacks","metrics","tracing","structured logs","SLOs/SLIs","incident lifecycle practices","Python","Go","programming","scripting","production services","tools"],"x-skills-preferred":["internal tooling","frameworks","automation","high-availability cloud operations","DR/BCP","service tiering","capacity planning","chaos engineering","large-scale AI","GPU-accelerated infrastructure"],"datePosted":"2026-04-18T15:52:09.786Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud computing, distributed systems, cloud platforms, Kubernetes, observability stacks, metrics, tracing, structured logs, SLOs/SLIs, incident lifecycle practices, Python, Go, programming, scripting, production services, tools, internal tooling, frameworks, automation, high-availability cloud operations, DR/BCP, service tiering, capacity planning, chaos engineering, large-scale AI, GPU-accelerated infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139000,"maxValue":204000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_fa9a54d7-549"},"title":"Senior Site Reliability Engineer, Data Infrastructure","description":"<p>As a Senior Site Reliability Engineer, you will own the reliability and performance of our Kubernetes-based data platform. You will design and operate highly available, multi-region systems, ensuring our services meet strict uptime and latency targets.</p>\n<p>Day-to-day, you’ll work on scaling infrastructure, improving deployment pipelines, and hardening our security posture. You’ll play a key role in evolving our DevSecOps practices while partnering closely with engineering teams to ensure services are built for reliability from day one.</p>\n<p>We operate with production-grade discipline, supporting mission-critical services with stringent uptime requirements and a focus on automation, observability, and resilience.</p>\n<p>The Platform &amp; Infrastructure Engineering team in the Data Infrastructure organization is responsible for the reliability, scalability, and security of the company’s data platform. The team builds and operates the foundational systems that power data ingestion, transformation, analytics, and internal AI workloads at scale.</p>\n<p>About the role:</p>\n<ul>\n<li>5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering roles</li>\n<li>Deep expertise in Kubernetes and containerized software services, including cluster design, operations, and troubleshooting in production environments</li>\n<li>Strong experience building and operating CI/CD systems, including tools such as Argo CD and GitHub Actions</li>\n<li>Proven experience owning production systems with high availability requirements (≥99.99% uptime), including incident response, SLI/SLO/SLA definition, error budgets, and postmortems</li>\n<li>Hands-on experience designing and operating geo-replicated, multi-region, active-active systems, including traffic routing, failover strategies, and data consistency tradeoffs</li>\n<li>Strong experience building and owning observability components, including metrics, logging, and tracing (e.g., Prometheus, Grafana, OpenTelemetry).</li>\n<li>Experience with infrastructure as code (e.g., Helm, Terraform, Pulumi) and automated environment provisioning</li>\n<li>Strong understanding of system performance tuning, capacity planning, and resource optimization in distributed systems</li>\n<li>Experience implementing and operating security best practices in cloud-native environments (e.g., secrets management, network policies, vulnerability scanning)</li>\n</ul>\n<p>Preferred:</p>\n<ul>\n<li>Experience operating data platforms or data-intensive workloads (e.g., Spark, Airflow, Kafka, Flink)</li>\n<li>Familiarity with service mesh technologies (e.g., Istio, Linkerd)</li>\n<li>Experience working in regulated environments with compliance frameworks such as GDPR, SOC 2, HIPAA, or SOX</li>\n<li>Background in building internal developer platforms or self-service infrastructure</li>\n</ul>\n<p>Wondering if you’re a good fit?</p>\n<p>We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren’t a 100% skill or experience match.</p>\n<p>Here are a few qualities we’ve found compatible with our team. If some of this describes you, we’d love to talk.</p>\n<ul>\n<li>You love building highly reliable systems that operate at scale</li>\n<li>You’re curious about how to continuously improve system resilience, security, and operations</li>\n<li>You’re an expert in diagnosing and solving complex distributed systems problems</li>\n</ul>\n<p>Why CoreWeave?</p>\n<p>At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning.</p>\n<p>Our team cares deeply about how we build our product and how we work together, which is represented through our core values:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n<li>Act Like an Owner</li>\n<li>Empower Employees</li>\n<li>Deliver Best-in-Class Client Experiences</li>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems.</p>\n<p>As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.</p>\n<p>Come join us!</p>\n<p>The base salary range for this role is $165,000 to $242,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation.</p>\n<p>In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>\n<p>What We Offer</p>\n<p>The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.</p>\n<p>In addition to a competitive salary, we offer a variety of benefits to support your needs, including:</p>\n<ul>\n<li>Medical, dental, and vision insurance</li>\n<li>100% paid for by CoreWeave</li>\n<li>Company-paid Life Insurance</li>\n<li>Voluntary supplemental life insurance</li>\n<li>Short and long-term disability insurance</li>\n<li>Flexible Spending Account</li>\n<li>Health Savings Account</li>\n<li>Tuition Reimbursement</li>\n<li>Ability to Participate in Employee Stock Purchase Program (ESPP)</li>\n<li>Mental Wellness Benefits through Spring Health</li>\n<li>Family-Forming support provided by Carrot</li>\n<li>Paid Parental Leave</li>\n<li>Flexible, full-service childcare support with Kinside</li>\n<li>401(k) with a generous employer match</li>\n<li>Flexible PTO</li>\n<li>Catered lunch each day in our office and data center locations</li>\n<li>A casual work environment</li>\n<li>A work culture focused on innovative disruption</li>\n</ul>\n<p>Our Workplace</p>\n<p>While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets.</p>\n<p>New hires will be invited to attend onboarding at one of our hubs within their first month.</p>\n<p>Teams also gather quarterly to support collaboration.</p>\n<p>California Consumer Privacy Act - California applicants only</p>\n<p>CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace.</p>\n<p>All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.</p>\n<p>As part of this commitment and consistent with the Americans with Disabilities Act (ADA), CoreWeave will ensure that qualified applicants and candidates with disabilities are provided reasonable accommodations for the hiring process, unless such accommodation would cause an undue hardship.</p>\n<p>If reasonable accommodation is needed, please contact: careers@coreweave.com.</p>\n<p>Export Control Compliance</p>\n<p>This position requires access to export controlled information.</p>\n<p>To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. § 1157, or (iv) asylee under 8 U.S.C. § 1158, (B) eligible to access the export controlled information without restrictions, or (C) otherwise exempt from the export regulations.</p>\n<p>If you are not a U.S. person, you will be required to provide documentation of your eligibility to access the export controlled information before being considered for this position.</p>\n<p>Please note that CoreWeave is subject to the requirements of the U.S. Department of Commerce&#39;s Export Administration Regulations (EAR) and the U.S. Department of State&#39;s International Traffic in Arms Regulations (ITAR).</p>\n<p>By applying for this position, you acknowledge that you have read and understood the export control requirements and that you will comply with them.</p>\n<p>If you have any questions or concerns regarding the export control requirements, please contact: careers@coreweave.com.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_fa9a54d7-549","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4671535006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $242,000","x-skills-required":["Kubernetes","containerized software services","cluster design","operations","troubleshooting","CI/CD systems","Argo CD","GitHub Actions","production systems","high availability","incident response","SLI/SLO/SLA definition","error budgets","postmortems","geo-replicated","multi-region","active-active systems","traffic routing","failover strategies","data consistency tradeoffs","observability components","metrics","logging","tracing","Prometheus","Grafana","OpenTelemetry","infrastructure as code","Helm","Terraform","Pulumi","automated environment provisioning","system performance tuning","capacity planning","resource optimization","distributed systems","security best practices","cloud-native environments","secrets management","network policies","vulnerability scanning"],"x-skills-preferred":["Spark","Airflow","Kafka","Flink","service mesh technologies","Istio","Linkerd","regulated environments","compliance frameworks","GDPR","SOC 2","HIPAA","SOX","internal developer platforms","self-service infrastructure"],"datePosted":"2026-04-18T15:51:59.035Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, containerized software services, cluster design, operations, troubleshooting, CI/CD systems, Argo CD, GitHub Actions, production systems, high availability, incident response, SLI/SLO/SLA definition, error budgets, postmortems, geo-replicated, multi-region, active-active systems, traffic routing, failover strategies, data consistency tradeoffs, observability components, metrics, logging, tracing, Prometheus, Grafana, OpenTelemetry, infrastructure as code, Helm, Terraform, Pulumi, automated environment provisioning, system performance tuning, capacity planning, resource optimization, distributed systems, security best practices, cloud-native environments, secrets management, network policies, vulnerability scanning, Spark, Airflow, Kafka, Flink, service mesh technologies, Istio, Linkerd, regulated environments, compliance frameworks, GDPR, SOC 2, HIPAA, SOX, internal developer platforms, self-service infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":242000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a1ba5c28-9ce"},"title":"Senior Software Engineer, Observability","description":"<p>Join CoreWeave&#39;s Observability team, responsible for building the systems that give our customers and internal teams unparalleled visibility into complex AI workloads.</p>\n<p>Our team empowers engineers to understand, troubleshoot, and optimize high-performance infrastructure at massive scale.</p>\n<p>As a Senior Software Engineer on the Observability team, you will design, build, and maintain core observability infrastructure spanning metrics, logging, tracing, and telemetry pipelines.</p>\n<p>Your day-to-day will involve developing highly reliable and scalable systems, collaborating with internal engineering teams to embed observability best practices, and tackling performance and reliability challenges across clusters of thousands of GPUs.</p>\n<p>You&#39;ll also contribute to platform strategy and participate in on-call rotations to ensure critical production systems remain robust and operational.</p>\n<p>The base salary range for this role is $139,000 to $220,000.</p>\n<p>In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>\n<p>We offer a variety of benefits to support your needs, including medical, dental, and vision insurance, 100% paid for by CoreWeave, company-paid Life Insurance, voluntary supplemental life insurance, short and long-term disability insurance, flexible Spending Account, Health Savings Account, tuition reimbursement, ability to participate in Employee Stock Purchase Program (ESPP), mental wellness benefits through Spring Health, family-forming support provided by Carrot, paid parental leave, flexible, full-service childcare support with Kinside, 401(k) with a generous employer match, flexible PTO, catered lunch each day in our office and data center locations, a casual work environment, and a work culture focused on innovative disruption.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a1ba5c28-9ce","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4554201006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$139,000 to $220,000","x-skills-required":["Go","Python","Kubernetes","containerization","microservices architectures","Helm","YAML-based configurations","automated testing","progressive release strategies","on-call rotations"],"x-skills-preferred":["designing, operating, or scaling logging, metrics, or tracing platforms","data streaming systems for observability pipelines","automating infrastructure provisioning","OpenTelemetry for unified telemetry collection and instrumentation","exposure to modern AI workloads and GPU-based infrastructure"],"datePosted":"2026-04-18T15:51:55.238Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY / Sunnyvale, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Go, Python, Kubernetes, containerization, microservices architectures, Helm, YAML-based configurations, automated testing, progressive release strategies, on-call rotations, designing, operating, or scaling logging, metrics, or tracing platforms, data streaming systems for observability pipelines, automating infrastructure provisioning, OpenTelemetry for unified telemetry collection and instrumentation, exposure to modern AI workloads and GPU-based infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":139000,"maxValue":220000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_72ebb09d-b37"},"title":"Staff+ Software Engineer, Observability","description":"<p>We&#39;re seeking talented and experienced Software Engineers to join our Observability team within the Infrastructure organization. The Observability team owns the monitoring and telemetry infrastructure that every engineer and researcher at Anthropic depends on,from metrics and logging pipelines to distributed tracing, error analytics, alerting, and the dashboards and query interfaces that make it all actionable.</p>\n<p>As Anthropic scales its infrastructure across massive GPU, TPU, and Trainium clusters, the volume and complexity of operational data is growing by orders of magnitude. We&#39;re building next-generation observability systems,high-throughput ingest pipelines, cost-efficient columnar storage, unified query layers across signals, and agentic diagnostic tools,to ensure that engineers can detect, diagnose, and resolve issues in minutes rather than hours, even as the systems they operate become exponentially more complex.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across Anthropic&#39;s multi-cluster infrastructure</li>\n<li>Own and evolve core observability platforms, driving migrations and architectural improvements that improve reliability, reduce cost, and scale with organisational growth</li>\n<li>Build instrumentation libraries, SDKs, and integrations that make it easy for engineering teams to emit high-quality telemetry from their services</li>\n<li>Drive alerting and SLO infrastructure that enables teams to define, monitor, and respond to reliability targets with minimal noise</li>\n<li>Reduce mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling</li>\n<li>Partner with Research, Inference, Product, and Infrastructure teams to ensure observability solutions meet the unique needs of each organisation</li>\n</ul>\n<p>You May Be a Good Fit If You:</p>\n<ul>\n<li>Have 10+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure</li>\n<li>Have deep experience with at least one observability signal area (metrics, logging, tracing, or error analytics) and familiarity with the others</li>\n<li>Understand high-throughput data pipelines, columnar storage engines, and the tradeoffs involved in ingesting and querying telemetry data at scale</li>\n<li>Have experience operating or building on top of observability platforms such as Prometheus, Grafana, ClickHouse, OpenTelemetry, or similar systems</li>\n<li>Have strong proficiency in at least one of Python, Rust, or Go</li>\n<li>Have excellent communication skills and enjoy partnering with internal teams to improve their operational visibility and incident response capabilities</li>\n<li>Are excited about building foundational infrastructure and are comfortable working independently on ambiguous, high-impact technical challenges</li>\n</ul>\n<p>Strong Candidates May Also Have:</p>\n<ul>\n<li>Experience operating metrics systems at very high cardinality (hundreds of millions of active time series or more)</li>\n<li>Experience with log storage migrations or operating columnar databases (ClickHouse, BigQuery, or similar) for analytics workloads</li>\n<li>Experience with OpenTelemetry instrumentation, collector pipelines, and tail-based sampling strategies</li>\n<li>Experience building or operating alerting platforms, on-call tooling, or SLO frameworks at scale</li>\n<li>Experience with Kubernetes-native monitoring, eBPF-based observability, or continuous profiling</li>\n<li>Interest in applying AI/LLMs to operational workflows such as automated root cause analysis, anomaly detection, or intelligent alerting</li>\n</ul>\n<p>The annual compensation range for this role is $405,000-$485,000 USD.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_72ebb09d-b37","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5139910008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$405,000-$485,000 USD","x-skills-required":["observability","monitoring","telemetry","metrics","logging","tracing","error analytics","alerting","SLO infrastructure","cross-signal correlation","unified query interfaces","AI-assisted diagnostic tooling","Python","Rust","Go","Prometheus","Grafana","ClickHouse","OpenTelemetry"],"x-skills-preferred":["high-throughput data pipelines","columnar storage engines","operating system administration","cloud computing","containerization","DevOps"],"datePosted":"2026-04-18T15:51:29.494Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"observability, monitoring, telemetry, metrics, logging, tracing, error analytics, alerting, SLO infrastructure, cross-signal correlation, unified query interfaces, AI-assisted diagnostic tooling, Python, Rust, Go, Prometheus, Grafana, ClickHouse, OpenTelemetry, high-throughput data pipelines, columnar storage engines, operating system administration, cloud computing, containerization, DevOps","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":405000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_86696218-8f0"},"title":"Staff Backend Engineer (Ruby on Rails/AI), Verify","description":"<p>As a Staff Backend Engineer (AI) in the Verify stage at GitLab, you&#39;ll help shape and scale the core infrastructure behind GitLab CI. You&#39;ll play a central role in how we integrate AI into CI/CD workflows. Your work will impact performance, reliability, and usability for people running millions of CI jobs, from small teams to the largest enterprises.</p>\n<p>In this role, you&#39;ll go beyond using AI tools and help define how we design, build, and iterate on AI-assisted and agentic CI experiences. You&#39;ll set standards for what good looks like across our AI agent portfolio, including how we measure success, how we instrument behavior in production, and how we account for large language model limitations. You&#39;ll also help responsibly integrate GitLab&#39;s Duo Agent Platform into CI workflows at scale, on a foundation that&#39;s fast, reliable, secure, and observable.</p>\n<p>We have ambitious goals for Agentic CI in FY27. As a Staff Engineer, you will:</p>\n<ul>\n<li>Partner with Engineering, Product, and UX leadership to pressure-test our priorities: where we can move faster, where we&#39;re missing data, and where there&#39;s whitespace to innovate. Part of this includes learning and growing with the Engineering team you will collaborate closely with.</li>\n</ul>\n<ul>\n<li>Define what success looks like across our agent portfolio and make sure we&#39;re tracking against it , not just shipping, but learning.</li>\n</ul>\n<ul>\n<li>Bring a sharp eye to the competitive landscape, helping us understand what it takes to keep GitLab CI best-in-class in an increasingly agentic world.</li>\n</ul>\n<p>Examples of Agentic CI work we have planned for the upcoming year:</p>\n<ul>\n<li>AI Pipeline Builder, the foundational CI agent that auto-creates pipelines for new projects and serves as the launchpad for onboarding new CI users.</li>\n</ul>\n<ul>\n<li>Automate the Fix a Failing Pipeline flow at scale – from dogfooding on internal GitLab projects through to safe, controlled rollout for customers, solving real infrastructure and scalability challenges.</li>\n</ul>\n<ul>\n<li>Build the instrumentation and observability layer that makes agentic CI trustworthy , trigger volume dashboards, retry rates, cost safeguards , so we can measure what&#39;s working, catch what isn&#39;t, and iterate with confidence.</li>\n</ul>\n<ul>\n<li>Harden the CI pipeline execution infrastructure that these agents depend on: database access patterns, background processing, and job orchestration built to handle the additional load that AI-driven automation introduces at enterprise scale.</li>\n</ul>\n<p>You&#39;ll shape and scale GitLab CI backend infrastructure to improve performance, reliability, and usability for users running jobs at high volume. You&#39;ll design and implement AI-powered features for Agentic CI, including agents, agentic flows, and LLM-backed tooling that integrates with GitLab&#39;s Duo Agent Platform. You&#39;ll define what success looks like for AI in CI before you build, including baselines, measurable outcomes, and clear signals that help the team learn and iterate. You&#39;ll build the instrumentation and observability needed to make AI-assisted CI trustworthy in production, including feature behavior metrics, dashboards, and safeguards. You&#39;ll own and drive measurable performance improvements across CI systems (for example, database access patterns, background processing, and job orchestration) by forming hypotheses, running experiments, and validating results with data. You&#39;ll write secure, well-tested, maintainable Ruby on Rails code in a large monolith, improving existing features while reducing technical debt and operational risk. You&#39;ll lead cross-functional technical work with Product, UX, and Infrastructure, influencing architecture and execution across the Verify stage. You&#39;ll share standards, patterns, and learnings with other engineers, raising the bar for responsible AI integration and evidence-driven engineering across CI.</p>\n<p>This role requires advanced proficiency with Ruby and Ruby on Rails, with experience building and maintaining reliable backend services in a large codebase. You should have strong PostgreSQL skills, including data modeling, query tuning, and scaling large tables through proactive performance investigation and remediation. You should have hands-on experience building, running, and debugging high-traffic production systems, ideally in CI, workflow orchestration, or adjacent infrastructure-heavy domains. You should have practical experience designing and shipping AI-powered backend features and integrations, including sound judgment about large language model limitations and responsible use in production. You should have a data-driven approach to engineering: defining hypotheses, establishing baseline metrics, instrumenting changes, and measuring outcomes against clear success criteria. You should have familiarity with observability patterns and tools (metrics, logging, tracing) to diagnose issues, improve reliability, and guide iteration. You should have strong backend architecture and delivery practices, including secure design, well-tested code, and strategies for safe rollouts and zero-downtime changes. You should have clear written and verbal communication skills, including writing technical proposals and documentation, and collaborating effectively in a remote, asynchronous, cross-functional environment.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_86696218-8f0","directApply":true,"hiringOrganization":{"@type":"Organization","name":"GitLab","sameAs":"https://about.gitlab.com/","logo":"https://logos.yubhub.co/about.gitlab.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/gitlab/jobs/8448283002","x-work-arrangement":"remote","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Ruby","Ruby on Rails","PostgreSQL","Data modeling","Query tuning","Scaling large tables","High-traffic production systems","CI","Workflow orchestration","Infrastructure-heavy domains","AI-powered backend features","Large language model limitations","Responsible use in production","Data-driven approach to engineering","Observability patterns","Metrics","Logging","Tracing","Backend architecture","Delivery practices","Secure design","Well-tested code","Safe rollouts","Zero-downtime changes"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:50:58.310Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote, APAC; Remote, Canada; Remote, Ireland; Remote, Netherlands; Remote, United Kingdom; Remote, US; Remote, US-Southeast"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Ruby, Ruby on Rails, PostgreSQL, Data modeling, Query tuning, Scaling large tables, High-traffic production systems, CI, Workflow orchestration, Infrastructure-heavy domains, AI-powered backend features, Large language model limitations, Responsible use in production, Data-driven approach to engineering, Observability patterns, Metrics, Logging, Tracing, Backend architecture, Delivery practices, Secure design, Well-tested code, Safe rollouts, Zero-downtime changes"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_6a24f057-4f1"},"title":"Staff Production Engineer","description":"<p>The Production Engineering Tools team builds and operates foundational platforms that make CoreWeave&#39;s cloud reliable, observable, and scalable. We are hiring a Staff Production Engineer to design, build, and own the foundational platforms and frameworks that underpin operational excellence across CoreWeave.</p>\n<p>In this role, you will combine deep technical leadership with hands-on engineering to create systems that improve availability, resiliency, and delivery velocity at scale. This is a high-impact role with broad organisational influence. You will develop a deep understanding of CoreWeave&#39;s infrastructure and services, shape architecture and tooling decisions, and partner closely with service owners to operationalise reliability through automation and paved paths rather than manual process or advocacy.</p>\n<p>Success requires the ability to pivot quickly between hot incidents, multi-team programs, and initiatives at all levels of the organisation. You will design, build, and own foundational platforms and frameworks from architecture through adoption and operation. You will lead technical strategy and execution for internal tooling that reduces manual operations, improves delivery velocity, and supports CoreWeave&#39;s revenue growth through faster, more reliable datacentre delivery.</p>\n<p>You will partner with service owners and platform teams to translate reliability and operational requirements into automation, self-service capabilities, and opinionated paved paths. You will build and evolve systems for observability, alerting, automated remediation, resiliency testing, and authoritative sources of truth, operationalising best practices through tooling rather than manual enforcement.</p>\n<p>You will participate in incident response for critical outages with the explicit goal of improving systems, tooling, and defaults to reduce future operational load,not as a long-term escalation path. You will ship production code, participate in on-call rotations as needed, and mentor engineers on platform ownership, operational design, and sustainable production practices.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_6a24f057-4f1","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4644302006","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$188,000 to $275,000","x-skills-required":["distributed systems","cloud platforms","Kubernetes","observability","incident practices","metrics","tracing","structured logs","SLIs/SLOs","PIRs"],"x-skills-preferred":["foundational internal platforms","service tiering","disaster recovery","chaos engineering","structured resilience programs"],"datePosted":"2026-04-18T15:50:55.257Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"distributed systems, cloud platforms, Kubernetes, observability, incident practices, metrics, tracing, structured logs, SLIs/SLOs, PIRs, foundational internal platforms, service tiering, disaster recovery, chaos engineering, structured resilience programs","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":188000,"maxValue":275000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_67b4ccd7-51d"},"title":"Senior Software Engineer, Observability Insights","description":"<p>Join CoreWeave&#39;s Observability team, where we are building the next-generation insights layer for AI systems.</p>\n<p>Our team empowers internal and external users to understand, troubleshoot, and optimize complex AI workloads by transforming telemetry into actionable insights.</p>\n<p>As a Senior Software Engineer on the Observability Insights team, you will lead the development of agentic interfaces and product experiences that sit atop CoreWeave&#39;s telemetry layer.</p>\n<p>You&#39;ll design multi-tenant APIs, managed Grafana experiences, and MCP-based tool servers to help customers and internal teams interact with data in innovative ways.</p>\n<p>Collaborating closely with PMs and engineering leadership, your work will shape the end-to-end observability experience and influence how people engage with cutting-edge AI infrastructure.</p>\n<p><strong>About the role</strong></p>\n<ul>\n<li>6+ years of experience in software or infrastructure engineering building production-grade backend systems and distributed APIs.</li>\n</ul>\n<ul>\n<li>Strong focus on developer-facing infrastructure, with a customer-obsessed approach to SDKs, CLIs, and APIs.</li>\n</ul>\n<ul>\n<li>Proficient in reliability engineering, including fault-tolerant design, SLOs, error budgets, and multi-tenant system resilience.</li>\n</ul>\n<ul>\n<li>Familiar with observability systems such as ClickHouse, Loki, VictoriaMetrics, Prometheus, and Grafana.</li>\n</ul>\n<ul>\n<li>Experienced in agentic applications or LLM-based features, including grounding, tool calling, and operational safety.</li>\n</ul>\n<ul>\n<li>Comfortable writing production code primarily in Go, with the ability to integrate Python components when needed.</li>\n</ul>\n<ul>\n<li>Collaborative experience in agile teams delivering end-to-end telemetry-to-insights pipelines.</li>\n</ul>\n<p><strong>Preferred</strong></p>\n<ul>\n<li>Experience operating Kubernetes clusters at scale, especially for AI workloads.</li>\n</ul>\n<ul>\n<li>Hands-on experience with logging, tracing, and metrics platforms in production, with deep knowledge of cardinality, indexing, and query optimization.</li>\n</ul>\n<ul>\n<li>Experienced in running distributed systems or API services at cloud scale, including event streaming and data pipeline management.</li>\n</ul>\n<ul>\n<li>Familiarity with LLM frameworks, MCP, and agentic tooling (e.g., Langchain, AgentCore).</li>\n</ul>\n<p><strong>Why CoreWeave?</strong></p>\n<p>At CoreWeave, we work hard, have fun, and move fast!</p>\n<p>We&#39;re in an exciting stage of hyper-growth that you will not want to miss out on.</p>\n<p>We&#39;re not afraid of a little chaos, and we&#39;re constantly learning.</p>\n<p>Our team cares deeply about how we build our product and how we work together, which is represented through our core values:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n</ul>\n<ul>\n<li>Act Like an Owner</li>\n</ul>\n<ul>\n<li>Empower Employees</li>\n</ul>\n<ul>\n<li>Deliver Best-in-Class Client Experiences</li>\n</ul>\n<ul>\n<li>Achieve More Together</li>\n</ul>\n<p>We support and encourage an entrepreneurial outlook and independent thinking.</p>\n<p>We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems.</p>\n<p>As we get set for takeoff, the organization&#39;s growth opportunities are constantly expanding.</p>\n<p>You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.</p>\n<p>Come join us!</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_67b4ccd7-51d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4650163006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$165,000 to $242,000","x-skills-required":["software engineering","infrastructure engineering","backend systems","distributed APIs","reliability engineering","fault-tolerant design","SLOs","error budgets","multi-tenant system resilience","observability systems","ClickHouse","Loki","VictoriaMetrics","Prometheus","Grafana","agentic applications","LLM-based features","grounding","tool calling","operational safety","Go","Python","Kubernetes","logging","tracing","metrics platforms","cardinality","indexing","query optimization","event streaming","data pipeline management","LLM frameworks","MCP","agent tooling"],"x-skills-preferred":["operating Kubernetes clusters"],"datePosted":"2026-04-18T15:48:46.219Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY / Sunnyvale, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"software engineering, infrastructure engineering, backend systems, distributed APIs, reliability engineering, fault-tolerant design, SLOs, error budgets, multi-tenant system resilience, observability systems, ClickHouse, Loki, VictoriaMetrics, Prometheus, Grafana, agentic applications, LLM-based features, grounding, tool calling, operational safety, Go, Python, Kubernetes, logging, tracing, metrics platforms, cardinality, indexing, query optimization, event streaming, data pipeline management, LLM frameworks, MCP, agent tooling, operating Kubernetes clusters","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":165000,"maxValue":242000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_cbeabfab-916"},"title":"Software Engineer, Observability","description":"<p>As a Software Engineer on the Observability team, you will design, build, and maintain scalable systems that process and surface telemetry data across distributed environments.</p>\n<p>You&#39;ll contribute production-quality code in languages like Go and Python, while improving system reliability through enhanced monitoring, alerting, and incident response practices.</p>\n<p>Day to day, you&#39;ll collaborate with cross-functional engineering teams to implement observability best practices, support production systems, and help optimize performance across large-scale infrastructure.</p>\n<p>You will also participate in on-call rotations and contribute to continuous improvements based on real-world system behavior.</p>\n<p>CoreWeave is looking for a talented software engineer to join our Observability team. You will be responsible for designing, building, and maintaining scalable systems that process and surface telemetry data across distributed environments.</p>\n<p>The ideal candidate will have experience with Go and Python, as well as a strong understanding of system reliability and observability best practices.</p>\n<p>In addition to your technical skills, you should be able to collaborate effectively with cross-functional teams and communicate complex technical concepts to non-technical stakeholders.</p>\n<p>If you&#39;re passionate about building scalable systems and improving system reliability, we&#39;d love to hear from you!</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_cbeabfab-916","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4587675006","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$109,000 to $145,000","x-skills-required":["Go","Python","Kubernetes","containerization","microservices architectures","observability systems","metrics","logging","tracing"],"x-skills-preferred":["ClickHouse","Elastic","Loki","VictoriaMetrics","Prometheus","Thanos","OpenTelemetry","Grafana","Terraform","modern testing frameworks","deployment strategies","data streaming technologies","AI/ML infrastructure"],"datePosted":"2026-04-18T15:46:41.788Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY / Sunnyvale, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Go, Python, Kubernetes, containerization, microservices architectures, observability systems, metrics, logging, tracing, ClickHouse, Elastic, Loki, VictoriaMetrics, Prometheus, Thanos, OpenTelemetry, Grafana, Terraform, modern testing frameworks, deployment strategies, data streaming technologies, AI/ML infrastructure","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":109000,"maxValue":145000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_61be0866-2b0"},"title":"Principal Software Engineer, Performance","description":"<p>We are seeking a highly experienced Principal Software Engineer to join our Infrastructure Performance team. As a key member of this team, you will be responsible for defining and driving Airbnb&#39;s long-term performance strategy, spanning product performance, infrastructure efficiency, and business objectives for scale and growth.</p>\n<p>In this role, you will lead the architecture and development of performance profiling and instrumentation infrastructure, covering CPU, GPU, memory, request hot paths, utilization, and deployment events, making these capabilities available to all backend teams.</p>\n<p>You will partner with infrastructure teams across compute, reliability, backend frameworks, and AI Infra to ensure the fleet operates at optimal utilization.</p>\n<p>You will connect performance outcomes to business objectives and company-wide SLOs, and guide engineering teams in keeping the stack scalable and efficient.</p>\n<p>You will evaluate emerging hardware and software technologies, engage with the external solutions ecosystem, and advise on build vs. buy decisions in areas of strategic importance.</p>\n<p>As a mentor and technical leader, you will uplevel engineers across the organization through design reviews, architectural guidance, and performance best practices.</p>\n<p>To be successful in this role, you will need to have 12+ years of performance engineering experience in high-scale, high-growth production environments.</p>\n<p>You will need to have a deep understanding of how software and hardware systems interact at scale, including architectural patterns for performance-critical stacks.</p>\n<p>You will need to have strong familiarity with public cloud infrastructure (AWS, GCP, or Azure) and container orchestration (Docker, Kubernetes).</p>\n<p>You will need to have experience with profiling and instrumentation tooling across CPU, GPU, memory, and distributed request tracing.</p>\n<p>You will need to have demonstrated ability to define performance objectives and drive delivery against company-wide SLOs across multiple organizations.</p>\n<p>You will need to have strong communication and influence skills; comfortable driving technical direction with senior engineering and product leadership.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_61be0866-2b0","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Airbnb","sameAs":"https://www.airbnb.com/","logo":"https://logos.yubhub.co/airbnb.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/airbnb/jobs/7826679","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$292,000-$365,000 USD","x-skills-required":["performance engineering","software engineering","infrastructure performance","public cloud infrastructure","container orchestration","profiling and instrumentation tooling","distributed request tracing","cloud computing","containerization"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:41:00.673Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote-US"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"performance engineering, software engineering, infrastructure performance, public cloud infrastructure, container orchestration, profiling and instrumentation tooling, distributed request tracing, cloud computing, containerization","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":292000,"maxValue":365000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_51758515-c12"},"title":"Member of Technical Staff","description":"<p>We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment.</p>\n<p>This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure.</p>\n<p>The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime,including close partnership with facility operations to address physical infrastructure impacts.</p>\n<p>In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities.</p>\n<p>By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scale environments like those at hyperscale cloud providers.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.</li>\n</ul>\n<ul>\n<li>Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers,open to innovative stacks beyond traditional ones like ELK.</li>\n</ul>\n<ul>\n<li>Collaborate with cross-functional teams,including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management),to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).</li>\n</ul>\n<ul>\n<li>Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.</li>\n</ul>\n<ul>\n<li>Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.</li>\n</ul>\n<ul>\n<li>Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation.</li>\n</ul>\n<ul>\n<li>Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios.</li>\n</ul>\n<ul>\n<li>Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies.</li>\n</ul>\n<p>Basic Qualifications:</p>\n<ul>\n<li>Bachelor&#39;s degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).</li>\n</ul>\n<ul>\n<li>5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.</li>\n</ul>\n<ul>\n<li>Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.</li>\n</ul>\n<ul>\n<li>Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.</li>\n</ul>\n<ul>\n<li>Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).</li>\n</ul>\n<ul>\n<li>Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.</li>\n</ul>\n<ul>\n<li>Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.</li>\n</ul>\n<ul>\n<li>Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.</li>\n</ul>\n<ul>\n<li>Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.</li>\n</ul>\n<ul>\n<li>Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).</li>\n</ul>\n<p>Preferred Skills and Experience:</p>\n<ul>\n<li>7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.</li>\n</ul>\n<ul>\n<li>Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.</li>\n</ul>\n<ul>\n<li>Proficiency in Rust for systems programming and performance-critical components.</li>\n</ul>\n<ul>\n<li>Direct experience integrating software reliability tools with physical data center infrastructure.</li>\n</ul>\n<ul>\n<li>Experience with observability tools and practices, such as metrics collection, logging, tracing, and dashboards.</li>\n</ul>\n<ul>\n<li>Familiarity with containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).</li>\n</ul>\n<ul>\n<li>Experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.</li>\n</ul>\n<ul>\n<li>Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.</li>\n</ul>\n<ul>\n<li>Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.</li>\n</ul>\n<ul>\n<li>Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_51758515-c12","directApply":true,"hiringOrganization":{"@type":"Organization","name":"xAI","sameAs":"https://www.xai.com/","logo":"https://logos.yubhub.co/xai.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/xai/jobs/5044403007","x-work-arrangement":"onsite","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Python","Rust","Linux systems administration","performance tuning","kernel-level understanding","scripting/automation","containerization","orchestration","observability","metrics collection","logging","tracing","dashboards","networking fundamentals","TCP/IP","routing","redundancy","DNS"],"x-skills-preferred":["Kubernetes","Docker","Grafana","Prometheus","ELK","DevOps","SRE","infrastructure engineering","systems engineering"],"datePosted":"2026-04-18T15:39:31.440Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Memphis, TN"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Python, Rust, Linux systems administration, performance tuning, kernel-level understanding, scripting/automation, containerization, orchestration, observability, metrics collection, logging, tracing, dashboards, networking fundamentals, TCP/IP, routing, redundancy, DNS, Kubernetes, Docker, Grafana, Prometheus, ELK, DevOps, SRE, infrastructure engineering, systems engineering"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_245477ba-29a"},"title":"Senior Software Engineer - Stability","description":"<p>The Stability team at Mercury champions and improves observability. We&#39;ve helped define incident response. We have introduced and support robust background work processing. We monitor and build tooling around platform and database health.</p>\n<p>As a Senior Software Engineer - Stability, you will lead projects end-to-end, drive technical projects from concept to production. You will define solutions, analyze tradeoffs, make critical decisions, and deliver software that works today and is sustainable for tomorrow.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Championing reliability by making technical choices that improve the reliability of Mercury&#39;s systems and making it easier to get reliability by default.</li>\n<li>Measuring outcomes by defining and collecting metrics that show how your work creates value for the business.</li>\n<li>Approaching code with craft by writing clear, testable, and maintainable code.</li>\n<li>Building for quality and sustainability by designing extensible systems, making balanced decisions on tech debt, planning careful rollouts, and owning the quality of your work through post-launch monitoring.</li>\n<li>Improving the developer experience by approaching problems with a product mindset, getting close to internal customers by supporting them and getting feedback from them.</li>\n</ul>\n<p>The ideal candidate for this role has expertise in PostgreSQL with query optimization, tuning, replication, pooling/proxying, or client-side libraries. They have worked with other data systems supporting a relational database: event streaming, OLAP, caches, etc. They have authored and operated Temporal workflows, are familiar with tracing and OpenTelemetry, and have learned by leading moderate-to-large technical projects, including planning, execution, and stakeholder management.</p>\n<p>The salary range for this role is $166,600 - 250,900 for US employees and CAD $157,400 - 237,100 for Canadian employees.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_245477ba-29a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Mercury","sameAs":"https://www.mercury.com/","logo":"https://logos.yubhub.co/mercury.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/mercury/jobs/5969193004","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$166,600 - 250,900 (US) | CAD $157,400 - 237,100 (Canada)","x-skills-required":["PostgreSQL","query optimization","tuning","replication","pooling/proxying","client-side libraries","Temporal workflows","tracing","OpenTelemetry"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:46:59.417Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA, New York, NY, Portland, OR, or Remote within Canada or United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"PostgreSQL, query optimization, tuning, replication, pooling/proxying, client-side libraries, Temporal workflows, tracing, OpenTelemetry","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":157400,"maxValue":250900,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_62efca6f-b6f"},"title":"Senior AI Engineer","description":"<p>We&#39;re looking for a Senior AI Engineer who is obsessed with building AI systems that actually work in production: reliable, observable, cost-efficient, and genuinely useful. This is not a research role. You will ship AI-powered features that process real financial data for real businesses.</p>\n<p>LLM &amp; AI Pipeline Engineering - Design, build, and maintain production-grade LLM integration pipelines , including retrieval-augmented generation (RAG), prompt engineering, output parsing, and chain orchestration.</p>\n<p>Develop and operate AI features within Jeeves&#39;s core financial products: spend categorization, document extraction, anomaly detection, financial Q&amp;A, and automated reconciliation.</p>\n<p>Implement structured output validation, fallback handling, and confidence scoring to ensure AI decisions meet reliability standards for financial use cases.</p>\n<p>Evaluate and integrate AI frameworks and tools (LangChain, LlamaIndex, OpenAI API, Anthropic API, HuggingFace, vector databases) and advocate for the right tool for the job.</p>\n<p>Establish prompt versioning and evaluation practices to ensure AI outputs remain accurate and consistent as models and data evolve.</p>\n<p>Retrieval &amp; Vector Search - Design and maintain vector search pipelines using databases such as Pinecone, Weaviate, or pgvector to power semantic search and RAG-based features.</p>\n<p>Build document ingestion and chunking pipelines for Jeeves&#39;s financial data , processing invoices, receipts, policy documents, and transaction records.</p>\n<p>Optimize retrieval quality through embedding model selection, chunk strategy, metadata filtering, and re-ranking techniques.</p>\n<p>ML Model Serving &amp; Operations - Collaborate with data scientists to take trained ML models from experimental notebooks to production serving infrastructure.</p>\n<p>Build and maintain model serving endpoints with appropriate latency SLOs, input validation, and output monitoring.</p>\n<p>Implement model performance monitoring and data drift detection to ensure production models remain accurate over time.</p>\n<p>Support model retraining workflows by designing clean data pipelines and feature engineering that can be continuously updated.</p>\n<p>Backend Integration &amp; Reliability - Integrate AI services cleanly with Jeeves&#39;s backend microservices , designing clear API contracts, circuit breakers, and graceful degradation patterns.</p>\n<p>Write high-quality, testable backend code in Python or Go/Node.js to power AI-integrated features.</p>\n<p>Instrument AI components with structured logging, distributed tracing, latency dashboards, and alerting to ensure operational visibility.</p>\n<p>Collaboration &amp; Growth - Partner with Product, Backend Engineering, and Data Science to define the AI roadmap and translate requirements into reliable systems.</p>\n<p>Contribute to a culture of quality by writing design docs, reviewing peers&#39; AI system designs, and sharing learnings openly.</p>\n<p>Help grow the AI engineering practice at Jeeves by establishing patterns, tooling, and best practices that the broader team can build on.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_62efca6f-b6f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Jeeves","sameAs":"https://www.jeeves.com/","logo":"https://logos.yubhub.co/jeeves.com.png"},"x-apply-url":"https://jobs.lever.co/tryjeeves/ded9e04e-f18e-4d4c-ae43-4b7882c6200b","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["LLM","AI","Python","LangChain","LlamaIndex","OpenAI API","Anthropic API","HuggingFace","vector databases","Pinecone","Weaviate","pgvector","semantic search","RAG-based features","document ingestion","chunking pipelines","embedding model selection","chunk strategy","metadata filtering","re-ranking techniques","model serving infrastructure","latency SLOs","input validation","output monitoring","model performance monitoring","data drift detection","clean data pipelines","feature engineering","API contracts","circuit breakers","graceful degradation patterns","structured logging","distributed tracing","latency dashboards","alerting"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:39:23.341Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"India"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"LLM, AI, Python, LangChain, LlamaIndex, OpenAI API, Anthropic API, HuggingFace, vector databases, Pinecone, Weaviate, pgvector, semantic search, RAG-based features, document ingestion, chunking pipelines, embedding model selection, chunk strategy, metadata filtering, re-ranking techniques, model serving infrastructure, latency SLOs, input validation, output monitoring, model performance monitoring, data drift detection, clean data pipelines, feature engineering, API contracts, circuit breakers, graceful degradation patterns, structured logging, distributed tracing, latency dashboards, alerting"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_e2350d04-53f"},"title":"Senior AI Engineer","description":"<p>We&#39;re looking for a Senior AI Engineer who is obsessed with building AI systems that actually work in production: reliable, observable, cost-efficient, and genuinely useful. This is not a research role. You will ship AI-powered features that process real financial data for real businesses.</p>\n<p>LLM &amp; AI Pipeline Engineering - Design, build, and maintain production-grade LLM integration pipelines , including retrieval-augmented generation (RAG), prompt engineering, output parsing, and chain orchestration.</p>\n<p>Develop and operate AI features within Jeeves&#39;s core financial products: spend categorization, document extraction, anomaly detection, financial Q&amp;A, and automated reconciliation.</p>\n<p>Implement structured output validation, fallback handling, and confidence scoring to ensure AI decisions meet reliability standards for financial use cases.</p>\n<p>Evaluate and integrate AI frameworks and tools (LangChain, LlamaIndex, OpenAI API, Anthropic API, HuggingFace, vector databases) and advocate for the right tool for the job.</p>\n<p>Establish prompt versioning and evaluation practices to ensure AI outputs remain accurate and consistent as models and data evolve.</p>\n<p>Retrieval &amp; Vector Search - Design and maintain vector search pipelines using databases such as Pinecone, Weaviate, or pgvector to power semantic search and RAG-based features.</p>\n<p>Build document ingestion and chunking pipelines for Jeeves&#39;s financial data , processing invoices, receipts, policy documents, and transaction records.</p>\n<p>Optimize retrieval quality through embedding model selection, chunk strategy, metadata filtering, and re-ranking techniques.</p>\n<p>ML Model Serving &amp; Operations - Collaborate with data scientists to take trained ML models from experimental notebooks to production serving infrastructure.</p>\n<p>Build and maintain model serving endpoints with appropriate latency SLOs, input validation, and output monitoring.</p>\n<p>Implement model performance monitoring and data drift detection to ensure production models remain accurate over time.</p>\n<p>Support model retraining workflows by designing clean data pipelines and feature engineering that can be continuously updated.</p>\n<p>Backend Integration &amp; Reliability - Integrate AI services cleanly with Jeeves&#39;s backend microservices , designing clear API contracts, circuit breakers, and graceful degradation patterns.</p>\n<p>Write high-quality, testable backend code in Python or Go/Node.js to power AI-integrated features.</p>\n<p>Instrument AI components with structured logging, distributed tracing, latency dashboards, and alerting to ensure operational visibility.</p>\n<p>Collaboration &amp; Growth - Partner with Product, Backend Engineering, and Data Science to define the AI roadmap and translate requirements into reliable systems.</p>\n<p>Contribute to a culture of quality by writing design docs, reviewing peers&#39; AI system designs, and sharing learnings openly.</p>\n<p>Help grow the AI engineering practice at Jeeves by establishing patterns, tooling, and best practices that the broader team can build on.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_e2350d04-53f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Jeeves","sameAs":"https://www.jeeves.com/","logo":"https://logos.yubhub.co/jeeves.com.png"},"x-apply-url":"https://jobs.lever.co/tryjeeves/66241934-7138-4d7d-8b05-a211ec5d6e24","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["LLM","AI","Python","LangChain","LlamaIndex","OpenAI API","Anthropic API","HuggingFace","vector databases","Pinecone","Weaviate","pgvector","PostgreSQL","async patterns","cloud infrastructure","AWS","GCP","Azure","structured logging","distributed tracing","latency dashboards","alerting"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:38:54.694Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Colombia"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"LLM, AI, Python, LangChain, LlamaIndex, OpenAI API, Anthropic API, HuggingFace, vector databases, Pinecone, Weaviate, pgvector, PostgreSQL, async patterns, cloud infrastructure, AWS, GCP, Azure, structured logging, distributed tracing, latency dashboards, alerting"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_d477874c-cf5"},"title":"Senior AI Engineer","description":"<p>We&#39;re looking for a Senior AI Engineer who is obsessed with building AI systems that actually work in production: reliable, observable, cost-efficient, and genuinely useful. This is not a research role. You will ship AI-powered features that process real financial data for real businesses.</p>\n<p>LLM &amp; AI Pipeline Engineering - Design, build, and maintain production-grade LLM integration pipelines , including retrieval-augmented generation (RAG), prompt engineering, output parsing, and chain orchestration.</p>\n<p>Develop and operate AI features within Jeeves&#39;s core financial products: spend categorization, document extraction, anomaly detection, financial Q&amp;A, and automated reconciliation.</p>\n<p>Implement structured output validation, fallback handling, and confidence scoring to ensure AI decisions meet reliability standards for financial use cases.</p>\n<p>Evaluate and integrate AI frameworks and tools (LangChain, LlamaIndex, OpenAI API, Anthropic API, HuggingFace, vector databases) and advocate for the right tool for the job.</p>\n<p>Establish prompt versioning and evaluation practices to ensure AI outputs remain accurate and consistent as models and data evolve.</p>\n<p>Retrieval &amp; Vector Search - Design and maintain vector search pipelines using databases such as Pinecone, Weaviate, or pgvector to power semantic search and RAG-based features.</p>\n<p>Build document ingestion and chunking pipelines for Jeeves&#39;s financial data , processing invoices, receipts, policy documents, and transaction records.</p>\n<p>Optimize retrieval quality through embedding model selection, chunk strategy, metadata filtering, and re-ranking techniques.</p>\n<p>ML Model Serving &amp; Operations - Collaborate with data scientists to take trained ML models from experimental notebooks to production serving infrastructure.</p>\n<p>Build and maintain model serving endpoints with appropriate latency SLOs, input validation, and output monitoring.</p>\n<p>Implement model performance monitoring and data drift detection to ensure production models remain accurate over time.</p>\n<p>Support model retraining workflows by designing clean data pipelines and feature engineering that can be continuously updated.</p>\n<p>Backend Integration &amp; Reliability - Integrate AI services cleanly with Jeeves&#39;s backend microservices , designing clear API contracts, circuit breakers, and graceful degradation patterns.</p>\n<p>Write high-quality, testable backend code in Python or Go/Node.js to power AI-integrated features.</p>\n<p>Instrument AI components with structured logging, distributed tracing, latency dashboards, and alerting to ensure operational visibility.</p>\n<p>Collaboration &amp; Growth - Partner with Product, Backend Engineering, and Data Science to define the AI roadmap and translate requirements into reliable systems.</p>\n<p>Contribute to a culture of quality by writing design docs, reviewing peers&#39; AI system designs, and sharing learnings openly.</p>\n<p>Help grow the AI engineering practice at Jeeves by establishing patterns, tooling, and best practices that the broader team can build on.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_d477874c-cf5","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Jeeves","sameAs":"https://www.jeeves.com/","logo":"https://logos.yubhub.co/jeeves.com.png"},"x-apply-url":"https://jobs.lever.co/tryjeeves/639e39d0-b357-4bc2-aff2-968cdedb14b6","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["LLM","AI","Python","Go","Node.js","Pinecone","Weaviate","pgvector","LangChain","LlamaIndex","OpenAI API","Anthropic API","HuggingFace","vector databases","API contracts","circuit breakers","graceful degradation patterns","structured logging","distributed tracing","latency dashboards","alerting"],"x-skills-preferred":[],"datePosted":"2026-04-17T12:38:44.910Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Argentina"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"LLM, AI, Python, Go, Node.js, Pinecone, Weaviate, pgvector, LangChain, LlamaIndex, OpenAI API, Anthropic API, HuggingFace, vector databases, API contracts, circuit breakers, graceful degradation patterns, structured logging, distributed tracing, latency dashboards, alerting"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_8af6c2b6-03c"},"title":"Member of Technical Staff, Domain (Backend Engineer)","description":"<p>At Anchorage Digital, we are building the world’s most advanced digital asset platform for institutions to participate in crypto. As a Member of Technical Staff on the Domain Engineering team, you are responsible for ensuring a robust technology stack, enabling our company to build scalable, efficient, and maintainable products. Allowing our product teams to focus on developing customer-focused features.</p>\n<p>You are a strong individual contributor and have the ability to significantly contribute to and execute complex engineering projects, enabled with appropriate coding and testing. You can understand the “why” in order to connect dependencies to the “bigger picture” and Anchorage mission and product roadmap.</p>\n<p><strong>Technical Skills</strong></p>\n<ul>\n<li>Collaborate with other engineering teams to identify areas for improvements across our engineering stack.</li>\n<li>Previous experience in establishing shared libraries across teams, with a focus on standardization, code quality, and reduced duplication.</li>\n<li>Proven experience with application observability projects that involved setting up performance metrics, log aggregation, tracing, and alerting systems.</li>\n</ul>\n<p><strong>Complexity and Impact of Work</strong></p>\n<ul>\n<li>Find the right balance between progress (i.e. shipping quickly) and perfection (i.e. measuring twice).</li>\n<li>Foster an efficient deterministic testing culture, with an emphasis on minimizing tech debt and bureaucracy.</li>\n<li>Ship code that will impact the whole organization.</li>\n</ul>\n<p><strong>Organizational Knowledge</strong></p>\n<ul>\n<li>Collaborate across multiple teams, especially on integration, standardization, and shared resources.</li>\n<li>Influence others by engaging in in-depth technical design discussions and demonstrating best practices through technical leadership by example.</li>\n<li>Make a meaningful impact across the entire engineering organization, extending influence beyond the immediate team.</li>\n</ul>\n<p><strong>Communication and Influence</strong></p>\n<ul>\n<li>Communicate technical concepts and solutions effectively to non-technical stakeholders.</li>\n<li>Build strong relationships with colleagues to drive collaboration and innovation.</li>\n</ul>\n<p><strong>You may be a fit for this role if you:</strong></p>\n<ul>\n<li>Are passionate about constantly seeking opportunities to refine and enhance existing systems and processes.</li>\n<li>Driven by a passion for being a force multiplier and influential technical leader in a dynamic, fast-paced startup environment.</li>\n<li>Have expert coding skills in Golang.</li>\n<li>Experienced in cross-functional projects, collaborating effectively with your team and adjacent teams to tackle complex challenges.</li>\n<li>Have excellent soft skills, including the ability to adapt communication for both internal and external stakeholders in an effective manner, bridging gaps with empathy and proactive communication.</li>\n</ul>\n<p><strong>Although not a requirement, bonus points if:</strong></p>\n<ul>\n<li>You have experience with infrastructure-as-code, Terraform, Gitops, Helm.</li>\n<li>You have experience with Google Cloud Platform &amp; Security.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_8af6c2b6-03c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anchorage Digital","sameAs":"https://anchorage.com","logo":"https://logos.yubhub.co/anchorage.com.png"},"x-apply-url":"https://jobs.lever.co/anchorage/5898d01d-a4a5-44e5-8d20-2f6710dc2035","x-work-arrangement":"remote","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Golang","Application Observability","Performance Metrics","Log Aggregation","Tracing","Alerting Systems"],"x-skills-preferred":["Infrastructure-as-code","Terraform","Gitops","Helm","Google Cloud Platform & Security"],"datePosted":"2026-04-17T12:24:58.203Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"United States"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Finance","skills":"Golang, Application Observability, Performance Metrics, Log Aggregation, Tracing, Alerting Systems, Infrastructure-as-code, Terraform, Gitops, Helm, Google Cloud Platform & Security"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_c043b353-08f"},"title":"Scaled Support Specialist","description":"<p>Your job is to produce a job description for the job seeker. Treat copy that describes the job as more important than copy that talks about the company.  Start with an opening paragraph (no heading): what the role is, who the company is, why it matters. If the ad mentions salary, include it here. One short paragraph about the company is enough — do not reproduce lengthy &quot;About Us&quot; text.  For the role details, reuse the same section headings from the original ad (e.g. if the ad says &quot;Responsibilities&quot;, use that heading, not &quot;What you&#39;ll do&quot;). Match the tone of the original: if formal, stay formal. If casual, stay casual.  Rephrase bullet points in your own words while keeping the factual content. Combine related points where it makes sense.  Content that is not directly about the role (long company history, mission statements, investor lists, press quotes) should be paraphrased into a sentence or two at most — the job seeker needs to understand the company, not read its pitch deck.  For benefits/perks: gather them from anywhere in the ad into one section. If the ad mentions nothing about benefits, omit a benefits section entirely.  Do not invent information that is not in the original ad.  ## <strong>About the Role</strong>  We&#39;re looking for a Scaled Support Specialist who lives at the intersection of deep technical troubleshooting and exceptional human communication. You&#39;ll be the front line for developers integrating with OpenRouter&#39;s API — diagnosing complex issues across dozens of model providers, untangling new edge cases, and making sure every developer who reaches out feels like they have a partner, not a ticket number.  This is not a scripted helpdesk role. Our users are highly capable engineers building the next generation of AI applications, which means the problems they bring to us are complex, nuanced, and frequently novel. You&#39;ll encounter issues daily where there is no runbook. You&#39;ll need to figure it out, often with incomplete information, and usually before anyone else on the team has seen it either.  If you&#39;re the kind of person who reads API changelogs for fun, has strong opinions about error message quality, and gets genuine satisfaction from turning a frustrated developer into a happy one — keep reading.  ## <strong>Key Responsibilities</strong>  ### <strong>Troubleshooting &amp; Problem Solving</strong> (Core Focus)  - Diagnose and resolve complex technical issues across OpenRouter&#39;s API, spanning multiple LLM providers - Reproduce bugs in ambiguous environments — different SDKs, languages, frameworks, and auth configurations — using tools like `curl`, Postman, and small test apps - Read and interpret logs, headers, and request traces; identify whether the problem is client-side, OpenRouter-side, or an upstream provider issue vs. a user misconfiguration - Turn &quot;it doesn&#39;t work&quot; into actionable findings: exact steps to reproduce, clear hypotheses, and verified fixes or workarounds  ### <strong>Developer Communication &amp; Advocacy</strong>  - Respond to developer inquiries across support channels (email, Discord, GitHub) with clarity, empathy, and technical precision - Translate complex technical root causes into human-friendly explanations - Set expectations on timelines and next steps; provide proactive updates and close the loop - Identify patterns in support requests and advocate internally for documentation improvements, API design changes, or better messages  ### <strong>Self-Directed Research &amp; Learning</strong>  - Stay current with the rapidly evolving LLM ecosystem - Develop deep expertise in OpenRouter&#39;s routing logic, fallback behavior, rate limiting, streaming (SSE), and billing systems with minimal hand-holding  ### <strong>Bridge to Product &amp; Engineering</strong>  - Spot systemic issues underneath individual tickets and push for the fix that prevents 50 more - Identify trends in support volume to capture product feedback and inform roadmap priorities - Collaborate on improving the developer experience  ## <strong>About You</strong>  ### <strong>Required:</strong>  - 4+ years in a technical support, developer support, solutions engineering, or similar role — ideally supporting an API or developer tools product - Exceptional troubleshooting instincts - Strong API fluency - Proficiency in at least one scripting language (Python or TypeScript) - Excellent written communication - Comfort with ambiguity - Genuine passion for AI and LLMs  ### <strong>Nice-to-Haves:</strong>  - Familiarity with the OpenAI SDK / Chat Completions API format - Experience with AI/ML frameworks like LangChain, LlamaIndex, or Hugging Face - Experience with observability tools (logging, tracing, metrics) - Experience scaling support operations — e.g., implementing AI-assisted support bots, building internal support dashboards, or creating automated triage workflows - Contributions to open-source projects or developer communities - Background in or exposure to ML/AI concepts beyond just using APIs (benchmarking, evals, fine-tuning)  ## <strong>Why OpenRouter</strong>  - Work at the center of the AI infrastructure stack as enterprises define how they adopt LLMs. - High ownership and autonomy to define how developer education and community scale. - Opportunity to shape a foundational function at a fast-growing company. - Fully remote team with a culture of autonomy and trust. - Competitive compensation, including base salary and equity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_c043b353-08f","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenRouter","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openrouter.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openrouter/89ff6b47-ba08-4418-b24b-c136dbf2ef82","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"Full time","x-salary-range":null,"x-skills-required":["API fluency","scripting language (Python or TypeScript)","exceptional troubleshooting instincts","strong API fluency","excellent written communication"],"x-skills-preferred":["OpenAI SDK / Chat Completions API format","AI/ML frameworks like LangChain, LlamaIndex, or Hugging Face","observability tools (logging, tracing, metrics)","scaling support operations","contributions to open-source projects or developer communities"],"datePosted":"2026-03-09T09:48:23.067Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote (US)"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"API fluency, scripting language (Python or TypeScript), exceptional troubleshooting instincts, strong API fluency, excellent written communication, OpenAI SDK / Chat Completions API format, AI/ML frameworks like LangChain, LlamaIndex, or Hugging Face, observability tools (logging, tracing, metrics), scaling support operations, contributions to open-source projects or developer communities"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_f70dd4a2-526"},"title":"Staff+ Software Engineer, Observability","description":"<p><strong>About the Role</strong></p>\n<p>Anthropic is seeking talented and experienced Software Engineers to join our Observability team within the Infrastructure organisation. The Observability team owns the monitoring and telemetry infrastructure that every engineer and researcher at Anthropic depends on—from metrics and logging pipelines to distributed tracing, error analytics, alerting, and the dashboards and query interfaces that make it all actionable.</p>\n<p><strong>Responsibilities</strong></p>\n<ul>\n<li>Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across Anthropic&#39;s multi-cluster infrastructure</li>\n<li>Own and evolve core observability platforms, driving migrations and architectural improvements that improve reliability, reduce cost, and scale with organisational growth</li>\n<li>Build instrumentation libraries, SDKs, and integrations that make it easy for engineering teams to emit high-quality telemetry from their services</li>\n<li>Drive alerting and SLO infrastructure that enables teams to define, monitor, and respond to reliability targets with minimal noise</li>\n<li>Reduce mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling</li>\n<li>Partner with Research, Inference, Product, and Infrastructure teams to ensure observability solutions meet the unique needs of each organisation</li>\n</ul>\n<p><strong>You May Be a Good Fit If You:</strong></p>\n<ul>\n<li>Have 10+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure</li>\n<li>Have deep experience with at least one observability signal area (metrics, logging, tracing, or error analytics) and familiarity with the others</li>\n<li>Understand high-throughput data pipelines, columnar storage engines, and the tradeoffs involved in ingesting and querying telemetry data at scale</li>\n<li>Have experience operating or building on top of observability platforms such as Prometheus, Grafana, ClickHouse, OpenTelemetry, or similar systems</li>\n<li>Have strong proficiency in at least one of Python, Rust, or Go</li>\n<li>Have excellent communication skills and enjoy partnering with internal teams to improve their operational visibility and incident response capabilities</li>\n<li>Are excited about building foundational infrastructure and are comfortable working independently on ambiguous, high-impact technical challenges</li>\n</ul>\n<p><strong>Strong Candidates May Also Have:</strong></p>\n<ul>\n<li>Experience operating metrics systems at very high cardinality (hundreds of millions of active time series or more)</li>\n<li>Experience with log storage migrations or operating columnar databases (ClickHouse, BigQuery, or similar) for analytics workloads</li>\n<li>Experience with OpenTelemetry instrumentation, collector pipelines, and tail-based sampling strategies</li>\n<li>Experience building or operating alerting platforms, on-call tooling, or SLO frameworks at scale</li>\n<li>Experience with Kubernetes-native monitoring, eBPF-based observability, or continuous profiling</li>\n<li>Interest in applying AI/LLMs to operational workflows such as automated root cause analysis, anomaly detection, or intelligent alerting</li>\n</ul>\n<p><strong>Logistics</strong></p>\n<ul>\n<li>Education requirements: We require at least a Bachelor&#39;s degree in a related field or equivalent experience.</li>\n<li>Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.</li>\n<li>Visa sponsorship: We do sponsor visas! However, we aren&#39;t able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.</li>\n</ul>\n<p><strong>We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you&#39;re interested in this work.</strong></p>\n<p><strong>Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses.</strong></p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_f70dd4a2-526","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5139910008","x-work-arrangement":"hybrid","x-experience-level":"staff","x-job-type":"full-time","x-salary-range":"$405,000 - $485,000 USD","x-skills-required":["observability","metrics","logging","tracing","error analytics","alerting","SLO infrastructure","cross-signal correlation","unified query interfaces","AI-assisted diagnostic tooling","Python","Rust","Go","Prometheus","Grafana","ClickHouse","OpenTelemetry"],"x-skills-preferred":["OpenTelemetry instrumentation","collector pipelines","tail-based sampling strategies","Kubernetes-native monitoring","eBPF-based observability","continuous profiling","AI/LLMs","automated root cause analysis","anomaly detection","intelligent alerting"],"datePosted":"2026-03-08T13:52:33.217Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA | New York City, NY | Seattle, WA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"observability, metrics, logging, tracing, error analytics, alerting, SLO infrastructure, cross-signal correlation, unified query interfaces, AI-assisted diagnostic tooling, Python, Rust, Go, Prometheus, Grafana, ClickHouse, OpenTelemetry, OpenTelemetry instrumentation, collector pipelines, tail-based sampling strategies, Kubernetes-native monitoring, eBPF-based observability, continuous profiling, AI/LLMs, automated root cause analysis, anomaly detection, intelligent alerting","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":405000,"maxValue":485000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3514d749-08c"},"title":"Senior Support Engineer","description":"<p><strong>Senior Support Engineer - San Francisco</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$234K – $260K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<p><strong>Benefits</strong></p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p><strong>About the Team</strong></p>\n<p>The Technical Support team is responsible for ensuring that developers and enterprises can reliably build mission critical solutions using OpenAI models. We provide technical guidance, resolve complex issues and support customers in maximizing value and adoption from deploying our highly-capable models. We work closely with Technical Success, Product, Engineering and others to deliver the best possible experience to our customers at scale. We think from an automation-first mindset and leverage the latest in AI to scale our support operations. Join the Senior Support Engineering (SSE) team at OpenAI and help shape the future of Technical Support in the age of AI.</p>\n<p><strong>About the Role</strong></p>\n<p>We are looking for a Senior Support Engineer to collaborate directly with our strategic enterprise accounts and product teams, helping solve some of the most difficult problems faced by our Customers. You will be part of the best technical troubleshooting team at OpenAI, and our Customers and Engineering teams will look to you for technical guidance in addressing the most technically difficult issues in our environment.</p>\n<p>As a Senior Support Engineer, you will design and run operational processes to monitor our top strategic customers and a 24x7 response team. You’ll work closely with our Infrastructure and Engineering teams to deliver the best possible experience to customers at scale. Working directly with our most strategic Customers - You will be crucial to the success of the most innovative, disruptive, and high-scale AI solutions being built with the OpenAI API platform.</p>\n<p>The nature of this role will be low volume, high difficulty.</p>\n<p>This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.</p>\n<p><strong>In this role, you will:</strong></p>\n<ul>\n<li>Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.</li>\n</ul>\n<ul>\n<li>Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.</li>\n</ul>\n<ul>\n<li>Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.</li>\n</ul>\n<ul>\n<li>In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.</li>\n</ul>\n<ul>\n<li>Design and refine incident response processes and documentation across strategic customers, engineering and support teams.</li>\n</ul>\n<ul>\n<li>Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.</li>\n</ul>\n<ul>\n<li>Provide support coverage during holidays and weekends based on business needs.</li>\n</ul>\n<p><strong>You might thrive in this role if you:</strong></p>\n<ul>\n<li>Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.</li>\n</ul>\n<ul>\n<li>Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.</li>\n</ul>\n<ul>\n<li>Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).</li>\n</ul>\n<ul>\n<li>Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and communication with stakeholders.</li>\n</ul>\n<ul>\n<li>Are able to work effectively in a fast-paced environment, prioritize tasks, and manage multiple projects simultaneously.</li>\n</ul>\n<ul>\n<li>Are a strong communicator and team player, with excellent written and verbal communication skills.</li>\n</ul>\n<ul>\n<li>Are able to adapt to changing priorities and requirements, and are flexible in your approach to problem-solving.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3514d749-08c","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/5431666c-530b-49c0-b67e-32477f9eaf5e","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$234K – $260K","x-skills-required":["Bachelor’s degree in Computer Science or a related field","8+ years of experience in technical operations roles such as SRE/NOC","Designing monitoring systems and resolving production issues in fast-paced and mission-critical environments","Troubleshooting complex technical problems at the systems level","Modern monitoring, alerting, and observability practices","Metrics, logging, and tracing for distributed systems","SLIs/SLOs, alert tuning, dashboard creation","Incident response for high‑severity outages or service disruptions","Real-time incident coordination, root cause analysis, and communication with stakeholders"],"x-skills-preferred":["Automation and advancements in AI technologies","Automation-first mindset and leveraging the latest in AI to scale support operations","Technical and troubleshooting expertise for API platform at OpenAI","Proactive identification and implementation of opportunities to scale support operations","Advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time","Reliability reviews and preparedness for new features, launches, or strategic customer requirement updates","Operational readiness (monitoring, alerting, and fallback plans)","Incident response processes and documentation across strategic customers, engineering and support teams","Operational metrics and incident RCAs to identify areas for improvement","Enhancements to monitoring dashboards, alert configurations, and support workflows"],"datePosted":"2026-03-06T18:43:55.714Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Bachelor’s degree in Computer Science or a related field, 8+ years of experience in technical operations roles such as SRE/NOC, Designing monitoring systems and resolving production issues in fast-paced and mission-critical environments, Troubleshooting complex technical problems at the systems level, Modern monitoring, alerting, and observability practices, Metrics, logging, and tracing for distributed systems, SLIs/SLOs, alert tuning, dashboard creation, Incident response for high‑severity outages or service disruptions, Real-time incident coordination, root cause analysis, and communication with stakeholders, Automation and advancements in AI technologies, Automation-first mindset and leveraging the latest in AI to scale support operations, Technical and troubleshooting expertise for API platform at OpenAI, Proactive identification and implementation of opportunities to scale support operations, Advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time, Reliability reviews and preparedness for new features, launches, or strategic customer requirement updates, Operational readiness (monitoring, alerting, and fallback plans), Incident response processes and documentation across strategic customers, engineering and support teams, Operational metrics and incident RCAs to identify areas for improvement, Enhancements to monitoring dashboards, alert configurations, and support workflows","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":234000,"maxValue":260000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_3a34dc62-295"},"title":"Software Engineer, Platform Systems","description":"<p><strong>Software Engineer, Platform Systems</strong></p>\n<p><strong>Location</strong></p>\n<p>San Francisco</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Scaling</p>\n<p><strong>Compensation</strong></p>\n<ul>\n<li>$310K – $460K • Offers Equity</li>\n</ul>\n<p>The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.</p>\n<ul>\n<li>Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts</li>\n</ul>\n<ul>\n<li>Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)</li>\n</ul>\n<ul>\n<li>401(k) retirement plan with employer match</li>\n</ul>\n<ul>\n<li>Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)</li>\n</ul>\n<ul>\n<li>Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees</li>\n</ul>\n<ul>\n<li>13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)</li>\n</ul>\n<ul>\n<li>Mental health and wellness support</li>\n</ul>\n<ul>\n<li>Employer-paid basic life and disability coverage</li>\n</ul>\n<ul>\n<li>Annual learning and development stipend to fuel your professional growth</li>\n</ul>\n<ul>\n<li>Daily meals in our offices, and meal delivery credits as eligible</li>\n</ul>\n<ul>\n<li>Relocation support for eligible employees</li>\n</ul>\n<ul>\n<li>Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.</li>\n</ul>\n<p>More details about our benefits are available to candidates during the hiring process.</p>\n<p>This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.</p>\n<p><strong>About the Team</strong></p>\n<p>The Platform Systems team at OpenAI operates at the intersection of cutting-edge AI and large-scale distributed systems. We build the engineering and research infrastructure required to train OpenAI’s flagship models on some of the world’s largest, custom-built supercomputers.</p>\n<p>Our team develops core model training software and works deep in the stack - spanning collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we build are foundational to OpenAI’s research velocity, enabling reliable, efficient training at frontier scale.</p>\n<p>We collaborate closely with researchers across the organization, continuously incorporating learnings from across OpenAI into the evolution of our training platform.</p>\n<p><strong>About the Role</strong></p>\n<p>As a Software Engineer, Platform Systems, you will design and build distributed systems that provide visibility into large-scale training workloads and help operate them reliably at scale.</p>\n<p>You’ll work on failure detection, tracing, and observability systems that identify slow or faulty nodes, surface performance bottlenecks, and help engineers understand and optimize massive distributed training jobs. This infrastructure is critical to operating OpenAI’s training stack and is actively evolving to support new use cases and increasingly complex workloads.</p>\n<p>This role sits at the core of our training infrastructure, blending systems engineering, performance analysis, and large-scale debugging.</p>\n<p><strong>In This Role, You Will</strong></p>\n<ul>\n<li>Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs</li>\n</ul>\n<ul>\n<li>Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior</li>\n</ul>\n<ul>\n<li>Improve observability, reliability, and performance across OpenAI’s training platform</li>\n</ul>\n<ul>\n<li>Debug and resolve issues in complex, high-throughput distributed systems</li>\n</ul>\n<ul>\n<li>Collaborate with systems, infrastructure, and research teams to evolve platform capabilities</li>\n</ul>\n<ul>\n<li>Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads</li>\n</ul>\n<p><strong>You Might Thrive in This Role If You</strong></p>\n<ul>\n<li>Care deeply about performance, stability, and observability in distributed systems</li>\n</ul>\n<ul>\n<li>Enjoy finding and fixing issues in large-scale systems and automating operational workflows</li>\n</ul>\n<ul>\n<li>Have experience writing low-level software where system details matter</li>\n</ul>\n<ul>\n<li>Understand hardware, operating systems, networking, concurrency, and distributed systems</li>\n</ul>\n<ul>\n<li>Have a background in high-performance computing or low-level systems engineering</li>\n</ul>\n<ul>\n<li>Are excited to work on critical infrastructure that powers frontier AI research</li>\n</ul>\n<p><strong>About OpenAI</strong></p>\n<p>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_3a34dc62-295","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/5e4ed6d1-2417-4bf5-bae0-905931c488e3","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$310K – $460K • Offers Equity","x-skills-required":["Distributed systems","Failure detection","Tracing","Observability","Performance analysis","Low-level software development","Hardware","Operating systems","Networking","Concurrency","Distributed systems engineering","High-performance computing"],"x-skills-preferred":["Cloud computing","Containerization","DevOps","Machine learning","Data science"],"datePosted":"2026-03-06T18:31:07.008Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Distributed systems, Failure detection, Tracing, Observability, Performance analysis, Low-level software development, Hardware, Operating systems, Networking, Concurrency, Distributed systems engineering, High-performance computing, Cloud computing, Containerization, DevOps, Machine learning, Data science","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":310000,"maxValue":460000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_49ecc85f-6cb"},"title":"Software Engineer, Platform Systems","description":"<p><strong>Software Engineer, Platform Systems</strong></p>\n<p><strong>Location</strong></p>\n<p>London, UK</p>\n<p><strong>Employment Type</strong></p>\n<p>Full time</p>\n<p><strong>Department</strong></p>\n<p>Scaling</p>\n<p><strong><strong>About the Team</strong></strong></p>\n<p>The Platform Systems team at OpenAI operates at the intersection of cutting-edge AI and large-scale distributed systems. We build the engineering and research infrastructure required to train OpenAI’s flagship models on some of the world’s largest, custom-built supercomputers.</p>\n<p>Our team develops core model training software and works deep in the stack - spanning collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we build are foundational to OpenAI’s research velocity, enabling reliable, efficient training at frontier scale.</p>\n<p>We collaborate closely with researchers across the organisation, continuously incorporating learnings from across OpenAI into the evolution of our training platform.</p>\n<p><strong><strong>About the Role</strong></strong></p>\n<p>As a Software Engineer, Platform Systems, you will design and build distributed systems that provide visibility into large-scale training workloads and help operate them reliably at scale.</p>\n<p>You’ll work on failure detection, tracing, and observability systems that identify slow or faulty nodes, surface performance bottlenecks, and help engineers understand and optimise massive distributed training jobs. This infrastructure is critical to operating OpenAI’s training stack and is actively evolving to support new use cases and increasingly complex workloads.</p>\n<p>This role sits at the core of our training infrastructure, blending systems engineering, performance analysis, and large-scale debugging.</p>\n<p><strong><strong>In This Role, You Will</strong></strong></p>\n<ul>\n<li>Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs</li>\n</ul>\n<ul>\n<li>Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behaviour</li>\n</ul>\n<ul>\n<li>Improve observability, reliability, and performance across OpenAI’s training platform</li>\n</ul>\n<ul>\n<li>Debug and resolve issues in complex, high-throughput distributed systems</li>\n</ul>\n<ul>\n<li>Collaborate with systems, infrastructure, and research teams to evolve platform capabilities</li>\n</ul>\n<ul>\n<li>Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads</li>\n</ul>\n<p><strong><strong>You Might Thrive in This Role If You</strong></strong></p>\n<ul>\n<li>Care deeply about performance, stability, and observability in distributed systems</li>\n</ul>\n<ul>\n<li>Enjoy finding and fixing issues in large-scale systems and automating operational workflows</li>\n</ul>\n<ul>\n<li>Have experience writing low-level software where system details matter</li>\n</ul>\n<ul>\n<li>Understand hardware, operating systems, networking, concurrency, and distributed systems</li>\n</ul>\n<ul>\n<li>Have a background in high-performance computing or low-level systems engineering</li>\n</ul>\n<ul>\n<li>Are excited to work on critical infrastructure that powers frontier AI research</li>\n</ul>\n<p><strong><strong>About OpenAI</strong></strong></p>\n<p>OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_49ecc85f-6cb","directApply":true,"hiringOrganization":{"@type":"Organization","name":"OpenAI","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/openai.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/openai/4349f80b-3518-4e4d-b9eb-3e5e9b490cc7","x-work-arrangement":"onsite","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["Distributed systems","Failure detection","Tracing","Observability","Performance analysis","Low-level software development","Hardware","Operating systems","Networking","Concurrency","Distributed systems engineering","High-performance computing"],"x-skills-preferred":["Cloud computing","Containerization","DevOps","Machine learning","Data science"],"datePosted":"2026-03-06T18:27:53.954Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"London, UK"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Distributed systems, Failure detection, Tracing, Observability, Performance analysis, Low-level software development, Hardware, Operating systems, Networking, Concurrency, Distributed systems engineering, High-performance computing, Cloud computing, Containerization, DevOps, Machine learning, Data science"}]}