DevOps Engineer (all genders)

bee517db-e9c DevOps Engineer (all genders) Join our DevOps team at Holidu, a central team across the entire tech organisation, responsible for creating and maintaining the infrastructure that powers all of our products and services.

In this role, you will contribute to the continuous improvement of our DevOps processes, collaborate with cross-functional teams, and apply best practices for scalable, reliable, and secure systems.

Our ideal candidate has a solid technical foundation, a strong hands-on approach, and the ability to deliver results with minimal supervision.

Our Tech Stack

Cloud: AWS (EC2, S3, RDS, EKS, Elasticache, Lambda)
Container Orchestration: Kubernetes with Helm
Infrastructure as Code: Terraform + Terragrunt, Pulumi/ CDK
Monitoring & Observability: Prometheus, Grafana, Elastic Stack, OpenTelemetry
CI/CD: Jenkins, GitHub Actions, ArgoCD, ArgoRollouts
Scripting: Python, Go, Bash
Version Control: GitHub
Collaboration: Jira (Agile)
Automation: N8N, AI-assisted tooling (Agentic ADK)

Your role in this journey

As a DevOps Engineer, you will be responsible for:

Implementing and maintaining infrastructure definitions using Terraform, Pulumi, or similar tools
Ensuring IaC standards are followed and contributing improvements to existing modules and patterns
Managing and monitoring AWS services, ensuring system performance, availability, and adherence to best practices
Troubleshooting production issues and participating in capacity planning
Maintaining and troubleshooting Kubernetes clusters , deploying workloads, managing configurations, scaling services, and resolving incidents to support high-availability applications
Maintaining and improving CI/CD pipelines to ensure smooth, automated software delivery
Identifying bottlenecks and implementing enhancements across Jenkins, GitHub Actions, ArgoRollouts and ArgoCD
Maintaining and extending our monitoring stack (Prometheus, Grafana)
Building dashboards, configuring alerts, and improving observability to ensure comprehensive visibility into system health and performance

Your backpack is filled with

4+ years of experience in a DevOps, SRE, or cloud engineering role with hands-on production experience
Solid working experience with AWS services (EC2, EKS, S3, RDS, Lambda) and cloud infrastructure management
Hands-on experience with Docker and Kubernetes in production environments , deploying, scaling, and troubleshooting containerized workloads
Practical experience with at least one Infrastructure as Code tool (Terraform, Pulumi, or AWS CDK)
Experience maintaining and improving CI/CD pipelines using tools like Jenkins, GitHub Actions, or ArgoCD
Proficiency in scripting with Python, Bash, or Go for operational automation
Working knowledge of monitoring and observability tools such as Prometheus, Grafana, or similar platforms
Familiarity with logging and log aggregation systems (Elastic Stack, Open Telemetry, or similar)
Solid understanding of Linux administration, networking fundamentals, and system security basics
Strong communication skills with the ability to collaborate across teams and explain technical decisions clearly

Nice to Have

Experience with Helm charts and Kubernetes package management
Familiarity with GitOps workflows (e.g., Github Actions, ArgoCD, Flux)
Experience with designing AWS services-based architectures is a plus
Experience with AI automation or low-code/no-code platforms such as N8N is a plus
Familiarity with prompt engineering and using AI tools to augment DevOps workflows
Exposure to cost optimization strategies for cloud infrastructure
Experience with incident response, on-call rotations, or SRE practices (SLOs, error budgets)
Experience with DevSecOps practices , integrating security scanning and compliance into CI/CD pipelines

Our adventure includes

Impact: Shape the future of travel with products used by millions of guests and thousands of hosts
Learning: Grow professionally in a culture that thrives on curiosity and feedback
Great People: Join a team of smart, motivated, and international colleagues who challenge and support each other
Technology: Work in a modern tech environment
Flexibility: Work a hybrid setup with 50% in-office time for collaboration, and spend up to 8 weeks a year from other inspiring locations
Perks on Top: Of course, we also offer travel benefits, gym discounts, and other perks to keep you energized

XML job scraping automation by YubHub

]]> Full-time mid hybrid Cloud, Container Orchestration, Infrastructure as Code, Monitoring & Observability, CI/CD, Scripting, Version Control, Collaboration, Automation, Helm, GitOps, AI automation, Low-code/no-code platforms, Prompt engineering, Cost optimization strategies, Incident response, SRE practices, DevSecOps practices Engineering Technology Holidu Hosts GmbH https://logos.yubhub.co/holidu.jobs.personio.com.png Holidu is a travel technology company that provides search engines for vacation rentals. https://holidu.jobs.personio.com https://holidu.jobs.personio.com/job/2595036 Munich, Germany 2026-04-18 790269e4-0f2 Associate Director, Software Engineering Join HSBC and fulfil your potential in the role of Associate Director, Software Engineering.

We are currently seeking an experienced professional to lead our software engineering team and drive practical improvement initiatives to address SDLC bottlenecks, inefficiencies, and friction points across teams.

Key responsibilities include:

Partnering with Engineering, Platform, and Risk and Control stakeholders to improve delivery flow, change quality, stability, resiliency, and operational effectiveness.
Defining and driving the adoption of DORA, SPACE, and broader engineering metrics to create visibility, support prioritisation, and improve performance outcomes.
Establishing and maintaining automated reporting to provide clear views of current performance, root-cause analysis, trends, and recommended actions.
Leading engineering and operational automation initiatives across areas such as testing, deployment, patching, recovery, and health checks.
Creating and maintaining a central engineering knowledge space and operating cadence to support governance, transparency, and continuous improvement.

To be successful, you will have 12+ years of engineering experience across the full software delivery lifecycle, with strong engineering leadership capability and hands-on experience in coding.

You will also bring proven experience across engineering excellence, DevOps, platform engineering, SRE, or software delivery improvement roles, and demonstrate strong ability to identify SDLC bottlenecks, prioritise improvement opportunities, and convert insight into practical cross-team action.

Additional requirements include:

Strong understanding of DORA metrics and good knowledge of SPACE or broader engineering productivity and developer experience measures.
Solid knowledge of software development, testing, release management, incident management, service recovery, and operational resilience practices.
Experience leading automation initiatives across testing, deployment, patching, recovery, and operational health checks.
An AI-driven mindset, with the ability to identify practical opportunities to use AI to improve engineering efficiency, analysis, decision-making, and delivery effectiveness.
Excellent analytical, communication, problem-solving, and delivery leadership skills.

You’ll achieve more when you join HSBC.

XML job scraping automation by YubHub

]]> full-time senior onsite SDLC, DORA, SPACE, engineering metrics, automated reporting, engineering and operational automation, testing, deployment, patching, recovery, health checks, central engineering knowledge space, operating cadence, governance, transparency, continuous improvement, DevOps, platform engineering, SRE, software delivery improvement, AI-driven mindset, engineering efficiency, analysis, decision-making, delivery effectiveness Engineering Finance HSBC https://logos.yubhub.co/portal.careers.hsbc.com.png HSBC is one of the largest banking and financial services organisations in the world, with operations in 64 countries and territories. https://portal.careers.hsbc.com https://portal.careers.hsbc.com/careers/job/563774610662004 Pune 2026-04-18 770c5fe8-cce Staff Security Engineer, Vulnerability Management We are seeking a Staff Security Engineer to lead the most complex technical work in CoreWeave's Vulnerability Management program.

As a Staff Security Engineer, you will design and implement scalable triage, prioritization, and remediation-tracking systems across application, infrastructure, and hardware domains. You will set technical standards, drive high-impact initiatives, and mentor engineers through technical leadership, while partnering with leadership on priorities and execution risks.

Key Responsibilities:

Lead high-complexity VM technical initiatives and deliver architecture decisions for assigned program areas
Design and build scalable triage automation, including integrations, decision logic, and production hardening
Implement end-to-end workflow components from assessment and detection to ticket routing and remediation tracking
Provide deep technical leadership on hardware-adjacent vulnerabilities (GPU firmware, DPU firmware/BlueField, and BMC surfaces)
Act as senior technical responder for embargoed disclosures and zero-day events, coordinating with owner teams that deploy fixes
Improve prioritization logic, severity models, and exception workflows through code, design reviews, and technical proposals
Produce actionable technical metrics and risk insights for leadership consumption
Lead root-cause analysis for high-impact vulnerability incidents and implement durable technical improvements
Mentor IC3/IC4/IC5 engineers through design guidance, code review, and incident coaching
Partner with security, engineering, and operational stakeholders to improve workflow reliability and accelerate remediation outcomes

Requirements:

9+ years of relevant experience with demonstrated strategic impact in vulnerability management, application security, platform security, or cloud security engineering
Proven track record building and scaling security automation (SOAR workflows, AI/ML systems, detection pipelines) in production environments
Deep subject matter expertise with vulnerability management best practices: CVSS, EPSS, CISA KEV, threat intelligence integration, and risk-based prioritization frameworks
Excellent development background with strong coding skills in Python, Go, or similar languages for building scalable, production-grade security systems
Significant experience with modern vulnerability management tooling (for example Wiz, Semgrep, Rapid7, Tenable, or equivalent)
Experience with specialized infrastructure: GPU/DPU environments, firmware security, hardware vulnerabilities, or high-performance computing
Demonstrated track record mentoring engineers across levels and driving cross-functional technical initiatives at organizational scale
Strong business acumen and understanding of how security decisions impact engineering velocity, customer trust, and business outcomes

Preferred Qualifications:

Practical experience building AI/ML-powered security systems (LLM integration, automated decision-making, human-in-the-loop validation) in production
Experience managing hardware vendor security partnerships (embargoed disclosures and pre-release collaboration)
Production experience with security automation platforms such as TINES and serverless frameworks (AWS Lambda, GCP Cloud Functions)
Strong DevOps, DevSecOps, or SRE background with deep experience in AWS/GCP/Azure cloud services and Infrastructure as Code (Terraform, CloudFormation)
Deep understanding of Kubernetes security (container scanning, admission controllers, supply chain security, runtime protection)
Experience leading security programs through rapid hypergrowth (10x+ infrastructure scaling) in startup or cloud-native environments
Practical experience managing vulnerabilities within a FedRAMP-certified environment or similar regulatory frameworks

Salary and Benefits: The base salary range for this role is $188,000 to $275,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

Work Environment:

While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.

XML job scraping automation by YubHub

]]> full-time staff hybrid $188,000 to $275,000 vulnerability management, application security, platform security, cloud security engineering, security automation, AI/ML systems, detection pipelines, Python, Go, modern vulnerability management tooling, GPU/DPU environments, firmware security, hardware vulnerabilities, high-performance computing, AI/ML-powered security systems, LLM integration, automated decision-making, human-in-the-loop validation, security automation platforms, TINES, serverless frameworks, AWS Lambda, GCP Cloud Functions, DevOps, DevSecOps, SRE, Kubernetes security, container scanning, admission controllers, supply chain security, runtime protection Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4653130006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 7bc4518a-7e3 AI Applications Ops Lead, GPS Role Overview

Scale's rapidly growing Global Public Sector team is focused on using AI to address critical challenges facing the public sector around the world.

Our core work consists of creating custom AI applications that will impact millions of citizens, generating high-quality training data for national LLMs, and upskilling and advisory services to spread the impact of AI.

As a Production AI Ops Lead, you will design and develop the production lifecycle of full-stack AI applications, while supporting end-to-end system reliability, real-time inference observability, sovereign data orchestration, high-security software integration, and the resilient cloud infrastructure required for our international government partners.

Responsibilities

Own the production outcome: Take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies.

Ensure Full-Stack integrity: Oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment.

Scale the feedback loop: Build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability.

Navigate global compliance: Manage the technical lifecycle within diverse regulatory frameworks.

Incident command: Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building the guardrails to prevent them from happening again.

Bridge the gap: Translate deep technical performance metrics into clear insights for senior international government officials.

Drive product evolution: Partner with our Engineering and ML teams to ensure the lessons learned in the field directly influence the technical architecture and decisions of future use cases.

Ideal Candidate

Experience: 6+ years in a high-impact technical role (SRE, FDE or MLOps) with experience in the public sector.

Global perspective: Familiarity with international government security standards and the complexities of deploying sovereign AI.

System architecture proficiency: Proven experience maintaining production-grade applications with a deep understanding of the full request lifecycle-connecting frontend/API layers to the backend and AI core.

Modern AI Stack expertise: Proficiency in coding and the modern AI infrastructure, including Kubernetes, vector databases, agentic development, and LLM observability tools.

Ownership: You treat every production deployment as your own. You race toward solving hard problems before the customer even sees them.

Reliability: You understand that in the public sector, a model failure may be a risk to public safety or privacy.

Customer communication: The ability to explain to a high-ranking official why the performance of the system has degraded and how we are fixing it.

About Us

At Scale, our mission is to develop reliable AI systems for the world's most important decisions. Our products provide the high-quality data and full-stack technologies that power the world's leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact.

XML job scraping automation by YubHub

]]> full-time senior hybrid Kubernetes, Vector databases, Agentic development, LLM observability tools, SRE, FDE, MLOps Engineering Technology Scale https://logos.yubhub.co/scale.com.png Scale develops reliable AI systems for the world's most important decisions. https://scale.com/ https://job-boards.greenhouse.io/scaleai/jobs/4654510005 Doha, Qatar; London, UK 2026-04-18 ca221b6f-dca Technical Program Manager, Safeguards (Infrastructure & Evals) About the Role

Safeguards Engineering builds and operates the infrastructure that keeps Anthropic's AI systems safe in production. As a Technical Program Manager for Safeguards Infrastructure and Evals, you'll own the operational health and forward momentum of this stack.

Your primary responsibility is driving reliability , owning the incident-response and post-mortem process, ensuring SLOs are defined and met in partnership with various teams, and making sure that when things go wrong, the right people know, the right actions get taken, and those actions actually get closed out.

Alongside that ongoing operational rhythm, you'll coordinate the larger platform investments: migrations, eval-platform improvements, and the cross-team dependencies that connect them.

This role sits at the intersection of operations and program management. It requires genuine technical depth , you need to understand how these systems work well enough to triage effectively, judge what's actually safety-critical versus what can wait, and have informed conversations with the engineers building and maintaining them.

But the core of the job is keeping the machine running well and the work moving.

Responsibilities

Own the Safeguards Engineering ops review
Drive the recurring cadence that keeps the team informed and coordinated: surfacing recent incidents and failures, bringing visibility to reliability trends, and making sure the right people are in the room when decisions need to be made.
Drive incident tracking and post-mortem execution
Establish and maintain SLOs with partner teams
Maintain runbook quality and incident-ownership clarity
Drive platform migrations and infrastructure projects
Coordinate evals platform improvements

Requirements

Solid technical program management experience, particularly in operational or infrastructure-heavy environments
Understanding of how production ML systems work well enough to triage incidents intelligently and have substantive conversations with engineers about what's going wrong and why
Ability to work effectively across team boundaries
Experience with or strong interest in AI safety

Nice to Have

Experience with SRE practices, incident management frameworks, or on-call operations at scale
Familiarity with monitoring and alerting tooling (PagerDuty, Datadog, or equivalents)
Experience driving infrastructure migrations in complex, multi-team environments

XML job scraping automation by YubHub

]]> full-time senior hybrid $290,000-$365,000 USD Technical Program Management, Operational or Infrastructure-heavy Environments, Production ML Systems, Incident Tracking and Post-Mortem Execution, Service-Level Objectives (SLOs), Runbook Quality and Incident-Ownership Clarity, Platform Migrations and Infrastructure Projects, Evals Platform Improvements, SRE Practices, Incident Management Frameworks, On-Call Operations at Scale, Monitoring and Alerting Tooling, Infrastructure Migrations in Complex, Multi-Team Environments Engineering Technology Anthropic https://logos.yubhub.co/anthropic.ai.png Anthropic develops artificial intelligence systems. It has a growing team of researchers, engineers, and business leaders. https://anthropic.ai/ https://job-boards.greenhouse.io/anthropic/jobs/5108695008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 491db8e9-776 Staff Site Reliability Engineer- Splunk Expert We are seeking a highly technical Staff Site Reliability Engineer with deep expertise in Splunk and Grafana to own and evolve our observability ecosystem.

As a Staff Site Reliability Engineer, you will move beyond simple monitoring to architect a comprehensive, scalable telemetry platform. You will be our subject-matter expert in Splunk optimisation, ensuring our logging architecture is performant, cost-effective, and deeply integrated with our automated workflows.

Key responsibilities include:

Splunk Architecture & Optimisation: Lead the design and tuning of Splunk environments. Optimise indexer performance, search efficiency, and data models to ensure rapid troubleshooting and cost-efficiency.

Advanced Visualisation: Architect and maintain sophisticated Grafana dashboards that correlate disparate data sources into a single pane of glass for real-time system health.

Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.

Pipeline Engineering: Optimise the collection, processing, and storage of telemetry data (Metrics, Logs, Traces) to ensure high reliability and low latency.

Workflow Automation: Develop custom Splunk workflows and integrations that trigger automated responses to system events, reducing Mean Time to Resolution (MTTR).

Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements through 'observability-driven development.'

Required skills and experience include:

Splunk Mastery: Deep, hands-on experience with Splunk administration, search optimisation (SPL), and architecting complex data pipelines.

Grafana Expertise: Proven ability to build actionable, intuitive dashboards in Grafana that go beyond simple charts to provide deep operational insights.

SRE Mindset: Minimum 8+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.

Programming Proficiency: Strong coding skills in Go, Python, or Ruby for building internal tools and automating observability workflows.

Telemetry Standards: Hands-on experience with OpenTelemetry (OTel), Prometheus, or similar frameworks for instrumenting applications.

Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/EKS).

Bonus skills include:

Tracing: Implementation of distributed tracing (Jaeger, Tempo, or Honeycomb) to visualise request flow across microservices.

Security Observability: Experience using Splunk for security orchestration (SOAR) or SIEM-related workflows.

Cloud Platforms: Experience managing observability native tools within AWS, Azure, or GCP.

XML job scraping automation by YubHub

]]> full-time staff hybrid Splunk, Grafana, SRE, Go, Python, Ruby, OpenTelemetry, Prometheus, Linux, Networking, Container Orchestration, Tracing, Security Observability, Cloud Platforms Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta is a publicly traded software company that specialises in identity and access management. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/6874616 Bengaluru, India 2026-04-18 6f3a053e-c43 Staff Software Engineer, AI Reliability Engineering We're seeking a Staff Software Engineer to join our AI Reliability Engineering team. As a key member of our team, you will develop Service Level Objectives for large language model serving systems, design and implement monitoring and observability systems, and lead incident response for critical AI services.

You will work closely with teams across Anthropic to improve reliability across our most critical serving paths. You will be responsible for making the systems that deliver Claude more robust and resilient, whether during an incident or collaborating on projects.

To be successful in this role, you should have strong distributed systems, infrastructure, or reliability backgrounds. You should be curious and brave, comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.

You will be working on high-availability serving infrastructure across multiple regions and cloud providers. You will support the reliability of safeguard model serving, which is critical for both site reliability and Anthropic's safety commitments.

If you're committed to creating reliable, interpretable, and steerable AI systems, and you're passionate about working on complex technical problems, we'd love to hear from you.

XML job scraping automation by YubHub

]]> full-time staff hybrid €235.000-€295.000 EUR distributed systems, infrastructure, reliability, Service Level Objectives, monitoring, observability, incident response, high-availability serving infrastructure, cloud providers, SRE, Production Engineer, chaos engineering, systematic resilience testing, AI-specific observability tools and frameworks, ML hardware accelerators, RDMA, InfiniBand Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5101169008 Dublin, IE 2026-04-18 ac14f361-5b8 Network Engineer, Capacity and Efficiency We're looking for a network engineer who thinks in metrics first. You will use deep networking knowledge and rigorous measurement to figure out where and how bandwidth, latency, and dollars are being used, find optimization opportunities and land them.

You will instrument spine-leaf fabrics, BGP, SDN overlays, and cloud interconnect products well enough to build them. You'll own the observability and efficiency surface for Anthropic's network: from per-flow telemetry on backbone routers, to QoS policy on cross-region links carrying inference traffic, to cost attribution that tells a research team exactly what their checkpoint sync is costing.

This is a hands-on IC role. You'll write code (Python, Go), build dashboards, model capacity, and ship config changes to production routers. You'll also influence architecture: when the data says a traffic pattern is pathological, you'll be in the room root causing it and fixing it.

You will be working across three areas: network telemetry and observability, traffic engineering, and cost modeling and attribution. We expect you to be strong in at least two and willing to grow into the third.

XML job scraping automation by YubHub

]]> full-time senior hybrid BGP, ECMP, VXLAN/EVPN, QoS, L1/optical basics, CSP networking model, network telemetry, flow export, eBPF-based host-side instrumentation, Python, Go, SRE experience for large-scale network infrastructure, cloud provider's networking team or a cloud networking product team, AI/ML infrastructure traffic patterns, HPC fabrics, traffic engineering for large backbones Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a technology company that creates reliable, interpretable, and steerable AI systems. https://anthropic.com https://job-boards.greenhouse.io/anthropic/jobs/5177143008 San Francisco, CA | New York City, NY 2026-04-18 ebf95cea-76b Technical Escalation Manager As a Technical Escalation Manager at Databricks, you will be responsible for coordinating efforts to resolve critical customer issues, customer-impacting situations, and major incidents. You will work with multiple internal teams (engineering, product management, Customer Success Engineering, and Support) and external partners to effectively resolve these customer-impacting situations.

Your key responsibilities will include:

Managing support escalation in partnership with engineering, product management, Customer Success Engineering, Support, Customers, and Partners until resolution.
Achieving customer satisfaction by ensuring incidents or escalations (and related cases) are well and fully documented with the timely execution of action items.
Creating and executing a data-driven customer recovery plan for every escalation and incident that is addressed.
Utilizing business and technical skills to manage customer escalations, coordinate meetings and deliverables, and analyze trends and patterns for reporting purposes.
Using data, metrics, and feedback to inform operational and tactical decisions that improve incident and escalation management.
Coordinating all necessary resources to fast-track and resolve new incidents and escalations from customers with a clear and detailed plan.

We are looking for a candidate with a minimum of 8+ years of experience in customer support, escalation, SRE, or incident management. You should have excellent contextual interpretation and writing skills, as well as the ability to effectively summarize and communicate to both technical and business audiences.

You will also need experience with a 'Distributed big data Computing' environment, SQL-based databases, as well as data warehousing and ETL technologies such as Informatica, DataStage, Oracle, Teradata, SQL Server, and MySQL. Linux/Unix administration skills, networking, and Hands-on Cloud experience with AWS, Azure, or GCP are required.

Experience working cross-functionally with support, engineering, product management, and directly with customers; ability to deeply understand product and customer personas is also essential.

A Bachelor's or Master's degree in Computer Science or Computer Engineering, or related Engineering field is preferred. Written and spoken proficiency in both Japanese and English is also required.

XML job scraping automation by YubHub

]]> full-time senior remote customer support, escalation, SRE, incident management, distributed big data computing, SQL-based databases, data warehousing, ETL technologies, Linux/Unix administration, networking, cloud experience Engineering Technology Databricks https://logos.yubhub.co/databricks.com.png Databricks builds and operates the world's leading data and AI infrastructure platformاساس enabling customers to leverage deep data insights and enhance their business. https://databricks.com/ https://job-boards.greenhouse.io/databricks/jobs/8407911002 Japan 2026-04-18 b5ce114e-dac Cloud Engineer – Factory Systems and Operational Technology Anduril Industries is a defence technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology and business model of the 21st century's most innovative companies to the defence industry, Anduril is changing how military systems are designed, built and sold.

The company's family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a real-time, 3D command and control centre.

As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion and networking technology to the military in months, not years.

We are seeking a mission-driven Cloud Infrastructure Engineer to take a leading role in designing and implementing world-class defensive controls. This is a high-impact role with the autonomy to shape security architecture and protect the technology that is changing the future of defence.

Key Responsibilities:

Design and Own Security Architecture: Architect, build and deploy robust, scalable security controls for our corporate, development and production cloud environments (AWS, Azure, GCP).

Automate Everything: Develop and automate infrastructure-as-code (IaC) to manage and scale our cloud deployments securely and efficiently.

Proactively Defend: Continuously monitor, identify and remediate security weaknesses and configuration drift across our entire cloud footprint.

Be a Force Multiplier: Partner with infrastructure, application and product teams to embed security best practices into their workflows and secure environments holding mission-critical data.

Enable Scale and Reliability: Engineer systems and processes that ensure our platforms are highly available, resilient and prepared for rapid growth.

Serve as a Cloud Security Expert: Act as the go-to subject matter expert for teams across Anduril, providing guidance, mentorship and paved-road solutions for building securely in the cloud.

Requirements:

Proven experience building and securing complex cloud environments, typically gained through 3+ years in a Cloud Security, DevOps or SRE role.

Deep proficiency in at least one major cloud provider (AWS, Azure or GCP).

Strong hands-on experience with Infrastructure as Code (e.g., Terraform, CloudFormation, Bicep).

Solid programming/scripting ability in one or more languages (e.g., Python, Go, Rust).

Firm understanding of public cloud networking principles (e.g., VPCs, subnets, routing, security groups).

Must be a U.S. Person and eligible to obtain and maintain a U.S. Top Secret security clearance.

Preferred Qualifications:

Experience hardening and monitoring Kubernetes clusters (EKS, GKE, AKS).

Experience with cloud security posture management (CSPM) or threat detection tooling.

Familiarity with CI/CD pipelines and securing the software supply chain.

Knowledge of compliance frameworks such as FedRAMP, MRL, SOC 2 or CMMC.

On-premises network engineering experience.

XML job scraping automation by YubHub

]]> full-time senior onsite $129,000-$193,000 USD Cloud Security, DevOps, SRE, Infrastructure as Code, Terraform, CloudFormation, Bicep, Python, Go, Rust, Public Cloud Networking, VPCs, Subnets, Routing, Security Groups, Kubernetes, Cloud Security Posture Management, Threat Detection Tooling, CI/CD Pipelines, Software Supply Chain Security, Compliance Frameworks, FedRAMP, MRL, SOC 2, CMMC, On-Premises Network Engineering Engineering Technology Anduril Industries https://logos.yubhub.co/anduril.com.png Anduril Industries is a defence technology company that designs, builds and sells advanced military systems. https://www.anduril.com/ https://job-boards.greenhouse.io/andurilindustries/jobs/5087348007 Costa Mesa, California, United States 2026-04-18 9e667b9c-eb8 Senior Security Engineer II, Vulnerability Management We are seeking a Senior Security Engineer to build the Vulnerability Management program protecting CoreWeave's AI infrastructure. You will architect intelligent automation systems that defend the GPU clusters powering breakthrough AI research and enterprise AI applications.

This role combines technical depth, strategic thinking, and the autonomy to design workflows that will protect infrastructure driving the future of AI.

Key Responsibilities:

Build and scale AI-powered triage workflows: evaluate tools (LLM integration, TINES orchestration), architect solutions, and deploy to production
Drive intelligent, risk-based vulnerability prioritization while simultaneously training AI models,your assessments become the foundation for automation
Influence automation priorities: recommend which areas of the vulnerability pipeline would most benefit from automation to improve team efficiency
Design and implement automated detection-to-ticket pipelines: build workflows that generate vulnerability detections, test them, scale across the environment, and auto-create Jira tickets
Execute remediation campaigns: build automated workflows for EOL product removal, vulnerable software upgrades, and OS migrations at scale
Manage embargoed vendor disclosures from hardware partners, including embargo verification and zero-day response coordination
Lead security incident investigations related to high-profile vulnerabilities, coordinating cross-functional response and impact assessment
Participate in on-call rotation for rapid-response vulnerability analysis during active zero-day events or critical security incidents
Partner with IT, Infrastructure, and Engineering teams to drive remediation efforts, enforce SLAs, and escalate blockers strategically
Write daily operations reports documenting vulnerability trends, remediation velocity, and emerging threats for security leadership
Drive process improvements and workflow automation to improve operational efficiency and reduce manual toil

Requirements:

7+ years of relevant experience with demonstrated impact in vulnerability management, application security, platform security, or cloud security engineering
Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
Proven hands-on experience building security automation (SOAR workflows, detection pipelines, or vulnerability prioritization frameworks)
Deep subject matter expertise with vulnerability management best practices: CVSS, EPSS, CISA KEV, exploit intelligence, and compensating controls
Strong development background with proficiency in Python, Go, or similar languages for building production-grade security tools
Experience with modern vulnerability management tooling such as Wiz, Semgrep, Rapid7, or similar platforms
Demonstrated ability to partner with cross-functional teams (IT, SRE, Engineering) to drive remediation without formal authority
Strong familiarity with common security vulnerabilities and the ability to judge their severity and business impact

Preferred Qualifications:

Practical experience building AI/ML-powered security workflows (LLM integration, automated triage, human-in-the-loop validation)
Experience managing hardware security vulnerabilities (GPU/DPU firmware, BMC/IPMI, specialized compute environments)
Production experience with security automation platforms such as TINES, Splunk SOAR, or serverless frameworks (AWS Lambda)
Strong DevOps, DevSecOps, or SRE background with experience in AWS/GCP/Azure cloud services and Infrastructure as Code (Terraform, CloudFormation)
Deep understanding of container security and Kubernetes (image scanning, admission control, runtime protection, supply chain security)
Experience supporting customer audits (SOC 2, ISO 27001, FedRAMP) with vulnerability evidence and control validation
Experience integrating vulnerability management into modern CI/CD pipelines with a 'shift-left' mentality

What We Offer:

The base salary range for this role is $165,000 to $242,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.

XML job scraping automation by YubHub

]]> full-time senior hybrid $165,000 to $242,000 vulnerability management, application security, platform security, cloud security engineering, Python, Go, security automation, SOAR workflows, detection pipelines, vulnerability prioritization frameworks, CVSS, EPSS, CISA KEV, exploit intelligence, compensating controls, Wiz, Semgrep, Rapid7, AI/ML-powered security workflows, hardware security vulnerabilities, security automation platforms, DevOps, DevSecOps, SRE, container security, Kubernetes, customer audits, CI/CD pipelines Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for AI development and deployment. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4650290006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 f516f0ef-a2d Senior Site Reliability Engineer (Auth0) Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.

This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission.

As a Senior Site Reliability Engineer, you'll join our SRE team based in Europe to ensure our production systems are not only operational but also resilient, scalable, and ready for exponential growth. This isn't just about keeping the lights on; it's about directly contributing to the platform's core resiliency and robustness.

You'll be a hands-on builder, crafting solutions that make our system more reliable by design.

Key Responsibilities:

Design and build custom software in Go to enhance the platform's reliability, resiliency, and redundancy.
Partner with engineering teams to embed reliability principles, improving the availability, performance, and observability of our services.
Use your deep understanding of infrastructure and observability principles to identify opportunities for improvement within the product and implement solutions.
Contribute to our on-call rotation, providing rapid, effective response to critical incidents and using your expertise to troubleshoot, mitigate or accurately escalate production issues.
Develop and refine our SRE tooling and processes, focusing on automation and operational efficiency.
Define, document, and champion reliability best practices across the organisation.

Requirements:

A proactive and systematic approach to problem-solving, with a high degree of ownership.
Proven experience in a production environment supporting large-scale, mission-critical applications with a high degree of autonomy.
Proficiency in at least one programming language, with a preference for Go. You should be comfortable writing custom applications, not just scripts.
Experience with infrastructure as code (Terraform), container orchestration (Kubernetes, Docker) and GitOps (ArgoCD).
Demonstrable expertise in a major cloud provider (Azure, AWS, or GCP).
A strong grasp of microservices architecture, databases (SQL, NoSQL), and networking fundamentals, so you can understand how custom code can solve platform-level issues.
An understanding of core SRE principles, including SLIs, SLOs, and error budgets.
Experience in an on-call rotation for a 24/7 cloud-based environment.
Exceptional communication and collaboration skills, with a proven ability to work effectively in a remote, distributed team, where tasks may be self-driven.

We're looking for someone who is not just looking for a job, but a career-defining opportunity to tackle complex challenges at a massive scale. If you're a curious and motivated engineer who's passionate about building reliability directly into the platform, we'd love to hear from you.

XML job scraping automation by YubHub

]]> full-time senior hybrid $136,000-$187,000 CAD Go, Terraform, Kubernetes, Docker, GitOps, Cloud provider (Azure, AWS, or GCP), Microservices architecture, Databases (SQL, NoSQL), Networking fundamentals, Core SRE principles (SLIs, SLOs, error budgets) Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta provides an unparalleled authentication experience for hundreds of millions of users worldwide. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7791590 Toronto, Ontario, Canada 2026-04-18 fd1da18e-84d Principal Software Engineer II - Observability We're looking for a Principal Software Engineer to join the Observability Experience Team as one of the Tech Leads. As part of this team, you will work at the intersection of big data engineering, backend architecture, and experiences to help users obtain the best insights from their Observability signals, especially logs, metrics, and traces.

Key responsibilities include collaborating with product management, product design, and multiple teams across Elastic to define and evolve the end-to-end experiences for Observability. You will also be a contact point for other teams within Elastic, providing hands-on support and guidance. Additionally, you will help the team define coding practices and standards, foster a culture of mutual respect, collaboration, and consensus-based decision-making, and stay true to the principles of software development as adopted by the team.

The ideal candidate will have experience leading technical projects in the data and enterprise architecture areas, with a proven knowledge in building and running sophisticated technical infrastructures and engineering sound software systems. They should also have hands-on experience using and developing Observability tools, preferably in the Logs space, and experience mentoring expert engineers, providing technical and professional guidance. Furthermore, they should be able to define a long-term technical vision for an area of a data-intensive application, working across teams and organizations to collaboratively build the technical roadmap.

Bonus points for experience as a user of the Elastic Stack and experience in SRE roles.

XML job scraping automation by YubHub

]]> full-time senior remote Observability tools, Logs space, Big data engineering, Backend architecture, Experiences, Elastic Stack, SRE roles Engineering Technology Elastic, the Search AI Company https://logos.yubhub.co/elastic.co.png Elastic enables everyone to find the answers they need in real time, using all their data, at scale. The Elastic Search AI Platform is used by more than 50% of the Fortune 500. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7635297 Greece 2026-04-18 da7679a6-e4f Senior Technical Operations Lead Job Title: Senior Technical Operations Lead

We are seeking an experienced Senior Technical Operations Lead to drive operational excellence across our Infrastructure Engineering organization.

As a Senior Technical Operations Lead, you will design and implement world-class operational processes, establish SRE best practices, and mentor technical teams to achieve exceptional reliability and efficiency.

Key Responsibilities:

SRE Leadership & Transformation

Lead the design and implementation of SRE practices and tooling across Infrastructure Engineering

Establish and cultivate an SRE-focused culture at Zoominfo

Operational Process Design & Governance

Establish clear governance frameworks and procedural consistency

Make decisions about process exceptions and/or changes to accommodate different team contexts

Design and/or implement process automations using scripts and integrations

Define functional requirements and goals for process automations

Conduct hands-on and/or automated audits to ensure process adherence and identify improvement opportunities

Incident Management & Root Cause Analysis

Design, implement, and continuously improve Incident Management and Change Management procedures that scale across the organization, using tools such as PagerDuty, Slack, Jira, ServiceNow, and custom integrations

Lead and participate in root cause analysis sessions, driving teams toward systemic improvements rather than blame

Design and execute incident dry runs and tabletop exercises to build organizational resilience

Establish metrics and KPIs that measure incident response effectiveness and drive continuous improvement

Enable Data-Driven Decision Making

Identify, define, and automate the tracking of operational KPIs and departmental metrics that matter, enabling senior managers to make informed decisions on the basis of data

Build and maintain metric dashboards and automated reporting systems that provide real-time visibility into operational health

Analyze trends and surface opportunities for optimization

Stakeholder Engagement, Training & Mentorship

Build and maintain strong relationships with Engineering managers, Product Managers, and cross-functional stakeholders across geographies

Maintain a feedback loop. Meet with stakeholders to understand process pain points.

Influence others by fostering trust, leading by example, and inspiring them with your expertise and passion for reliability practices.

Enhance internal knowledge of third-party tools such as Pagerduty, Datadog, and more, by educating Zoominfo employees on these tools.

Deliver training sessions that make Operational Excellence engaging and motivating for diverse audiences.

Required Experience & Qualifications:

Bachelor’s degree in Software Engineering, Operations Management, or related field

7+ years of hands-on experience in technical operations, Site Reliability Engineering (SRE), Incident Management, or IT Service Management roles within SaaS or technical organizations

Fluent English proficiency (written and verbal)

Proven track record designing and implementing operational processes at scale

Demonstrated expertise in SRE principles, practices, and tooling

Strong data analysis skills with ability to define metrics, build or design dashboards, and use data to drive strategic decisions

Proven ability to work effectively in a matrix organizational structure

Ability and experience working with senior management at global organizations

Hands-on experience with monitoring and observability tools such as PagerDuty and/or Datadog

Familiarity with Jira, Confluence, Google Data Studio, or Tableau

Experience with scripting and integrations (Python, JavaScript, Google AppScript, or similar)

Background in SRE transformation or organizational process improvement initiatives

#LI-SS4 #LI-Hybrid

XML job scraping automation by YubHub

]]> full-time senior hybrid Site Reliability Engineering (SRE), Technical Operations, Incident Management, IT Service Management, Monitoring and Observability Tools, Jira, Confluence, Google Data Studio, Tableau, Scripting and Integrations, Python, JavaScript, Google AppScript Engineering Technology ZoomInfo https://logos.yubhub.co/zoominfo.com.png ZoomInfo is a technology company that provides a go-to-market intelligence platform. It has over 35,000 customers worldwide. https://www.zoominfo.com/ https://job-boards.greenhouse.io/zoominfo/jobs/8451386002 Ra'anana, Israel 2026-04-18 81e928a2-c9f Senior Site Reliability Engineer (Auth0) Secure Every Identity

We are looking for a Senior Site Reliability Engineer to join our SRE team based in Europe. As a Senior Site Reliability Engineer, you'll ensure our production systems are not only operational but also resilient, scalable, and ready for exponential growth.

This isn't just about keeping the lights on; it's about directly contributing to the platform's core resiliency and robustness. You'll be a hands-on builder, crafting solutions that make our system more reliable by design.

Responsibilities

Design and build custom software in Go to enhance the platform's reliability, resiliency, and redundancy.
Partner with engineering teams to embed reliability principles, improving the availability, performance, and observability of our services.
Use your deep understanding of infrastructure and observability principles to identify opportunities for improvement within the product and implement solutions.
Contribute to our on-call rotation, providing rapid, effective response to critical incidents and using your expertise to troubleshoot, mitigate or accurately escalate production issues.
Develop and refine our SRE tooling and processes, focusing on automation and operational efficiency.
Define, document, and champion reliability best practices across the organisation.

What you'll need to be successful

This role requires a unique blend of a software engineer's mindset and operational expertise. You'll thrive in this role if you have:

A proactive and systematic approach to problem-solving, with a high degree of ownership.
Proven experience in a production environment supporting large-scale, mission-critical applications with a high degree of autonomy.
Proficiency in at least one programming language, with a preference for Go. You should be comfortable writing custom applications, not just scripts.
Experience with infrastructure as code (Terraform), container orchestration (Kubernetes, Docker) and GitOps (ArgoCD).
Demonstrable expertise in a major cloud provider (Azure, AWS, or GCP).
A strong grasp of microservices architecture, databases (SQL, NoSQL), and networking fundamentals, so you can understand how custom code can solve platform-level issues.
An understanding of core SRE principles, including SLIs, SLOs, and error budgets.
Experience in an on-call rotation for a 24/7 cloud-based environment.
Exceptional communication and collaboration skills, with a proven ability to work effectively in a remote, distributed team, where tasks may be self-driven.

The Okta Experience

Supporting Your Well-Being
Driving Social Impact
Developing Talent and Fostering Connection + Community

XML job scraping automation by YubHub

]]> full-time senior remote Go, Terraform, Kubernetes, Docker, GitOps, Cloud provider (Azure, AWS, or GCP), Microservices architecture, Databases (SQL, NoSQL), Networking fundamentals, Core SRE principles (SLIs, SLOs, error budgets) Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta provides an unparalleled authentication experience for hundreds of millions of users worldwide. It is a large technology company. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7418982 Barcelona, Spain 2026-04-18 9cd0420a-99d Network Engineer, Capacity and Efficiency About the Role

We're looking for a network engineer who thinks in metrics first. You will use deep networking knowledge and rigorous measurement to figure out where and how bandwidth, latency, and dollars are being used, find optimization opportunities and land them.

Responsibilities

Build the network observability stack. Design and deploy telemetry pipelines , sFlow/IPFIX, gNMI streaming, eBPF host probes , that turn packet counters into per-flow, per-tenant, per-workload cost and utilization data. Own the SLIs for backbone and DCN fabric health.
Hunt for efficiency. Analyze inter-region traffic patterns, identify hot links and stranded capacity, and quantify the dollar impact. Build the models that tell us whether we should buy more capacity, or move the workload.
Own QoS and traffic engineering. Design and operate traffic classification, marking, and shaping across the backbone. Make sure bulk checkpoint transfers don’t starve latency-sensitive inference, and that we’re not paying premium cross-region rates for traffic that could take the cheap path.
Drive cost attribution. Tie network spend , egress, interconnect ports, transit, optical leases , back to the teams and workloads that generate it. Make network cost a first-class input to capacity planning and workload placement decisions.
Influence decisions you don't own. A large fraction of this role is convincing other teams to act on what your data shows: making the case to research that a traffic pattern needs to change, to finance that an interconnect tranche is worth buying, to Systems Networking that a QoS policy needs rewriting.

Requirements

Have 5+ years operating large-scale production networks , data center fabrics (spine-leaf, Clos), backbone/WAN, or hyperscaler-adjacent environments.
Are genuinely fluent across the stack: BGP (including policy and communities), ECMP, VXLAN/EVPN or equivalent overlays, QoS (DSCP, queuing, shaping), and L1/optical basics (DWDM, coherent, LAGs).
Know at least one major CSP’s networking model deeply , AWS (VPC, TGW, Direct Connect, Gateway Load Balancer) or GCP (Shared VPC, Interconnect, Cloud Router, Network Connectivity Center) , and understand how their overlays interact with physical underlays.
Have built or operated network telemetry at scale: streaming telemetry (gNMI/OpenConfig), flow export (sFlow, IPFIX, NetFlow), or eBPF-based host-side instrumentation. You can reason about sampling, cardinality, and storage tradeoffs.
Comfortable writing Python or Go to build tooling, telemetry pipelines, infrastructure-as-code, config management for network devices and automation, that you’ll ship to production.
Think quantitatively by default. You reach for a notebook or a Grafana query before you reach for an opinion, and you can turn messy counter data into a defensible cost model.
Communicate crisply. You can explain to a finance partner why a 10% egress reduction matters, and to a network engineer why a specific ECMP imbalance is costing real money.

Nice to Have

SRE experience for large-scale network infrastructure , designing for reliability, defining SLOs/SLIs for network services, capacity planning with error budgets, and incident response for network-impacting outages at scale.
Background on a cloud provider's networking team or a cloud networking product team , building or operating the interconnect, backbone, or SDN control plane from the provider side, not just consuming it as a customer.
Familiarity with AI/ML infrastructure traffic patterns like collective communication (all-reduce, all-gather), checkpoint/weight transfer, inference serving, and how these stress networks differ than traditional workloads in terms of burst behavior, flow synchronization, and bandwidth symmetry.
Experience with HPC fabrics like InfiniBand, RoCE v2, lossless Ethernet, or custom high-radix topologies and an understanding of how job placement, congestion management, and adaptive routing interact at scale.
Background in traffic engineering for large backbones and the operational judgment to know when TE is worth the complexity.
Hands-on time with multi-cloud connectivity: cross-cloud peering, private interconnect products, and the billing models that come with them.
Experience building cost/chargeback systems for shared infrastructure, or FinOps exposure in a large cloud environment.

Representative Projects

Build a per-flow cost attribution pipeline that traces every byte of cross-region egress back to the team and workload that generated it
Design QoS policy for the private backbone that prevents bulk checkpoint transfers from starving inference traffic
Model whether it's cheaper to buy an additional 1.6Tb interconnect tranche or to re-route traffic through existing capacity
Instrument DCN fabric utilization with streaming telemetry and build the Grafana dashboards that become the team's source of truth for network observability

XML job scraping automation by YubHub

]]> full-time senior onsite network engineering, network observability, telemetry pipelines, sFlow/IPFIX, gNMI streaming, eBPF host probes, BGP, ECMP, VXLAN/EVPN, QoS, DSCP, queuing, shaping, L1/optical basics, DWDM, coherent, LAGs, AWS, GCP, cloud networking, infrastructure-as-code, config management, automation, Python, Go, quantitative analysis, cost modeling, communication, SRE, cloud provider's networking team, cloud networking product team, AI/ML infrastructure traffic patterns, HPC fabrics, traffic engineering, multi-cloud connectivity, cost/chargeback systems, FinOps Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a technology company that creates reliable, interpretable, and steerable AI systems. https://anthropic.com https://job-boards.greenhouse.io/anthropic/jobs/5177143008 San Francisco, CA | New York City, NY 2026-04-18 9b8fb427-b59 Elastic AI Engineer - Canada (Remote) We are looking for an innovative Elastic AI Engineer to join our team to build autonomous, enterprise-grounded agents that don't just answer questions,they complete complex business tasks to accelerate productivity across the entire organization.

The ideal candidate is an Elastic product expert (including but not limited to Agent Builder and Workflows), using the full power of the Elastic Stack to provide the 'brain' and 'memory' for our agentic ecosystem.

As the company behind the popular open-source projects , Elasticsearch, Kibana, Logstash, and Beats , we help people around the world do great things with their data.

The Elastic family unites employees across 40+ countries into one coherent team, while the broader community spans across over 100 countries.

Responsibilities

Agentic Strategy & Design: Invent and implement sophisticated agentic workflows that use reasoning and tools to complete end-to-end business processes.

Enterprise Grounding: Apply Retrieval Augmented Generation (RAG) and the Elasticsearch Relevance Engine (ESRE) to ensure agents are deeply grounded in enterprise knowledge for high-accuracy task completion.

AI Model & Tool Integration: Develop and fine-tune LLMs and integrate them with internal APIs and third-party SaaS tools to enable autonomous action.

Scalable Infrastructure: Firm understanding of cloud-based environments (AWS, Azure, GCP) in order to support the high-concurrency demands of enterprise agents.

Lifecycle Management: Oversee the training, deployment, and performance optimization of agents, ensuring they remain secure, reliable, and compliant.

Technical Leadership: Act as a domain expert on the Elastic Stack, making technical recommendations that push the boundaries of AI-driven productivity.

Documentation: Maintain comprehensive documentation of AI workflows, cloud infrastructure, and deployment processes.

Security: Implement standards for security and data privacy to protect sensitive information and ensure compliance with relevant regulations.

Requirements

3-5 years of work experience in a relevant field.

Minimum 1 year experience building with the Elastic Stack.

Knowledge of Elasticsearch Relevance Engine (ESRE), Jina AI, and advanced RAG patterns is critical.

Proven success in delivering independent GenAI projects, specifically those involving autonomous task completion or complex workflow automation.

Agentic Frameworks: Familiarity with LangGraph, LangChain, and LangSmith for building and debugging multi-agent systems.

Expertise in Enterprise Agentic & Workflow Platforms: Deep familiarity with leading agentic AI and workflow automation platforms (such as Microsoft Copilot Studio, Salesforce Agentforce, ServiceNow AI Agents).

Market Trend Integration: Proven ability to apply emerging market trends,such as Multi-Agent Orchestration and Model Context Protocol (MCP),to build high-impact, cost-optimized solutions that scale across the enterprise.

Programming: Experience with Python or TypeScript for backend logic and agent orchestration.

Cloud & Orchestration: Familiarity with Kubernetes (Operators/Controllers), Docker, and Terraform for automated deployment.

Model Expertise: Hands-on experience with LLM providers.

Bonus Points

Bachelor’s or Master’s degree in Computer Science or a related engineering field.

Strong communication skills with the ability to translate business requirements into technical agent architectures.

A commitment to Ethical AI and responsible development practices.

Experience with containerization and orchestration (e.g., Docker, Kubernetes).

Knowledge of DevOps practices for model deployment and automation.

XML job scraping automation by YubHub

]]> full-time senior remote $101,900-$161,200 CAD Elasticsearch Relevance Engine (ESRE), Jina AI, advanced RAG patterns, LangGraph, LangChain, LangSmith, Microsoft Copilot Studio, Salesforce Agentforce, ServiceNow AI Agents, Multi-Agent Orchestration, Model Context Protocol (MCP), Python, TypeScript, Kubernetes, Docker, Terraform, LLM providers Engineering Technology Elastic https://logos.yubhub.co/elastic.co.png Elastic is a software company that provides a platform for search, security, and observability. It has a global presence with employees across 40+ countries. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7792839 Canada 2026-04-18 6274ee2d-545 Elastic AI Engineer We are looking for an innovative Elastic AI Engineer to join our team to build autonomous, enterprise-grounded agents that don't just answer questions,they complete complex business tasks to accelerate productivity across the entire organization.

As the company behind the popular open-source projects , Elasticsearch, Kibana, Logstash, and Beats , we help people around the world do great things with their data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots.

Responsibilities

Agentic Strategy & Design: Invent and implement sophisticated agentic workflows that use reasoning and tools to complete end-to-end business processes.
Enterprise Grounding: Apply Retrieval Augmented Generation (RAG) and the Elasticsearch Relevance Engine (ESRE) to ensure agents are deeply grounded in enterprise knowledge for high-accuracy task completion.
AI Model & Tool Integration: Develop and fine-tune LLMs and integrate them with internal APIs and third-party SaaS tools to enable autonomous action.
Scalable Infrastructure: Firm understanding of cloud-based environments (AWS, Azure, GCP) in order to support the high-concurrency demands of enterprise agents.
Lifecycle Management: Oversee the training, deployment, and performance optimization of agents, ensuring they remain secure, reliable, and compliant.
Technical Leadership: Act as a domain expert on the Elastic Stack, making technical recommendations that push the boundaries of AI-driven productivity.
Documentation: Maintain comprehensive documentation of AI workflows, cloud infrastructure, and deployment processes.
Security: Implement standards for security and data privacy to protect sensitive information and ensure compliance with relevant regulations.

Requirements

3-5 years of work experience in a relevant field.
Minimum 1 year experience building with the Elastic Stack.
Knowledge of Elasticsearch Relevance Engine (ESRE), Jina AI, and advanced RAG patterns is critical.
Proven success in delivering independent GenAI projects, specifically those involving autonomous task completion or complex workflow automation.
Agentic Frameworks: Familiarity with LangGraph, LangChain, and LangSmith for building and debugging multi-agent systems.
Expertise in Enterprise Agentic & Workflow Platforms: Deep familiarity with leading agentic AI and workflow automation platforms (such as Microsoft Copilot Studio, Salesforce Agentforce, ServiceNow AI Agents.)
Market Trend Integration: Proven ability to apply emerging market trends,such as Multi-Agent Orchestration and Model Context Protocol (MCP),to build high-impact, cost-optimized solutions that scale across the enterprise.
Programming: Experience with Python or TypeScript for backend logic and agent orchestration.
Cloud & Orchestration: Familiarity with Kubernetes (Operators/Controllers), Docker, and Terraform for automated deployment.
Model Expertise: Hands-on experience with LLM providers.

Bonus Points

Bachelor’s or Master’s degree in Computer Science or a related engineering field.
Strong communication skills with the ability to translate business requirements into technical agent architectures.
A commitment to Ethical AI and responsible development practices.
Experience with containerization and orchestration (e.g., Docker, Kubernetes).
Knowledge of DevOps practices for model deployment and automation.

XML job scraping automation by YubHub

]]> full-time mid remote $94,300-$149,200 USD Elasticsearch Relevance Engine (ESRE), Jina AI, Advanced RAG patterns, Python, TypeScript, LangGraph, LangChain, LangSmith, Microsoft Copilot Studio, Salesforce Agentforce, ServiceNow AI Agents Engineering Technology Elastic https://logos.yubhub.co/elastic.co.png Elastic is a software company that provides a platform for search, security, and observability. The company has a global presence with employees across 40+ countries. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7607148 United States 2026-04-18 51758515-c12 Member of Technical Staff We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment.

This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure.

The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime,including close partnership with facility operations to address physical infrastructure impacts.

In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities.

By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scale environments like those at hyperscale cloud providers.

Responsibilities:

Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.

Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers,open to innovative stacks beyond traditional ones like ELK.

Collaborate with cross-functional teams,including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management),to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).

Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.

Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.

Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation.

Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios.

Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies.

Basic Qualifications:

Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).

5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.

Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.

Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.

Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).

Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.

Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.

Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.

Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.

Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).

Preferred Skills and Experience:

7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.

Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.

Proficiency in Rust for systems programming and performance-critical components.

Direct experience integrating software reliability tools with physical data center infrastructure.

Experience with observability tools and practices, such as metrics collection, logging, tracing, and dashboards.

Familiarity with containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).

Experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.

Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.

Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.

Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).

XML job scraping automation by YubHub

]]> full-time staff onsite Python, Rust, Linux systems administration, performance tuning, kernel-level understanding, scripting/automation, containerization, orchestration, observability, metrics collection, logging, tracing, dashboards, networking fundamentals, TCP/IP, routing, redundancy, DNS, Kubernetes, Docker, Grafana, Prometheus, ELK, DevOps, SRE, infrastructure engineering, systems engineering Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/5044403007 Memphis, TN 2026-04-18 5dd5f58c-c07 Principal Engineer We're looking for a well-versed Principal Engineer to play a key role in architecting and building highly available, reliable, and scalable payments applications. Collaborate with Payments Engineering teams to design, develop, and champion best-practices, patterns, and standards for all payments applications. Work closely with our CTO and other architects to create holistic technology solutions for our customers.

As a Principal Engineer, you will:

Collaborate and communicate with Payments Engineering teams to design, develop, and champion best-practices, patterns, and standards for all payments applications.
Work closely with our CTO and other architects to create holistic technology solutions for our customers.
Be part of the Tech Leads group, driving measurable outcomes and iterative delivery strategy, removing roadblocks, empowering others, and mentoring high-potential engineers.
Produce clear, detailed, and actionable design documents, architecture blueprints, architectural decisions with context, decision, and tradeoffs.
Be involved in hands-on development of proof-of-concepts, prototypes, and real production-ready code.
Mentor engineers on architecture best practices and standards.
Engage in all phases of the software lifecycle - design, implement, test, deploy, and support services in production.
Maintain a culture of code quality through rigorous testing, automation, and code reviews.
Be proactive and innovative - we rely on your feedback to build a world-class product.

We're seeking individuals with an equal flair for creative problem-solving, enthusiasm for new technologies, and a desire to contribute to our product. You will likely be successful in this role if you identify with the following traits: attention to detail, problem solver, customer-oriented, versatile, resilient, and confident.

If all of this sounds interesting to you, we'd love to hear from you.

XML job scraping automation by YubHub

]]> full-time senior remote Cloud SaaS environment, Highly available, reliable, and scalable SaaS applications/platforms, Backend API specs, mocks, and service implementations, Cloud-native architecture, microservices, CI/CD (GitHub Actions, Argo), GitOps, Authentication and Authorization, APIs and API Gateway, Docker, Kubernetes (EKS), Kafka (MSK), Java, Spring Framework, Python, and AWS services, Observability solutions using Grafana and Open Telemetry, DevOps, SRE, Configuration Management, and Release Management, Payments technologies and ecosystem (card networks, PSP integration) Engineering Technology VGS https://logos.yubhub.co/vgs.com.png VGS is the world's leader in payment tokenization, providing processor-agnostic tokenization solutions to large banks, fintechs, and merchants. https://www.vgs.com https://jobs.lever.co/verygoodsecurity/33e033b6-ae9b-4d51-b190-262a2cb83d96 San Francisco 2026-04-17 a632e52b-c63 Site Reliability Engineer About Mistral AI

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.

We are a dynamic team passionate about AI and its potential to transform society. Our diverse workforce thrives in competitive environments and is committed to driving innovation.

Role Summary

We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers' expectations.

Responsibilities

As a Site Reliability Engineer, you balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.

Operations

• Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads

• Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters

• Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)

• Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime

• Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs

• Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences

Development

• Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform

• Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments

• Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure

• Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)

• Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements

• Document processes and procedures to ensure consistency and knowledge sharing across the team

• Contribute to open-source projects, research publications, blog articles and conferences

About You

• Master’s degree in Computer Science, Engineering or a related field

• 7+ years of experience in a DevOps/SRE role

• Strong experience with cloud computing and highly available distributed systems

• Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)

• Experience working against reliability KPIs (observability, alerting, SLAs)

• Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)

• Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)

• Familiarity with infrastructure-as-code tools like Terraform or CloudFormation

• Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices

• Strong understanding of networking, security, and system administration concepts

• Excellent problem-solving and communication skills

• Self-motivated and able to work well in a fast-paced startup environment

Your Application Will Be All The More Interesting If You Also Have:

• Experience in an AI/ML environment

• Experience of high-performance computing (HPC) systems and workload managers (Slurm)

• Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

XML job scraping automation by YubHub

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life. We democratize AI through high-performance, optimized, open-source and cutting-edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on-premises or in cloud environments.

Role Summary

Responsibilities

Operations (50%)

Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads
Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters
Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)
Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime
Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs
Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences

Development (50%)

Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform
Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments
Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure
Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)
Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements
Document processes and procedures to ensure consistency and knowledge sharing across the team
Contribute to open-source projects, research publications, blog articles and conferences

About You

Master’s degree in Computer Science, Engineering or a related field
7+ years of experience in a DevOps/SRE role
Strong experience with cloud computing and highly available distributed systems
Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
Experience working against reliability KPIs (observability, alerting, SLAs)
Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
Strong understanding of networking, security, and system administration concepts
Excellent problem-solving and communication skills
Self-motivated and able to work well in a fast-paced startup environment

Your Application Will Be All The More Interesting If You Also Have:

Experience in an AI/ML environment
Experience of high-performance computing (HPC) systems and workload managers (Slurm)
Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

XML job scraping automation by YubHub

]]> full-time senior remote cloud computing, highly available distributed systems, DevOps, SRE, Kubernetes, Flux, Terraform, CI/CD, containerization, orchestration, monitoring, logging, alerting, observability, infrastructure-as-code, scripting languages, software development best practices, networking, security, system administration, AI/ML environment, high-performance computing, workload managers, modern AI-oriented solutions Engineering Technology Mistral AI Mistral AI is a company that develops and provides artificial intelligence (AI) technology to simplify tasks, save time, and enhance learning and creativity. It has a diverse workforce with teams distributed across multiple countries. https://mistral.ai/careers https://jobs.lever.co/mistral/6e16e4fa-a60b-4270-a815-06b0450fb597 Paris 2026-03-10 d2955c92-774 Network Security Engineering Enterprise Architect (GSR8) As a Network Security Engineering Enterprise Architect (GSR8), you will be a technical lead supporting Ford's complete Enterprise Network & Security architecture transformation. You will be taking care of dynamics for Network Security Engineering Products (Security Firewalls, Proxy, ISE, SDN Networks, Wireless) team to a centre of technical excellence and customer Advocacy.

You will identify, analyse, and resolve existing network security design weaknesses and vulnerabilities which could possess risk to existing infrastructure. Expert in closing zero-day security vulnerabilities taking along all infrastructure domain teams which could impact Ford's reputation across globe.

As a Network Security Engineering enterprise architect, you would lead future network security product development by contributing to the network Design (architecture) and Automation used across multiple Engineering Branches, Data Centres, Manufacturing Plants and Remote users.

This Role requires defining road map for ZTNA/SASE deployment using Prisma Access/Cloud, setup support model, automation to accelerate end user experience. The Global Network Security Engineering enterprise architect is responsible for successful setup of the products by working closely with Software developers from Ford and OEMs in consultation with Ford's Network and Security Operations Team.

This position will be part of Ford's Enterprise Tech department and will report to the Regional Network Delivery Manager, based in same or another region. The lead needs to ensure 'Always On' (24 x 7) availability of Ford Global Network Product offerings, working with Network & Security Peers from other regions.

Responsibilities

This role will also be driving towards supporting full observability and Monitoring, process response, and technical capability to ensure customer up time of 99.999%+. This position requires a wide range of skills and experience,

This role involves collaborating closely with the network operations team to identify continuous improvement opportunities and working with the network engineering team and OEMs to devise and implement solutions. The implementation will be driven through automation in partnership with Ford's developers.
Design and implement robust security architectures and frameworks to protect against threats and vulnerabilities.
Ensure timely proactive identification and reporting of security gaps and vulnerabilities to the critical business information, systems and network infrastructure.
Plan for End-to-end Network & Security projects implementation.

Qualifications

Support the Major technical Incident Management Calls and Change Controls through STRONG Technical Network Knowledge, Operational capability, and strong communication skills.
Perform configuration updates, such as modifying configurations, signature definitions or implementing new policies on various network security tools, as directed.
Demonstrate technical excellence through technical knowledge.
Collaborate with global leaders to support 24/7 network availability on a worldwide scale.
Advocate and ensure that high quality Follow the Sun (FTS) is delivered to receiving teams. As well as support on-call schedule and shifts are available.
Support continuous improvement in service management for Network Services leveraging enterprise tools and processes (Incident, Problem & Change) and focusing on customer value optimization.
Supports implement best practices and processes for Network & Security Operations services to maintain availability, reliability, scalability, and security.
Support for effective SRE Monitoring and FSO (Full Stack Observability) on system performance and overall health, troubleshoot issues, and implement corrective actions.
Collaborate with the Network LAN/WAN & security Engineering/development teams to optimize infrastructure for application performance and scalability.
Support team members to achieve technical network excellence thru experience, and network Certifications and support training requirements.
Able to support the team to develop continued improvements leading to an 'always on network capability.
Be able to leverage other network management tools used by the NOC in the identification and response to security connectivity incidents and faults.
Develop security policies, standards, and procedures.
Assist with security compliance audits to verify completeness of required configurations and verify system hardening.
Participate in the problem investigation connectivity incidents related to security devices, provide recommendations to improve reliability and availability, or reduce recovery time.
Support assurance of up-to-date SW releases, targeted LDOS, and PSIRTS (security updates).

XML job scraping automation by YubHub

]]> full-time senior hybrid Network Security Engineering, Enterprise Architecture, Security Firewalls, Proxy, ISE, SDN Networks, Wireless, Prisma Access/Cloud, ZTNA/SASE, Automation, Network Design, Network Security, Security Operations, Incident Management, Change Controls, Technical Knowledge, Global Leadership, Follow the Sun, SRE Monitoring, FSO, Full Stack Observability, System Performance, Network Certifications, Training Requirements Engineering Automotive Ford Ford is a multinational automaker that designs, manufactures, and markets vehicles and automotive-related products. It is one of the largest automakers in the world. https://efds.fa.em5.oraclecloud.com https://efds.fa.em5.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1/job/56878 Chennai, Tamil Nadu, India 2026-03-09 c0069e5d-01b Forward Deployment Engineer (Developer Success) You'll work directly with customers to deploy Firecrawl in real-world environments, unblock integrations, and turn customer needs into repeatable solutions and product improvements. This is a highly hands-on, customer-facing engineering role — ideal for someone who likes solving complex problems live and shipping pragmatic solutions fast. Salary Range: $150,000–$250,000/year (Range shown is for U.S.-based employees. Compensation outside the U.S. is adjusted fairly based on your country's cost of living.) Equity Range: Up to 0.10% Location: San Francisco, CA (Hybrid) OR Remote Job Type: Full-Time (SF) OR Contract (Remote) Experience: 3+ years or equivalent shipped systems Visa: US Citizenship/Visa required for SF; N/A for Remote You'll work directly with customers to deploy, customize, and troubleshoot Firecrawl in production environments. You'll own technical delivery for priority accounts, from first integration through ongoing optimization. You'll debug complex real-world issues involving APIs, crawling, data pipelines, and infra constraints. You'll build reusable solutions, templates, and playbooks based on customer needs. You'll translate customer feedback into clear product and engineering insights. You'll collaborate closely with core engineers to improve reliability, performance, and usability. You'll help define best practices for how Firecrawl is implemented at scale. A strong engineer who likes being close to customers. You have solid fundamentals — APIs, systems, debugging — and you're comfortable explaining technical concepts clearly to people who aren't engineers. You'd rather hop on a call and unblock someone than file a ticket and wait. Calm and effective in ambiguity. You don't need a runbook for every situation. You diagnose fast, communicate clearly, and make good decisions with incomplete information. Biased toward action. You unblock first, optimize later. You ship pragmatic solutions over perfect abstractions and know when "good enough now" beats "ideal next quarter." Comfortable in a small, high-trust team. You don't need layers of process. You work directly with founders and core engineers, own your domain, and move fast. Backgrounds that often do well: Solutions engineers, SREs, devrel engineers, or customer-facing infra roles. Engineers at startups who wore multiple hats. Ex-founders who've debugged customer problems at 2am because the customer mattered. Benefits & Perks Salary that makes sense — $170,000–215,000/year (SF, U.S.-based), based on impact, not tenure Own a piece — Up to 0.20% equity in what you're helping build Generous PTO — 15 days mandatory, anything after 24 days, just ask (holidays excluded); take the time you need to recharge Parental leave — 12 weeks fully paid, for moms and dads Wellness stipend — $100/month for the gym, therapy, massages, or whatever keeps you human Learning & Development — Expense up to $1000/year toward anything that helps you grow professionally Team offsites — A change of scenery, minus the trust falls Sabbatical — 3 paid months off after 4 years, do something fun and new Interview Process 1. Application Review – Send us your stuff + a quick note on why this excites you (plus links to things you've built or deployed). 2. Technical + Customer Scenario Interview (~45 min) – Real-world problem solving: we'll walk through a customer deployment scenario and see how you debug, communicate, and prioritize live. We're looking for engineering depth and customer instincts — not trivia. 3. Founder Chat (~30 min) – Culture, pace, ownership, and how you like to work. Time for your questions too. 4. Paid Work Trial (1–2 weeks) – Test drive the real thing: work on a real customer deployment or integration with measurable impact. 5. Decision – We move fast after the trial.

XML job scraping automation by YubHub

]]> Full time mid Remote $150K - $250K APIs, systems, debugging, customer-facing engineering, solutions engineering, SREs, devrel engineers, customer-facing infra roles, engineering depth, customer instincts, problem-solving, communication, prioritization Engineering Technology Firecrawl https://logos.yubhub.co/firecrawl.com.png Firecrawl is a small, fast-moving, technical team building essential infrastructure super-intelligence will use to gather data on the web. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/firecrawl/bda40f47-a69b-44d4-ac1a-3f86f20d802d San Francisco, CA (Hybrid) OR Remote-Global 2026-03-08 84b511f8-598 Field Engineer Compensation\n\n- Compensation is determined based on career level, with the OTE for this role being between $150K – $250K • Offers Equity\n\nReplit is the agentic software creation platform that enables anyone to build applications using natural language. With millions of users worldwide, Replit is democratizing software development by removing traditional barriers to application creation.\n\n## About the Role\n\nAs an Enterprise/Strategic Field Engineer (L5), you'll be the technical cornerstone for Replit's largest and most strategic accounts. This is a hybrid role: high-impact pre-sales (closing complex technical evaluations) and post-sales (driving adoption, expansion, and retention). You'll own the end-to-end technical relationship—from pre-sales architecture discussions through multi-year expansion—ensuring our enterprise customers don't just use Replit, but become Replit-powered companies.\n\nYou'll partner with Account Executives and Account Managers in a high-accountability Pod structure. This is not a reactive support role—this is a proactive, strategic technical leader who identifies blockers before they become problems, champions new use cases, and directly influences $5M+ in annual recurring revenue.\n\n## In this role you will:\n\nPre-Sales\n\n- Strategic Technical Discovery: When you are pulled into complex deals, you join as the expert closer. You run deep discovery on their stack and constraints, then design the winning technical strategy.\n\n- Proof of Value (POV) & Live Building: You build live, functional applications on the fly during executive meetings to prove immediate value and technical feasibility to VPs and C-suite stakeholders.\n\n- Context & Connectivity (MCP): You write and deploy Model Context Protocol (MCP) servers to securely connect Replit Agents to customer-specific data, making Replit the central hub for their internal development.\n\n- Enterprise Governance: You own the "Guardrails" mission. You configure workspace policies and AI governance templates that solve for data safety, compliance, and CISO approval.\n\n- Infrastructure Strategy: You lead deep-dive reviews for Single-Tenant/VPC deployments, ensuring Replit fits the customer’s security posture.\n\n- Hackathons & Trials: You lead high-energy Hackathons alongside Account Executives to drive hands-on experience and excitement about Replit.\n\nPost-Sales\n\n- Production Foundation: You lead the technical kickoff, ensuring production-ready SSO/SCIM provisioning, guardrails, and security setup.\n\n- Technical Onboarding & Enablement: You ensure customers learn how to build in Replit to its maximum capabilities. You enable technical and non-technical teams by running training sessions and workshops that establish enterprise workflows with Replit.\n\n- Design Systems: You build Design Systems and starter templates tailored to the customer’s stack to accelerate their internal development time.\n\n- Drive Viral Growth: You build trust with key technical stakeholders, proactively running enablement sessions to keep teams building. You act as the spark for viral growth within the account.\n\n- Source & Qualify Expansion: Work with Account Managers to proactively identify new teams, new use cases, and new projects. You provide the technical validation to support the commercial close.\n\n- Run Value-Based QBRs: Co-lead quarterly business reviews, shifting the conversation to business value delivered and strategic roadmap alignment.\n\n## Required skills and experience:\n\n- 5-7+ years in technical customer-facing roles such as Solutions Engineer, Sales Engineer, Implementation Engineer, Forward Deployed Engineer, Technical Account Manager, or Customer Success Engineer at a high-growth B2B SaaS or dev tools company\n\n- Replit Power User: You understand Replit better than 99% of our users. You have likely built Replit apps before. To facilitate this, we provide interviewees with credits/access to the platform.\n\n- Enterprise Depth: You have worked with enterprise prospects to drive adoption, expansion, and renewal. You can explain complex technical concepts to non-technical executives and translate business requirements into technical architecture.\n\n- Live Builder: You've run POCs, onboarding sessions, and workshops. You can listen to vague requirements and translate them on the fly into technical concepts, creating live apps in real-time.\n\n- Production Engineering: You can read and write code (JavaScript, Python, or similar). You understand APIs, databases, CI/CD pipelines, and modern cloud architecture.\n\n- Pod Mentality: You thrive in a high-accountability POD structure.\n\n- Military Experience: Relevant military experience with technology is counted as background and experience.\n\n- Comfort with up to 25% travel (expect 30%+).\n\n## Nice to have:\n\n- Experience with AI-powered dev tools (Cursor, Windsurf, Lovable, Claude Code, Zapier etc.)\n\n- Understanding of AI Evaluation patterns (Evals) and Context Management (RAG, System Prompts).\n\n- Background in DevOps, cloud infrastructure (AWS/GCP/Azure), or SRE.\n\n## Tools + Tech Stack for this role:\n\n- Replit\n\n- HubSpot CRM\n\n- SSO/SAML/SCIM identity systems\n\n- Cloud platforms (AWS, GCP, Azure)\n\n- Slack\n\n- Claude\n\n- ChatGPT\n\n- Gemini\n\n- Notion\n\n- Superhuman\n\n- ZoomInfo\n\n- Hex\n\n## This role may _not_ be a fit if:\n\n- You haven't had your "Replit Moment". You didn't explore the product on your own, get mind-blown by the speed of creation, and immediately start showing your apps to friends and family.\n\n- You are uncomfortable building functional apps on the fly in front of executives or need a script to demonstrate value.\n\n- You prefer reactive troubleshooting over proactively identifying architectural blockers and owning the technical strategy.\n\n- You struggle to debug code, configure SSO, or understand cloud infrastructure without significant hand-holding.\n\n- You aren't obsessed with how Agents are rewriting the SDLC, and you don't use AI tools to build in your own spare time.\n\n- You struggle in ambiguous, high-growth environments where you often have to build the tool or process required to solve the problem.\n\n- You're not comfortable with significant travel (up to 30%+) for customer meetings and on-site engagements\n\n- You prefer transactional interactions over building deep, multi-year customer relationships\n\n_This is a full-time role based in San Francisco Bay Area, New York City, or Remote (US-based). Travel up to 25% required (expect 30%+). 2-week onsite onboarding in Foster City, CA required._\n\n## Full-Time Employee Benefits Include:\n\n💰 Competitive Salary & Equity\n\n💹 401(k) Program with a 4% match\n\n⚕️ Health, Dental, Vision and Life Insurance\n\n🩼 Short Term and Long Term Disability\n\n🚼 Paid Parental, Medical, Caregiver Leave\n\n🚗 Commuter Benefits\n\n📱 Monthly Wellness Stipend\n\n🧑‍💻 Autonomous Work Environment\n\n🖥 In Office Set-Up Reimbursement\n\n🏝 Flexible Time Off (FTO) + Holidays\n\n🚀 Quarterly Team Gatherings\n\n☕ In Office Amenities\n\n## Want to learn more about what we are up to?\n\n- [Meet the Replit Agent](https://www.youtube.com/watch?v=IYiVPrxY8-Y)\n\n- [Replit: Make an app for that](https://www.youtube.com/watch?v=4zd9hzngFwY)\n\n- [Replit Blog](https://blog.replit.com/)\n\n- [Amjad TED Talk](https://youtu.be/kCudFI4tcpg?si=l4ViCejV_f2RZkDi)\n\n## Interviewing + Culture at Replit\n\n- [Operating Principles](https://blog.replit.com/operating-principles)\n\n- [Reasons not to work at Replit](https://blog.replit.com/reasons-not-to-join-replit)\n\nTo achieve our mission of making programming more accessible around the world, we need our team to be representative of the world. We welcome your unique perspective and experiences in shaping this product. We encourage people from all kinds of backgrounds and experiences to apply.\n\n

XML job scraping automation by YubHub

]]> Full time senior Hybrid Competitive Salary & Equity 5-7+ years in technical customer-facing roles, Replit Power User, Enterprise Depth, Live Builder, Production Engineering, Pod Mentality, Military Experience, Experience with AI-powered dev tools, Understanding of AI Evaluation patterns, Background in DevOps, cloud infrastructure (AWS/GCP/Azure), or SRE Engineering Technology Replit https://logos.yubhub.co/replit.com.png Replit is a software creation platform that enables anyone to build applications using natural language. With millions of users worldwide, Replit is a leading provider of dev tools. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/replit/df87458e-b5e9-4dbc-85a4-54c508d040b7 NYC (SoHo) Hybrid 2026-03-08 055260e3-5e7 Field Engineer Compensation\n\n- Compensation is determined based on career level, with the OTE for this role being between $150K – $250K • Offers Equity\n\nReplit is the agentic software creation platform that enables anyone to build applications using natural language. With millions of users worldwide, Replit is democratizing software development by removing traditional barriers to application creation.\n\n## About the Role\n\nAs an Enterprise/Strategic Field Engineer (L5), you'll be the technical cornerstone for Replit's largest and most strategic accounts. This is a hybrid role: high-impact pre-sales (closing complex technical evaluations) and post-sales (driving adoption, expansion, and retention). You'll own the end-to-end technical relationship—from pre-sales architecture discussions through multi-year expansion—ensuring our enterprise customers don't just use Replit, but become Replit-powered companies.\n\nYou'll partner with Account Executives and Account Managers in a high-accountability Pod structure. This is not a reactive support role—this is a proactive, strategic technical leader who identifies blockers before they become problems, champions new use cases, and directly influences $5M+ in annual recurring revenue.\n\n## In this role you will:\n\nPre-Sales\n\n- Strategic Technical Discovery: When you are pulled into complex deals, you join as the expert closer. You run deep discovery on their stack and constraints, then design the winning technical strategy.\n\n- Proof of Value (POV) & Live Building: You build live, functional applications on the fly during executive meetings to prove immediate value and technical feasibility to VPs and C-suite stakeholders.\n\n- Context & Connectivity (MCP): You write and deploy Model Context Protocol (MCP) servers to securely connect Replit Agents to customer-specific data, making Replit the central hub for their internal development.\n\n- Enterprise Governance: You own the "Guardrails" mission. You configure workspace policies and AI governance templates that solve for data safety, compliance, and CISO approval.\n\n- Infrastructure Strategy: You lead deep-dive reviews for Single-Tenant/VPC deployments, ensuring Replit fits the customer’s security posture.\n\n- Hackathons & Trials: You lead high-energy Hackathons alongside Account Executives to drive hands-on experience and excitement about Replit.\n\nPost-Sales\n\n- Production Foundation: You lead the technical kickoff, ensuring production-ready SSO/SCIM provisioning, guardrails, and security setup.\n\n- Technical Onboarding & Enablement: You ensure customers learn how to build in Replit to its maximum capabilities. You enable technical and non-technical teams by running training sessions and workshops that establish enterprise workflows with Replit.\n\n- Design Systems: You build Design Systems and starter templates tailored to the customer’s stack to accelerate their internal development time.\n\n- Drive Viral Growth: You build trust with key technical stakeholders, proactively running enablement sessions to keep teams building. You act as the spark for viral growth within the account.\n\n- Source & Qualify Expansion: Work with Account Managers to proactively identify new teams, new use cases, and new projects. You provide the technical validation to support the commercial close.\n\n- Run Value-Based QBRs: Co-lead quarterly business reviews, shifting the conversation to business value delivered and strategic roadmap alignment.\n\n## Required skills and experience:\n\n- 5-7+ years in technical customer-facing roles such as Solutions Engineer, Sales Engineer, Implementation Engineer, Forward Deployed Engineer, Technical Account Manager, or Customer Success Engineer at a high-growth B2B SaaS or dev tools company\n\n- Replit Power User: You understand Replit better than 99% of our users. You have likely built Replit apps before. To facilitate this, we provide interviewees with credits/access to the platform.\n\n- Enterprise Depth: You have worked with enterprise prospects to drive adoption, expansion, and renewal. You can explain complex technical concepts to non-technical executives and translate business requirements into technical architecture.\n\n- Live Builder: You've run POCs, onboarding sessions, and workshops. You can listen to vague requirements and translate them on the fly into technical concepts, creating live apps in real-time.\n\n- Production Engineering: You can read and write code (JavaScript, Python, or similar). You understand APIs, databases, CI/CD pipelines, and modern cloud architecture.\n\n- Pod Mentality: You thrive in a high-accountability POD structure.\n\n- Military Experience: Relevant military experience with technology is counted as background and experience.\n\n- Comfort with up to 25% travel (expect 30%+).\n\n## Nice to have:\n\n- Experience with AI-powered dev tools (Cursor, Windsurf, Lovable, Claude Code, Zapier etc.)\n\n- Understanding of AI Evaluation patterns (Evals) and Context Management (RAG, System Prompts).\n\n- Background in DevOps, cloud infrastructure (AWS/GCP/Azure), or SRE.\n\n## Tools + Tech Stack for this role:\n\n- Replit\n\n- HubSpot CRM\n\n- SSO/SAML/SCIM identity systems\n\n- Cloud platforms (AWS, GCP, Azure)\n\n- Slack\n\n- Claude\n\n- ChatGPT\n\n- Gemini\n\n- Notion\n\n- Superhuman\n\n- ZoomInfo\n\n- Hex\n\n## This role may _not_ be a fit if:\n\n- You haven't had your "Replit Moment". You didn't explore the product on your own, get mind-blown by the speed of creation, and immediately start showing your apps to friends and family.\n\n- You are uncomfortable building functional apps on the fly in front of executives or need a script to demonstrate value.\n\n- You prefer reactive troubleshooting over proactively identifying architectural blockers and owning the technical strategy.\n\n- You struggle to debug code, configure SSO, or understand cloud infrastructure without significant hand-holding.\n\n- You aren't obsessed with how Agents are rewriting the SDLC, and you don't use AI tools to build in your own spare time.\n\n- You struggle in ambiguous, high-growth environments where you often have to build the tool or process required to solve the problem.\n\n- You're not comfortable with significant travel (up to 30%+) for customer meetings and on-site engagements\n\n- You prefer transactional interactions over building deep, multi-year customer relationships\n\n_This is a full-time role based in San Francisco Bay Area, New York City, or Remote (US-based). Travel up to 25% required (expect 30%+). 2-week onsite onboarding in Foster City, CA required._\n\n## Full-Time Employee Benefits Include:\n\n💰 Competitive Salary & Equity\n\n💹 401(k) Program with a 4% match\n\n⚕️ Health, Dental, Vision and Life Insurance\n\n🩼 Short Term and Long Term Disability\n\n🚼 Paid Parental, Medical, Caregiver Leave\n\n🚗 Commuter Benefits\n\n📱 Monthly Wellness Stipend\n\n🧑‍💻 Autonomous Work Environment\n\n🖥 In Office Set-Up Reimbursement\n\n🏝 Flexible Time Off (FTO) + Holidays\n\n🚀 Quarterly Team Gatherings\n\n☕ In Office Amenities\n\nWant to learn more about what we are up to?\n\n- [Meet the Replit Agent](https://www.youtube.com/watch?v=IYiVPrxY8-Y)\n\n- [Replit: Make an app for that](https://www.youtube.com/watch?v=4zd9hzngFwY)\n\n- [Replit Blog](https://blog.replit.com/)\n\n- [Amjad TED Talk](https://youtu.be/kCudFI4tcpg?si=l4ViCejV_f2RZkDi)\n\nInterviewing + Culture at Replit\n\n- [Operating Principles](https://blog.replit.com/operating-principles)\n\n- [Reasons not to work at Replit](https://blog.replit.com/reasons-not-to-join-replit)\n\nTo achieve our mission of making programming more accessible around the world, we need our team to be representative of the world. We welcome your unique perspective and experiences in shaping this product. We encourage people

XML job scraping automation by YubHub

Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.

About the Role

Claude has your back. AIRE has Claude's. Help us keep Claude reliable for everyone who depends on it.

AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects.

Reliability here is an emergent phenomenon that transcends any single team's boundaries, so someone has to zoom out and look at the whole picture. That's us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most.

Responsibilities

Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.
Design and implement monitoring and observability systems across the token path.
Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

You may be a good fit if you

Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs.
Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.
Think holistically about how systems compose and where the seams are.
Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.
Care about users and feel ownership over outcomes, even for systems you don't own.
Have excellent communication and collaboration skills -- you'll be partnering across the entire company.
Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between.

Strong candidates may also

Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems
Have experience operating large-scale model serving or training infrastructure (>1000 GPUs).
Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
Understand ML-specific networking optimizations like RDMA and InfiniBand.
Have expertise in AI-specific observability tools and frameworks.
Have experience with chaos engineering and systematic resilience testing.
Have contributed to open-source infrastructure or ML tooling.

Logistics

Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.

Visa sponsorship

We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification.

Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

Your safety matters to us.

To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit anthropic.com/careers directly for confirmed position openings.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and engineering as it does with computer science.

XML job scraping automation by YubHub

]]> full-time staff hybrid £325,000 - £390,000GBP distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, SRE, Production Engineer, reliability-focused roles, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. It has a quickly growing team of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5101173008 London, UK 2026-03-08 c930b80e-7a6 Staff / Senior Software Engineer, AI Reliability About Anthropic

About the Role

Claude has your back. AIRE has Claude's. Help us keep Claude reliable for everyone who depends on it.

Responsibilities:

Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.

Design and implement monitoring and observability systems across the token path.

Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers

Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.

Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

You may be a good fit if you:

Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs.

Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.

Think holistically about how systems compose and where the seams are.

Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.

Care about users and feel ownership over outcomes, even for systems you don't own.

Have excellent communication and collaboration skills -- you'll be partnering across the entire company.

Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between.

Strong candidates may also:

Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems

Have experience operating large-scale model serving or training infrastructure (>1000 GPUs).

Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).

Understand ML-specific networking optimizations like RDMA and InfiniBand.

Have expertise in AI-specific observability tools and frameworks.

Have experience with chaos engineering and systematic resilience testing.

Have contributed to open-source infrastructure or ML tooling.

Logistics

Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.

We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work.

Your safety matters to us. To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit anthropic.com/careers directly for confirmed position openings.

How we're different

We believe that the highest-impact AI research will be big science. At Anthropic we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as a team sport, where everyone contributes to the overall success of the team.

XML job scraping automation by YubHub

]]> full-time staff hybrid $325,000 - $485,000 USD distributed systems, infrastructure, reliability, large language model serving systems, monitoring and observability systems, high-availability serving infrastructure, incident response, safeguard model serving, SRE, Production Engineer, ML hardware accelerators, ML-specific networking optimizations, AI-specific observability tools and frameworks, chaos engineering, systematic resilience testing, open-source infrastructure or ML tooling Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. It has a quickly growing team of researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5113224008 San Francisco, CA | New York City, NY | Seattle, WA 2026-03-08 10798a1e-9fa Staff Software Engineer, AI Reliability Engineering About Anthropic

About the Role

Claude has your back. AIRE has Claude's. Help us keep Claude reliable for everyone who depends on it.

Responsibilities

Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity.
Design and implement monitoring and observability systems across the token path.
Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers
Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments.

You may be a good fit if you

Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs.
Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet.
Think holistically about how systems compose and where the seams are.
Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions.
Care about users and feel ownership over outcomes, even for systems you don't own.
Have excellent communication and collaboration skills -- you'll be partnering across the entire company.
Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between.

Strong candidates may also

Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems
Have experience operating large-scale model serving or training infrastructure (>1000 GPUs).
Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
Understand ML-specific networking optimizations like RDMA and InfiniBand.
Have expertise in AI-specific observability tools and frameworks.
Have experience with chaos engineering and systematic resilience testing.
Have contributed to open-source infrastructure or ML tooling.

Logistics

Salary

The annual compensation range for this role is €235.000 - €295.000EUR.

How we're different

XML job scraping automation by YubHub

]]> full-time staff hybrid €235.000 - €295.000EUR distributed systems, infrastructure, reliability, software engineering, SRE, large scale systems, model serving, training infrastructure, ML hardware accelerators, RDMA, InfiniBand, AI-specific observability tools, chaos engineering, resilience testing, open-source infrastructure, ML tooling, communication, collaboration, diverse experience, product stacks, databases, distributed systems Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a company that creates reliable, interpretable, and steerable AI systems. It has a quickly growing team of researchers, engineers, policy experts, and business leaders. https://job-boards.greenhouse.io https://job-boards.greenhouse.io/anthropic/jobs/5101169008 Dublin 2026-03-08 3514d749-08c Senior Support Engineer Senior Support Engineer - San Francisco

Location

San Francisco

Employment Type

Full time

Department

Compensation

$234K – $260K • Offers Equity

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

The Technical Support team is responsible for ensuring that developers and enterprises can reliably build mission critical solutions using OpenAI models. We provide technical guidance, resolve complex issues and support customers in maximizing value and adoption from deploying our highly-capable models. We work closely with Technical Success, Product, Engineering and others to deliver the best possible experience to our customers at scale. We think from an automation-first mindset and leverage the latest in AI to scale our support operations. Join the Senior Support Engineering (SSE) team at OpenAI and help shape the future of Technical Support in the age of AI.

About the Role

We are looking for a Senior Support Engineer to collaborate directly with our strategic enterprise accounts and product teams, helping solve some of the most difficult problems faced by our Customers. You will be part of the best technical troubleshooting team at OpenAI, and our Customers and Engineering teams will look to you for technical guidance in addressing the most technically difficult issues in our environment.

As a Senior Support Engineer, you will design and run operational processes to monitor our top strategic customers and a 24x7 response team. You’ll work closely with our Infrastructure and Engineering teams to deliver the best possible experience to customers at scale. Working directly with our most strategic Customers - You will be crucial to the success of the most innovative, disruptive, and high-scale AI solutions being built with the OpenAI API platform.

The nature of this role will be low volume, high difficulty.

This role is based in San Francisco, CA. We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.

In this role, you will:

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI. You are the last line of defense before the core Engineering team.

Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies. Contribute to shaping the future of technical support in an AI-driven era.

Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time.

In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates. Ensure that operational readiness (monitoring, alerting, and fallback plans) is in place for any such changes.

Design and refine incident response processes and documentation across strategic customers, engineering and support teams.

Analyze operational metrics and incident RCAs to identify areas for improvement. Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows.

Provide support coverage during holidays and weekends based on business needs.

You might thrive in this role if you:

Have a Bachelor’s degree in Computer Science or a related field. A strong software engineering foundation is important for this role’s success.

Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments. A strong track record of troubleshooting complex technical problems at the systems level.

Have deep familiarity with modern monitoring, alerting, and observability practices. Hands‑on experience setting up or managing metrics, logging, and tracing for distributed systems (e.g., understanding of SLIs/SLOs, alert tuning, dashboard creation).

Have proven experience leading incident response for high‑severity outages or service disruptions. Able to perform real‑time incident coordination, root cause analysis, and communication with stakeholders.

Are able to work effectively in a fast-paced environment, prioritize tasks, and manage multiple projects simultaneously.

Are a strong communicator and team player, with excellent written and verbal communication skills.

Are able to adapt to changing priorities and requirements, and are flexible in your approach to problem-solving.

XML job scraping automation by YubHub

]]> full-time senior hybrid $234K – $260K Bachelor’s degree in Computer Science or a related field, 8+ years of experience in technical operations roles such as SRE/NOC, Designing monitoring systems and resolving production issues in fast-paced and mission-critical environments, Troubleshooting complex technical problems at the systems level, Modern monitoring, alerting, and observability practices, Metrics, logging, and tracing for distributed systems, SLIs/SLOs, alert tuning, dashboard creation, Incident response for high‑severity outages or service disruptions, Real-time incident coordination, root cause analysis, and communication with stakeholders, Automation and advancements in AI technologies, Automation-first mindset and leveraging the latest in AI to scale support operations, Technical and troubleshooting expertise for API platform at OpenAI, Proactive identification and implementation of opportunities to scale support operations, Advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time, Reliability reviews and preparedness for new features, launches, or strategic customer requirement updates, Operational readiness (monitoring, alerting, and fallback plans), Incident response processes and documentation across strategic customers, engineering and support teams, Operational metrics and incident RCAs to identify areas for improvement, Enhancements to monitoring dashboards, alert configurations, and support workflows Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is a technology company that develops and offers artificial intelligence (AI) models and tools. It was founded in 2015 and is headquartered in San Francisco, California. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/5431666c-530b-49c0-b67e-32477f9eaf5e San Francisco 2026-03-06 237ffb32-054 Software Engineer, Security Observability Software Engineer, Security Observability

Location

Remote - US

Employment Type

Full time

Location Type

Remote

Department

Security

Compensation

$234.4K – $385K • Offers Equity

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

Security is at the foundation of OpenAI’s mission to ensure that artificial general intelligence benefits all of humanity.

The Security team protects OpenAI’s technology, people, and products. We are technical in what we build but are operational in how we do our work, and are committed to supporting all products and research at OpenAI. Our Security team tenets include: prioritizing for impact, enabling researchers, preparing for future transformative technologies, and engaging a robust security culture.

About the Role

We are seeking a Software Engineer, Security Observability to join our Security team. In this role, you will be responsible for building secure, scalable systems that enhance our security observability infrastructure. Leveraging your strong engineering skills, you will collaborate with cross-functional teams to develop, deploy, and maintain robust software solutions that support our security and detection capabilities.

This role is open to remote employees, or relocation assistance is available to one of our OpenAI offices in San Francisco, Seattle, or New York City.

In this role, you will:

Design and develop scalable software systems that facilitate security observability across our infrastructure.

Build and maintain data pipelines that centralize and store security-relevant data from diverse sources.

Proactively improve the resilience and reliability of data systems to ensure high platform availability

Collaborate closely with Detection & Response (D&R) and other security teams to reduce the company’s security risk.

Contribute to data engineering in support of forensic investigations and compliance efforts.

You might thrive in this role if you have:

Strong software engineering experience, with proficiency in programming languages such as Python, Golang, or similar.

A background in infrastructure as code, with experience using tools like Terraform and working with cloud platforms such as Azure.

Experience with building and maintaining data pipelines, particularly for security-related use cases.

A generalist engineering mindset, with the flexibility to pivot between various technical domains such as databases, site reliability engineering (SRE), or security.

The ability to collaborate effectively with security and engineering teams to understand evolving data needs and implement scalable solutions.

A proactive and detail-oriented approach to problem-solving, with a focus on improving security data visibility and forensic capabilities.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time mid remote $234.4K – $385K Python, Golang, Terraform, Azure, data pipelines, security-related use cases, databases, site reliability engineering (SRE), security, infrastructure as code, cloud platforms, forensic investigations, compliance efforts Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/92bf4ff3-7acf-4e49-8e09-47e4e8bd1f83 Remote - US 2026-03-06 edcdad0c-360 Software Engineer, Security Observability Software Engineer, Security Observability

Location

San Francisco

Employment Type

Full time

Location Type

Hybrid

Department

Security

Compensation

$234.4K – $385K • Offers Equity

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

Security is at the foundation of OpenAI’s mission to ensure that artificial general intelligence benefits all of humanity.

About the Role

This role is open to remote employees, or relocation assistance is available to one of our OpenAI offices in San Francisco, Seattle, or New York City.

In this role, you will:

Design and develop scalable software systems that facilitate security observability across our infrastructure.

Build and maintain data pipelines that centralize and store security-relevant data from diverse sources.

Proactively improve the resilience and reliability of data systems to ensure high platform availability

Collaborate closely with Detection & Response (D&R) and other security teams to reduce the company’s security risk.

Contribute to data engineering in support of forensic investigations and compliance efforts.

You might thrive in this role if you have:

Strong software engineering experience, with proficiency in programming languages such as Python, Golang, or similar.

A background in infrastructure as code, with experience using tools like Terraform and working with cloud platforms such as Azure.

Experience with building and maintaining data pipelines, particularly for security-related use cases.

A generalist engineering mindset, with the flexibility to pivot between various technical domains such as databases, site reliability engineering (SRE), or security.

The ability to collaborate effectively with security and engineering teams to understand evolving data needs and implement scalable solutions.

A proactive and detail-oriented approach to problem-solving, with a focus on improving security data visibility and forensic capabilities.

About OpenAI

XML job scraping automation by YubHub

]]> full-time mid hybrid $234.4K – $385K • Offers Equity Python, Golang, Terraform, Azure, data pipelines, security-related use cases, databases, site reliability engineering (SRE), security, infrastructure as code, cloud platforms, data engineering, forensic investigations, compliance efforts Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The company pushes the boundaries of the capabilities of AI systems and seeks to safely deploy them to the world through their products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/3e254907-5101-438d-8708-f6f34e5c75ea San Francisco 2026-03-06 88643d65-f58 Software Engineer, Security Observability Software Engineer, Security Observability

Location

Seattle

Employment Type

Full time

Department

Security

Compensation

$234.4K – $385K • Offers Equity

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

Security is at the foundation of OpenAI’s mission to ensure that artificial general intelligence benefits all of humanity.

About the Role

This role is open to remote employees, or relocation assistance is available to one of our OpenAI offices in San Francisco, Seattle, or New York City.

In this role, you will:

Design and develop scalable software systems that facilitate security observability across our infrastructure.

Build and maintain data pipelines that centralize and store security-relevant data from diverse sources.

Proactively improve the resilience and reliability of data systems to ensure high platform availability

Collaborate closely with Detection & Response (D&R) and other security teams to reduce the company’s security risk.

Contribute to data engineering in support of forensic investigations and compliance efforts.

You might thrive in this role if you have:

Strong software engineering experience, with proficiency in programming languages such as Python, Golang, or similar.

A background in infrastructure as code, with experience using tools like Terraform and working with cloud platforms such as Azure.

Experience with building and maintaining data pipelines, particularly for security-related use cases.

A generalist engineering mindset, with the flexibility to pivot between various technical domains such as databases, site reliability engineering (SRE), or security.

The ability to collaborate effectively with security and engineering teams to understand evolving data needs and implement scalable solutions.

A proactive and detail-oriented approach to problem-solving, with a focus on improving security data visibility and forensic capabilities.

About OpenAI

XML job scraping automation by YubHub

]]> full-time mid remote $234.4K – $385K Python, Golang, Terraform, Azure, data pipelines, security-related use cases, databases, site reliability engineering (SRE), security, infrastructure as code, cloud platforms, data engineering, forensic investigations, compliance efforts Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. The company was founded in 2015 and has since grown to become a leading player in the field of artificial intelligence. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/747bb870-4ef1-4bfd-b2c0-d48042a85080 Seattle 2026-03-06 7f4e2dd8-338 Software Engineer, Security Observability Software Engineer, Security Observability

Location

New York City

Employment Type

Full time

Location Type

Hybrid

Department

Security

Compensation

$325K – $405K • Offers Equity

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

Security is at the foundation of OpenAI’s mission to ensure that artificial general intelligence benefits all of humanity.

About the Role

This role is open to remote employees, or relocation assistance is available to one of our OpenAI offices in San Francisco, Seattle, or New York City.

In this role, you will:

Design and develop scalable software systems that facilitate security observability across our infrastructure.

Build and maintain data pipelines that centralize and store security-relevant data from diverse sources.

Proactively improve the resilience and reliability of data systems to ensure high platform availability

Collaborate closely with Detection & Response (D&R) and other security teams to reduce the company’s security risk.

Contribute to data engineering in support of forensic investigations and compliance efforts.

You might thrive in this role if you have:

Strong software engineering experience, with proficiency in programming languages such as Python, Golang, or similar.

A background in infrastructure as code, with experience using tools like Terraform and working with cloud platforms such as Azure.

Experience with building and maintaining data pipelines, particularly for security-related use cases.

A generalist engineering mindset, with the flexibility to pivot between various technical domains such as databases, site reliability engineering (SRE), or security.

The ability to collaborate effectively with security and engineering teams to understand evolving data needs and implement scalable solutions.

A proactive and detail-oriented approach to problem-solving, with a focus on improving security data visibility and forensic capabilities.

About OpenAI

XML job scraping automation by YubHub

]]> full-time mid hybrid $325K – $405K • Offers Equity Python, Golang, Terraform, Azure, data pipelines, security-related use cases, databases, site reliability engineering (SRE), security, infrastructure as code, cloud platforms, forensic investigations, compliance efforts Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/1e4e9985-babf-4bd9-8fe8-a2016250780d New York City 2026-03-06 6308fa9f-2f4 Member of Technical Staff - Principal Data Infrastructure Engineer Summary

Microsoft AI are looking for a talented Member of Technical Staff - Principal Data Infrastructure Engineer at their Redmond office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising AI technology. You'll work directly with leadership to shape the company's direction in the AI market.

About the Role

As a Member of Technical Staff - Principal Data Infrastructure Engineer, you will be responsible for architecting and maintaining scalable, reliable, and observable Big Data Infrastructure for mission-critical AI applications. You will champion DevOps and SRE best practices—automated deployments, service monitoring, and incident response. You will build a self-service big data platform that empowers data and platform engineers and researchers. You will develop robust CI/CD pipelines and automate infrastructure provisioning using Infrastructure as Code tools (Bicep, Terraform, ARM).

Accountabilities

Architect and maintain scalable, reliable, and observable Big Data Infrastructure for mission-critical AI applications.
Champion DevOps and SRE best practices—automated deployments, service monitoring, and incident response.

The Candidate we're looking for

Experience:

Master’s Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ years experience in business analytics, data science, software development, data modeling, or data engineering OR Bachelor’s Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years experience in business analytics, data science, software development, data modeling, or data engineering OR equivalent experience.

Technical skills:

4+ years in Big Data Infrastructure, DevOps, SRE, or Platform Engineering.
3+ years of hands-on experience managing and scaling distributed systems—from bare-metal to cloud-native environments.
2+ years deploying containerized applications using Kubernetes and Helm/Kustomize.
Solid scripting and automation skills using Python, Bash, or PowerShell.

Personal attributes:

Excellent interpersonal and communication skills, with a solid passion for mentorship and continuous learning.

Benefits

Starting January 26, 2026, Microsoft AI employees who live within a 50-mile commute of a designated Microsoft office in the U.S. or 25-mile commute of a non-U.S., country-specific location are expected to work from the office at least four days per week.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more.
Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

XML job scraping automation by YubHub

]]> full-time staff hybrid Big Data Infrastructure, DevOps, SRE, Platform Engineering, Python, Bash, PowerShell, Kubernetes, Helm/Kustomize, Databricks, IAM, OAuth, Kerberos, Azure, AWS, GCP Engineering Technology Microsoft AI https://logos.yubhub.co/microsoft.ai.png Microsoft continues to push the boundaries of AI, aiming to build systems with true artificial intelligence across agents, applications, services, and infrastructure, making AI accessible to all. https://microsoft.ai https://microsoft.ai/job/member-of-technical-staff-principal-data-infrastructure-engineer/ Redmond 2026-03-06 bed7736d-0a7 Browser Infrastructure Engineer This role exists to build reliable, automated, and scalable infrastructure for Chromium-based browser teams. As a Browser Infrastructure Engineer, you will focus on CI/CD pipelines, monitoring, and development environments to support fast-paced browser innovation.

What you'll do

You will set up and maintain CI/CD pipelines for builds and testing, support and evolve Chromium browser development infrastructure, configure monitoring and alerting systems, manage cloud infrastructure, develop automation scripts, and ensure high availability, resilience, and security of development infrastructure.

What you need

You will need 3+ years in software development infrastructure, preferably Chromium browsers, hands-on DevOps and SRE experience, including monitoring and incident management, proficiency in k8s, Terraform, Datadog, Sentry, AWS, Unix, TeamCity, strong CI/CD implementation skills, and ability to thrive in Agile teams with excellent communication.

XML job scraping automation by YubHub

]]> full-time mid remote software development infrastructure, CI/CD pipelines, monitoring and alerting systems, cloud infrastructure, automation scripts, DevOps and SRE experience, k8s, Terraform, Datadog, Sentry, AWS, Unix, TeamCity Engineering Technology Perplexity https://logos.yubhub.co/perplexity.com.png Perplexity is a young, fast-growing Chromium-based browser. They are committed to building reliable, automated, and scalable infrastructure for their browser development teams. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/perplexity/7bce0fcf-eef6-41aa-9243-896f07a0316e Belgrade 2026-03-04 c4e68b15-a2a Produktionsplaner / Produktionssteurer (m/w/d) Summary

FUCHS LUBRICANTS GERMANY GmbH are looking for a talented Produktionsplaner / Produktionssteurer (m/w/d) at their Wedel office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising the chemical industry. You'll work directly with leadership to shape the company's direction in the production and manufacturing markets.

About the Role

Wir suchen einen Produktionsplaner / Produktionssteurer (m/w/d) für unsere Fertigungsstätte in Wedel. Als Produktionsplaner / Produktionssteurer (m/w/d) sind Sie für die Planung und Koordination von Produktions- und Abfüllaufträgen verantwortlich. Sie überwachen alle Produktionsaufträge und sichern die termingerechte Fertigstellung. Enge Abstimmung mit den Bereichen Einkauf, Logistik, Produktion, Qualität, Customer Service und Vertrieb.

Accountabilities

Planung und Koordination von Produktions- und Abfüllaufträgen zur optimalen Kapazitätsauslastung
Verfolgung aller Produktionsaufträge und Sicherstellung der termingerechten Fertigstellung

The Candidate we're looking for

Experience:

Abgeschlossene kaufmännische oder technische Ausbildung bzw. ein Studium (z.B. Betriebswirtschaft mit Schwerpunkt Supply Chain, Logistik, Produktion)

Technical skills:

Erfahrung im dispositiven Bereich, z. B. Produktionsplanung, Produktionssteuerung, operativer Einkauf oder Vertrieb

Personal attributes:

Analytisches Denken, strukturiertes Arbeiten, hohe Stressresistenz sowie Durchsetzungsvermögen

Benefits

Vereinbarkeit von Privat- und Berufsleben (u.a. Flexible Arbeitszeitmodelle, Gleitzeit, 30 Tage Urlaub, Freistellungsmöglichkeiten)
Eine sichere Zukunftsperspektive in einem dynamischen, weltweit agierenden Unternehmen

XML job scraping automation by YubHub

]]> full-time mid onsite Produktionsplanung, Produktionssteuerung, Einkauf, Logistik, Produktion, Qualität, Customer Service, Vertrieb, Analytisches Denken, strukturiertes Arbeiten, hohe Stressresistenz, Durchsetzungsvermögen Operations Manufacturing FUCHS LUBRICANTS GERMANY GmbH https://logos.yubhub.co/jobs.fuchs.com.png FUCHS LUBRICANTS GERMANY GmbH ist die größte operativ tätige Gesellschaft des global agierenden FUCHS Konzerns mit Stammsitz in Mannheim und entwickelt, produziert und vertreibt qualitativ hochwertige Schmierstoffe und benachbarte chemische Spezialitäten für den deutschen und internationalen Markt. https://jobs.fuchs.com https://jobs.fuchs.com/job/Wedel-Produktionsplaner-Produktionssteurer-%28mwd%29-SH-22880/1365989033/ Wedel 2026-02-19 7634df8f-923 HR Manager (m/w/d) Summary

FUCHS LUBRICANTS GERMANY GmbH are looking for a talented HR Manager (m/w/d) at their Mannheim office. This role sits at the heart of strategic decision-making, turning market data into actionable insights for a company that's revolutionising haptic entertainment technology. You'll work directly with leadership to shape the company's direction in the cinema and simulation markets.

About the Role

As HR Manager (m/w/d), you will be the first point of contact for employees and management in all basic HR matters of the day-to-day business (e.g. labour and collective bargaining law, remuneration issues, performance and potential management). You will steer and accompany our employees along the entire employee lifecycle from planning to recruitment, development to exit in the assigned department. You will plan and implement personnel measures taking into account legal, contractual, collective bargaining and personnel requirements. You will work closely with the HR team at all German locations. You will maintain a trusting and constructive relationship with the works council. You will participate in HR projects (e.g. process optimisation, digitalisation, employer branding) with a focus on operational implementation. You will support in cross-location HR initiatives.

Accountabilities

Conduct in-depth market research across cinema and simulation sectors, identifying emerging trends, competitive threats, and partnership opportunities that directly inform the company's quarterly strategic planning sessions

The Candidate we're looking for

Experience:

Abgeschlossenes Studium im Bereich Wirtschaftswissenschaften, Psychologie, Sozialwissenschaften oder eine vergleichbare Ausbildung mit erster Berufserfahrung im Personalwesen

Technical skills:

Grundkenntnisse im Arbeitsrecht und ein grundlegendes Verständnis betriebsverfassungsrechtlicher Abläufe

Personal attributes:

Hohe Serviceorientierung, kommunikative Stärke und Freude an der Zusammenarbeit mit Mitarbeitenden und Führungskräften im operativen Tagesgeschäft

Benefits

Vereinbarkeit von Privat- und Berufsleben (u.a. Flexible Arbeitszeitmodelle, Gleitzeit, 30 Tage Urlaub, Freistellungsmöglichkeiten)
Eine sichere Zukunftsperspektive in einem dynamischen, weltweit agierenden Unternehmen

XML job scraping automation by YubHub

]]> full-time mid onsite Arbeitsrecht, Betriebsverfassungsrecht, Personalwesen, Digitalisierung, Employer Branding HR Manufacturing FUCHS LUBRICANTS GERMANY GmbH https://logos.yubhub.co/jobs.fuchs.com.png FUCHS LUBRICANTS GERMANY GmbH is the largest operating company of the global FUCHS Group with its headquarters in Mannheim and develops, produces and distributes high-quality lubricants and adjacent chemical specialties for the German and international market. https://jobs.fuchs.com https://jobs.fuchs.com/job/Mannheim-HR-Manager-%28mwd%29-BW-68169/1291919601/ Mannheim 2026-02-12 938f7e18-10b Praktikum HR Business Partner - Produktion & Logistik What you'll do

You'll manage daily operations of the facility, ensuring equipment runs smoothly and maintenance schedules stay on track.

Coordinate maintenance schedules and ensure equipment operates efficiently throughout the day
Respond to urgent repair requests within 2-hour SLA windows using our ticketing system
Manage relationships with external contractors and vendors, getting quotes and overseeing work quality
Track facility costs and identify opportunities to reduce waste whilst maintaining standards
Lead weekly safety inspections and ensure compliance with health and safety regulations

What you need

To succeed in this role, you'll need hands-on facilities experience and strong problem-solving skills.

3+ years facilities maintenance experience in a commercial or industrial environment
Proven ability to manage contractors and vendors effectively whilst staying within budget
Strong electrical and mechanical troubleshooting skills - you can diagnose issues quickly
Comfortable using CMMS software (Maximo, SAP, or similar) to log jobs and track work
Understanding of health and safety regulations and how they apply to facilities work

Why this matters

This role keeps a world-championship-winning F1 team running. When equipment fails, races can be lost, so your work directly impacts performance. You'll develop deep expertise in high-spec facilities and have clear progression into senior facilities management roles. The F1 environment means you'll work with cutting-edge building systems and learn from the best in the industry.

XML job scraping automation by YubHub

]]> full-time entry onsite Ganzheitliches Personalmanagement, HR-Prozessmanagement und -optimierung, Führungskräfte-Beratung, Transformationsprozesse, Risikoidentifikation und -analyse im Personalbereich, Personal- und Arbeitsrecht, HR-Administration, SAP-Kenntnisse HR Automotive Dr. Ing. h.c. F. Porsche AG https://logos.yubhub.co/jobs.porsche.com.png Porsche is a valuable brand with worldwide appeal and a loyal customer base around the globe. The way we work together and hold together as a team is unique. Our Miteinander is shaped by our strong Porsche culture: Heartblood | Sportiness | Pioneer spirit | A family https://jobs.porsche.com https://jobs.porsche.com/index.php?ac=jobad&id=18021 Sachsenheim bei Stuttgart 2025-12-08 c900cb93-d8d Empowering Climate-Positive Generations What you'll do

You'll create, check, and evaluate protection, measurement, and control concepts as well as circuit diagrams for electrical systems. You'll plan and carry out tenders, including the creation of technical specifications, performance schedules, and job descriptions. You'll lead bidder discussions, evaluate offers, and create well-founded decision-making documents for the project management. You'll manage and coordinate electrical projects in the medium and high-voltage range under compliance with schedule, cost, and quality requirements. You'll manage interfaces and coordinate with customers, network operators, certifiers, and executing companies to ensure smooth project progress. You'll accompany conformity tests and commissioning of switchgear and generators, taking into account relevant standards and market requirements.

What you need

You'll be able to read, create, and evaluate protection, measurement, and control concepts (e.g. E-Plan, SLD). You'll have knowledge of current communication protocols (IEC 61850, IEC 60870-5-101/104, Modbus, etc.). You'll be familiar with energy market interfaces and their integration into network and control technology. You'll have in-depth knowledge of technical connection rules in medium-voltage networks (VDE-AR-N 4105/4110). You'll have experience in project management of electrical projects and switchgear in MS/HS (schedule, cost, resource management). You'll have a degree in electrical engineering, energy technology, etc. or a comparable training. You'll have several years of professional experience in consulting, at an engineering service provider, and/or in the energy sector. You'll have practical experience in project management, e.g. in the energy sector. You'll have entrepreneurial thinking and sales affinity, ideally with an existing network in the relevant field. You'll have a high interest in current trends in the energy industry and new technologies. You'll have a structured and independent working style. You'll have high communication and coordination skills. You'll be able to address complex technical issues in a target-oriented manner. You'll have secure verbal and written German and English skills. You'll have a basic willingness to travel and flexibility.

Why this matters

We work together on complex, technical, energy-economic, and conceptual challenges of sustainable future design. Our actions are guided by our IE2S purpose: EMPOWERING CLIMATE-POSITIVE GENERATIONS!

XML job scraping automation by YubHub

]]> full-time mid onsite Schutz-, Mess- und Steuerungskonzepte, aktuelle Kommunikationsprotokolle, Energiemarktschnittstellen, technische Anschlussregeln, Projektmanagement, Elektrotechnik, Energiebranche, Unternehmerisches Denken, vertriebliche Affinität, Kommunikations- und Koordinationsfähigkeit Engineering Technology MHP - A Porsche Company https://logos.yubhub.co/jobs.porsche.com.png As Intelligent Energy System Services GmbH, we know that the mobility and energy transition can only be achieved with concentrated power - with the bundling of the right competences and skills. We work together on complex, technical, energy-economic, and conceptual challenges of sustainable future design. https://jobs.porsche.com https://jobs.porsche.com/index.php?ac=jobad&id=17773 Stuttgart 2025-12-08