Senior Infrastructure Engineer

32334977-1bd Senior Infrastructure Engineer About Us

Descript is on a mission to make audio and video content creation and editing fast, easy, and accessible to all. We are building a cutting-edge media editor incorporating real time collaboration, ground-breaking UX, and cutting-edge AI.

Job Description

As a Senior Infrastructure Engineer, you will drive projects that let engineers better understand and improve the performance, availability, and quality of what they ship. You will be owning and improving the core production infrastructure and building blocks upon which other engineers depend.

Responsibilities

Develop technical and business solutions that enable engineers to improve the quality and reliability of product features and systems that they build.
Drive improvements to the reliability of our core infrastructure, such as production clusters, networking, databases, and observability systems.
Champion best practices during reviews of code, technical designs, and launch plans.
Own our incident management and fire drill processes.
Work with engineering leadership to set goals and prioritize production reliability.

Requirements

5+ years experience in production/site-reliability engineering OR 5+ years of server-side software engineering with an interest in working on core infrastructure
A solid understanding of at least two of: public cloud infrastructure, Linux systems administration, and DevOps tooling.
Basic coding skills to work on automation and technical guardrails.
Strong written and verbal communication skills, and the ability to collaborate with other functions
Experience mentoring engineers, including code reviews, architecture discussions, and leadership skills

Nice to Have’s

Experience with:

+ TypeScript + Kubernetes + Google Cloud Platform + Terraform

The base salary range for this role is $191K-$250K.

XML job scraping automation by YubHub

]]> full-time senior remote $191K-$250K public cloud infrastructure, Linux systems administration, DevOps tooling, basic coding skills, strong written and verbal communication skills, TypeScript, Kubernetes, Google Cloud Platform, Terraform Engineering Technology Descript https://logos.yubhub.co/descript.com.png Descript is building a simple, intuitive, fully-powered editing tool for video and audio. It has around 150 employees. https://descript.com/ https://job-boards.greenhouse.io/descript/jobs/7500000003 Remote, San Francisco, California, United States 2026-04-18 682f5f72-49b Senior Site Reliability Engineer, Edge - TS/SCI Secure Every Identity, from AI to Human

Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence.

This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

About the Team

At Okta, our motto is "Always On." Within the Technical Operations (TechOps) team, we live this mission by building the most reliable and performant systems on the planet. We empower organisations to do their most significant work by securely connecting any person, on any device, to the technologies they need.

The Role

We are seeking a Senior Site Reliability Engineer (SRE) to lead the evolution of our large-scale production systems. This role is designed for a technical expert who thrives on solving complex problems at scale and lives by the ethic: "If you have to do it twice, automate it." Based in the Washington, D.C. area, you will ensure our infrastructure maintains uncompromising reliability and performance while supporting critical national security missions in secure, restricted environments.

Security Requirement: Must be able to obtain and maintain a U.S. security clearance (Secret or Top Secret) to the extent required by U.S. Government contracts.

The selected candidate may be subject to drug testing to the extent required by U.S. Government contracts.

What You’ll Do

Infrastructure Leadership: Design, build, and oversee Okta’s production infrastructure, ensuring architectural integrity and peak performance.

Incident Engineering: Act as a senior escalation point for production incidents, conducting deep-dive root cause analysis and implementing permanent, automated preventive solutions.

Strategic Automation: Eliminate manual toil by developing sophisticated automation frameworks, evolving monitoring tools, and establishing rigorous technical documentation.

System Resilience: Optimize a highly available, large-scale environment, ensuring "Always On" service delivery across complex network topologies.

Mentorship: Provide technical guidance to the engineering organisation, championing SRE best practices and a culture of self-education.

What You’ll Bring

Core Requirements

Clearance: Active TS/SCI with Polygraph.

Compliance Expertise: Deep professional experience with FedRAMP and DoD IL6 frameworks.

Education: B.S. in Computer Science or equivalent technical experience.

Technical Expertise

Networking & Cloud Architecture: Mastery of AWS networking and security, including Transit Gateways, VPCs, Route Tables, ELBs, and NACLS.

Infrastructure as Code (IaC): Advanced experience automating enterprise-scale infrastructure via Terraform or CloudFormation.

Systems & Scripting: Expert-level Linux systems administration with proficiency in Go, Python, Bash, or Ruby.

Production Support: Proven success managing Docker containers and Java-based stacks (Apache/Tomcat) in high-security production environments.

Protocol Knowledge: Solid understanding of networking concepts, IP protocols, and multi-cloud infrastructure.

#LI-TM

#LI-Hybrid

P24505

XML job scraping automation by YubHub

]]> full-time senior onsite $159,000-$218,900 USD AWS networking and security, Terraform or CloudFormation, Linux systems administration, Go, Python, Bash, or Ruby, Docker containers and Java-based stacks Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta is a software company that provides identity and access management solutions. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7562925 Washington, DC 2026-04-18 64fb6c63-a4b Senior Product Security Engineer, Red Team Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.

This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

Within the Product Security team, our Red Team delivers robust security assurance for Okta's products, services, and infrastructure. You will be the team's dedicated infrastructure and tooling engineer, the first person in this role for a small team of operators. You will work alongside operators but not report through an operator chain; you'll collaborate as a peer focused on a different discipline.

We seek a Staff Security Infrastructure Engineer to own the engineering backbone that enables our operations. This is not a traditional operator role but a dedicated infrastructure, tooling, and automation engineering position embedded within the Red Team.

You will design, build, maintain, and continuously improve the platforms, infrastructure, and custom tooling that our operators depend on to execute engagements. Your work directly enables the team to operate at a higher maturity level: faster infrastructure deployment, more resilient and OPSEC-aware architecture, automated workflows, and reliable custom tooling, freeing operators to focus on the mission.

Your role will also extend to cultivating stakeholder collaboration and elevating our company’s security posture through strategic engagement and proactive measures. As the team matures, this role can evolve toward platform leadership, custom capability development, or a hybrid operator/engineer path.

Responsibilities

Infrastructure Engineering & Automation:

Own the full lifecycle of red team infrastructure: design, provisioning, configuration, maintenance, and teardown
Build and maintain Infrastructure-as-Code (IaC) using Terraform (or equivalent) to automate deployment of C2 servers, redirectors, phishing infrastructure, payload-delivery systems, and supporting services.
Resource and asset lifecycle management through tracking domains, certificates, cloud accounts, recurring expenses, and infrastructure resources; managing acquisition, rotation, and retirement.

Tooling Development & Maintenance:

Develop, maintain, and improve custom tools, scripts, and automation to support red team operations (e.g., payload generation pipelines, log aggregation, C2 profile management, infrastructure health checks), providing on-demand infrastructure/tooling support when issues or gaps arise.
Collaborate closely with operators during engagement planning to understand infrastructure requirements, OPSEC constraints, and operational timelines.
Building and maintaining a representative test environment for pre-operation validation of tools and tradecraft against a security stack similar to the target.
Maintaining the team's source code repository with merge/pull request processes, documentation, and code quality standards.
Ensuring engagement evidence, infrastructure logs, and operational data are centrally collected and accessible for reporting and after-action reviews.
Contribute to and maintain metrics that demonstrate infrastructure maturity, operational efficiency, and readiness (e.g., deployment time, rebuild time, infrastructure availability during engagements).

Security & OPSEC:

Design infrastructure with OPSEC as a first-class requirement: network segmentation, traffic separation between operations, credential management, and access controls
Implement and manage secure access to red team infrastructure
Create and update operational runbooks, infrastructure documentation, and SOPs for the team.
Maintain clear records of infrastructure ownership and attribution to support deconfliction processes.

Requirements

5+ years of professional experience in infrastructure engineering, DevOps, platform engineering, or a similar role with significant automation responsibilities
Strong familiarity with Terraform (or equivalent IaC tooling) for multi-cloud infrastructure provisioning and management
Experience operating in cloud-native, SaaS, or identity-focused environments
Strong proficiency with configuration management tools (Ansible, or equivalent)
Proficiency in at least one systems programming or scripting language (Python, Go, Bash) with disciplined development practices (version control, code review, testing, documentation)
Solid understanding of Linux systems administration, networking fundamentals (DNS, HTTP/S, TCP/IP, proxying, TLS), and cloud platforms (AWS, GCP, or Azure)
Understanding of OPSEC principles as they apply to offensive infrastructure , you know why redirector chains, domain categorization, traffic separation, and certificate management matter.

Desired Qualifications

Experience building and maintaining CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or similar)
Familiarity with containerization and orchestration (Docker, Kubernetes) as applicable to tooling and lab environments
Familiarity with C2 frameworks (Cobalt Strike, Mythic, Sliver, or similar) from an infrastructure and deployment perspective , you don't need to operate them, but you need to understand what operators need from the infrastructure
Familiarity with detection evasion concepts as they relate to infrastructure (e.g., traffic shaping, hosting provider reputation, certificate transparency)

Nice to Have

Working knowledge of Blue Team operations and related technologies
Experience with security tool development (implant development, payload engineering, evasion tooling) , this role can grow in that direction
Familiarity with Red Team maturity models and how infrastructure/tooling capabilities map to organisational maturity

Note: This is not an operator role. You will not be the person running hands-on-keyboard engagements as your primary function. While you may participate in operations to understand requirements or provide support, your core mission is ensuring the team's infrastructure, workflows, tooling, and automation are reliable, repeatable, and mature. You are the engineering foundation the operators build on.

#LI-TM #LI-Hybrid (P22302_3403905)

XML job scraping automation by YubHub

]]> full-time senior hybrid $114,000-$157,300 USD Terraform, Infrastructure-as-Code, Linux systems administration, Networking fundamentals, Cloud platforms, Configuration management tools, Systems programming or scripting language, OPSEC principles, CI/CD pipelines, Containerization and orchestration, C2 frameworks, Detection evasion concepts Engineering Technology Okta https://logos.yubhub.co/okta.com.png Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era. https://www.okta.com/ https://job-boards.greenhouse.io/okta/jobs/7773769 Toronto, Ontario, Canada 2026-04-18 1b773e5c-b51 IT Systems Engineer, Corporate Systems & Infrastructure About the role ---------------- The Corporate Infrastructure team builds and operates the platform layer the rest of IT Engineering runs on , cloud infrastructure hosting our internal services, the CI/CD that ships IT's own code, the observability stack across the corporate environment, and the cross-system automation that wires together tools never designed to talk to each other.

You'll build deployment pipelines and internal tooling that let IT Engineering ship like a product team. You'll define SLOs for corporate services, build the monitoring to know when we're missing them, and run on-call for the things you deploy. You'll partner with our network and AV engineers as their infrastructure counterpart , automating physical-world systems, building the telemetry that tells us an office is degraded before someone files a ticket. The scope is broad and the team is deliberately small, which means you'll need depth across cloud, CI, and observability, strong judgment about where to invest, and a bias toward infrastructure-as-code over heroic manual fixes.

Responsibilities ---------------

Build and operate the cloud infrastructure that hosts IT's internal services
Design CI/CD pipelines that let IT Engineering ship through code review and automated testing
Own observability for corporate infrastructure , monitoring, alerting, dashboards, and SLOs
Write cross-system automation to integrate third-party systems and internal services
Partner with network, audiovisual, and physical security to deliver robust infrastructure solutions
Build internal tools , CLIs, bots, dashboards , that make other IT engineers faster
Run on-call for corporate infrastructure with post-incident reviews that drive durable fixes
Deploy infrastructure as code

Requirements ------------

8+ years building secure IT systems in complex environments
Excel at solving ambiguous problems with multiple stakeholders
Communicate technical concepts clearly to any audience
View IT Engineering as requiring product engineering rigor
Successfully deliver complex projects from conception to production
Write clear documentation as a natural part of your workflow
Have shipped Infrastructure as Code in production , Terraform or similar, with modules and state you maintained
Have run services with SLOs, on-call rotations, and post-incident reviews
Have built internal platforms or tooling that other engineers depend on

Strong candidates may also -------------------------------

Have transformed traditional IT operations into engineering-driven organizations
Have built strong partnerships with Security and Engineering teams
Practice modern development methods (code reviews, testing, CI/CD)
Work effectively in distributed teams
Have experience with ECS, Kubernetes or other container orchestration for internal services
Have automated physical-world infrastructure deployment (e.g., network configuration, office technology, physical security systems)
Have worked with enterprise integration or workflow automation platforms (e.g., Workato, n8n, Tines, or equivalents)

Technical Skills ----------------

Python, golang, etc
Terraform and Infrastructure as Code
Cloud platforms (AWS, GCP, Azure)
CI/CD pipeline design
Observability tooling (e.g., Prometheus, Grafana, Datadog, Honeycomb, or equivalent)
Linux systems administration
Strong networking skills
Configuration management

Experience Level: senior Employment Type: full-time Workplace Type: remote Category: Engineering Industry: Technology Salary Range: $275,000-$325,000 USD Required Skills:

Python
Terraform
Cloud platforms
CI/CD pipeline design
Observability tooling
Linux systems administration
Strong networking skills
Configuration management

Preferred Skills:

golang
ECS
Kubernetes
Enterprise integration or workflow automation platforms

XML job scraping automation by YubHub

]]> full-time senior remote $275,000-$325,000 USD Python, Terraform, Cloud platforms, CI/CD pipeline design, Observability tooling, Linux systems administration, Strong networking skills, Configuration management, golang, ECS, Kubernetes, Enterprise integration or workflow automation platforms Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that aims to create reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/4887952008 Remote-Friendly (Travel-Required) | San Francisco, CA | Seattle, WA | New York City, NY 2026-04-18 51758515-c12 Member of Technical Staff We are seeking a highly skilled Member of Technical Staff to join our team in managing and enhancing reliability across a multi-data center environment.

This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure.

The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime,including close partnership with facility operations to address physical infrastructure impacts.

In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities.

By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scale environments like those at hyperscale cloud providers.

Responsibilities:

Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning.

Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers,open to innovative stacks beyond traditional ones like ELK.

Collaborate with cross-functional teams,including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management),to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).

Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.

Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.

Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation.

Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios.

Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies.

Basic Qualifications:

Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).

5+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.

Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.

Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.

Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).

Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.

Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.

Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.

Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.

Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).

Preferred Skills and Experience:

7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.

Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.

Proficiency in Rust for systems programming and performance-critical components.

Direct experience integrating software reliability tools with physical data center infrastructure.

Experience with observability tools and practices, such as metrics collection, logging, tracing, and dashboards.

Familiarity with containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).

Experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.

Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.

Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.

Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).

XML job scraping automation by YubHub

]]> full-time staff onsite Python, Rust, Linux systems administration, performance tuning, kernel-level understanding, scripting/automation, containerization, orchestration, observability, metrics collection, logging, tracing, dashboards, networking fundamentals, TCP/IP, routing, redundancy, DNS, Kubernetes, Docker, Grafana, Prometheus, ELK, DevOps, SRE, infrastructure engineering, systems engineering Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI creates AI systems to understand the universe and aid humanity in its pursuit of knowledge. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/5044403007 Memphis, TN 2026-04-18