Senior Infrastructure Engineer

32334977-1bd Senior Infrastructure Engineer About Us

Descript is on a mission to make audio and video content creation and editing fast, easy, and accessible to all. We are building a cutting-edge media editor incorporating real time collaboration, ground-breaking UX, and cutting-edge AI.

Job Description

As a Senior Infrastructure Engineer, you will drive projects that let engineers better understand and improve the performance, availability, and quality of what they ship. You will be owning and improving the core production infrastructure and building blocks upon which other engineers depend.

Responsibilities

Develop technical and business solutions that enable engineers to improve the quality and reliability of product features and systems that they build.
Drive improvements to the reliability of our core infrastructure, such as production clusters, networking, databases, and observability systems.
Champion best practices during reviews of code, technical designs, and launch plans.
Own our incident management and fire drill processes.
Work with engineering leadership to set goals and prioritize production reliability.

Requirements

5+ years experience in production/site-reliability engineering OR 5+ years of server-side software engineering with an interest in working on core infrastructure
A solid understanding of at least two of: public cloud infrastructure, Linux systems administration, and DevOps tooling.
Basic coding skills to work on automation and technical guardrails.
Strong written and verbal communication skills, and the ability to collaborate with other functions
Experience mentoring engineers, including code reviews, architecture discussions, and leadership skills

Nice to Have’s

Experience with:

+ TypeScript + Kubernetes + Google Cloud Platform + Terraform

The base salary range for this role is $191K-$250K.

XML job scraping automation by YubHub

]]> full-time senior remote $191K-$250K public cloud infrastructure, Linux systems administration, DevOps tooling, basic coding skills, strong written and verbal communication skills, TypeScript, Kubernetes, Google Cloud Platform, Terraform Engineering Technology Descript https://logos.yubhub.co/descript.com.png Descript is building a simple, intuitive, fully-powered editing tool for video and audio. It has around 150 employees. https://descript.com/ https://job-boards.greenhouse.io/descript/jobs/7500000003 Remote, San Francisco, California, United States 2026-04-18 cc2c1709-591 Senior Infrastructure Engineer Imagine being a pioneer, venturing through the uncharted territories of the cloud. You're not just navigating; you're shaping the landscape, constructing robust architectures that withstand the tests of time and scale.

At Mercury, your mission, should you choose to accept it, is to help steer our cloud infrastructure into the future. With projects as dynamic as migrating our entire fleet to ECS and building out our golden paths for service deployment, your role is pivotal. This isn't just a job; it's an epic tale of transformation and triumph.

As a senior member of our infrastructure team, you will be equipped with essential tools and technologies designed for scaling and enhancing Mercury's infrastructure:

AWS Services: Proficiently utilize EC2, RDS, IAM, Networking, Opensearch, and ECS to build and manage robust cloud environments.
Terraform: Leverage Terraform for infrastructure as code to efficiently manage and provision our cloud resources.
Agentic Infrastructure: Build the frameworks around using AI safely in our infrastructure, both for the agents and the users that kick off those agents.
Monitoring and Observability Tools: Employ Prometheus, Grafana, Opensearch, and OpenTelemetry to maintain high availability and monitor system health.
Version Control and CI/CD: Manage code and automate deployments using GitHub & GitHub Actions.

As we gear up for the next stages of Mercury's growth, you will:

Build our “Infrastructure Platform” to support the growing needs of the Engineering Organization.
Focus on building a platform that is AI friendly while still usable for engineers. We want our users to be humans and Agents.
Lead key infrastructure projects, break-down complex initiatives, and define our infrastructure strategy through detailed RFCs and technical specifications.

Must haves:

You have 5+ years of experience with AWS.
You have extensive experience, ideally 3 years or more, with observability and monitoring tools like Prometheus, Grafana, and OpenTelemetry, optimizing system performance and reliability.
You have demonstrated ability in technical writing, with at least 3 years of experience creating detailed technical documentation, RFCs, and tech specs that clearly communicate complex ideas.

The ideal candidate should:

You bring at least 2 years of experience leading infrastructure projects in regulated environments such as HITRUST or SOC2, ensuring compliance and security.
You have 3+ years of experience managing large-scale Terraform implementations, including the setup and maintenance of Terraform CI/CD pipelines.
You have 2+ years of experience writing code. We are building an Infrastructure Platform from scratch and there is plenty of code to write to support that.
Experience mentoring and elevating those around you, we are force multipliers for the engineering org.

If this role interests you, we invite you to explore our public demo at demo.mercury.com.

The total rewards package at Mercury includes base salary, equity, and benefits. Our salary and equity ranges are highly competitive within the SaaS and fintech industry and are updated regularly using the most reliable compensation survey data for our industry. New hire offers are made based on a candidate’s experience, expertise, geographic location, and internal pay equity relative to peers.

Our target new hire base salary ranges for this role are the following:

US employees: $200,700 - $250,900
Canadian employees: CAD $189,700 - $237,100

XML job scraping automation by YubHub

]]> full-time senior remote $200,700 - $250,900 (US employees), CAD $189,700 - $237,100 (Canadian employees) AWS, EC2, RDS, IAM, Networking, Opensearch, ECS, Terraform, Prometheus, Grafana, OpenTelemetry, GitHub, GitHub Actions Engineering Technology Mercury https://logos.yubhub.co/demo.mercury.com.png Mercury is a software company that provides cloud infrastructure services. https://demo.mercury.com https://job-boards.greenhouse.io/mercury/jobs/5832466004 San Francisco, CA, New York, NY, Portland, OR, or Remote within Canada or United States 2026-04-17 2e8a2997-260 Senior Infrastructure Engineer We are open to hiring at multiple levels for this role, depending on experience, impact, and demonstrated ownership. While this role is level-agnostic, it is best suited for engineers with experience owning and working in highly ambiguous problem spaces.

About the company: The mining industry has steadily become worse at finding new ore deposits, requiring >10X more capital to make discoveries compared to 30 years ago. KoBold Metals builds AI models for mineral exploration and deploys those models,alongside our novel sensors,to guide decisions on KoBold-owned-and-operated exploration programs.

About The Role: In this role, you will partner with exploration and engineering teams to build reliable, scalable infrastructure that makes it easier to turn data and models into real-world exploration insights. You will improve observability, streamline MLOps workflows, and maintain shared tools like JupyterHub that enable faster experimentation and collaboration. Your work will help create a solid foundation for scientists and engineers to focus on discovery instead of infrastructure.

Responsibilities

Design, build, and operate compute infrastructure that is both scalable and reliable to support critical services.
Work closely with engineering teams to embed observability, reliability, and security throughout the software development process.
Create and maintain automation for monitoring, deployments, and incident response to keep operations efficient and predictable.
Lead or support capacity planning, performance reviews, and system tuning to ensure stable and efficient systems.
Join the on-call rotation and take part in incident response, troubleshooting, and resolution.
Develop and refine monitoring and alerting to catch issues early and reduce downtime.
Establish and maintain disaster recovery and business continuity practices that protect the organization against failures.
Regularly review and improve our tools and processes to strengthen system visibility and reliability.
Investigate points of fragility in distributed systems and understand how complex systems behave under stress in order to improve resilience.
Continually learn about mineral exploration through reading, discussions with exploration team members, periodic rotation on an exploration team and time in the field with geologists

Qualifications

5+ years of experience as an Infrastructure Engineer, Site Reliability Engineer or in a similar role
Strong scripting and programming skills (Python, Go, Java or JavaScript/ Node.js )
Experience with IaC tools like Terraform and container orchestration tools like Kubernetes and Docker
Experience with cloud platforms such as AWS
Experience operating or administering JupyterHub in a multi-user environment
Understanding of MLOps workflows, including model training, deployment, and related tooling
Excellent communication & collaboration skills and a continuous improvement mindset
Proven ability to troubleshoot complex issues and implement effective solutions
Proven ability to thrive in dynamic and evolving environments, effectively navigating uncertainty and incomplete information.
Proven ability to grow expertise, influence & educate others
Comfortable making informed decisions with limited data, adapting quickly to new circumstances, and maintaining focus on strategic objectives while driving clarity for the team.
Intellectual curiosity and eagerness to learn about all aspects of mineral exploration, particularly in the geology domain. Enjoys constantly learning such that you are driving insights through using our tools in exploration and willing to work directly with geologists in the field.
Ability to explain technical problems to and collaborate on solutions with domain experts who are not infrastructure engineers. A strong communicator who enjoys working with colleagues across the company.
Excitement about joining a fast-growing early-stage company, comfort with a dynamic work environment, and eagerness to take on an evolving range of responsibilities.
Keen not just to build cool technology, but to figure out what technical product to build to best achieve the business objectives of the company.

XML job scraping automation by YubHub

]]> full-time senior remote $170,000 - $230,000 scripting, programming, IaC, container orchestration, cloud platforms, MLOps workflows, observability, reliability, security, automation, monitoring, deployments, incident response, capacity planning, performance reviews, system tuning, disaster recovery, business continuity, tools, processes, distributed systems, complex systems, resilience, mineral exploration, geology Engineering Technology KoBold Metals https://logos.yubhub.co/koboldmetals.com.png KoBold Metals is a privately held mineral exploration company and technology developer, with a portfolio of over 60 projects. https://koboldmetals.com/ https://job-boards.greenhouse.io/koboldmetals/jobs/4002126005 Remote 2026-04-17 8c164f95-f8d Senior Infrastructure Engineer Join our Infrastructure Engineering team and help ensure the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide. As a Senior Infrastructure Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.

We are seeking Senior Infrastructure Engineers who are passionate about building and maintaining resilient systems at scale. Your mission will be to proactively find and analyse reliability problems across our stack, then design and implement software and systems to address them. You will build robust monitoring solutions, automate operational tasks, and continuously improve our infrastructure's reliability.

You Will:

Drive Automation and Infrastructure as Code: Build and improve automation to eliminate toil and operational work. Maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios.
Optimise Performance and Infrastructure: Collaborate with core infrastructure and product teams to performance tune and optimise our cloud deployments (Kubernetes, Docker, GCP). Identify and resolve performance bottlenecks and implement capacity planning strategies.
Elevate Developer Experience: Design and implement improvements to our build, test, and deployment systems to make software delivery faster, safer, and more reliable for all engineers.
Drive Cross-Team Improvements: Partner with service owners across Replit to understand their pain points, and collaborate on implementing build/test/deploy enhancements within their specific services.
Build Shared Tooling: Create and maintain centralized tooling and automation that improves the engineering lifecycle, from local development to production monitoring.
Debug and Harden Systems: Dive deep into debugging difficult technical problems, making our systems and products more robust, operable, and easier to diagnose.
Collaborate on Design Reviews: Participate in feature and system design reviews, contributing expertise on security, scale, and operational considerations.
Build and Integrate: Write high-quality, well-tested code to meet the needs of your customers, including building pipelines to integrate with 3rd party vendors.

Required Skills and Experience:

4+ years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering).
Strong programming skills in languages like Python or Go.
You write high-quality, well-tested code.
Solid understanding of distributed systems. You've built, scaled, and maintained production services and understand service-oriented architecture.
Experience with container orchestration platforms (Kubernetes) and cloud-native technologies.
Experience implementing and maintaining monitoring/observability solutions, with strong skills in debugging and performance tuning.
Strong incident management skills with experience participating in incident response and demonstrated critical thinking under pressure.
Experience with infrastructure as code (e.g., Terraform) and configuration management tools.
Excellent written and verbal communication skills, with an ability to explain technical concepts clearly.
A willingness to dive into understanding, debugging, and improving any layer of the stack.
You're passionate about making software creation accessible and empowering the next generation of builders.

Bonus Points:

Experience with Google Cloud Platform (GCP) services and tools.
Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.).
Experience building reliable systems capable of handling high throughput and low latency.
Experience with Go and Terraform.
Familiarity with working in rapid-growth environments.

_This is a full-time role that can be held from our Foster City, CA office. The role has an in-office requirement of Monday, Wednesday, and Friday._

Full-Time Employee Benefits Include:

Competitive Salary & Equity
401(k) Program with a 4% match
Health, Dental, Vision and Life Insurance
Short Term and Long Term Disability
Paid Parental, Medical, Caregiver Leave
Commuter Benefits
Monthly Wellness Stipend
Autonomous Work Environment
In Office Set-Up Reimbursement
Flexible Time Off (FTO) + Holidays
Quarterly Team Gatherings
In Office Amenities

XML job scraping automation by YubHub

]]> full-time senior hybrid $190K - $240K Site Reliability Engineering, DevOps, Systems Engineering, Infrastructure Engineering, Python, Go, Terraform, Kubernetes, Docker, GCP, Monitoring/observability solutions, Debugging and performance tuning, Incident management, Infrastructure as code, Configuration management tools, Google Cloud Platform (GCP) services and tools, Modern observability platforms (Prometheus, Grafana, Datadog, etc.), Building reliable systems capable of handling high throughput and low latency, Go and Terraform, Familiarity with working in rapid-growth environments Engineering Technology Replit https://logos.yubhub.co/replit.com.png Replit is a software creation platform that enables anyone to build applications using natural language. With millions of users worldwide, Replit is a leading platform in the software development industry. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/replit/16c85abc-763c-4f36-ab67-64f416343384 Foster City, CA 2026-03-07