{"version":"0.1","company":{"name":"YubHub","url":"https://yubhub.co","jobsUrl":"https://yubhub.co/jobs/skill/observability-platforms"},"x-facet":{"type":"skill","slug":"observability-platforms","display":"Observability Platforms","count":7},"x-feed-size-limit":100,"x-feed-sort":"enriched_at desc","x-feed-notice":"This feed contains at most 100 jobs (the most recently enriched). For the full corpus, use the paginated /stats/by-facet endpoint or /search.","x-generator":"yubhub-xml-generator","x-rights":"Free to redistribute with attribution: \"Data by YubHub (https://yubhub.co)\"","x-schema":"Each entry in `jobs` follows https://schema.org/JobPosting. YubHub-native raw fields carry `x-` prefix.","jobs":[{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_5717691a-508"},"title":"Staff Infrastructure Software Engineer, Enterprise AI","description":"<p>We are looking for a Staff Infrastructure Software Engineer to act as a primary technical lead, engineering the &#39;paved road&#39; for our knowledge retrieval and inference engines. You will define the deployment standards for Agentic workflows at scale, bridging the gap between complex AI orchestration and world-class infrastructure.</p>\n<p>The ideal candidate thrives in a fast-paced environment, has a passion for both deep technical work and mentoring, and is capable of setting a long-term technical strategy for a critical domain while maintaining a strong, hands-on delivery focus.</p>\n<p>You will architect and implement solutions across multiple cloud providers (GCP, Azure, AWS) for customers in diverse, highly-regulated industries like healthcare, telecom, finance, and retail.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Architecting multi-cloud systems and abstractions to allow the SGP platform to run on top of existing Cloud providers.</li>\n<li>Using our own data and AI platform to analyse build and test logs and metrics to identify areas for improvement.</li>\n<li>Defining the architectural patterns for our multi-cloud infrastructure to support secure, reliable, and scalable Agentic workflows for enterprise customers.</li>\n<li>Enhancing engineering and infrastructure efficiency, reliability, accuracy, and response times, including CI/CD processes, test frameworks, data quality assurance, end-to-end reconciliation, and anomaly detection.</li>\n<li>Collaborating with platform and product teams to develop and implement innovative infrastructure that scales to meet evolving needs.</li>\n<li>Designing and championing highly scalable, reliable, and low-latency infrastructure and frameworks for building, orchestrating, and evaluating multi-agent systems at enterprise scale.</li>\n<li>Leading the infrastructure roadmap with a strong focus on compliance, privacy, and security standards, including designing change management and data isolation strategies.</li>\n<li>Owning the development and maintenance of our best-in-class Agentic observability platform (logging, metrics, tracing, and analytics) to proactively ensure system health and enable rapid incident response.</li>\n<li>Driving developer efficiency by building automated tooling and championing Infrastructure-as-Code (IaC) paradigms throughout the engineering organization to improve workflows and operational efficiency.</li>\n</ul>\n<p>The ideal candidate has proven experience in a senior role, with 5+ years of full-time software engineering experience, and a deep understanding of modern infrastructure practices, including CI/CD, IaC (e.g., Terraform, Helm Charts), container orchestration (e.g., Kubernetes) and observability platforms (e.g., Datadog, Prometheus, Grafana).</p>\n<p>Extensive experience with at least one major cloud provider (AWS, Azure, or GCP) and strong knowledge of security and compliance in enterprise environments, with a focus on access management, data isolation, and customer-specific VPC setups is required.</p>\n<p>Proficiency in Python or JavaScript/TypeScript, and SQL is also necessary.</p>\n<p>Bonus points for hands-on experience and a passion for working with Agents, LLMs, vector databases, and other emerging AI technologies.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_5717691a-508","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Scale","sameAs":"https://scale.com/","logo":"https://logos.yubhub.co/scale.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/scaleai/jobs/4599700005","x-work-arrangement":"onsite","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$216,200-$310,500 USD","x-skills-required":["Cloud computing","Infrastructure as Code","Container orchestration","Observability platforms","Security and compliance","Access management","Data isolation","Customer-specific VPC setups","Python","JavaScript/TypeScript","SQL"],"x-skills-preferred":["Agents","LLMs","Vector databases","Emerging AI technologies"],"datePosted":"2026-04-18T15:58:05.354Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY; San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Cloud computing, Infrastructure as Code, Container orchestration, Observability platforms, Security and compliance, Access management, Data isolation, Customer-specific VPC setups, Python, JavaScript/TypeScript, SQL, Agents, LLMs, Vector databases, Emerging AI technologies","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":216200,"maxValue":310500,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_2ab9c635-07a"},"title":"Operations Engineer, Fleet Reliability","description":"<p>The Fleet Reliability Operations team is responsible for the day-to-day provisioning, management, and uptime of CoreWeave&#39;s ever-expanding fleet of server nodes. This team plays a central role in CoreWeave&#39;s growth strategy, configuring, updating, and remotely troubleshooting our highest-tier supercomputing clusters and their networking, delivery platforms, and tools dependencies.</p>\n<p>We are seeking curious, creative, and persistent problem solvers to join our Fleet Reliability Operations team to help drive batches of server nodes through our provisioning and validation processes while efficiently and effectively troubleshooting node or cluster problems as they arise.</p>\n<p>Key responsibilities include:</p>\n<ul>\n<li>Configuring and maintaining large-scale high-performance supercomputing clusters running state-of-the-art GPUs</li>\n<li>Troubleshooting hardware and software issues; escalating and coordinating as needed with data center, network, hardware, and platform teams to drive resolution</li>\n<li>Monitoring and analyzing system performance and taking appropriate remediation actions for cloud health</li>\n<li>Approaching work with flexibility and optimism, anticipating shifting business and technical priorities</li>\n<li>Creating and maintaining documentation of team processes, knowledge, and best practices for system management</li>\n<li>Thinking critically about day-to-day work and working collaboratively to improve team processes and efficiency</li>\n</ul>\n<p>As a member of our team, you will be part of a dynamic and fast-paced environment where you will have the opportunity to grow and develop your skills. We offer a competitive salary range of $83,000 to $110,000, as well as a comprehensive benefits package, including medical, dental, and vision insurance, company-paid life insurance, and flexible PTO.</p>\n<p>If you are a motivated and detail-oriented individual who is passionate about working with cutting-edge technology, we encourage you to apply for this exciting opportunity.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_2ab9c635-07a","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4617382006","x-work-arrangement":"hybrid","x-experience-level":"mid","x-job-type":"full-time","x-salary-range":"$83,000 to $110,000","x-skills-required":["Linux system administration","Troubleshooting hardware and software issues","System maintenance tasks","Scripting languages (bash, python, powershell, etc)","Grafana, Prometheus, promsql queries or similar observability platforms"],"x-skills-preferred":["Kubernetes administration","HPC - administering GPU-related workloads","Data center environments including server racks, HVAC systems, fiber trays"],"datePosted":"2026-04-18T15:51:55.238Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"New York, NY /Plano, TX /  Bellevue, WA / Sunnyvale, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Linux system administration, Troubleshooting hardware and software issues, System maintenance tasks, Scripting languages (bash, python, powershell, etc), Grafana, Prometheus, promsql queries or similar observability platforms, Kubernetes administration, HPC - administering GPU-related workloads, Data center environments including server racks, HVAC systems, fiber trays","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":83000,"maxValue":110000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_8f4ab428-1e7"},"title":"Security Technology Deployment Specialist","description":"<p>As a Security Technology Deployment Specialist at Anthropic, you will own the validation, standardization, and deployment of physical security technology across our rapidly expanding global office portfolio. This role bridges the gap between technology selection and production-ready operation , ensuring that every security platform deployed is rigorously tested, properly integrated with enterprise infrastructure, fully documented, and built for scale.</p>\n<p>You&#39;ll define the installation standards, configuration baselines, and deployment processes that the broader team executes against , from access control migrations and intercom replacements to AI analytics onboarding and new application integrations. You&#39;ll work across InfoSec, IT, Networking, and Identity Management to ensure every security application passes review, integrates with SSO, and is supported within Anthropic&#39;s infrastructure before going live. Your work will directly determine whether Anthropic&#39;s security technology stack scales reliably as the company grows from dozens of locations to a global enterprise footprint.</p>\n<p>Responsibilities:</p>\n<ul>\n<li>Validate and deploy new and replacement security technology platforms including access control systems, intercom solutions, video management, visitor management, and AI/analytics tools across all Anthropic locations</li>\n</ul>\n<ul>\n<li>Build and maintain staging environments for pre-production testing and validation of all security applications, hardware, firmware, and system configurations</li>\n</ul>\n<ul>\n<li>Define installation standards, configuration baselines, licensing structures, update procedures, and maintenance requirements for every deployed security platform</li>\n</ul>\n<ul>\n<li>Deploy integrations between security applications, validating that platforms communicate and share data correctly before transitioning to production</li>\n</ul>\n<ul>\n<li>Support colleagues&#39; security applications through InfoSec review processes, ensuring new tools meet Anthropic&#39;s information security and compliance requirements</li>\n</ul>\n<ul>\n<li>Coordinate SSO integration for newly deployed security applications with Identity Management and IT teams</li>\n</ul>\n<ul>\n<li>Transition applications requiring custom integration or data pipeline development to the IT Engineering team with documented technical requirements for roadmap inclusion</li>\n</ul>\n<ul>\n<li>Initiate onboarding of deployed hardware and systems into Anthropic&#39;s health monitoring platform to ensure operational visibility from day one</li>\n</ul>\n<ul>\n<li>Develop standardized deployment playbooks, checklists, configuration templates, and handoff documentation that enable repeatable installations across all current and future sites</li>\n</ul>\n<ul>\n<li>Evaluate security platforms for scalability, identifying capacity constraints, single points of failure, and architectural limitations before they impact operations at scale</li>\n</ul>\n<ul>\n<li>Coordinate with Networking, IT Infrastructure, and Facilities teams to ensure all infrastructure prerequisites (network, power, rack space, cloud resources) are met prior to deployment</li>\n</ul>\n<ul>\n<li>Execute structured handoffs to Project Management (for site programming), Break-Fix Support (for maintenance), and Access Control Administration (for ongoing system management), ensuring each team has the standards and documentation to execute independently</li>\n</ul>\n<p>Requirements:</p>\n<ul>\n<li>5+ years of hands-on experience deploying, validating, and managing enterprise physical security technology across a large or rapidly growing organization</li>\n</ul>\n<ul>\n<li>Experience working across InfoSec, IT, Networking, and Identity Management teams to onboard and integrate security applications into enterprise environments</li>\n</ul>\n<ul>\n<li>Strong technical communication skills, with the ability to define standards clearly enough that PMs, integrators, and service teams execute against them without ambiguity</li>\n</ul>\n<ul>\n<li>Experience with IP networking, VLANs, PoE, and infrastructure requirements for security devices</li>\n</ul>\n<ul>\n<li>Comfortable with 25% travel for site deployments, commissioning, and validation</li>\n</ul>\n<p>Preferred Qualifications:</p>\n<ul>\n<li>Previous experience at a hyper-growth technology company or managing security technology programs for high-profile corporate environments</li>\n</ul>\n<ul>\n<li>Experience with Anthropic&#39;s specific technology stack: Genetec Security Center, Axis cameras, Wavelynx, Commend Symphony Cloud, Alcatraz.ai, Ambient.ai, SureView, Envoy</li>\n</ul>\n<ul>\n<li>Industry certifications: Genetec, Axis, CCNA, PSP, CPP, or PMP</li>\n</ul>\n<ul>\n<li>Experience with OSDP, modern credential technologies, and encryption protocols for physical security systems</li>\n</ul>\n<ul>\n<li>Familiarity with scripting or automation (Python, PowerShell) for configuration management and deployment automation</li>\n</ul>\n<ul>\n<li>Experience with health monitoring and observability platforms</li>\n</ul>\n<ul>\n<li>Experience with change management, configuration control, and version-controlled infrastructure documentation</li>\n</ul>\n<p>Salary Range: $175,000-$220,000 USD</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_8f4ab428-1e7","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://www.anthropic.com/","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5123587008","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$175,000-$220,000 USD","x-skills-required":["security technology deployment","physical security technology","access control systems","intercom solutions","video management","visitor management","AI/analytics tools","InfoSec","IT","Networking","Identity Management","SSO integration","custom integration","data pipeline development","health monitoring platform","deployment playbooks","checklists","configuration templates","handoff documentation","scalability analysis","infrastructure prerequisites","structured handoffs"],"x-skills-preferred":["Genetec Security Center","Axis cameras","Wavelynx","Commend Symphony Cloud","Alcatraz.ai","Ambient.ai","SureView","Envoy","OSDP","modern credential technologies","encryption protocols","scripting","automation","Python","PowerShell","health monitoring","observability platforms","change management","configuration control","version-controlled infrastructure documentation"],"datePosted":"2026-04-18T15:48:43.816Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Remote-Friendly (Travel-Required) | San Francisco, CA | Seattle, WA | New York City, NY"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"security technology deployment, physical security technology, access control systems, intercom solutions, video management, visitor management, AI/analytics tools, InfoSec, IT, Networking, Identity Management, SSO integration, custom integration, data pipeline development, health monitoring platform, deployment playbooks, checklists, configuration templates, handoff documentation, scalability analysis, infrastructure prerequisites, structured handoffs, Genetec Security Center, Axis cameras, Wavelynx, Commend Symphony Cloud, Alcatraz.ai, Ambient.ai, SureView, Envoy, OSDP, modern credential technologies, encryption protocols, scripting, automation, Python, PowerShell, health monitoring, observability platforms, change management, configuration control, version-controlled infrastructure documentation","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":175000,"maxValue":220000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_4c401f90-9e1"},"title":"Senior Security Production Engineer","description":"<p>As a Senior Security Production Engineer at CoreWeave, you will design, build, and operate the systems that keep our platform secure, reliable, and highly performant.</p>\n<p>You&#39;ll work closely with infrastructure and engineering teams to improve system resilience, automate operational processes, and proactively mitigate risks. Your day-to-day will include developing scalable security infrastructure, enhancing observability, and responding to production incidents while continuously improving system reliability and performance.</p>\n<p>In this role, you will:</p>\n<ul>\n<li>Design, implement, and maintain scalable, highly available security infrastructure using Kubernetes and cloud native technologies</li>\n<li>Build automation and monitoring solutions to proactively identify and mitigate reliability risks</li>\n<li>Collaborate with engineering teams to optimize system performance, reduce latency, and improve service uptime</li>\n<li>Participate in incident response, conduct root cause analysis, and implement preventative solutions</li>\n<li>Mentor team members and promote best practices in reliability, security engineering, and infrastructure management</li>\n</ul>\n<p>Who You Are:</p>\n<ul>\n<li>5+ years of experience in site reliability engineering, DevOps, security engineering, security operations, or related roles</li>\n<li>Strong proficiency with Kubernetes, container orchestration, and cloud native technologies</li>\n<li>Experience managing and operating Teleport for infrastructure access control</li>\n<li>Proficiency in automation and scripting languages such as Python, Bash, or Go</li>\n<li>Experience operating and maintaining large scale distributed systems with a focus on reliability</li>\n</ul>\n<p>Preferred:</p>\n<ul>\n<li>Familiarity with observability platforms such as Prometheus, Grafana, or Datadog</li>\n<li>Experience working with cloud providers such as AWS, Azure, or GCP</li>\n</ul>\n<p>Wondering if you&#39;re a good fit? We believe in investing in our people and value candidates who bring diverse experiences, even if they don&#39;t meet every requirement. If some of the below resonates with you, we&#39;d love to connect.</p>\n<ul>\n<li>You enjoy solving complex infrastructure and security challenges at scale</li>\n<li>You&#39;re curious about improving system reliability, automation, and observability</li>\n<li>You have a strong ownership mindset and take pride in building resilient systems</li>\n</ul>\n<p>Why CoreWeave?</p>\n<p>At CoreWeave, we work hard, have fun, and move fast. We are in an exciting stage of hyper growth and building the infrastructure powering the next wave of AI. Our team embraces continuous learning, collaboration, and innovation to solve complex challenges at scale. Our core values guide how we work together:</p>\n<ul>\n<li>Be Curious at Your Core</li>\n<li>Act Like an Owner</li>\n<li>Empower Employees</li>\n<li>Deliver Best in Class Client Experiences</li>\n<li>Achieve More Together</li>\n</ul>\n<p>We foster an environment that encourages independent thinking, collaboration, and the development of innovative solutions. You will work alongside some of the best talent in the industry and have opportunities to grow as we continue to scale. We support and encourage an entrepreneurial outlook and independent thinking.</p>\n<p>The base salary range for this role is $190,000 to $282,000. The starting salary will be determined by job-related knowledge, skills, experience, and the market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).</p>\n<p>What We Offer</p>\n<p>The range we&#39;ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location. In addition to a competitive salary, we offer a variety of benefits to support your needs, including:</p>\n<ul>\n<li>Medical, dental, and vision insurance</li>\n<li>100% paid for by CoreWeave</li>\n<li>Company-paid Life Insurance</li>\n<li>Voluntary supplemental life insurance</li>\n<li>Short and long-term disability insurance</li>\n<li>Flexible Spending Account</li>\n<li>Health Savings Account</li>\n<li>Tuition Reimbursement</li>\n<li>Ability to Participate in Employee Stock Purchase Program (ESPP)</li>\n<li>Mental Wellness Benefits through Spring Health</li>\n<li>Family-Forming support provided by Carrot</li>\n<li>Paid Parental Leave</li>\n<li>Flexible, full-service childcare support with Kinside</li>\n<li>401(k) with a generous employer match</li>\n<li>Flexible PTO</li>\n<li>Catered lunch each day in our office and data center locations</li>\n<li>A casual work environment</li>\n<li>A work culture focused on innovative disruption</li>\n</ul>\n<p>Our Workplace</p>\n<p>While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.</p>\n<p>California Consumer Privacy Act - California applicants only</p>\n<p>CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information. As part of this commitment and consistent with the Americans with Disabilities Act (ADA), CoreWeave will ensure that qualified applicants and candidates with disabilities are provided reasonable accommodations for the hiring process, unless such accommodation would cause an undue hardship. If reasonable accommodation is needed, please contact: careers@coreweave.com</p>\n<p>Export Control Compliance</p>\n<p>This position requires access to export controlled information. To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. § 1157, or (iv) asylee under 8 U.S.C. § 1158, (B) eligible to access the export controlled information without a required export authorization, or (C) eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency. CoreWeave may, for legitimate business reasons, decline to pursue any export licensing process.</p>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_4c401f90-9e1","directApply":true,"hiringOrganization":{"@type":"Organization","name":"CoreWeave","sameAs":"https://www.coreweave.com","logo":"https://logos.yubhub.co/coreweave.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/coreweave/jobs/4569069006","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$190,000 to $282,000","x-skills-required":["Kubernetes","cloud native technologies","Teleport","Python","Bash","Go","observability platforms","Prometheus","Grafana","Datadog","cloud providers","AWS","Azure","GCP"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:48:28.443Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / San Francisco, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Kubernetes, cloud native technologies, Teleport, Python, Bash, Go, observability platforms, Prometheus, Grafana, Datadog, cloud providers, AWS, Azure, GCP","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":190000,"maxValue":282000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_a514157f-198"},"title":"Senior Manager, Site Reliability Engineering -  Infrastructure Platform","description":"<p>Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organisations to safely embrace this new era.</p>\n<p>This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We&#39;re all in on this mission. If you are too, let&#39;s talk.</p>\n<p>The Infrastructure Platform and Shared Services Team Okta authenticates, authorises and provisions millions of users a day. The service is hosted on Amazon Web Services (AWS) across multiple availability zones and geographically separated regions. The service is designed for high throughput and 99.999 availability.</p>\n<p>We&#39;re looking for a technical leader to help us continue to scale the service with great people and reliable, cost-effective, and efficient infrastructure, processes, and tooling.</p>\n<p>As the Sr. Manager of Infrastructure Platform and Shared Services, you will oversee multiple teams focused on Edge networking, K8s platform, CI/CD, Observability, automation platform &amp; tooling.</p>\n<p>Responsibilities</p>\n<ul>\n<li>Lead the Infra platform and shared services org and various initiatives across SRE &amp; Infrastructure organisation.</li>\n</ul>\n<ul>\n<li>Lead the DevOps transformation, microservice journey, and next generation Infra platform capabilities in partnership with architects and product engineering.</li>\n</ul>\n<ul>\n<li>Build a world-class observability platform and monitoring capabilities enabled with self-service.</li>\n</ul>\n<ul>\n<li>Accelerate the velocity of SRE and product engineering by developing robust platforms, powerful tooling, and intuitive self-service capabilities.</li>\n</ul>\n<ul>\n<li>Own the design and operation of scalable, self-service Cloud infrastructure platforms (e.g., Kubernetes, service mesh, CI/CD pipelines, IaC &amp; Edge Infrastructure).</li>\n</ul>\n<ul>\n<li>Lead, mentor, and grow a high-performing team of engineers and managers across platform, infrastructure, and shared services domains.</li>\n</ul>\n<ul>\n<li>Perform engineering design evaluations and ensure the completion of projects within resource, budget, and scheduling constraints.</li>\n</ul>\n<ul>\n<li>Improve SDLC processes for Cloud infrastructure as a code, including the maturity of CI/CD pipelines, change and release management.</li>\n</ul>\n<ul>\n<li>Manage service and business expectations and prioritise resource allocation.</li>\n</ul>\n<ul>\n<li>Maintain a deep knowledge of industry best practices, evolving trends, and technologies.</li>\n</ul>\n<p>Requirements</p>\n<ul>\n<li>6+ years of experience in technical leadership &amp; people management.</li>\n</ul>\n<ul>\n<li>Extensive experience using Agile and DevOps methodologies to build product infrastructure and shared service at scale.</li>\n</ul>\n<ul>\n<li>3+ years of experience running large-scale infrastructure platforms supporting a SaaS/Cloud service in a public Cloud, preferably AWS. Experience supporting a multi-Cloud environment will be a plus.</li>\n</ul>\n<ul>\n<li>Strong expertise in cloud-native architectures, containerisation (Kubernetes), IaC (Terraform), and CI/CD pipelines.</li>\n</ul>\n<ul>\n<li>Strong background and hands-on experience in SW development, PaaS and automation.</li>\n</ul>\n<ul>\n<li>Deep experience with building and operating observability platforms and monitoring tools (Grafana, Splunk, APM etc.) in a large scale environment.</li>\n</ul>\n<ul>\n<li>Demonstrated ability to lead cross-functional teams and manage large-scale programs.</li>\n</ul>\n<ul>\n<li>Effective verbal, written communication and interpersonal skills.</li>\n</ul>\n<ul>\n<li>Computer Science Degree or related degree or equivalent experience.</li>\n</ul>\n<p>Additional requirements:</p>\n<ul>\n<li>This position requires the ability to access federal environments and/or have access to protected federal data. As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_a514157f-198","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Okta","sameAs":"https://www.okta.com/","logo":"https://logos.yubhub.co/okta.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/okta/jobs/7317857","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$176,000-$264,000 USD","x-skills-required":["cloud-native architectures","containerisation (Kubernetes)","IaC (Terraform)","CI/CD pipelines","SW development","PaaS and automation","observability platforms and monitoring tools (Grafana, Splunk, APM etc.)"],"x-skills-preferred":[],"datePosted":"2026-04-18T15:45:57.955Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bellevue, Washington; San Francisco, California"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"cloud-native architectures, containerisation (Kubernetes), IaC (Terraform), CI/CD pipelines, SW development, PaaS and automation, observability platforms and monitoring tools (Grafana, Splunk, APM etc.)","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":176000,"maxValue":264000,"unitText":"YEAR"}}},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_afeda614-2d1"},"title":"Security Technology Deployment Specialist","description":"<p><strong>About the Role:</strong></p>\n<p>As part of Anthropic&#39;s Global Safety, Intelligence, and Security (GSIS) team, the Security Technology Deployment Specialist will own the validation, standardization, and deployment of physical security technology across Anthropic&#39;s rapidly expanding global office portfolio.</p>\n<p>You&#39;ll define the installation standards, configuration baselines, and deployment processes that the broader team executes against — from access control migrations and intercom replacements to AI analytics onboarding and new application integrations. You&#39;ll work across InfoSec, IT, Networking, and Identity Management to ensure every security application passes review, integrates with SSO, and is supported within Anthropic&#39;s infrastructure before going live.</p>\n<p><strong>Responsibilities:</strong></p>\n<ul>\n<li>Validate and deploy new and replacement security technology platforms including access control systems, intercom solutions, video management, visitor management, and AI/analytics tools across all Anthropic locations</li>\n<li>Build and maintain staging environments for pre-production testing and validation of all security applications, hardware, firmware, and system configurations</li>\n<li>Define installation standards, configuration baselines, licensing structures, update procedures, and maintenance requirements for every deployed security platform</li>\n<li>Deploy integrations between security applications, validating that platforms communicate and share data correctly before transitioning to production</li>\n<li>Support colleagues&#39; security applications through InfoSec review processes, ensuring new tools meet Anthropic&#39;s information security and compliance requirements</li>\n<li>Coordinate SSO integration for newly deployed security applications with Identity Management and IT teams</li>\n<li>Transition applications requiring custom integration or data pipeline development to the IT Engineering team with documented technical requirements for roadmap inclusion</li>\n<li>Initiate onboarding of deployed hardware and systems into Anthropic&#39;s health monitoring platform to ensure operational visibility from day one</li>\n<li>Develop standardized deployment playbooks, checklists, configuration templates, and handoff documentation that enable repeatable installations across all current and future sites</li>\n<li>Evaluate security platforms for scalability, identifying capacity constraints, single points of failure, and architectural limitations before they impact operations at scale</li>\n<li>Coordinate with Networking, IT Infrastructure, and Facilities teams to ensure all infrastructure prerequisites (network, power, rack space, cloud resources) are met prior to deployment</li>\n<li>Execute structured handoffs to Project Management (for site programming), Break-Fix Support (for maintenance), and Access Control Administration (for ongoing system management), ensuring each team has the standards and documentation to execute independently</li>\n</ul>\n<p><strong>You may be a good fit if you:</strong></p>\n<ul>\n<li>Have 5+ years of hands-on experience deploying, validating, and managing enterprise physical security technology across a large or rapidly growing organisation</li>\n<li>Have deployed security technology across 50 or more sites, or have demonstrated experience in a high-growth environment where deployment velocity and repeatability were essential</li>\n<li>Have built standardized deployment processes, playbooks, and configuration templates that enabled others to execute installations independently and consistently</li>\n<li>Have experience working across InfoSec, IT, Networking, and Identity Management teams to onboard and integrate security applications into enterprise environments</li>\n<li>Have supported SSO integration, InfoSec reviews, and enterprise application onboarding workflows for security tools</li>\n<li>Possess broad technology experience across access control, video management, intercoms, visitor management, AI/analytics, and alarm monitoring platforms</li>\n<li>Are a strong technical communicator who can define standards clearly enough that PMs, integrators, and service teams execute against them without ambiguity</li>\n<li>Have experience with IP networking, VLANs, PoE, and infrastructure requirements for security devices</li>\n<li>Are comfortable with 25% travel for site deployments, commissioning, and validation</li>\n</ul>\n<p><strong>Strong candidates may have:</strong></p>\n<ul>\n<li>Previous experience at a hyper-growth technology company or managing security technology programs for high-profile corporate environments</li>\n<li>Experience with Anthropic&#39;s specific technology stack: Genetec Security Center, Axis cameras, Wavelynx, Commend Symphony Cloud, Alcatraz.ai, Ambient.ai, SureView, Envoy</li>\n<li>Industry certifications: Genetec, Axis, CCNA, PSP, CPP, or PMP</li>\n<li>Experience with OSDP, modern credential technologies, and encryption protocols for physical security systems</li>\n<li>Familiarity with scripting or automation (Python, PowerShell) for configuration management and deployment automation</li>\n<li>Experience with health monitoring and observability platforms</li>\n<li>Experience with change management, configuration control, and version-controlled infrastructure documentation</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_afeda614-2d1","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Anthropic","sameAs":"https://job-boards.greenhouse.io","logo":"https://logos.yubhub.co/anthropic.com.png"},"x-apply-url":"https://job-boards.greenhouse.io/anthropic/jobs/5123587008","x-work-arrangement":"remote","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":null,"x-skills-required":["physical security technology","access control systems","intercom solutions","video management","visitor management","AI/analytics tools","security applications","InfoSec","IT","Networking","Identity Management","SSO integration","IP networking","VLANs","PoE","infrastructure requirements for security devices"],"x-skills-preferred":["Genetec Security Center","Axis cameras","Wavelynx","Commend Symphony Cloud","Alcatraz.ai","Ambient.ai","SureView","Envoy","OSDP","modern credential technologies","encryption protocols for physical security systems","scripting or automation (Python, PowerShell)","health monitoring and observability platforms","change management","configuration control","version-controlled infrastructure documentation"],"datePosted":"2026-03-08T13:56:18.481Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"San Francisco, CA; Seattle, WA; New York City, NY"}},"jobLocationType":"TELECOMMUTE","employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"physical security technology, access control systems, intercom solutions, video management, visitor management, AI/analytics tools, security applications, InfoSec, IT, Networking, Identity Management, SSO integration, IP networking, VLANs, PoE, infrastructure requirements for security devices, Genetec Security Center, Axis cameras, Wavelynx, Commend Symphony Cloud, Alcatraz.ai, Ambient.ai, SureView, Envoy, OSDP, modern credential technologies, encryption protocols for physical security systems, scripting or automation (Python, PowerShell), health monitoring and observability platforms, change management, configuration control, version-controlled infrastructure documentation"},{"@context":"https://schema.org","@type":"JobPosting","identifier":{"@type":"PropertyValue","name":"YubHub","value":"job_8c164f95-f8d"},"title":"Senior Infrastructure Engineer","description":"<p>Join our Infrastructure Engineering team and help ensure the reliability, scalability, and performance of Replit&#39;s infrastructure that serves millions of developers worldwide. As a Senior Infrastructure Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.</p>\n<p>We are seeking Senior Infrastructure Engineers who are passionate about building and maintaining resilient systems at scale. Your mission will be to proactively find and analyse reliability problems across our stack, then design and implement software and systems to address them. You will build robust monitoring solutions, automate operational tasks, and continuously improve our infrastructure&#39;s reliability.</p>\n<p><strong>You Will:</strong></p>\n<ul>\n<li>Drive Automation and Infrastructure as Code: Build and improve automation to eliminate toil and operational work. Maintain CI/CD pipelines and infrastructure automation using tools like Terraform or Pulumi. Create self-healing systems that can automatically respond to common failure scenarios.</li>\n<li>Optimise Performance and Infrastructure: Collaborate with core infrastructure and product teams to performance tune and optimise our cloud deployments (Kubernetes, Docker, GCP). Identify and resolve performance bottlenecks and implement capacity planning strategies.</li>\n<li>Elevate Developer Experience: Design and implement improvements to our build, test, and deployment systems to make software delivery faster, safer, and more reliable for all engineers.</li>\n<li>Drive Cross-Team Improvements: Partner with service owners across Replit to understand their pain points, and collaborate on implementing build/test/deploy enhancements within their specific services.</li>\n<li>Build Shared Tooling: Create and maintain centralized tooling and automation that improves the engineering lifecycle, from local development to production monitoring.</li>\n<li>Debug and Harden Systems: Dive deep into debugging difficult technical problems, making our systems and products more robust, operable, and easier to diagnose.</li>\n<li>Collaborate on Design Reviews: Participate in feature and system design reviews, contributing expertise on security, scale, and operational considerations.</li>\n<li>Build and Integrate: Write high-quality, well-tested code to meet the needs of your customers, including building pipelines to integrate with 3rd party vendors.</li>\n</ul>\n<p><strong>Required Skills and Experience:</strong></p>\n<ul>\n<li>4+ years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering).</li>\n<li>Strong programming skills in languages like Python or Go.</li>\n<li>You write high-quality, well-tested code.</li>\n<li>Solid understanding of distributed systems. You&#39;ve built, scaled, and maintained production services and understand service-oriented architecture.</li>\n<li>Experience with container orchestration platforms (Kubernetes) and cloud-native technologies.</li>\n<li>Experience implementing and maintaining monitoring/observability solutions, with strong skills in debugging and performance tuning.</li>\n<li>Strong incident management skills with experience participating in incident response and demonstrated critical thinking under pressure.</li>\n<li>Experience with infrastructure as code (e.g., Terraform) and configuration management tools.</li>\n<li>Excellent written and verbal communication skills, with an ability to explain technical concepts clearly.</li>\n<li>A willingness to dive into understanding, debugging, and improving any layer of the stack.</li>\n<li>You&#39;re passionate about making software creation accessible and empowering the next generation of builders.</li>\n</ul>\n<p><strong>Bonus Points:</strong></p>\n<ul>\n<li>Experience with Google Cloud Platform (GCP) services and tools.</li>\n<li>Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc.).</li>\n<li>Experience building reliable systems capable of handling high throughput and low latency.</li>\n<li>Experience with Go and Terraform.</li>\n<li>Familiarity with working in rapid-growth environments.</li>\n</ul>\n<p>_This is a full-time role that can be held from our Foster City, CA office. The role has an in-office requirement of Monday, Wednesday, and Friday._</p>\n<p><strong>Full-Time Employee Benefits Include:</strong></p>\n<ul>\n<li>Competitive Salary &amp; Equity</li>\n<li>401(k) Program with a 4% match</li>\n<li>Health, Dental, Vision and Life Insurance</li>\n<li>Short Term and Long Term Disability</li>\n<li>Paid Parental, Medical, Caregiver Leave</li>\n<li>Commuter Benefits</li>\n<li>Monthly Wellness Stipend</li>\n<li>Autonomous Work Environment</li>\n<li>In Office Set-Up Reimbursement</li>\n<li>Flexible Time Off (FTO) + Holidays</li>\n<li>Quarterly Team Gatherings</li>\n<li>In Office Amenities</li>\n</ul>\n<p style=\"margin-top:24px;font-size:13px;color:#666;\">XML job scraping automation by <a href=\"https://yubhub.co\">YubHub</a></p>","url":"https://yubhub.co/jobs/job_8c164f95-f8d","directApply":true,"hiringOrganization":{"@type":"Organization","name":"Replit","sameAs":"https://jobs.ashbyhq.com","logo":"https://logos.yubhub.co/replit.com.png"},"x-apply-url":"https://jobs.ashbyhq.com/replit/16c85abc-763c-4f36-ab67-64f416343384","x-work-arrangement":"hybrid","x-experience-level":"senior","x-job-type":"full-time","x-salary-range":"$190K - $240K","x-skills-required":["Site Reliability Engineering","DevOps","Systems Engineering","Infrastructure Engineering","Python","Go","Terraform","Kubernetes","Docker","GCP","Monitoring/observability solutions","Debugging and performance tuning","Incident management","Infrastructure as code","Configuration management tools"],"x-skills-preferred":["Google Cloud Platform (GCP) services and tools","Modern observability platforms (Prometheus, Grafana, Datadog, etc.)","Building reliable systems capable of handling high throughput and low latency","Go and Terraform","Familiarity with working in rapid-growth environments"],"datePosted":"2026-03-07T15:20:28.138Z","jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Foster City, CA"}},"employmentType":"FULL_TIME","occupationalCategory":"Engineering","industry":"Technology","skills":"Site Reliability Engineering, DevOps, Systems Engineering, Infrastructure Engineering, Python, Go, Terraform, Kubernetes, Docker, GCP, Monitoring/observability solutions, Debugging and performance tuning, Incident management, Infrastructure as code, Configuration management tools, Google Cloud Platform (GCP) services and tools, Modern observability platforms (Prometheus, Grafana, Datadog, etc.), Building reliable systems capable of handling high throughput and low latency, Go and Terraform, Familiarity with working in rapid-growth environments","baseSalary":{"@type":"MonetaryAmount","currency":"USD","value":{"@type":"QuantitativeValue","minValue":190000,"maxValue":240000,"unitText":"YEAR"}}}]}