Senior Software Engineer - Robinhood Command Center

bffa66a3-0b4 Senior Software Engineer - Robinhood Command Center Join us in building the future of finance.

Our mission is to democratize finance for all. An estimated $124 trillion of assets will be inherited by younger generations in the next two decades. The largest transfer of wealth in human history. If you’re ready to be at the epicenter of this historic cultural and financial shift, keep reading.

We are building an elite team, applying frontier technologies to the world’s biggest financial problems. We’re looking for bold thinkers. Sharp problem-solvers. Builders who are wired to make an impact. Robinhood isn’t a place for complacency, it’s where ambitious people do the best work of their careers. We’re a high-performing, fast-moving team with ethics at the center of everything we do. Expectations are high, and so are the rewards.

The Robinhood Command Center (RCC) is a newly formed reliability team that serves as the front line for detecting, coordinating, and mitigating production incidents across Robinhood.

As part of Robinhood’s broader reliability initiative, RCC works closely with product engineering, reliability, observability, infrastructure, and business teams to reduce customer impact and shorten incident duration.

As a Senior Engineer, you will be part of the founding RCC team, helping define how Robinhood responds to and learns from incidents at scale. This is a highly visible role focused on incident leadership, operational excellence, and reliability tooling. You will not own product services or core infrastructure, but you will own the processes and tools that enable fast, high-quality incident response.

This role is based in our Menlo Park, California office, with in-person attendance expected at least 3 days per week.

Responsibilities:

Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood’s infrastructure

Partner closely across many different types of engineers to raise the bar for operational excellence and incident response

Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents

Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact

Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics

Own and evolve incident response tooling and processes, including education, adoption, and measurement of MTTD/MTTR improvements

Drive post-incident governance and learning, defining standards for postmortems, SEV reviews, and follow-up tracking to ensure durable reliability improvements

Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers

Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems

Define and own the roadmap of bringing observability to critical user journeys for Robinhood’s products

Deliver key insights and executive-level reporting to enable better business decisions around service quality and reliability

Act as a force multiplier through mentoring, technical influence, and contributions to hiring and engineering culture

Requirements:

5+ years of software engineering experience, including significant experience operating production systems

2+ years focused on reliability engineering, infrastructure, distributed systems, or production operations

Hands-on experience serving in incident leadership roles (e.g., IMOC, incident commander, primary oncall)

Strong communication and cross-functional collaboration skills, especially during high-severity incidents

Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design

Experience with multi-region or multi-cluster architectures, capacity planning, and failover strategies

Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana)

Demonstrated ability to drive measurable improvements in MTTD, MTTR, availability, or customer impact

What we offer:

Challenging, high-impact work to grow your career

Performance driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching

Best in class benefits to fuel your work, including 100% paid health insurance for employees with 90% coverage for dependents

Lifestyle wallet - a highly flexible benefits spending account for wellness, learning, and more

Employer-paid life & disability insurance, fertility benefits, and mental health benefits

Time off to recharge including company holidays, paid time off, sick time, parental leave, and more!

Exceptional office experience with catered meals, events, and comfortable workspaces

In addition to the base pay range listed below, this role is also eligible for bonus opportunities + equity + benefits.

Base pay for the successful applicant will depend on a variety of job-related factors, which may include education, training, experience, location, business needs, or market demands. The expected base pay range for this role is based on the location where the work will be performed and is aligned to one of 3 compensation zones. For other locations not listed, compensation can be discussed with your recruiter during the interview process.

Base Pay Range:

Zone 1 (Menlo Park, CA; New York, NY; Bellevue, WA; Washington, DC): $196,000-$230,000 USD

Zone 2 (Denver, CO; Westlake, TX; Chicago, IL): $172,000-$202,000 USD

Zone 3 (Lake Mary, FL; Clearwater, FL; Gainesville, FL): $153,000-$179,000 USD

XML job scraping automation by YubHub

]]> full-time senior onsite Based on performance software engineering, reliability engineering, infrastructure, distributed systems, production operations, incident leadership, operational excellence, reliability tooling, processes and tools, fast and high-quality incident response, long-term reliability and observability strategy, multi-region or multi-cluster architectures, capacity planning, failover strategies, modern observability stacks, OpenTelemetry, Prometheus, Grafana Engineering Finance Robinhood https://logos.yubhub.co/robinhood.com.png Robinhood is a financial services company that provides a mobile app for buying and selling stocks, options, ETFs, and cryptocurrencies. https://www.robinhood.com/ https://job-boards.greenhouse.io/robinhood/jobs/7838644?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply Menlo Park, CA 2026-04-25 49ef318f-90a Director, Site Reliability Engineer | Senior Engineering Team Director We're seeking a Site Reliability Engineering (SRE) Lead to design, build, and maintain resilient, high-scale systems supporting BlackRock's Private Markets platform. In this hands-on leadership role, you'll apply deep engineering expertise to solve complex challenges, guide a global team, shape technical direction, and communicate effectively with senior stakeholders,ensuring the reliability of mission-critical systems that power private market investment workflows and decision-making. You will drive the adoption of AI-driven solutions to accelerate incident detection and triage, reduce toil, improve forecasting and capacity planning, and strengthen end-to-end observability and resilience.

Key Responsibilities:

Take ownership of project priorities, deadlines and deliverables using Agile methodologies, with clear outcomes around reliability automation and AI-enabled operations
Understand and refine business and functional requirements, translating them into SLOs/SLIs and AI-assisted observability and support capabilities
Hands on approach to getting work done,this role requires a “roll your sleeves up” mentality, including building and operationalizing reliability tooling and automation that measurably reduces toil and improves stability
Be a leader with vision and a partner in brainstorming solutions for team productivity and efficiency to improve engineering effectiveness
Drive priority setting of the engineering teams, balancing foundational reliability work with delivery of new product features
Improve Engineering culture by encouraging continuous focus on reliability across the entire application lifecycle, and by adopting AI-enabled SRE practices (e.g., intelligent alerting, automated diagnosis, and self-healing where appropriate)
Proactive participant in architectural and design decisions, including AI-ready telemetry, data quality, and model integration patterns for operational analytics
Design and implement end-to-end monitoring solutions for application and infrastructure components, leveraging modern observability platforms plus AI/ML techniques for anomaly detection, correlation, and alert noise reduction
Drive the engineering of capacity management and demand forecasting solutions, including predictive analytics/ML approaches where they add measurable value
Act as a culture carrier and leader, passing on SRE knowledge and best practices to the engineering team
Drive detailed root cause investigations for production incidents with rigorous focus on issue avoidance, using AI-assisted correlation/analysis to accelerate time-to-insight
Create/coordinate retros for significant incidents, ensuring learnings are captured in automated/AI-assisted runbooks and embedded into prevention mechanisms
Additional core engineering functions, such as adding custom telemetry metrics/logs/traces to the code base of in-scope applications to enable AI/ML-driven operational insights
Anticipate new opportunities to continuously evolve the resiliency profile of scoped applications and infrastructure

Requirements:

B.S. / M.S. degree in Computer Science, Engineering or a related discipline with 10+ years of experience
Experience leading high performing engineering/SRE teams, with a track record of driving continuous improvement through automation and AI-enabled operations
Demonstrated ability to represent engineering/SRE priorities, status, and risk to senior leadership stakeholders with clear, executive-ready communication
Hands-on experience building or operating AI-assisted capabilities (AIOps, ML-based anomaly detection, or GenAI workflows) in an engineering/production environment
A passion for providing engineering support for highly available, performant full stack applications with a “Student of Technology” attitude
Experience with relational database and NoSQL Database (e.g. Redis, Apache Cassandra)

Benefits:

Retirement investment and tools designed to help you in building a sound financial future
Access to education reimbursement
Comprehensive resources to support your physical health and emotional well-being
Family support programs
Flexible Time Off (FTO) so you can relax, recharge and be there for the people you care about

Hybrid Work Model:

BlackRock’s hybrid work model is designed to enable a culture of collaboration and apprenticeship that enriches the experience of our employees, while supporting flexibility for all
Employees are currently required to work at least 4 days in the office per week, with the flexibility to work from home 1 day a week
Some business groups may require more time in the office due to their roles and responsibilities
We remain focused on increasing the impactful moments that arise when we work together in person – aligned with our commitment to performance and innovation

About BlackRock:

At BlackRock, we are all connected by one mission: to help more and more people experience financial well-being
Our clients, and the people they serve, are saving for retirement, paying for their children’s educations, buying homes and starting businesses
Their investments also help to strengthen the global economy: support businesses small and large; finance infrastructure projects that connect and power cities; and facilitate innovations that drive progress

XML job scraping automation by YubHub

]]> full-time senior hybrid Site Reliability Engineering, Agile Methodologies, Reliability Automation, AI-Enabled Operations, Business Requirements, Functional Requirements, SLOs/SLIs, Observability, Support Capabilities, Reliability Tooling, Automation, Stability, Leadership, Vision, Team Productivity, Efficiency, Engineering Effectiveness, Priority Setting, Foundational Reliability, New Product Features, Engineering Culture, Reliability Across Application Lifecycle, AI-Enabled SRE Practices, Intelligent Alerting, Automated Diagnosis, Self-Healing, Architectural Decisions, AI-Ready Telemetry, Data Quality, Model Integration Patterns, Operational Analytics, Monitoring Solutions, Application Components, Infrastructure Components, Anomaly Detection, Correlation, Alert Noise Reduction, Capacity Management, Demand Forecasting, Predictive Analytics, ML Approaches, Root Cause Investigations, Production Incidents, Issue Avoidance, AI-Assisted Correlation, Time-To-Insight, Retros, Significant Incidents, Learnings, Runbooks, Prevention Mechanisms, Custom Telemetry Metrics, Logs, Traces, AI/ML-Driven Operational Insights, Resiliency Profile, Scoped Applications, Infrastructure, Relational Database, NoSQL Database, Redis, Apache Cassandra Engineering Finance BlackRock https://logos.yubhub.co/blackrock.com.png BlackRock is a global investment management corporation that provides investments in equity, fixed income, alternatives, and money market instruments. It has over $9 trillion in assets under management. https://www.blackrock.com/ https://jobs.workable.com/view/cLBuSgz7avHiG3cKzS91ZB/director%2C-site-reliability-engineer-%7C-senior-engineering-team-director-in-england-at-blackrock?utm_source=yubhub.co&utm_medium=jobs_feed&utm_campaign=apply England 2026-04-24