Network Engineer

a5430d30-778 Network Engineer About Us At Cloudflare, we are on a mission to help build a better Internet. Today the company runs one of the world’s largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies.

Responsibilities: We are looking for a Network Engineer to join our team (5+yrs). Cloudflare is building one of the largest, most resilient networks that spans over 335 cities spread across all regions and we plan to continue our expansion at a rapid pace. You will have the opportunity to (literally) build a faster, safer Internet for our millions of users and the billions of web surfers that visit their sites each month. This position will be responsible for:

Technical operation and engineering of the Cloudflare network, including the provisioning and management of the network hardware and software,
Day to day network operations and monitoring, working closely with internal teams such as System Reliability Engineering, Infrastructure Engineering and Customer Support teams,
Creating and maintaining documentation, SOP’s, knowledge base,
Interacting with our network peers to assist with their inquiries, and subsequently provide meaningful data on performance degradation.

Requirements:

Capable of learning new technologies / systems / features under guidance of mentors,
Proficient in multiple network vendor operating systems , Associate level network certification(s) (JNCIA , CCNA , etc) or higher,
Understanding of BGP, Knowledge of the OSI-model and experience isolating network, hardware and software issues,
Experience writing scripts in Bash, Python, or other scripting language,
Experience in working as part of a team in a customer-facing role,
Ability to prioritise when faced with high pressure scenarios.

Bonus Points but not required:

Understanding of anycast routing,
Good working knowledge of Junos, IOS-XR,NX-OS, EOS and SONIC,
Experience writing network configuration and design documentation,
Experience solving problems through automation,
Experience with optical transport technologies such as CWDM/DWDM,
Linux system administration,
Multilingual.

XML job scraping automation by YubHub

]]> full-time senior onsite network vendor operating systems, Associate level network certification(s), BGP, OSI-model, scripting languages (Bash, Python, etc.), team collaboration, prioritization, anycast routing, Junos, IOS-XR, NX-OS, EOS, SONIC, network configuration and design documentation, problem-solving through automation, optical transport technologies (CWDM/DWDM), Linux system administration, multilingual Engineering Technology Cloudflare https://logos.yubhub.co/cloudflare.com.png Cloudflare is a technology company that builds a network that powers millions of websites and other Internet properties. https://www.cloudflare.com/ https://job-boards.greenhouse.io/cloudflare/jobs/7628395 In-Office 2026-04-18 a438f945-411 Senior Site Reliability Engineer (Resilience) - Platform Resilience We're seeking a Senior Site Reliability Engineer (SRE) to join our Platform Engineering department. As an SRE, you will lead technical initiatives to automate system engineering efforts, ensuring the reliability of our global infrastructure. You will grow our global Platform infrastructure to meet increasing scaling demands by developing and maintaining software, tooling, and automations.

Responsibilities:

Develop and maintain software, tooling, and automations to ensure the reliability and scalability of our global infrastructure.

Lead technical initiatives to automate system engineering efforts, ensuring the reliability of our global infrastructure.

Collaborate with engineers to identify, implement, and deliver solutions that meet the needs of our customers.

Champion an environment focused on collaboration, operational excellence, and uplifting others.

Respond to and prevent repeated customer impact in response to major incidents and prioritized problem management.

Requirements:

Success and lessons of experiences from striving for 'progress not perfection' in the name of Platform reliability.

Background in software engineering to collaborate with engineers to expertly identify, implement, and deliver solutions.

Experience in public cloud and managed Kubernetes services is advantageous.

Passion for developing solutions that involve inclusive communication methods to grow and strengthen partner and team relationships.

Preferred Qualifications:

Operated a SaaS product in a public cloud ideally built using Infrastructure-as-Code tooling such as Crossplane or Terraform.

Built or operated a Kubernetes-at-scale infrastructure, ideally across multiple cloud providers, and the vital automation to support it.

Written non-trivial programs in Golang or other programming languages.

Worked with containerized services (such as Docker).

Proven experience in leading and improving alerting and major incident management standard processes metrics systems (e.g. Elastic Stack, Graphite, Prometheus, Influx) to diagnose issues and quantify impacts to present to others at varying levels of the organization.

Experienced in system administration with professional skills in Linux on distributed systems at scale.

Diagnosed or designed, implemented, and created solutions with the Elastic Stack.

Thrived in a self-organizing and sharing in a globally distributed team environment.

Strengthened team members in bringing out the best of each other by uplifting others with coaching and mentoring.

Compensation:

This role is eligible to participate in Elastic's stock program.

Total rewards package includes a company-matched 401k with dollar-for-dollar matching up to 6% of eligible earnings, along with a range of other benefits offered with a holistic emphasis on employee well-being.

Typical starting salary range for this role is $154,800-$195,600 USD.

XML job scraping automation by YubHub

]]> full-time senior remote $154,800-$195,600 USD Software engineering, Public cloud, Managed Kubernetes services, Infrastructure-as-Code tooling, Containerized services, System administration, Linux on distributed systems, Golang, Crossplane, Terraform, Docker, Elastic Stack, Graphite, Prometheus, Influx Engineering Technology Elastic https://logos.yubhub.co/elastic.co.png Elastic develops a search and analytics platform used by over 50% of the Fortune 500 companies. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7794016 United States 2026-04-18 2ab9c635-07a Operations Engineer, Fleet Reliability The Fleet Reliability Operations team is responsible for the day-to-day provisioning, management, and uptime of CoreWeave's ever-expanding fleet of server nodes. This team plays a central role in CoreWeave's growth strategy, configuring, updating, and remotely troubleshooting our highest-tier supercomputing clusters and their networking, delivery platforms, and tools dependencies.

We are seeking curious, creative, and persistent problem solvers to join our Fleet Reliability Operations team to help drive batches of server nodes through our provisioning and validation processes while efficiently and effectively troubleshooting node or cluster problems as they arise.

Key responsibilities include:

Configuring and maintaining large-scale high-performance supercomputing clusters running state-of-the-art GPUs
Troubleshooting hardware and software issues; escalating and coordinating as needed with data center, network, hardware, and platform teams to drive resolution
Monitoring and analyzing system performance and taking appropriate remediation actions for cloud health
Approaching work with flexibility and optimism, anticipating shifting business and technical priorities
Creating and maintaining documentation of team processes, knowledge, and best practices for system management
Thinking critically about day-to-day work and working collaboratively to improve team processes and efficiency

As a member of our team, you will be part of a dynamic and fast-paced environment where you will have the opportunity to grow and develop your skills. We offer a competitive salary range of $83,000 to $110,000, as well as a comprehensive benefits package, including medical, dental, and vision insurance, company-paid life insurance, and flexible PTO.

If you are a motivated and detail-oriented individual who is passionate about working with cutting-edge technology, we encourage you to apply for this exciting opportunity.

XML job scraping automation by YubHub

]]> full-time mid hybrid $83,000 to $110,000 Linux system administration, Troubleshooting hardware and software issues, System maintenance tasks, Scripting languages (bash, python, powershell, etc), Grafana, Prometheus, promsql queries or similar observability platforms, Kubernetes administration, HPC - administering GPU-related workloads, Data center environments including server racks, HVAC systems, fiber trays Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4617382006 New York, NY /Plano, TX / Bellevue, WA / Sunnyvale, CA 2026-04-18 72ebb09d-b37 Staff+ Software Engineer, Observability We're seeking talented and experienced Software Engineers to join our Observability team within the Infrastructure organization. The Observability team owns the monitoring and telemetry infrastructure that every engineer and researcher at Anthropic depends on,from metrics and logging pipelines to distributed tracing, error analytics, alerting, and the dashboards and query interfaces that make it all actionable.

As Anthropic scales its infrastructure across massive GPU, TPU, and Trainium clusters, the volume and complexity of operational data is growing by orders of magnitude. We're building next-generation observability systems,high-throughput ingest pipelines, cost-efficient columnar storage, unified query layers across signals, and agentic diagnostic tools,to ensure that engineers can detect, diagnose, and resolve issues in minutes rather than hours, even as the systems they operate become exponentially more complex.

Responsibilities:

Design and build scalable telemetry ingest and storage pipelines for metrics, logs, traces, and error data across Anthropic's multi-cluster infrastructure
Own and evolve core observability platforms, driving migrations and architectural improvements that improve reliability, reduce cost, and scale with organisational growth
Build instrumentation libraries, SDKs, and integrations that make it easy for engineering teams to emit high-quality telemetry from their services
Drive alerting and SLO infrastructure that enables teams to define, monitor, and respond to reliability targets with minimal noise
Reduce mean time to detection and resolution by building cross-signal correlation, unified query interfaces, and AI-assisted diagnostic tooling
Partner with Research, Inference, Product, and Infrastructure teams to ensure observability solutions meet the unique needs of each organisation

You May Be a Good Fit If You:

Have 10+ years of relevant industry experience building and operating large-scale observability or monitoring infrastructure
Have deep experience with at least one observability signal area (metrics, logging, tracing, or error analytics) and familiarity with the others
Understand high-throughput data pipelines, columnar storage engines, and the tradeoffs involved in ingesting and querying telemetry data at scale
Have experience operating or building on top of observability platforms such as Prometheus, Grafana, ClickHouse, OpenTelemetry, or similar systems
Have strong proficiency in at least one of Python, Rust, or Go
Have excellent communication skills and enjoy partnering with internal teams to improve their operational visibility and incident response capabilities
Are excited about building foundational infrastructure and are comfortable working independently on ambiguous, high-impact technical challenges

Strong Candidates May Also Have:

Experience operating metrics systems at very high cardinality (hundreds of millions of active time series or more)
Experience with log storage migrations or operating columnar databases (ClickHouse, BigQuery, or similar) for analytics workloads
Experience with OpenTelemetry instrumentation, collector pipelines, and tail-based sampling strategies
Experience building or operating alerting platforms, on-call tooling, or SLO frameworks at scale
Experience with Kubernetes-native monitoring, eBPF-based observability, or continuous profiling
Interest in applying AI/LLMs to operational workflows such as automated root cause analysis, anomaly detection, or intelligent alerting

The annual compensation range for this role is $405,000-$485,000 USD.

XML job scraping automation by YubHub

]]> full-time staff hybrid $405,000-$485,000 USD observability, monitoring, telemetry, metrics, logging, tracing, error analytics, alerting, SLO infrastructure, cross-signal correlation, unified query interfaces, AI-assisted diagnostic tooling, Python, Rust, Go, Prometheus, Grafana, ClickHouse, OpenTelemetry, high-throughput data pipelines, columnar storage engines, operating system administration, cloud computing, containerization, DevOps Engineering Technology Anthropic https://logos.yubhub.co/anthropic.com.png Anthropic is a public benefit corporation that creates reliable, interpretable, and steerable AI systems. https://www.anthropic.com/ https://job-boards.greenhouse.io/anthropic/jobs/5139910008 San Francisco, CA | New York City, NY | Seattle, WA 2026-04-18 fb9b187c-e32 HPC Engineer We are seeking a skilled and driven NVLink Engineer to support large-scale data center deployments. In this role, you'll be at the forefront of cutting-edge infrastructure technologies, ensuring the optimal performance and stability of NVLink systems.

Key Responsibilities:

Support the deployment of NVLink systems across large data center environments.
Support the full lifecycle management of NVLink hardware and software components.
Build and maintain tooling to automate and streamline the deployment, monitoring and troubleshooting workflows.
Diagnose and resolve performance, connectivity and stability issues in complex environments.
Collaborate with internal teams and external customers worldwide.
Participate in a rotating on-call schedule to ensure 24/7 support coverage.

Required Qualifications:

Solid understanding of networking fundamentals
Proven background in troubleshooting network and server hardware at the component level.
Strong Linux system administration skills.
Proficiency in at least one language (e.g., Python, Go).
Proven ability to troubleshoot and debug complex application issues.
Excellent communication and collaboration skills.
Experience with Ansible.

Preferred Qualifications:

Experience with InfiniBand networking.
Experience managing large-scale environments (1,000+ switches or nodes).
Prior experience with NVLink technologies.
Knowledge of Redfish API for system management.
Experience with NVUE (NVIDIA User Experience).
Background with SONiC.
Experience with Grafana/PromQL

XML job scraping automation by YubHub

]]> full-time senior hybrid $109,000 to $204,000 Networking fundamentals, Linux system administration, Python, Go, Troubleshooting and debugging, InfiniBand networking, Ansible, Redfish API, NVUE, SONiC, Grafana/PromQL Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4645664006 New York, NY/ Bellevue, WA/ Sunnyvale, CA / Livingston, NJ 2026-04-18 c916726e-d71 Principal Software Engineer (Networking) - Platform As a Principal Software Engineer (Networking) - Platform, you will lead technical initiatives for automating network engineering efforts to guarantee the reliability of the global Elastic infrastructure. You will grow our global Platform infrastructure to meet the increasing scaling demands by developing and maintaining software, codebases, tooling and automations.

Collaborate in an environment with an inclusive approach, and focus on operational perfection which uplifts others. Prevent repeated customer impact in response to major incidents and prioritized problem management. Our on-call rotation is spread well, and we address complex customer concerns too.

You will participate in coding, innovating technical designs, crafting solutions, improving resilience, and prioritizing security, bug fixes, and features. For example, debugging Azure Networking for Elastic Cloud Serverless is part of our efforts, and we want your experience to contribute to a truly exceptional customer experience!

You will take an engineering approach in leading technical initiatives for automating network engineering efforts to guarantee the reliability of the global Elastic infrastructure. You will grow our global Platform infrastructure to meet the increasing scaling demands by developing and maintaining software, codebases, tooling and automations.

You will collaborate in an environment with an inclusive approach, and focus on operational perfection which uplifts others. Prevent repeated customer impact in response to major incidents and prioritized problem management. Our on-call rotation is spread well, and we address complex customer concerns too.

Success and lessons of experiences from striving for 'progress not perfection' in the name of Platform reliability. We want to hear about your customer-first approach in solving operational problems for both today and the future.

Passion for developing solutions that involve inclusive communication methods to grow and strengthen partner and team relationships. Examples of working in distributed teams or working remotely is desirable.

You have designed and built a SaaS product in a public cloud ideally built using Infrastructure-as-Code tooling such as Crossplane or Terraform

You have built Kubernetes-at-scale infrastructure, ideally across multiple cloud providers, and the vital automation to support it.

You have written product features or functions in Golang or other programming languages.

You have worked with containerized services (such as Docker).

You have proven results in leading and improving cross-team engineering initiatives.

You have experience in system administration with professional skills in Linux on distributed systems at scale.

You have diagnosed or designed, implemented and created solutions with the Elastic Stack.

You are experienced in a self-organizing and sharing in a globally distributed team environment.

You strengthen team members in bringing out the best of each other by uplifting others with coaching and mentoring.

Compensation for this role is in the form of base salary. This role does not have a variable compensation component. The typical starting salary range for new hires in this role is $189,800-$232,900 USD. In select locations (including Seattle WA, Los Angeles CA, the San Francisco Bay Area CA, and the New York City Metro Area), an alternate range may apply as specified below.

Elastic believes that employees should have the opportunity to share in the value that we create together for our shareholders. Therefore, in addition to cash compensation, this role is currently eligible to participate in Elastic's stock program. Our total rewards package also includes a company-matched 401k with dollar-for-dollar matching up to 6% of eligible earnings, along with a range of other benefits offered with a holistic emphasis on employee well-being.

XML job scraping automation by YubHub

]]> full-time senior remote $189,800-$232,900 USD Software Engineering, Cloud Network Solutions, Public Cloud, Go, Managed Kubernetes Services, Linux, Distributed Systems, Elastic Stack, Infrastructure-as-Code, Crossplane, Terraform, Kubernetes, Containerized Services, Docker, System Administration, Golang, Programming Languages, SaaS Product Development, Kubernetes-at-Scale Infrastructure, Automation, Self-Organizing Team Environment, Coaching and Mentoring Engineering Technology Elastic https://logos.yubhub.co/elastic.co.png Elastic is a search AI company that enables everyone to find the answers they need in real time, using all their data, at scale. The Elastic Search AI Platform is used by more than 50% of the Fortune 500. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7565185 United States 2026-04-18 1868194d-726 Operations Engineer, HPC Networking In this role, you will support the deployment, monitoring, troubleshooting, and maintenance of large-scale InfiniBand fabrics, ensuring their stability and performance.

The ideal candidate will have a strong operations mindset, effective collaboration skills, and the ability to solve complex issues in a dynamic environment.

Key responsibilities include:

Regularly monitoring the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.
Investigating and resolving operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks.
Assisting with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams.
Performing routine maintenance and upgrades on InfiniBand switches and control plane components.
Collaborating with HPC cluster operations teams to provide troubleshooting and operational expertise.

Investing in our people is one of our top priorities, and we value candidates who can bring their diversified experiences to our teams.

Minimum Qualifications:

At least 1 year of experience with InfiniBand or similar networking technologies.
Solid understanding of networking concepts, including architectures, topologies, operational best practices, and troubleshooting.
Experience with Linux system administration and maintenance.
Proficiency in at least one scripting language.

Preferred Qualifications:

Hands-on experience with Nvidia UFM or similar fabric management tools.
Familiarity with SLURM job scheduler and its role in HPC environments.
Experience with monitoring and visualization platforms such as Grafana or Prometheus.
Experience with operational tooling and automation frameworks like Ansible.
Knowledge of data center operations, including server racks, and cabling.
Python or Bash scripting.

Why CoreWeave? At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at Your Core
Act Like an Owner
Empower Employees
Deliver Best-in-Class Client Experiences
Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization's growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.

Come join us!

The base salary range for this role is $110,000 to $179,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation.

XML job scraping automation by YubHub

]]> full-time mid hybrid $110,000 to $179,000 InfiniBand, Linux system administration, Scripting language, Networking concepts, Architectures, Topologies, Operational best practices, Troubleshooting, Nvidia UFM, SLURM job scheduler, Grafana, Prometheus, Ansible, Data center operations, Server racks, Cabling, Python, Bash scripting Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI applications. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4673462006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 468f1224-f5c AV Systems Administrator We are seeking an AV Systems Administrator to maintain and support our AV infrastructure, ensuring all conferencing systems are operational and users are well-trained. The position is based at our Washington DC location but may involve travel to other sites.

The successful candidate will oversee AV infrastructure across the organization, including hardware in conference rooms, All-Hands spaces, and training rooms. They will keep systems updated with the latest operating systems, software, and applications to ensure availability during business hours.

Key responsibilities include providing specialized support for a variety of AV events, from podcasts to large-scale conferences, designing and installing AV systems utilizing a standard design concept across a variety of work spaces, and assisting in the coordination and implementation of new office, production floors, warehouses and other spaces requiring AV installations.

The ideal candidate will have a bachelor's degree in computer science or a related field, or equivalent experience, and demonstrated expertise in AV system administration. They will also have familiarity with AV software and conferencing tools, knowledge of AV equipment, experience with virtualization technologies, strong analytical and problem-solving abilities, and outstanding communication skills.

In addition to a competitive salary, we offer a range of benefits, including comprehensive medical, dental, and vision plans, income protection, generous time off, family planning and parenting support, mental health resources, professional development opportunities, and commuter benefits.

XML job scraping automation by YubHub

]]> full-time mid onsite $33-$43 USD per hour AV system administration, AV software and conferencing tools, AV equipment, Virtualization technologies, Analytical and problem-solving skills, Communication skills, Experience with ticketing systems, Effective organizational skills, Enterprise/Corporate AV Management Systems, Live Audio and Video Production IT Technology Anduril https://logos.yubhub.co/anduril.com.png Anduril is a technology company that develops and manufactures advanced sensors and software for the military and defense industries. https://www.anduril.com/ https://job-boards.greenhouse.io/andurilindustries/jobs/5084907007 Washington, District of Columbia, United States 2026-04-18 a8092b6e-7f5 Bare Metal Support Engineer As a Bare Metal Support Engineer at CoreWeave, you will be responsible for supporting, operating, and maintaining CoreWeave's extensive GPU fleet across our growing data centers in the U.S., Europe, and beyond.

You will work closely with customers, data center technicians, and engineering teams to ensure the reliability, performance, and scalability of our infrastructure.

Key responsibilities include:

Providing high-level support for customers utilizing bare-metal GPU fleets on CoreWeave Cloud.
Diagnosing, triaging, and investigating reported customer issues and high-priority incidents, identifying root causes and escalating when necessary.
Developing a deep understanding of customer workloads and use cases to provide tailored technical support.
Coordinating remote troubleshooting and hardware interventions with Data Center Technicians.
Creating and maintaining internal documentation, including troubleshooting guides, best practices, and knowledge base articles.
Participating in an on-call rotation to support production clusters and ensure operational reliability.
Collaborating with engineering teams to improve hardware reliability, software stability, and system performance.
Implementing automation and scripting to streamline support workflows and reduce manual interventions.
Performing in-depth log analysis and debugging across multiple layers of the stack (firmware, drivers, hardware).
Providing feedback to internal teams on common support issues to drive continuous improvements.
Working with networking teams to troubleshoot connectivity issues affecting customer workloads.
Supporting supercomputing infrastructure running GPU workloads at scale.
Driving operational excellence by refining internal processes and support methodologies.

To succeed in this role, you will need:

Experience in data centers, GPU clusters, server deployments, system administration, or hardware troubleshooting.
Demonstrated experience driving resolutions and continuous improvements across cross-functional environments and teams within a data center environment.
Intermediate knowledge of Linux (Ubuntu, CentOS, or similar), including command-line proficiency.
Experience with NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing (HPC), and large-scale data center environments.
Experience in networking fundamentals (TCP/IP, VLANs, DNS, DHCP) and troubleshooting tools.
Hands-on experience with firmware updates, BIOS configurations, and driver management.
Experience analyzing system logs and debugging issues across firmware, drivers, and hardware layers.
Experience working with Jira, Confluence, Notion, or other issue-tracking and documentation platforms.
Experience in scripting and automation (Python, Bash, Ansible, or similar).

If you're a curious and analytical individual with a passion for problem-solving and a desire to work in a fast-paced environment, we'd love to hear from you!

XML job scraping automation by YubHub

]]> full-time mid hybrid $83,000 to $132,000 Linux, GPU clusters, server deployments, system administration, hardware troubleshooting, NVIDIA GPUs, SuperMicro systems, Dell systems, high-performance computing, large-scale data center environments, networking fundamentals, troubleshooting tools, firmware updates, BIOS configurations, driver management, system logs, debugging issues, Jira, Confluence, Notion, issue-tracking, documentation platforms, scripting, automation, Kubernetes, Docker, containerized infrastructure Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that delivers a platform of technology, tools, and teams to enable innovators to build and scale AI with confidence. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4560350006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 d34bbf18-2b2 Senior Site Reliability Engineer (FinOps) - Platform As a Senior Site Reliability Engineer (FinOps) - Platform, you will be part of the Platform Engineering department, responsible for designing, building, scaling, and maturing the multi-cloud platform for hosting internal and external services. You will lead technical initiatives for automating system engineering efforts to guarantee the reliability of the global Elastic infrastructure. You will also grow our global Platform infrastructure to meet the increasing scaling demands by developing and maintaining software, tooling, and automations.

Key responsibilities include:

Taking an engineering approach in leading technical initiatives for automating system engineering efforts to guarantee the reliability of the global Elastic infrastructure.
Growing our global Platform infrastructure to meet the increasing scaling demands by developing and maintaining software, tooling, and automations.
Using an inclusive approach at championing an environment focused on collaboration, operational excellence, and uplifting others.
Responding to and preventing repeated customer impact in response to major incidents and prioritized problem management.

The ideal candidate will have success and lessons of experiences from striving for 'progress not perfection' in the name of Platform reliability. They will have a background in software engineering to collaborate with engineers to expertly identify, implement, and deliver solutions. An experience in public cloud and managed Kubernetes services is advantageous.

The role requires passion for developing solutions that involve inclusive communication methods to grow and strengthen partner and team relationships. Examples of working in distributed teams or working remotely is desirable.

Bonus points for experience in operating a SaaS product in a public cloud, building or operating a Kubernetes-at-scale infrastructure, writing non-trivial programs in Golang or other programming languages, working with containerized services, leading and improving alerting and major incident management standard processes metrics systems, and experience in system administration with professional skills in Linux on distributed systems at scale.

XML job scraping automation by YubHub

]]> full-time senior remote Cloud computing, Kubernetes, Golang, Containerization, Linux, System administration, Alerting and incident management, Infrastructure-as-Code, Terraform, Crossplane, Distributed systems, Self-organizing teams Engineering Technology Elastic https://logos.yubhub.co/elastic.co.png Elastic develops a search engine and analytics platform used by over 50% of the Fortune 500 companies. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7565188 Spain 2026-04-18 db7b0f51-7df Senior Cloud Support Engineer As a Senior Cloud Support Engineer at CoreWeave, you'll be on the front lines of a technological revolution, empowering our customers to harness the full potential of our advanced Kubernetes-powered HPC cloud infrastructure.

You'll be hands-on, collaborating with engineers and researchers to resolve issues that impact high-profile, mission-critical applications and cutting-edge AI training workloads. Your contributions will be pivotal in ensuring seamless performance, reliability, and success for our customers, positioning you at the very core of transformative technologies reshaping industries worldwide at a company that is truly one of a kind.

In this role, you will:

Guide and mentor team members in developing their technical skills and troubleshooting capabilities across all disciplines supported by CoreWeave.
Provide real-time feedback and coaching, reviewing tickets to identify opportunities for improvement and ensure quality assurance (QA).
Develop and deliver training sessions to improve the team's proficiency and efficiency in resolving customer issues.
Use technical expertise to investigate, debug, and resolve customer-impacting issues with the curiosity required to uncover and understand root causes.
Maintain high customer satisfaction through swift, accurate, and empathetic high-touch support communications, as well as established best practices.
Help design and implement troubleshooting best practices to ensure fast, accurate client resolutions.
Contribute to refining processes, workflows, and playbooks for handling complex customer challenges.
Serve as a technical escalation point for high-priority escalations or complex cases, modeling effective problem-solving approaches.
Lead the creation of knowledge-sharing resources, including documentation, tutorials, and how-to guides.
Enhance the support team's knowledge of CoreWeave's products and services through continuous learning initiatives.

Who You Are:

Have a Bachelor's degree in Information Science / Information Technology, Data Science, Computer Science, Engineering, Mathematics, Physics, or a related field, OR equivalent experience in a technical position
At least 5+ years of experience in cloud support, systems administration, or related technical support-focused roles
Proven hands-on work experience with Kubernetes
Experience with networking, load balancing, storage volumes, observability, node management, High-Performance Computing (HPC), and Linux system administration
Proven ability to mentor team members, foster technical growth, and improve team-wide capabilities through guidance and feedback
Experience with observability tools such as Grafana
Strong troubleshooting skills, with experience resolving complex customer issues and driving quality assurance through ticket reviews or similar processes
Demonstrated success collaborating with cross-functional teams to refine workflows, implement best practices, and advocate for necessary tools or process changes
Excellent written and verbal communication skills, with a track record of simplifying complex concepts for diverse audiences
Strong technical presentation skills, with experience delivering precise, engaging, and informative presentations to technical and non-technical audiences, effectively showcasing complex concepts and solutions

Preferred:

CKA Certified
Demonstrated experience with training, coaching, and creating onboarding materials.
Operates in a fast-paced, global, 24/7 support team environment
Ability to collaborate across different time zones
On-site office environment, hybrid, or remote options depending on location
Flexible to travel up to 10% (~25 days/year)

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast! We're in an exciting stage of hyper-growth that you will not want to miss out on. We're not afraid of a little chaos, and we're constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at Your Core
Act Like an Owner
Empower Employees
Deliver Best-in-Class Client Experiences
Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too.

Come join us!

XML job scraping automation by YubHub

]]> full-time senior hybrid $122,000 to $163,000 cloud support, systems administration, Kubernetes, networking, load balancing, storage volumes, observability, node management, High-Performance Computing (HPC), Linux system administration, CKA Certified, training, coaching, onboarding materials, fast-paced global support team environment, collaboration across different time zones Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for artificial intelligence (AI) workloads. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4568136006 Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA 2026-04-18 1a7635f5-a02 Principal Software Engineer (Networking) - Platform As a Principal Software Engineer (Networking) - Platform, you will be part of the Platform Engineering department, responsible for crafting, building, and improving the multi-cloud platform at scale for Elastic Cloud Hosted and Serverless. You will participate in coding, innovating technical designs, crafting solutions, improving resilience, and prioritizing security, bug fixes, and features.

Key responsibilities include:

Taking an engineering approach in leading technical initiatives for automating network engineering efforts to guarantee the reliability of the global Elastic infrastructure.
Growing our global Platform infrastructure to meet the increasing scaling demands by developing and maintaining software, codebases, tooling, and automations.
Collaborating in an environment with an inclusive approach, and focusing on operational perfection which uplifts others.
Preventing repeated customer impact in response to major incidents and prioritized problem management.

Requirements include:

10+ years in Software Engineering with product success in delivering Cloud network solutions.
Experience in public cloud, Go, and managed Kubernetes services is advantageous.
Success and lessons of experiences from striving for 'progress not perfection' in the name of Platform reliability.
Passion for developing solutions that involve inclusive communication methods to grow and strengthen partner and team relationships.

Bonus points include:

Designing and building a SaaS product in a public cloud ideally built using Infrastructure-as-Code tooling such as Crossplane or Terraform.
Building Kubernetes-at-scale infrastructure, ideally across multiple cloud providers, and the vital automation to support it.
Writing product features or functions in Golang or other programming languages.
Working with containerized services (such as Docker).
Proven results in leading and improving cross-team engineering initiatives.
Experience in system administration with professional skills in Linux on distributed systems at scale.
Diagnosing or designing, implementing, and creating solutions with the Elastic Stack.
Experienced in a self-organizing and sharing in a globally distributed team environment.
Strengthening team members in bringing out the best of each other by uplifting others with coaching and mentoring.

XML job scraping automation by YubHub

]]> full-time senior remote Software Engineering, Cloud Network Solutions, Public Cloud, Go, Managed Kubernetes Services, Infrastructure-as-Code, Crossplane, Terraform, Golang, Containerized Services, Docker, System Administration, Linux, Distributed Systems, Kubernetes, Automation, Inclusive Communication, Coaching and Mentoring Engineering Technology Elastic https://logos.yubhub.co/elastic.co.png Elastic is a search AI company that enables everyone to find the answers they need in real time, using all their data, at scale. Its search AI platform is used by over 50% of the Fortune 500. https://www.elastic.co/ https://job-boards.greenhouse.io/elastic/jobs/7713597 Spain 2026-04-18 35ca76f9-e25 Product Technical Support Associate, Edge Compute Systems As a Product Technical Support Associate for Edge Compute Systems, you will play a critical role in ensuring the reliability and readiness of Anduril's fixed-site and expeditionary asset control solutions. GCS is designed to deliver real-time planning and control of autonomous systems at the tactical edge through several form-factor solutions to support system employment in any situation.

In this role, you will support end users by improving field failure discovery, mitigation, and resolution processes, conducting root cause analysis, deploying fixes, and managing incidents across the GCS fleet. This position requires a strong problem-solving mindset and hands-on expertise in debugging and resolving complex compute hardware and software issues.

Key responsibilities include:

Sustain Anduril's GCS deployments by combining an understanding of our customers' missions with familiarity of our products and delivered capabilities
Triage, diagnose and root cause product incidents, driving postmortem actions including providing status visibility through resolution
Consistently assess and seek to improve the quality of the fleet's observability and health telemetry in partnership with multiple functions across the GCS team
Collect, organize, and analyze system failure data to define trends, drive proactive sustainment processes, and support resource allocation
Support Anduril's global customers through proactive communications and detail-oriented execution
Support the evaluation and improvement of product capabilities, analyzing customer communication and feedback for capability requirements, product performance indicators, and desired functionality

Required qualifications include:

4+ year of technical support experience with a focus on final-tier customer concern support
Experience supporting and/or performing incident driven workflows requiring analysis, triage, and prioritization
Experience in on-call support operations and working in limited risk tolerance environments
Ability to work non-standard hours and weekends as needed
Ability to obtain and maintain a U.S. Secret Security clearance

Preferred qualifications include:

BA or BS degree from accredited institution, STEM degree, preferably in computer science, software engineering, electrical engineering, information technology, or similar
Experience supporting and/or operating compute-enabled communications systems, including electronic warfare domain experience, as a DOD employee, contractor, or end-user
Experience with observability tooling such as DataDog, Grafana, and Victor Ops; exposure to software development tooling such as Git and Jira
Applicable industry certifications (e.g. CompTIA Network+, CCNA, Linux+)
Familiarity with and/or experience administrating NixOS systems
Experience working as a system administrator
Experience executing sustainment and reliability workflows for a defense-focused service or product
DOD, Law Enforcement, or other Government agency experience preferred
Demonstrated experience as a self-starter, able to find and resolve issues on your own
Experience performing trend analysis to inform business decisions
Strong aptitude for problem solving in unstructured situations at the interface of hardware, software, and networking
Ability to drive challenging and vague technical problems to clarity and resolution
Proven ability to master a technical system and support it in operational environments
Must demonstrate an innate drive to be self-sufficient across the depth and breadth of a technical system
Daily practice of excellence and rigor - you execute the 100th rep of a process with the same focus and care as the first five reps
Confident with navigating ambiguity and crafting new ways of doing things
Excellent written, visual, and verbal communication skills
Active SECRET (or higher level) security clearance

US Salary Range: $113,000-$149,000 USD

XML job scraping automation by YubHub

]]> full-time mid onsite $113,000-$149,000 USD Technical support, Problem-solving, Debugging, Root cause analysis, Incident management, Observability, Health telemetry, System failure analysis, Proactive sustainment, Resource allocation, Customer communication, Detail-oriented execution, Product evaluation, Capability requirements, Product performance indicators, Desired functionality, Computer science, Software engineering, Electrical engineering, Information technology, NixOS systems administration, System administration, Sustainment and reliability workflows, Defense-focused services, Government agency experience, Self-starting, Trend analysis, Ambiguity navigation, Communication skills Engineering Technology Anduril Industries https://logos.yubhub.co/andurilindustries.com.png Anduril Industries is a defense technology company that designs, builds, and sells military systems. It has a family of systems powered by Lattice OS, an AI-powered operating system. https://www.andurilindustries.com/ https://job-boards.greenhouse.io/andurilindustries/jobs/5083881007 Costa Mesa, California, United States 2026-04-18 a561c761-1f3 Manager, Bare Metal Support Engineering The Customer Experience (CX) Organisation at CoreWeave is dedicated to ensuring every client running AI workloads at scale has a seamless, reliable, and high-performance experience.

As a Manager of Bare Metal Support Engineering, you'll be at the centre of ensuring our dedicated infrastructure remains stable, reliable, and performant. You'll lead daily support operations, triage incidents, drive escalations, and ensure that hardware is monitored, maintained, and delivered effectively for our clients.

Key responsibilities include:

Leading a skilled team responsible for maintaining and optimising physical infrastructure across multiple client environments.
Building, developing, and leading a dedicated Infrastructure Support team focused on supporting key infrastructure, handling escalations, and ensuring smooth hardware operations.
Overseeing the resolution of infrastructure-related incidents, escalation management, and collaborating with internal teams to deliver effective solutions.
Improving support processes to enhance efficiency and reduce downtime, ensuring the infrastructure meets client expectations.

The ideal candidate will have 5+ years of experience leading teams responsible for infrastructure support, data centre operations, or physical compute environments. They should be hands-on with Linux system administration and command-line tools, familiar with hardware-level diagnostics, troubleshooting, and replacement, and have experience working with high-performance rack-scale hardware.

In addition to the required skills, preferred skills include experience managing infrastructure support teams in high-growth or rapidly evolving environments, proven ability to develop and implement operational processes that scale with business needs, and strong familiarity with server and GPU hardware lifecycle management.

XML job scraping automation by YubHub

]]> full-time senior onsite $170,000 to $240,000 SGD Linux system administration, Command-line tools, Hardware-level diagnostics, Troubleshooting and replacement, High-performance rack-scale hardware, Managing infrastructure support teams, Developing and implementing operational processes, Server and GPU hardware lifecycle management Engineering Technology CoreWeave https://logos.yubhub.co/coreweave.com.png CoreWeave is a cloud computing company that provides a platform for building and scaling AI workloads. https://www.coreweave.com https://job-boards.greenhouse.io/coreweave/jobs/4649055006 Singapore 2026-04-18 a22de8a6-69b Senior Professional Services Consultant About Us

At Cloudflare, we are on a mission to help build a better Internet. Today the company runs one of the world’s largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies.

As a Senior Professional Services Consultant for Zero Trust Services/SASE and Network Services, you are an individual contributor working in the post-sales landscape, responsible for the technical execution of solutions and guidance to our customers to get the most value possible from their Cloudflare investment.

Responsibilities:

Plan and deliver timely and organized services for customers, ensure customers see the full value in Cloudflare’s products and advice on product best practices.

Gather business and technical requirements, use cases and any other information required to build, migrate and deliver a solution on behalf of the customer and transition the Cloudflare working environment to the customer.

Produce a Solution Design, HLD, LLD, databuilds, procedures, scripts, test plans, drawings, deployment plan, migration plan, as-builts, and any other artifacts necessary to deliver the solution and transition smoothly into the customer’s technical teams.

Implement changes on behalf of the customer in the Cloudflare environment following the customer’s change management process.

Provide guidance to the customer to configure their CPEs and integration points.

Troubleshoot implementation issues and collaborate with Customer Support, Engineering and other teams to assist technical escalations.

Contribute towards the success of the organization through knowledge sharing activities such as contributing to internal and external documentation, answering technical Q&A, and helping to iterate on best practices.

Support building operational assets like templates, automation scripts, procedures, workflows, etc.

Experience might include a combination of the skills below:

5+ years of experience working in a customer-facing technical implementation or onboarding team, security engineering team, security operations center or other highly technical team

Layers and protocols of the OSI model, such as TCP/IP, TLS, DNS, HTTP

Deep understanding of Internet and Security technologies such as SWG, Sandboxing, Firewalls, DLP, and VPNs

Deep understanding of Zero Trust technologies such as Zero Trust Network Access (ZTNA), Secure Web Gateway (SWG), Remote Browser Isolation (RBI), Data Loss Prevention (DLP), CASB and Cloud Email Security

Deep understanding of SSO, ADFS, LDAP and SAML and the role of the IdP for SSO.

Security aspects of an internet property, such as Firewalls, WAFs, Bot Management, Rate Limiting, (M)TLS

Performance aspects of an internet property, such as Speed, Latency, Caching, HTTP/2, TLSv1.3

Network transformation technologies such as MPLS, SD-WAN or WAN Optimization

Ability to manage a project, work to deadlines, and prioritize between competing demands

Demonstrated technical experience (hands-on-keyboard) with:

Experience working with Reverse & Forward proxies with Content Filtering & DLP technologies

Software distribution & deployment in Microsoft, Mac and Linux environments

Detailed working knowledge of web based security, network segmentation, security proxies, Firewalls and NGFWs, SSL/IPSec and SSL/TLS VPN appliances & services, etc

Windows & Linux Server system administration

Windows, MacOS and Linux client endpoints troubleshooting

Experience debugging Network connectivity issues using Network Troubleshooting Tools

Preferred but not mandatory:

Software distribution via Mobile Device Management platforms

Experience with Microsoft Active Directory, GPO, AzureAD an Intune

Experience with Zero Trust Implementation and deep understanding of the Zero Trust Frameworks like NIST will be a huge plus

Experience implementing or understanding of regulatory requirements such as PCI DSS, HIPAA, and SOC-2

Security skills and certifications such as CISSP, GCIA GCIH, GCFA, GCFE, CCIE, CCNP, JNCIE or MCSE will be a huge plus

XML job scraping automation by YubHub

]]> full-time senior hybrid TCP/IP, TLS, DNS, HTTP, SWG, Sandboxing, Firewalls, DLP, VPNs, Zero Trust Network Access, Secure Web Gateway, Remote Browser Isolation, Data Loss Prevention, CASB, Cloud Email Security, SSO, ADFS, LDAP, SAML, WAFs, Bot Management, Rate Limiting, MPLS, SD-WAN, WAN Optimization, Windows Server system administration, Linux Server system administration, Windows client endpoints troubleshooting, Linux client endpoints troubleshooting, Network Troubleshooting Tools Engineering Technology Cloudflare https://logos.yubhub.co/cloudflare.com.png Cloudflare is a technology company that provides internet infrastructure and security services. It operates one of the world's largest networks, powering millions of websites and other internet properties. https://www.cloudflare.com/ https://job-boards.greenhouse.io/cloudflare/jobs/7714624 Hybrid 2026-04-18 9bf55fe3-b2b Detection & Response Engineer We are seeking a skilled and proactive Detection & Response Engineer to join our security team. In this critical role, you will be responsible for detecting, investigating, and responding to security incidents across our cloud-native and AI-focused infrastructure.

Responsibilities

Monitor and analyse security alerts and logs to identify potential threats and anomalies
Develop, implement, and maintain detection rules and correlation logic in our SIEM platform
Conduct thorough investigations of security incidents, performing root cause analysis and impact assessments
Lead incident response efforts, coordinating with relevant teams to contain and mitigate threats
Create and maintain incident response playbooks and runbooks
Perform regular threat hunting activities to proactively identify potential security risks
Develop and refine metrics and reporting to track the effectiveness of detection and response capabilities
Collaborate with other security teams to improve overall security posture and incident handling processes
Stay current with emerging threats, attack techniques, and defensive strategies in the cloud-native and AI domains

Basic Qualifications

Bachelor's degree in Computer Science, Cybersecurity, or a related field
3-5 years of experience in security operations, incident response, or a similar role
Strong understanding of cybersecurity principles, attack techniques, and defensive strategies
Proficiency in at least one scripting language (e.g., Python, Rust) for automation and tool development
Experience with SIEM platforms and log analysis tools
Familiarity with cloud environments (e.g., AWS, GCP, Azure) and their security features
Knowledge of network protocols, system administration, and common attack vectors
Strong analytical and problem-solving skills with attention to detail
Excellent communication skills and ability to work effectively under pressure

Preferred Skills and Experience

Relevant security certifications (e.g., GCIH, GCIA, SANS)
Experience with threat intelligence platforms and their integration into detection processes
Familiarity with AI/ML security implications, particularly those outlined in the OWASP LLM Top 10
Knowledge of software supply chain security and SBOM analysis
Experience with containerized environments and Kubernetes security
Experience in building custom security tools or integrations to enhance detection and response capabilities
Interest in leveraging AI to improve threat detection and automate response processes
Contributions to open-source security projects or threat research
Experience with digital forensics and malware analysis

Compensation and Benefits

$200,000 - $340,000 USD

Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

XML job scraping automation by YubHub

]]> full-time mid onsite $200,000 - $340,000 USD cybersecurity principles, attack techniques, defensive strategies, scripting language, SIEM platforms, log analysis tools, cloud environments, network protocols, system administration, common attack vectors, relevant security certifications, threat intelligence platforms, AI/ML security implications, software supply chain security, containerized environments, Kubernetes security, custom security tools, digital forensics, malware analysis Engineering Technology xAI https://logos.yubhub.co/xai.com.png xAI€’s mission is to create AI systems that aid humanity in its pursuit of knowledge. The organisation is small and highly motivated. https://www.xai.com/ https://job-boards.greenhouse.io/xai/jobs/4559148007 Palo Alto, CA 2026-04-18 5f5e14c0-796 System Administrator Meet Yubico: the creator of the most secure passkeys and leading provider of hardware authentication security keys.

Our company’s mission is to make secure login easy and available for everyone. Yubico was founded in 2007 by Stina and Jakob Ehrensvard, and is public on Nasdaq Stockholm Main Market: YUBICO.

We are a global company with a strong company culture and employees located in over 14 countries. Yubico’s headquarters are based in Stockholm, Sweden and Santa Clara, CA.

Aligned with our mission to make the internet more secure for everyone, Yubico donates YubiKeys to organisations helping at-risk individuals through our philanthropic initiative, Secure it Forward.

We are looking for a talented individual to join the IT team to help with the growing needs in our Singapore office. As our first on-site IT team member, you'll be responsible for providing comprehensive IT support, managing endpoints, and administering systems.

About you:

You are a systematic troubleshooter
You like to work collaboratively and have fun
You are customer service-oriented
You conduct yourself with the utmost integrity and are security-minded

Tasks & Responsibilities:

Build and configure endpoints and peripheral devices (such as printers, scanners, mobile devices) related to desktop infrastructure
Facilitate IT onboarding and offboarding tasks
Manage IT assets, vendors, and procurement
Install, configure and maintain devices, providing regular maintenance to ensure optimal functionality
Provide global remote support, supporting a “follow the sun” Service Desk operations
Identify, log, troubleshoot, and resolve technical problems both locally and remotely
Create and provide end-user training and documentation
Create and update knowledge base articles and internal documentation.
Collaborate with teams across multiple time zones
Manage office networking
Some domestic and international travel may be required

Basic Qualifications:

5+ years experience providing end user support to in-office employees as well as remote workers around the world
Experience with Okta (or other identity provider) and Google Workspace administration
Highly skilled in the Windows and macOS endpoint configuration and support
Endpoint hardening and configuration management experience
Conference and Audio Visual support experience
System administration of endpoint systems and cloud-based applications (SaaS)
Office networking
Organized in nature, with great documentation and communication skills

Bonus Qualifications:

Familiarity with Atlassian products Jira Software, Jira Service Desk, or Confluence administration
Familiarity with Physical Security & Office Access Management systems

Additional Information

We are an equal opportunity employer, we value diversity and uphold an inclusive environment where all people feel that they are equally respected and valued. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity or expression, age, marital status, religion, national origin, disability, protected Veteran status or any other characteristic protected by law.

XML job scraping automation by YubHub

]]> full-time mid onsite Okta, Google Workspace administration, Windows and macOS endpoint configuration and support, Endpoint hardening and configuration management, Conference and Audio Visual support, System administration of endpoint systems and cloud-based applications (SaaS), Office networking IT Technology Yubico https://logos.yubhub.co/yubico.com.png Yubico is a Swedish company that specialises in creating secure passkeys and hardware authentication security keys. It has a global customer base and is publicly listed on the Nasdaq Stockholm Main Market. https://www.yubico.com/ https://jobs.lever.co/yubico/03892dff-f87e-40b5-81ba-ddcc856d2baf Singapore 2026-04-17 8c5fdc6a-a68 Senior Engineer, Build and CI As a Hivemind Build and CI engineer, you will design and implement engineering-centric automation across the organisation. You will work closely with product development teams, implementing policies and guidelines into the continuous integration and delivery systems.

This role requires you to be very hands-on and contribute to discussions with cross-functional teams across the organisation. We embrace an attitude that focuses on solving the root cause of problems efficiently.

A large part of your day-to-day will be in our build pipelines, build configuration management, and focusing on making changes to increase developer iteration time.

Responsibilities:

Embed with a product engineering team as their primary Software Operations partner, working closely with engineers to improve how software is built, tested, and delivered.

Design, implement, and continuously improve build pipelines, CI workflows, and supporting tooling with a focus on scalability, reliability, and developer experience.

Apply strong C++ and/or Go software development experience to build and maintain robust build and CI solutions.

Reduce iteration time and friction by improving build performance, test reliability, and CI feedback loops.

Debug and resolve complex build, test, and CI failures using disciplined root-cause analysis.

Influence technical direction without formal authority by earning trust through collaboration, technical credibility, and a deep understanding of team and program constraints.

Promote best practices in build hygiene, CI/CD design, dependency management, and software development workflows that scale across teams and programs.

Apply knowledge of software design patterns and architectural principles to design maintainable CI systems and build abstractions.

Coach and mentor product engineers on build and CI topics, helping teams make better design decisions and understand trade-offs.

Represent the Software Operations organisation within the product team, acting as a bridge between platform capabilities and product needs.

Advocate for practical, production-ready solutions that improve developer productivity without sacrificing velocity or quality.

Requirements:

BS in computer science or related engineering field with 3+ years of professional experience.

Experience with configuration management tools (Makefile, CMake, Conan, Bazel, etc.).

Strong demonstrated proficiency in continuous integration/delivery (e.g. Github Actions, ADO, TeamCity, etc.).

Strong understanding of C++ (or other compiled language), Linux and CMake.

Strong knowledge of APIs, web services, and identity access management.

Strong knowledge of containers (e.g. Docker, Podman, etc.).

Strong knowledge of scripting languages (Bash, Python, PowerShell).

Strong knowledge of Git.

Strong system administration in Linux (w/ Windows a bonus).

Strong desire to learn and grow on the job.

Preferences:

Strong Experience with Conan Package Manager.

Experience with Rust in a production environment.

Experience with Hardware in the Loop build/deploy/test systems.

Experience owning build infrastructure.

Experience with NVIDIA Jetson products.

Salary and Benefits:

$120,000 - $180,000 a year

Pay within range listed + Bonus + Benefits + Equity

Temporary employee offer package:

Pay within range listed above + temporary benefits package (applicable after 60 days of employment)

Salary compensation is influenced by a wide array of factors including but not limited to skill set, level of experience, licenses and certifications, and specific work location. All offers are contingent on a cleared background and possible reference check. Military fellows and part-time employees are not eligible for benefits. Please speak to your talent acquisition representative for more information.

XML job scraping automation by YubHub

]]> full-time senior onsite $120,000 - $180,000 a year configuration management tools, continuous integration/delivery, C++, Linux, CMake, APIs, web services, identity access management, containers, scripting languages, Git, system administration in Linux, Conan Package Manager, Rust, Hardware in the Loop build/deploy/test systems, NVIDIA Jetson products Engineering Technology Shield AI https://logos.yubhub.co/shield.ai.png Shield AI is a venture-backed deep-tech company founded in 2015, developing intelligent systems for protecting service members and civilians. https://www.shield.ai https://jobs.lever.co/shieldai/6cdd98c9-6579-4609-8ac3-9fc0604f6160 San Diego 2026-04-17 a2e88648-d1d Mistral Cloud - Site Reliability Engineer We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our Cloud platform and customer facing applications.

You will work closely with our software engineers and product teams to ensure our systems meet and exceed our internal and external customers' expectations.

Key responsibilities include:

Design, build, and maintain scalable, highly available and fault-tolerant infrastructures
Operate systems and troubleshoot issues in production environments
Implement and improve monitoring, alerting, and incident response systems
Implement and maintain workflows and tools for both our customer-facing APIs and large training runs

Development responsibilities include:

Drive continuous improvement in infrastructure automation, deployment, and orchestration
Collaborate with software engineers to develop and implement solutions that enable safe and reproducible model-training experiments
Help build a cloud platform offering an abstraction layer between science, engineering and infrastructure
Design and develop new workflows and tooling to improve the reliability, availability and performance of our systems

Additional responsibilities include:

Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements
Document processes and procedures to ensure consistency and knowledge sharing across the team
Contribute to open-source projects, research publications, blog articles and conferences

About you:

Master’s degree in Computer Science, Engineering or a related field
5+ years of experience in a DevOps/SRE role
Strong experience with bare metal infrastructure and highly available distributed systems
Exposure to site reliability issues in critical environments
Experience working against reliability KPIs
Hands-on experience with CI/CD, containerization and orchestration tools
Knowledge of monitoring, logging, alerting and observability tools
Familiarity with infrastructure-as-code tools
Proficiency in scripting languages and knowledge of software development best practices
Strong understanding of networking, security, and system administration concepts
Excellent problem-solving and communication skills

Your application will be all the more interesting if you also have:

Experience in an AI/ML environment
Experience of high-performance computing (HPC) systems and workload managers
Worked with modern AI-oriented solutions

XML job scraping automation by YubHub

]]> full-time senior remote bare metal infrastructure, highly available distributed systems, CI/CD, containerization, orchestration tools, monitoring, logging, alerting, observability tools, infrastructure-as-code tools, scripting languages, software development best practices, networking, security, system administration, AI/ML environment, high-performance computing (HPC) systems, workload managers, modern AI-oriented solutions Engineering Technology Mistral AI https://logos.yubhub.co/mistral.ai.png Mistral AI is a technology company that develops high-performance, optimized, open-source and cutting-edge AI models, products and solutions. https://mistral.ai https://jobs.lever.co/mistral/f76907fd-428a-4824-a1cf-8013974fde29 Paris 2026-04-17 a632e52b-c63 Site Reliability Engineer About Mistral AI

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.

We are a dynamic team passionate about AI and its potential to transform society. Our diverse workforce thrives in competitive environments and is committed to driving innovation.

Role Summary

We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers' expectations.

Responsibilities

As a Site Reliability Engineer, you balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.

Operations

• Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads

• Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters

• Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)

• Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime

• Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs

• Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences

Development

• Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform

• Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments

• Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure

• Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)

• Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements

• Document processes and procedures to ensure consistency and knowledge sharing across the team

• Contribute to open-source projects, research publications, blog articles and conferences

About You

• Master’s degree in Computer Science, Engineering or a related field

• 7+ years of experience in a DevOps/SRE role

• Strong experience with cloud computing and highly available distributed systems

• Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)

• Experience working against reliability KPIs (observability, alerting, SLAs)

• Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)

• Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)

• Familiarity with infrastructure-as-code tools like Terraform or CloudFormation

• Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices

• Strong understanding of networking, security, and system administration concepts

• Excellent problem-solving and communication skills

• Self-motivated and able to work well in a fast-paced startup environment

Your Application Will Be All The More Interesting If You Also Have:

• Experience in an AI/ML environment

• Experience of high-performance computing (HPC) systems and workload managers (Slurm)

• Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

XML job scraping automation by YubHub

]]> full-time senior remote cloud computing, highly available distributed systems, DevOps, SRE, Kubernetes, Flux, Terraform, CI/CD, containerization, orchestration, monitoring, logging, alerting, observability, infrastructure-as-code, scripting languages, software development best practices, networking, security, system administration, AI/ML environment, high-performance computing (HPC) systems, workload managers, modern AI-oriented solutions Engineering Technology Mistral AI https://logos.yubhub.co/mistral.ai.png Mistral AI is a company that develops and provides artificial intelligence (AI) technology to simplify tasks, save time, and enhance learning and creativity. https://mistral.ai https://jobs.lever.co/mistral/6e16e4fa-a60b-4270-a815-06b0450fb597 Paris 2026-04-17 b29decfc-c15 Site Reliability Staff Engineer - Administrator At Synopsys, we drive the innovations that shape the way we live and connect. Our technology is central to the Era of Pervasive Intelligence, from self-driving cars to learning machines. We lead in chip design, verification, and IP integration, empowering the creation of high-performance silicon chips and software content.

You are a highly motivated Site Reliability, Staff Engineer with a passion for Linux platforms and a commitment to operational excellence. You thrive in dynamic, multi-faceted environments and are energized by the challenge of deploying, maintaining, and optimizing complex systems. Your curiosity drives you to continually learn and adapt, while your technical expertise enables you to solve intricate problems efficiently.

Administering and managing Linux operating systems, including kernel components, memory management, process scheduling, and system performance optimization. Performing routine and advanced system administration tasks such as monitoring, tuning, and troubleshooting across bare-metal and virtualized nodes. Deploying, configuring, and managing Linux-based operating systems using Kickstart and Ansible for automation and environment standardization. Implementing and managing MAAS (Metal as a Service) for large-scale bare-metal provisioning and lifecycle operations. Operating and maintaining OpenStack environments for On Demand Computing and cloud infrastructure. Providing support for virtualization technologies (VMware, KVM, etc.), including troubleshooting and maintenance. Delivering basic Linux networking support, resolving connectivity, routing, firewall, NIC bonding, VLAN, and interface configuration issues. Collaborating with cross-functional teams to enhance infrastructure reliability, scalability, and security. Creating and maintaining detailed documentation, including configurations, SOPs, troubleshooting guides, and operational runbooks.

Ensuring the reliability and uptime of critical Linux environments that underpin Synopsys' engineering and development operations. Enabling rapid deployment and scalability of infrastructure through automation and standardized processes. Reducing downtime and improving system performance by proactively identifying and resolving technical issues. Enhancing security and compliance across platforms through robust configuration and monitoring practices. Accelerating innovation by providing stable, high-performance environments for development and testing teams. Fostering a collaborative culture by sharing expertise, mentoring peers, and contributing to knowledge repositories.

XML job scraping automation by YubHub

]]> full-time staff onsite Linux system administration, Linux internals, Kickstart, Ansible, MAAS, OpenStack, Virtualization technologies, Linux networking, Scripting languages (Bash, Python), Problem-solving, Communication Engineering Technology Synopsys https://logos.yubhub.co/careers.synopsys.com.png Synopsys develops and maintains software used in chip design, verification, and manufacturing. https://careers.synopsys.com https://careers.synopsys.com/job/bengaluru/site-reliability-staff-engineer-administrator/44408/93181374944 Bengaluru 2026-04-05 419c1058-a0b Site Reliability Engineer About Mistral AI

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life. We democratize AI through high-performance, optimized, open-source and cutting-edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on-premises or in cloud environments.

Role Summary

Responsibilities

Operations (50%)

Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads
Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters
Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.)
Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime
Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs
Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences

Development (50%)

Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform
Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments
Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure
Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.)
Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements
Document processes and procedures to ensure consistency and knowledge sharing across the team
Contribute to open-source projects, research publications, blog articles and conferences

About You

Master’s degree in Computer Science, Engineering or a related field
7+ years of experience in a DevOps/SRE role
Strong experience with cloud computing and highly available distributed systems
Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
Experience working against reliability KPIs (observability, alerting, SLAs)
Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
Strong understanding of networking, security, and system administration concepts
Excellent problem-solving and communication skills
Self-motivated and able to work well in a fast-paced startup environment

Your Application Will Be All The More Interesting If You Also Have:

Experience in an AI/ML environment
Experience of high-performance computing (HPC) systems and workload managers (Slurm)
Worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

XML job scraping automation by YubHub

]]> full-time senior remote cloud computing, highly available distributed systems, DevOps, SRE, Kubernetes, Flux, Terraform, CI/CD, containerization, orchestration, monitoring, logging, alerting, observability, infrastructure-as-code, scripting languages, software development best practices, networking, security, system administration, AI/ML environment, high-performance computing, workload managers, modern AI-oriented solutions Engineering Technology Mistral AI Mistral AI is a company that develops and provides artificial intelligence (AI) technology to simplify tasks, save time, and enhance learning and creativity. It has a diverse workforce with teams distributed across multiple countries. https://mistral.ai/careers https://jobs.lever.co/mistral/6e16e4fa-a60b-4270-a815-06b0450fb597 Paris 2026-03-10 4c45d017-749 FBS Mainframe System Administration- Application Subject Matter Expert II Job Summary

We are seeking a highly skilled FBS Mainframe System Administration- Application Subject Matter Expert II to join our team. As a key member of our Insurance Mainframe Job Subject Matter Expert (SME) team, you will be responsible for ensuring the stability, performance, and operational integrity of the mainframe environment supporting insurance applications.

Core Responsibilities

Incident Management

Analyze and resolve job failures, spool issues, and performance bottlenecks.
Provide root cause analysis (RCA) and implement preventive measures.

Environment Support

Maintain mainframe environments for development, testing, and production.
Coordinate with infrastructure teams for system health and resource optimization.

Performance Tuning

Monitor CPU, spool, and memory utilization.
Optimize job configurations to reduce resource consumption.

Compliance & Audit

Ensure jobs comply with regulatory and security standards.
Maintain documentation for audits and governance.

Collaboration

Work closely with application teams, operations, and business units.
Provide technical guidance and best practices for job design and execution.

Key Responsibilities

Environment Support & Stability

+ Manage and maintain mainframe environments for development, testing, and production. + Monitor system health, resource utilization, and job performance.

Batch Job Expertise

+ Oversee scheduling, execution, and troubleshooting of insurance-related batch jobs. + Analyze job failures, spool issues, and CPU spikes; implement preventive measures.

Incident & Problem Management

+ Provide root cause analysis (RCA) for outages and performance issues. + Collaborate with operations and application teams to resolve incidents promptly.

Performance Optimization

+ Tune jobs and system parameters to improve efficiency and reduce resource consumption. + Implement best practices for job design and output management.

Compliance & Documentation

+ Ensure adherence to regulatory, security, and audit requirements. + Maintain detailed documentation for processes and incident resolutions.

Requirements

Mainframe Technologies
Patches
System Administration
zOS
Mainframe technologies (JCL, COBOL, DB2, CICS) - Advanced
Batch job scheduling tools (e.g., Control-M, CA7). - Advanced
Knowledge of spool management, CPU optimization, and performance tuning. - Advanced
Excellent problem-solving and communication skills - Advanced
Insurance domain experience is a plus - Advanced

Benefits

Competitive compensation and benefits package:

+ Competitive salary and performance-based bonuses + Comprehensive benefits package + Career development and training opportunities + Flexible work arrangements (remote and/or office-based) + Dynamic and inclusive work culture within a globally renowned group + Private Health Insurance + Pension Plan + Paid Time Off + Training & Development

Note: Benefits differ based on employee level.

XML job scraping automation by YubHub

]]> full-time senior hybrid Mainframe Technologies, Patches, System Administration, zOS, Mainframe technologies (JCL, COBOL, DB2, CICS), Batch job scheduling tools (e.g., Control-M, CA7), Knowledge of spool management, CPU optimization, and performance tuning, Excellent problem-solving and communication skills, Insurance domain experience Engineering Technology Capgemini https://logos.yubhub.co/capgemini.com.png Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The company has a strong 55-year heritage and deep industry expertise. https://www.capgemini.com/us-en/about-us/who-we-are/ https://jobs.workable.com/view/1k3E5rgxsguPKBxRevC7y7/hybrid-fbs-mainframe-system-administration--application-subject-matter-expert-ii-in-pune-at-capgemini Pune, Maharashtra, India 2026-03-09 d9ec7bc5-2af Praktikum Risikomanagement IT-System Administration & Entwicklung Your job is to support the risk management team in the development and implementation of risk management strategies. As a risk management intern, you will work closely with the team to identify, assess, and mitigate risks associated with IT system administration and development.

Responsibilities

Support the development and implementation of risk management strategies
Analyze and assess risks associated with IT system administration and development
Identify and implement measures to mitigate risks
Collaborate with the team to develop and maintain risk management policies and procedures
Support the development and implementation of IT system administration and development projects

Requirements

Bachelor's degree in a relevant field (e.g. business administration, computer science, mathematics)
Strong analytical and problem-solving skills
Excellent communication and teamwork skills
Proficiency in MS Office and other relevant software tools
Experience with risk management and IT system administration and development is an asset

Benefits

Opportunity to work with a leading manufacturer of high-performance sports cars
Collaborative and dynamic work environment
Professional development and growth opportunities
Competitive salary and benefits package

Preferred Skills

Experience with risk management and IT system administration and development
Proficiency in programming languages (e.g. Python, Java)
Knowledge of IT system administration and development tools and technologies
Strong analytical and problem-solving skills
Excellent communication and teamwork skills

Duration

6 months

Start Date

Immediately

XML job scraping automation by YubHub

]]> full-time entry onsite risk management, IT system administration, development, MS Office, programming languages, risk management, IT system administration, development, Python, Java Engineering Automotive Dr. Ing. h.c. F. Porsche AG https://logos.yubhub.co/jobs.porsche.com.png Porsche is a leading manufacturer of high-performance sports cars. The company is headquartered in Stuttgart, Germany, and has a global presence with a workforce of over 30,000 employees. https://jobs.porsche.com https://jobs.porsche.com/index.php?ac=jobad&id=17973 Stuttgart-Zuffenhausen 2026-03-09 2ed6e0d4-88f Site Reliability Staff - EDA Engineering Compute and IT Infrastructure Automation At Synopsys, we drive the innovations that shape the way we live and connect. Our technology is central to the Era of Pervasive Intelligence, from self-driving cars to learning machines. We lead in chip design, verification, and IP integration, empowering the creation of high-performance silicon chips and software content.

You are a forward-thinking and highly motivated IT professional with a passion for reliability, automation, and infrastructure excellence. You thrive in dynamic, fast-paced environments where your expertise in data center operations, engineering compute, and automation can make a tangible impact. With a strong foundation in Linux-based environments, virtualization, and network architecture, you bring both depth and breadth to IT operations, ensuring optimal performance and security.

Responsibilities

Maintain and optimize the Taiwan data center at the Synopsys Hsinchu office, ensuring compliance with corporate IT data center and security standards.
Manage the full server hardware lifecycle, including rack installation, provisioning, maintenance, and decommissioning.
Support and manage engineering compute IT environments located at key customer data centers, such as TSMC and MediaTek.
Collaborate with corporate network and InfoSec teams to maintain data center network infrastructure, including core switches, TOR, firewalls, routers, and circuits.
Oversee the software architecture of EDA compute, focusing on Linux environments, and provide troubleshooting expertise for virtualization platforms, job schedulers, and remote access solutions.
Design and implement automation processes to streamline regular tasks, reduce manual effort, and adopt ML/AI technologies for efficient EDA compute orchestration and secure chamber domain management.
Proactively monitor, analyze, and optimize system performance to maximize uptime and reliability.

Impact

Ensure the seamless operation and scalability of Synopsys' Taiwan data center, directly supporting R&D and customer-facing teams.
Enable faster and more reliable silicon design and verification cycles for leading semiconductor companies.
Drive continuous improvement in data center operations through automation and adoption of cutting-edge technologies.
Safeguard mission-critical infrastructure by upholding best practices in security and compliance.
Facilitate collaboration between internal teams and major customers, strengthening Synopsys' reputation as a trusted technology partner.
Contribute to the global reliability and performance of Synopsys' EDA compute environment, empowering innovation at scale.

Requirements

Proven experience in data center operations, including server hardware lifecycle management and resource provisioning.
Strong knowledge of Linux system administration, virtualization platforms, and EDA compute architectures.
Solid understanding of data center networking concepts, including core switches, firewalls, routers, and security protocols.
Hands-on expertise in automation tools and scripting (such as Python, Bash, or Ansible), with a track record of process optimization.
Experience implementing or supporting ML/AI-based solutions for IT infrastructure orchestration is a plus.

Team

You'll join a dedicated IT engineering compute team responsible for maintaining and optimizing Synopsys' Taiwan data center and engineering compute operations. The team plays a critical role in supporting Synopsys' R&D software development and providing seamless support to key customers across Taiwan, including industry leaders like TSMC and MediaTek. You'll collaborate with global IT, InfoSec, and engineering teams, driving future EDA compute expansion and innovation.

Rewards and Benefits

We offer a comprehensive range of health, wellness, and financial benefits to cater to your needs. Our total rewards include both monetary and non-monetary offerings. Your recruiter will provide more details about the salary range and benefits during the hiring process.

XML job scraping automation by YubHub

]]> full-time staff onsite Linux system administration, Virtualization platforms, EDA compute architectures, Data center networking, Automation tools and scripting, ML/AI-based solutions, Python, Bash, Ansible Engineering Technology Synopsys https://logos.yubhub.co/careers.synopsys.com.png Synopsys is a leading provider of electronic design automation (EDA) software and services. The company has a global presence with a large team of engineers and researchers. https://careers.synopsys.com https://careers.synopsys.com/job/hsinchu/site-reliability-staff-eda-engineering-compute-and-it-infrastructure-automation/44408/91681543232 Hsinchu 2026-03-09 4f4bdbb3-f16 Training as a Specialist in System Integration (m/w/d) Join our team as a trainee in system integration and be part of shaping the future of mobility.

As a trainee, you will be responsible for installing, configuring, and maintaining hardware and software, as well as gaining insights into the administration of Windows and Linux systems. You will also participate in network and infrastructure projects, provide support in IT, analyze and resolve technical issues, and learn about topics such as Active Directory, Microsoft 365, virtualization, and IT security.

We offer a dynamic and innovative work environment, with opportunities for professional growth and development. Our team is passionate about creating a better world of mobility, and we are looking for like-minded individuals to join us.

If you are interested in a challenging and rewarding career in system integration, we encourage you to apply.

XML job scraping automation by YubHub

]]> university internship entry onsite Linux, Windows, System administration, Networking, IT support, Troubleshooting, Active Directory, Microsoft 365, Virtualization, IT security, Programming languages, Database management, Cloud computing Engineering Automotive AVL Software and Functions GmbH https://logos.yubhub.co/jobs.avl.com.png AVL is a leading technology company in the automotive industry, providing development, simulation, and testing solutions. https://jobs.avl.com https://jobs.avl.com/job/Regensburg-Ausbildung-zum-Fachinformatiker-f%C3%BCr-Systemintegration-%28mwd%29/755260801/ Regensburg, DE 2026-03-09 d52b3f92-dbf SW Developer & Data Analyst We are looking for an SW Developer & Data Analyst with strong proficiency in Python development, Linux environments, and database management. The role focuses on ensuring the stability, scalability, and continuous improvement of our production platforms through robust scripting, automation, and system administration.

Responsibilities:

Ensure support and maintenance of platforms in production.
Develop and maintain Python scripts and services for data processing and platform operations, including ETL workflows.
Manage and interact with MongoDB databases and Python Object DB.
Deploy, configure, and maintain application services running on Linux (e.g. System Control services).
Contribute to the setup and maintenance of CI/CD pipelines.
Participate in system administration activities, including monitoring, troubleshooting, and performance tuning.
Ensure platform stability, availability, and reliability.
Collaborate closely with technical teams.
Participate in the continuous improvement of the platform and existing processes.
Take part in technical reviews and knowledge sharing within the team.

Requirements:

Master’s degree in Computer Science, Software Engineering, Data Engineering, or a related field.
Minimum 2 years in Data Engineering.
Minimum 2 years of professional experience in Python development.
Solid experience with MongoDB and Python-based database interaction.
Very good knowledge of Linux environments, with at least 1 year of hands-on experience working on Linux systems.
Good background in system administration (sysadmin), including: Service management - Incident analysis and troubleshooting
Server and infrastructure environment understanding
Ability to understand, maintain, and evolve a long-lifecycle platform.
Strong analytical and problem-solving skills.
Very good level in English

Benefits:

Smooth Onboarding: Our technical and personal onboarding concept will help you transition easily.
Career Development: Opportunities for growth and advancement within the company, including mentorship and training programs tailored to your goals.
Flexible Working Arrangements: Options for mobile working and flexible hours to support a healthy work-life balance.
Collaborative Environment: A culture that encourages open communication, teamwork, and innovative thinking.
Community Connection: Regular employee events and activities that foster camaraderie and strengthen team bonds.
Recognition and Support: A commitment to recognizing your contributions and providing the support you need to thrive.

XML job scraping automation by YubHub

]]> permanent mid onsite Python development, Linux environments, database management, MongoDB, Python Object DB, Linux system administration, system administration, ETL workflows, CI/CD pipelines Engineering Automotive AVL Maroc SARL AU https://logos.yubhub.co/jobs.avl.com.png AVL is a leading mobility technology company that provides concepts, solutions, and methodologies in fields like vehicle development and integration, e-mobility, automated and connected mobility, and software. https://jobs.avl.com https://jobs.avl.com/job/Sala-Al-Jadida-SW-Developer-&-Data-Analyst/1370565533/ Sala Al Jadida, MA 2026-03-09 101df34a-252 Site Reliability Manager You will lead and be part of a Linux Engineering / Site Reliability Engineering organisation responsible for frontline (L1) production support. The team works closely with L2/L3 engineering, platform, network, security, and R&D teams to ensure reliable and scalable infrastructure operations across the business.

Job Description

We are a technology organisation operating high performance, large scale Linux production environments that support critical platforms and engineering teams. Our focus is on operational excellence, service reliability, automation, and continuous improvement. We run 24x7 operations and partner closely with platform, network, security, and engineering teams to deliver stable, secure, and scalable infrastructure.

Responsibilities

Leading and managing a 24x7 L1 Linux Engineering / SRE team operating in rotational shifts
Owning hiring, onboarding, performance management, coaching, and career development for L1 engineers
Owning L1 production support operations for Linux systems in a 24x7 environment
Acting as the first leadership escalation point during major production incidents
Ensuring adherence to SLAs, OLAs, and operational KPIs such as availability and MTTR
Providing technical oversight across Linux OS, bare metal and virtualized platforms, and monitoring/logging systems
Driving automation adoption using Ansible, Bash, and Python to reduce manual toil
Defining and maintaining SOPs, runbooks, escalation procedures, and documentation
Partnering with platform, network, security, and engineering teams to improve system reliability and resilience

Impact

Ensuring stable, reliable, and efficient 24x7 L1 Linux/SRE operations
Reducing incident recurrence and improving incident response and resolution times
Building a skilled, motivated, and well-governed L1 engineering team
Improving operational maturity through automation, standardization, and documentation
Enabling engineering and R&D teams through predictable and resilient platform operations

Requirements

10–14+ years of experience in IT Infrastructure, Linux Operations, or SRE
4–6+ years of people management experience, preferably managing 24x7 support teams
Strong hands-on background in Linux system administration and production support
Experience with incident management, on-call models, and rotational shifts
Advanced knowledge of Linux OS internals
Experience with virtualization platforms (VMware, KVM, OpenStack, oVirt)
Knowledge of monitoring and logging tools (e.g., Nagios, ELK)
Experience with automation and configuration management (Ansible)
Scripting skills in Bash and/or Python

Who You Are

A strong people leader with excellent coaching and decision-making skills
Calm and effective under high-pressure production scenarios
Highly structured and data-driven in driving operational excellence
An effective communicator and stakeholder partner
Passionate about reliability engineering, automation, and continuous improvement

Rewards and Benefits

Opportunity to lead mission-critical, large-scale Linux and SRE operations
High visibility role with exposure to senior leadership and engineering stakeholders
Ability to shape operational strategy, automation, and reliability practices
Strong focus on career growth, learning, and leadership development

XML job scraping automation by YubHub

]]> full-time senior onsite Linux system administration, Linux OS internals, Virtualization platforms, Monitoring and logging tools, Automation and configuration management, Scripting skills in Bash and/or Python, Ansible, Bash, Python Engineering Technology Synopsys https://logos.yubhub.co/careers.synopsys.com.png Synopsys is a technology organisation that develops and maintains software used in chip design, verification and manufacturing. It has a large scale operation with high performance Linux production environments. https://careers.synopsys.com https://careers.synopsys.com/job/bengaluru/site-reliability-manager/44408/92446615696 Bengaluru 2026-03-08 4083fddb-152 Strategic Procurement Drive System Intern You will be supporting the operational and strategic procurement process for drive system components. This includes creating and evaluating tenders, conducting price analyses, and preparing decision-making documents. You will also assist in preparing and following up on supplier meetings and internal discussions. Additionally, you will be responsible for maintaining databases, systems, and documentation. Your tasks will also include conducting research and supporting daily business operations.

To be successful in this role, you should have a strong understanding of procurement processes and a keen eye for detail. You should also be able to work independently and as part of a team, with excellent communication and organizational skills.

As a strategic procurement drive system intern, you will have the opportunity to work with a highly experienced team and contribute to the development of Porsche's procurement strategy. You will also have the chance to learn from experienced professionals and gain valuable insights into the automotive industry.

Responsibilities:

Support the operational and strategic procurement process for drive system components
Create and evaluate tenders, conduct price analyses, and prepare decision-making documents
Assist in preparing and following up on supplier meetings and internal discussions
Maintain databases, systems, and documentation
Conduct research and support daily business operations

Requirements:

Studium der Fachrichtungen Wirtschaftsingenieurwesen, Maschinenbau, Fahrzeugtechnik, Betriebswirtschaftslehre oder vergleichbar
Erste Praxiserfahrungen im Einkauf, der Industrie oder im automobilen Umfeld wünschenswert
Sehr gute Kenntnisse in MS Office (insb. Excel, PowerPoint)
Fließende Deutsch- und Englischkenntnisse in Wort und Schrift
Selbstständige, strukturierte und zuverlässige Arbeitsweise sowie Teamfähigkeit und Kommunikationsstärke

XML job scraping automation by YubHub

]]> part-time entry onsite procurement, tendering, price analysis, supplier management, database management, system administration, documentation, research, communication, organization, MS Office, Excel, PowerPoint, German, English Engineering Automotive Dr. Ing. h.c. F. Porsche AG https://logos.yubhub.co/jobs.porsche.com.png Porsche is a leading manufacturer of high-performance sports cars. The company is headquartered in Weissach, Germany. https://jobs.porsche.com https://jobs.porsche.com/index.php?ac=jobad&id=19977 Weissach 2026-03-08 f5e7e195-679 Datacenter Hardware Operations Technician, AI Compute Infrastructure - Stargate Job Posting

Compensation

$86.4K – $228K

The base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If the role is non-exempt, overtime pay will be provided consistent with applicable laws. In addition to the salary range listed above, total compensation also includes generous equity, performance-related bonus(es) for eligible employees, and the following benefits.

Benefits

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

About the Team

OpenAI, in close collaboration with our capital partners, is embarking on a journey to build the world’s most advanced AI infrastructure ecosystem. Our Stargate program develops and deploys massive, state-of-the-art data center campuses in partnership with industry leaders such as Oracle today—and through future OpenAI infrastructure projects tomorrow. We design for scale, speed, and reliability, and we need experienced hardware professionals who can help ensure our high-density compute environment operates at peak performance.

About the Role

We are seeking a senior datacenter hardware operations technician to coordinate physical hardware activities at a large partner-operated campus. In this role you will work side-by-side with Oracle and their delivery teams, helping align OpenAI’s compute requirements with day-to-day hardware work on the ground. Rather than directing partner personnel, you will focus on collaboration, technical alignment, and shared problem solving, ensuring that maintenance, repairs, and lifecycle activities support the performance and reliability goals of both organizations. As the campus matures, you will help capture lessons learned and develop standards and playbooks to guide hardware operations at future OpenAI infrastructure projects.

_Candidates must be able to sit onsite in Abilene, Texas 5 days per week_

Responsibilities

Serve as OpenAI’s primary on-site hardware contact, collaborating with Oracle teams and vendors to plan and coordinate maintenance, repairs, and lifecycle activities.

Share technical requirements and verify that work performed supports OpenAI’s compute needs and agreed quality targets.

Coordinate schedules, spare-parts planning, and issue escalation with partner teams to minimize downtime and keep operations running smoothly.

Work with OpenAI fleet-health engineers to translate software-detected issues into on-site hardware actions in partnership with Oracle.

Track hardware trends and provide joint recommendations with partner teams for design or operational improvements.

Prepare documentation and runbooks that capture joint best practices and can be applied at additional campuses.

Offer technical guidance and context to partner personnel while respecting their operational ownership.

Collaborate with supply-chain teams to plan spares and manage hardware lifecycle activities.

Requirements

Have 7+ years of experience in datacenter hardware operations, hardware engineering, or large-scale server maintenance, with at least 2 years in a senior or lead technician capacity.

Bring deep knowledge of high-density server hardware, including x86 platforms, GPUs, storage devices, and power/cooling systems.

Excel at diagnosing hardware issues, coordinating complex repairs, and maintaining strong working relationships across organizations.

Are comfortable setting technical expectations and validating outcomes through collaboration, not direct management.

Adapt quickly to changing operational conditions and enjoy solving problems at both the strategic and on-site levels.

Communicate clearly and build trust across partner teams, vendors, and internal engineering stakeholders.

Are willing to be based full-time at a partner-operated campus

Preferred Skills

Familiarity with large-scale cluster management or monitoring tools (IPMI, BMC, Prometheus, Nagios) to interpret alerts and coordinate partner responses.

Experience with GPU-accelerated compute clusters or other high-performance computing hardware.

Knowledge of Linux/Unix system administration and command-line diagnostic tools for hardware validation.

Industry certifications such as CompTIA Server+, OEM hardware certifications, or equivalent.

XML job scraping automation by YubHub

]]> full-time senior onsite $86.4K – $228K datacenter hardware operations, hardware engineering, large-scale server maintenance, high-density server hardware, x86 platforms, GPUs, storage devices, power/cooling systems, large-scale cluster management, monitoring tools, IPMI, BMC, Prometheus, Nagios, GPU-accelerated compute clusters, Linux/Unix system administration, command-line diagnostic tools, industry certifications Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is a technology company that specializes in developing and commercializing advanced artificial intelligence (AI) systems. The company was founded in 2015 and has since grown to become one of the leading AI research and development organizations in the world. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/b9a4a809-a965-4dbe-aeef-6ce1593903dd Remote - US 2026-03-06 119df59e-db7 Software Engineer, AI Safety Software Engineer, AI Safety

Location

San Francisco

Employment Type

Full time

Department

Safety Systems

Compensation

$185K – $325K • Offers Equity

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts

Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)

401(k) retirement plan with employer match

Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)

Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees

13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)

Mental health and wellness support

Employer-paid basic life and disability coverage

Annual learning and development stipend to fuel your professional growth

Daily meals in our offices, and meal delivery credits as eligible

Relocation support for eligible employees

Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.

More details about our benefits are available to candidates during the hiring process.

This role is at-will and OpenAI reserves the right to modify base pay and other compensation components at any time based on individual performance, team or company results, or market conditions.

About the Team

The Safety Systems team is dedicated to ensuring the safety, robustness, and reliability of AI models and their deployment in the real world.

Building on the many years of our practical alignment work and applied safety efforts, Safety Systems addresses emerging safety issues and develops new fundamental solutions to enable the safe deployment of our most advanced models and future AGI, to make AI that is beneficial and trustworthy.

Learn more about OpenAI’s approach to safety

About the Role

At OpenAI, we're dedicated to advancing artificial intelligence, and we know that creating a secure and reliable platform is vital to our mission. That's why we're seeking a software engineer to help us build out our trust and safety capabilities.

In this role, you'll work with our entire engineering team to design and implement systems that detect and prevent abuse, promote user safety, and reduce risk across our platform. You'll be at the forefront of our efforts to ensure that the immense potential of AI is harnessed in a responsible and sustainable manner.

In this role, you will:

Architect, build, and maintain anti-abuse and content moderation infrastructure designed to protect us and end users from unwanted behavior.

Work closely with our other engineers and researchers to utilize both industry standard and novel AI techniques to measure, monitor and improve AI models’ alignment to human values.

Diagnose and remediate active incidents on the platform and build new tooling and infrastructure that address the root causes of system failure.

You might thrive in this role if:

You have built and run production services in a high growth, rapidly scaling environment.

You can debug live issues and restore systems quickly.

You have worked on content safety, fraud, or abuse, or are motivated and excited to work on present-day (“now-term”) AI safety.

You have experience with Python or with modern languages such as C++, Rust, or Go, and are able to quickly ramp up on Python.

You understand the trade-offs of capabilities and risks and navigate them to deploy novel products and features safely.

You can critically assess risks of a new product or feature and devise innovative solutions to mitigate these risks without harming the product experience.

You’re pragmatic. You know when to build a quick, good-enough fix, and when to invest in a robust, lasting solution.

You possess strong project management skills. You are self-directed and can remove roadblocks to drive projects to completion with minimal guidance.

You’ve deployed classifiers or machine learning models, or are excited to learn about modern ML infra.

Our tech stack

Our infrastructure is built on Terraform, Kubernetes, Azure, Python, Postgres, and Kafka. While we value experience with these technologies, we are primarily looking for engineers with strong technical skills who understand the fundamental problems these tools solve, and can quickly pick up new tools and frameworks.

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

XML job scraping automation by YubHub

]]> full-time mid onsite $185K – $325K • Offers Equity Python, Terraform, Kubernetes, Azure, Postgres, Kafka, C++, Rust, Go, Content safety, Fraud, Abuse, AI safety, Machine learning, Classifiers, ML infra, Project management, Debugging, System administration, Cloud computing, Containerization, DevOps, Agile development, Scrum, Kanban Engineering Technology OpenAI https://logos.yubhub.co/openai.com.png OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. https://jobs.ashbyhq.com https://jobs.ashbyhq.com/openai/b9dee2a0-9bb3-447e-9bce-2b1bed784e5b San Francisco 2026-03-06 6c9adcae-84f Sr. Project Engineer - Design CAD Specialist AVL Mobility Technologies, Inc. (MTI) is looking for a motivated engineer to join our team. We are specifically looking for a Design CAD Engineer to provide technical, program, and account management expertise to cross-functional business development and engineering program team execution activities.

What you'll do

Responsible for local PDM system support in coordination with global keyusers.
Direct interface to global keyusers for PDM and CAD system questions, concerns, tools and techniques, upgrade testing, and local implementation of the systems.
Ownership of all local PDM activities for project creation and applying roles and responsibilities for new PDM system libraries.
Provides leadership for U.S. restricted PDM systems and/or CAD systems.

What you need

Bachelors’ degree in Mechanical or related field required.
Master’s degree preferred.
5 – 7+ years’ experience in area of expertise in relevant (Considerable experience in CAD and/or PDM system administration).
Excellent communication skills.
Analytical thinking.
Motivated.
Team player.
Excellent knowledge of PTC Creo and Windchill PDM system backend installation.
Good understanding of design and development processes.
Ability to create advanced user forms in multiple interfaces.
Understands and has used 3D CAD tools.
Organizational skills.
Communication and presentation skills.

XML job scraping automation by YubHub

]]> full-time senior onsite PDM system administration, CAD system administration, PTC Creo, Windchill PDM system backend installation, 3D CAD tools, Advanced user forms, Organizational skills, Communication and presentation skills Engineering Automotive AVL Mobility Technologies, Inc. (MTI) https://logos.yubhub.co/jobs.avl.com.png AVL Mobility Technologies, Inc. (MTI) forges new ideas, creating exciting breakthroughs, and providing mobility solutions for combustion engines, transmissions, hybrid applications, fuel cell, battery, ADAS/AD, data intelligence, and embedded systems for all types of vehicles. https://jobs.avl.com https://jobs.avl.com/job/Offsite-in-Michigan-Sr_-Project-Engineer-Design-CAD-Specialist-MI/1289526601/ 2026-03-06 b050a65c-f0a Senior SRE 1 We are seeking an accomplished Senior Site Reliability Engineer (SRE) with 12–15 years of experience to lead the reliability, scalability, and performance engineering of our critical infrastructure and production systems. As a Senior SRE, you will play a strategic and technical leadership role — driving reliability practices, mentoring SRE teams, and influencing the adoption of automation, observability, and resilience engineering across the organization.

What you'll do

Architect, implement, and manage resilient, scalable, and highly available infrastructure systems.
Lead initiatives to automate manual operations, deployment, and monitoring processes to improve reliability and reduce toil.

What you need

Strong proficiency in Linux/Unix system administration and internals.
Proven experience in cloud platforms — AWS, Azure, or GCP.

XML job scraping automation by YubHub

]]> full-time senior hybrid Linux/Unix system administration, Cloud platforms, Automation, Containerization and orchestration, Monitoring and observability stacks, Configuration management and IaC tools Engineering Technology Electronic Arts https://logos.yubhub.co/jobs.ea.com.png Electronic Arts creates next-level entertainment experiences that inspire players and fans around the world. Here, everyone is part of the story. Part of a community that connects across the globe. A place where creativity thrives, new perspectives are invited, and ideas matter. A team where everyone makes play happen. https://jobs.ea.com https://jobs.ea.com/en_US/careers/JobDetail/Senior-SRE-I/211515 Hyderabad 2026-01-05