Acerca del empleo

SRE and Incident Manager – Brazil / Chile (Latin America)

Pluxee is the leading global employee benefits and engagement partner that opens up a world of opportunities to help people enjoy more of what really matters.

We believe that living life to the full means making the most of every moment and sharing experiences with the people we care about. To make these experiences meaningful, fulfilling, and personalized, we combine our 45+ years of experience with the agility and energy of a new digital brand.

The SRE and Incident Manager – LATAM will be responsible for overseeing the entire SRE (Site Reliability Engineering) and incident management process, ensuring that infrastructure and application under his / her scope are design, developed and implemented with reliability and resiliency, increasing efficiency through automation, which at the end improving customer satisfaction and retention. He/she should also proactively drive a culture of continuous improvement

During the operational phase of the system under his / her scope, he is responsible for the timely resolution of incidents and minimizing the impact on our systems and services offered to clients, consumers and merchants in several Latin American countries.

He / She will be a part of a distributed international Run & Operations team that collaborates with other cross-functional teams to drive continuous improvement of our incident response capabilities at all levels (local, regional and global).

Our systems and services are built on a cloud-native, distributed architecture and integrated to country systems. The current functional scope is composed of an identity management system, a store locator, several web portals, a couple of mobile applications, as well as several payment capabilities.

Technical expertise: solid experience with Azure services, like App Services, Functions, Application Gateways, APIM, and understanding of common server-side frameworks, like React.js, Node.js, PHP Synfony, «.Net» as well as at least one observability platform, preferably Datadog.

Process-centric approach: strive to understand and optimize the end-to-end flow of work, rather than solely focusing on individual tasks or functional silos.

Assertive communication: clearly express his / her thoughts, needs, and boundaries in a direct, and respectful manner while considering the need to engage multiple parties (internal and external) towards a common goal.

🚀 Your next challenge:

The SRE and Incident Manager – LATAM will have the following responsibilities to ensure reliability and resiliency of the system under his / her scope:

  • Ensuring reliability – getting systems back to steady-state as quickly as possible
  • Eliminating toil – automating wherever possible
  • Blameless postmortems – driving better cross-team collaboration
  • Observing what matters – gaining full visibility into system health
  • Being pro-active – living and breathing SLOs to identify and remediate issues before SLAs are violated
  • Architecting for resiliency – Informing architectural design decisions to build more reliable systems

The SRE and Incident Manager – LATAM will have the following responsibilities during the lifecycle of an incident or crisis:

  • Review and maintain the incident and crisis management processes, escalation paths, and communication plans.
  • Review and maintain an incident and crisis response plan, outlining steps, procedures, and roles during incidents.
  • Identify and log incidents promptly, using monitoring tools, alerts, and reports.
  • Ensure incidents are prioritized according to predefined criteria, allocating appropriate resources based on the incident’s impact.
  • Coordinate and communicate effectively with technical teams, support staff, and stakeholders throughout the incident lifecycle.
  • Initiate escalation processes when incidents cannot be resolved within predefined timeframes or by initial response teams.
  • Lead or participate in incident investigations and root cause analysis, identifying underlying causes and recommending preventive measures.
  • Maintain accurate records of incidents, including timelines, actions taken, and resolutions.
  • Lead or participate in the preparation of incident reports and root cause analysis reports.
  • Share relevant reports with stakeholders, highlighting trends, recurring issues, and improvement opportunities.
  • Drive continuous improvement of the incident management process by analysing incident data, identifying patterns, and suggesting enhancements.

🌟 You’re a match :

  • The Incident Manager – LATAM will be responsible for the following technical activities:
  • Review and track patch management activities to ensure that applications and services are up to date with the latest security patches, bug fixes, and upgrades.
  • Coordinate and schedule regular maintenance windows for applying patches or implementing upgrades, minimizing disruption to users.
  • Track support lifecycle of all configuration items supporting the systems and services under his / her responsibility.
  • Collaborate with other stakeholders to ensure timely upgrades of software and services ensuring that they remain enjoying the proper support and service levels.
  • Track capacity and utilization of resources and collaborate with other stakeholders to ensure services remain available, even under high loads.
  • Regularly review monitoring data with Tech Leads and DevOps to identify performance issues, capacity bottlenecks, or potential risks.
  • Proactively propose new monitors and adjustments in the monitoring thresholds aiming at detecting events and incidents the earliest.
  • Review and maintain process documentation, including support procedures, runbooks, knowledge articles, and FAQs (Frequently Asked Questions), for all systems and services under his responsibility.
  • Review contracts from vendors and service providers to ensure they fully support the SLAs (Service Level Agreements) agreed with internal clients.
  • Propose architectural changes to increase systems resiliency, based on insights and experience gained from incidents and maintenance activities.
  • Conduct training sessions, workshops, or awareness programs to educate staff and countries on the support model, incident response procedures and best practices.
  • Be the main point of contact from Run & Operations in matters related to the Release Management process that may affect the services availability.


Soft Skills :

  • Fluency in English and Spanish (Mandatory) / Portuguese is a Plus
  • Assertive communication, Leadership, Problem-Solving, Stress Management, Collaboration and Teamwork, Conflict Resolution, Continuous Learning and Adaptability.

Knowledge / Experience:

Technical Knowledge, Incident Management Tools, Troubleshooting and Root Cause Analysis, Application Stacks Knowledge, Incident Response and Recovery, IT Infrastructure and Networks, Data Analysis and Reporting, IT Security and Compliance.

🔎 To get this challenge:

  • Video call Discussion with TA Expert.
  • Video call Discussion with Hiring Manager
  • Video call Discussion with Operation Manager
  • Video call Discussion with HRBP.

🏅 Your team


📍 Your Location:

Preferred: Brazil / Chile (Latin America)

You may also like

Leave a Comment

es un portal de difusión de la cultura de la seguridad y salud ocupacional dirigido por Andrés Minchalo, abogado y consultor SSO (Ecuador) con experiencia en los sectores público y privado tanto en el asesoramiento legal, patrocinio y representación como en la implementación y gestión de sistemas de seguridad y salud en el trabajo para empresas e instituciones de diversos tamaños y actividades. Brindamos a nuestros lectores información sobre novedades normativas, actualización de requisitos legales e insumos para la gestión técnica.

Selección del editor

Últimas publicaciones

Información sobre cumplimiento en seguridad y salud ocupacional.


©2023, todos los derechos reservados.

Diseñado por Diego Ortuno Rosales