Tech Lead Site Reliability Engineer Job in Nice Interactive Solutions India Pvt Ltd

Tech Lead Site Reliability Engineer

Apply Now
Job Summary

Description

Being an efectiveSREis as much about how you think, as it is about your technical skills. TheSRErole requires a mix of development and operations skills. Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.

First and foremost, an SRE is a software developer that builds things. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you are expected to manage the complex challenges of scale which are unique to Nice InContact, while using your expertise in coding, systems, complexity of operating systems and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem solving, and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives.

Generalists thrive in this role as an SRE. Ensuring that the Nice InContact servicesboth our internally critical and our externally-visible systemshave the reliability/uptime appropriate to users' needs. Additionally, SREs will keep an ever-watchful eye on our systems capacity and performance.

Answering the questions: How does something work? How can I make it run better? How do I know its working? How do I measure the performance? Now, once you have answered those questions: How do I work within the organizational departments to do this?

Education/Experience:

  • Bachelor's Degree in a related field or equivalent time/experience of relevant work history that should consist of 1 - 3 years in cloud environments provisioning automation within AWS and Azure. There should also be a demonstratable software development experience.
  • Managing services inside the cloud AWS/Azure connections to Enterprise infrastructure.
  • Prior experience with Microsoft and Linux troubleshooting, coding/scripting, higher level languages C#, Java, etc
  • Can demonstrate the SOLID principles while writing code, can follow code flow, use version control Git/GitHub. Familiar with the building of CI/CD pipelines with Jenkins
  • Understanding of common scripting PowerShell, Bash, AWS CLI
  • Experience managing a full application stack with high availability requirements is preferred
  • Managing of both Microsoft and Linux servers and services
  • Experience leveraging monitoring and alerting tools such as Grafana, Prometheus. Inspec testing for auditing. Chef scripts for reliable builds.
  • Understand containerization and the orchestration of Docker/Kubernetes,
  • Strong written and verbal communication skill

Duties

  • Writing is our primary means of communication, from pull requests, team chat, knowledge sharing, and communicating changes. Excellent writing skills are crucial to success.
  • What is TOIL? Understanding of TOIL and its characteristics, including having a drive to measure and eliminate it.
  • What is an Error budget?
  • What is SLI and SLO?
  • Continued improvement of tech skills is a requirement. You should be learning a new tech skill each quarter. Seeking industry certifications to establish your level of knowledge
  • Required: A self-motivated individual with a track record of having the internal drive and motivation to begin and continue tasks without external prodding or extra rewards.
  • Maintain obtainable goals with manager

Within the duties are three main areas of focus: Reliability, Monitoring/Alerting, and Service

Reliability/Availability

  • Collaborate and contribute with other enterprise teams
  • Communicate availability to the team and manager
  • Monitoring - site reliability and health of our systems. Learn to identify those areas critical, major, minor.
  • Alerting the critical problems/errors of systems and their processes.
  • Metrics - gathering data for troubleshooting of all kinds. Exposing application metrics to managing/monitoring/monitoring
  • Building new features and services is a big part of this role. We are continually developing and implementing new ways to support our teams, understanding our customers needs, and becoming experts in site reliability.

Monitoring and Alerting

  • Development and the Deployment of new tools to support our systems and services in an automated fashion
  • Hardening our systems where applicable
  • Supporting the deployment of new product services
  • Use software development approaches to operations. You should have a breadth of experience in software development, operations, and be actively practicing site reliability principles

Experience Required :

Fresher

Vacancy :

2 - 4 Hires

Similar Jobs for you

See more recommended jobs