Site Reliability Operations-iii Job in Exotel Techcom Pvt. Ltd

Site Reliability Operations-iii

Apply Now
Job Summary


SRO (Site Reliability Operations) team manages the setup/expansions of Exotel s production Infrastructure in managed data centres ( DCs) at multiple locations. SRO team also makes sure that our DCs are up and running all the time.

Infrastructure includes Linux services, Linux cloud servers, Linux bare-metal servers, Network devices, internet leased lines, telephone lines, telephony hardware etc.

This team provides 24x7 coverage and support and is responsible for monitoring, reporting, troubleshooting, resolving and escalation of any Production Infrastructure related issues. This includes incidents where a Network infrastructure or a Carrier may experience issues. This also involves identifying, troubleshooting and resolving issues with systems and applications reported through monitoring systems or trouble tickets.

We as a team love to increase the efficiency and speed of execution by constantly automating the regular activities.

What we are looking for?

  • Design & Manage complex & large scale Data Center infrastructures. (e.g. Servers/Network/Security/vendors/software upgrades, patches, hot fixes ) per business requirement.
  • Drive automation strategies and deployment processes following SDLC processes
  • Automate systems administration-related solutions for various project and operational needs
  • Monitor and react to security related incidents as necessary and involve required stakeholders for short term and long term solutions.
  • Lead & drive root cause analysis efforts across multiple infrastructure layers( OS/ Network/App)
  • Provide on call and out of hours support for business critical services.
  • Troubleshoot issues in detail whenever there is failure with any component - Server/Monitoring/Service related issues following a solid data-driven approach while arriving with hypothesis. Drive & implement short term and long term solutions.
  • Administer monitoring services such as Grafana, Nagios and custom-scripts
  • Explore and implement latest technologies to improve the stability, security, efficiency, and scalability of the environment
  • Drive initiatives to reduce TAT, MTTR for existing processes and practices
  • Perform benchmarking exercises for different system components
  • Drive initiatives to improve the stability, security, efficiency, and scalability of the environment
  • Mentor juniors in the team

What you will do?

Must-haves

  • [Must Have] 4-6 years strong hands-on working knowledge of RHEL/CentOS 5/6/7 in an enterprise environment & good understanding of the design and configuration of UNIX/Linux systems.
  • [Must Have] Handson experience of Orchestration/Configuration Management tools (e.g. Ansible, Chef, or Puppet)
  • [Must Have] 4-6 years experience in supporting and managing a large number of complex multi-server, multi-vendor, multi-technology infrastructures.
  • [Must Have] 4-6 years of experience in leading projects from technical design all the way through to delivery.
  • [Must Have] Hands on experience of one or more scripting languages (e.g. Bash, Python)
  • [Must Have] Strong in Computer Science fundamentals and strong exploratory skills for exploring new age technologies
  • [Must Have] Exposure with few of the following: Logging (Rsyslog), Monitoring frameworks (Prometheus, Nagios), Linux Security , Databases - mysql/sql
  • [Must Have] A "SRE" mindset. You own what you will setup & manage.

Good-to-haves

  • 4+ years of hands on experience of setting-up and managing physical DataCenter environments

Experience Required :

4 to 6 Years

Vacancy :

2 - 4 Hires

Similar Jobs for you

See more recommended jobs