Senior Consultant - System Management Job in Ltimindtree

Senior Consultant - System Management

Apply Now
Job Summary Key Responsibilities: Cluster Buildout and Management: Build and manage Azure High-Performance Computing (HPC) clusters, ensuring smooth cluster creation, maintenance, and automation of tasks. Develop and automate cluster buildout workflows and tasks to improve operational efficiency. InfiniBand Networking: Configure, deploy, and troubleshoot InfiniBand networking layers. This includes cabling validation, network topology checks, and software/firmware upgrades during cluster buildout and production. Monitor the health of InfiniBand (IBUFM) nodes and resolve issues using existing tools and procedures defined by Microsoft. Automation and PowerShell: Automate cluster buildout, maintenance processes, and reporting using PowerShell to streamline workflows. Work on automating incident queue mitigation and other manual processes to improve operational efficiency. Troubleshooting and Issue Resolution: Create support cases and work with product support engineers to troubleshoot issues, track their resolution, and mitigate ongoing problems. Collaborate with teams to drive cluster deployment issues to resolution and escalate when necessary. Move faulty nodes or devices to the RMA queue, collaborate with vendors, and return them to production once resolved. Cluster Maintenance: Conduct node diagnostics, routing, and recovery activities as part of regular cluster maintenance. Perform regular health checks and troubleshooting on target clusters in buildout progress, ensuring seamless operations. Documentation: Create TSG (Technical Support Guides) and SOP (Standard Operating Procedure) documents for internal teams and external stakeholders. Collaboration: Work closely with cross-functional teams, including vendors and component teams, to drive cluster buildout and troubleshooting tasks to completion. Participate in team collaboration to ensure efficient cluster operation and issue resolution. Shift Work: This position requires the willingness to work on a 24/7 shift basis, ensuring continuous monitoring and troubleshooting support for cluster activities. Qualifications: Expertise in Bare Metal Servers and Datacenter Architecture, particularly with a focus on Azure concepts. Strong knowledge and experience with Windows Server Operating Systems in physical server environments. Basic understanding of Networking and troubleshooting, with experience in InfiniBand networking deployment and troubleshooting. Linux (Ubuntu) knowledge and experience. PowerShell experience for automation tasks. Proven experience in IT Operations, including case creation, troubleshooting, and collaborating with product support engineers to resolve issues. Power BI and Kusto experience at an intermediate level is preferred. Excellent oral and written English communication skills to effectively collaborate with teams and document procedures. Prior experience in a Fortune 500 company is a plus. Additional Skills: Strong troubleshooting skills and a deep understanding of cluster management and maintenance. Ability to define and follow best practices and business processes. Ability to manage and automate workflows, enhance operational efficiency, and ensure the timely resolution of issues. Application Process: If you meet the above qualifications and are eager to contribute to high-performance computing infrastructure in an Azure-based environment, we encourage you to apply. Please submit your resume and relevant experience.
Experience Required :

1 to 3 Year

Vacancy :

2 - 4 Hires

Similar Jobs for you

See more recommended jobs