About the job
Applied Digital (NASDAQ: APLD) operates next-generation data centers for high-performance compute (HPC) and machine learning applications. As a rapidly growing publicly traded company, we seek an HPC Systems Administrator to support the design, development, commissioning, and ongoing operations of state-of-the-art HPC data centers.
**
Job Summary**
The HPC Systems Administrator has worked in state-of-the-art GPU-centric HPC data centers and has proven ownership and accountability over production-level environments. The candidate must be self-sufficient, be comfortable in a fast-paced greenfield environment, and have a growth mindset.
**
Education and Experience**
- 5-7 years of experience, with at least 2 years of hands-on responsibilities in a GPU HPC environment
- Strong understanding of AI and ML workloads, including deep learning frameworks (TensorFlow, PyTorch, etc.).
- Strong knowledge of cluster management, scheduling, and reporting in enterprise and HPC data center operations such as Kubernetes, Singularity, Slurm, SlurmDB, and Grafana
- Demonstrated mastery in at least one major Linux server distribution with comfort and proficiency in a variety of Linux server distributions
- Strong understanding of security trade-offs with various architectures and virtualization techniques
- Comfort and track record of scripting, systems and process automation using tools such as bash, python, and ansible
- Proven experience learning complex new systems and technologies
- Full-stack technical depth of HPC Linux clustered environments
- Demonstrated ability to contribute and constantly improve cutting HPC systems environments
- Knowledge and understanding of advanced HPC systems architectures, technologies, packages, and workloads, in line with industry standards
- Excellent verbal and written communication skills, with the ability to communicate complex concepts to non-technical internal and external stakeholders
- Proven track record of effectively prioritizing heavy workloads in a fast-moving environment
- Positive and constructive attitude with strong attention to detail, ability to work productively in others
- Must be comfortable in a rapidly growing startup environment but in an enterprise level-production data center environment
- Bachelor's or Masters degree in Computer Science, Information Technology, or a related field is required
- Strong engineering background with good judgment, rationale, and technical aptitude
**
Primary Job Duties**
- Accountable for the build-out, documentation, and administration of a new HPC data center environment, including monitoring, cluster management, and server systems
- Evaluate and implement cluster management and scheduling systems for ML, AI, and Deep Learning workloads
- Responsible for building, deploying, and managing HPC system images
- Ensure high availability and uptime to meet customer service level agreements and company excellence standards
- Push the status quo with a relentless focus on constant iteration and improvement, including but not limited to automation, monitoring, alerting, updates, and technical refreshes
- Displays a hands-on approach, working cross-functionally with stakeholders and peers to accomplish individual, team, and company goals.
- Ensure compliance with industry regulations and standards such as SOC 1 & 2, ISO 27001/2, etc.
- Stay current with the latest trends and technologies, and ensure that the company's infrastructure is competitive with comparable state-of-the-art systems
- Develop key metrics and provide regular reports to senior management on the status of systems, deployments, and customer workloads
Note
This job description in no way states or implies that these are the only duties to be performed by the employee(s) incumbent in this position. Employees will be required to follow any other job-related instructions and to perform any other job-related duties requested by any person authorized to give instructions or assignments. All duties and responsibilities are essential functions and requirements and are subject to possible modification to reasonably accommodate individuals with disabilities. To perform this job successfully, the incumbents will possess the skills, aptitudes, and abilities to perform each duty proficiently. Some requirements may exclude individuals who pose a direct threat or significant risk to the health or safety of themselves or others. The requirements listed in this document are the minimum levels of knowledge, skills, or abilities. This document does not create an employment contract, implied or otherwise, other than an “at will” relationship.
The company is an Equal Opportunity Employer, drug free workplace, and complies with ADA regulations as applicable.