Role: IBM Workload Scheduler Administrator / Infrastructure Engineer
Location: Riverwoods, IL (3 days a week onsite) / but remote may be considered for an exceptional candidate
Job Type: Contract (W2)
Expected Working Hours: Monday Friday 9:00 am 5:00 pm US/Central, full time, with flexible hours for occasional weekend change control, rotating on-call with two other team members
Reports To: Senior Manager Software Engineering
Description
- We are seeking a highly skilled (3 5+ years dedicated experience administering) IBM Workload Scheduler (IWS) Administrator to manage, maintain, and optimize our enterprise batch scheduling infrastructure.
- The successful candidate will be responsible for the end-to-end administration of the IWS environment hosted primarily on Red Hat Enterprise Linux (RHEL).
- This role requires a strong blend of IWS expertise, Linux system administration, and scripting to ensure high availability and seamless execution of critical business workloads.
Responsibilities
- Administer Production IBM Workload Scheduler (aka Tivoli Workload Scheduler) environment with 28,000 unique daily jobs across ~350,000 daily job runs, 44 servers, and three other change control environments.
- Administer, install, configure, and patch/upgrade IWS components (Master Domain Manager, Dynamic Agents, Dynamic Pool, Dynamic Workload Console).
- Work with Product Owner on communicating work streams in Jira.
- Manage job promotions using Workload Application Template-based processes, ensuring platform stability checks for each promotion.
- Manage change control across four separate environments, enforcing standards and policies.
- Maintain and promote 99.17% Production platform uptime per calendar month (excluding planned outages and maintenance windows) using SOPs, DevOps tools, and disciplined change control.
- Communicate platform improvements to a user community of ~500 developers and data engineers.
- Production consists of 44 servers across MDM, DWC, and dynamic agents.
- Resolve complex job failures, performance bottlenecks, agent issues, and infrastructure issues.
- Advise on complex job scheduling design questions for the scheduling support team.
- Monitor scheduler health, manage database maintenance, perform backup/disaster recovery, and conduct monthly failovers.
- Define and maintain security policies, user authorizations, and authentication for the DWC.
- Respond to cybersecurity vulnerability assessments and regulatory audit inquiries (including PCI).
- Design and implement Ansible automation and self-healing mechanisms to reduce unplanned outages.
- Coordinate with offshore teams performing SOPs during non-working hours.
- Script in Python using the IWS REST API.
Required Technical Skills
- Strong experience with IBM Workload Scheduler architecture, especially Dynamic Workload Broker, V10.1+, high availability of MDMs managing Fault Tolerant Agent and Dynamic Agent architectures.
- Strong conceptual understanding of Master Domain Manager (MDM), Backup MDM (BMDM), Dynamic Workload Console (DWC), Fault Tolerant Agent (FTA), Dynamic Agent (DA).
- Strong grasp of conman CLI to monitor and control production plan, check job/job stream/resource status.
- Strong grasp of composer CLI to define, modify and extract scheduling objects.
- Strong grasp of planman CLI to control pre-production plan and GUI mirroring.
- Strong grasp of lifecycle of daily production planning process, phases of JNextPlan/FINAL.
- Proficiency in navigating the DWC web-based GUI to monitor workloads, manage user access security, and define scheduling objects.
- Experience installing IWS components, applying Fix Packs, and Interim Fixes.
- Troubleshooting with logs under TWSDATA/stdlist, adjusting trace level for netman, batchman, writer, mailman, etc.
- Strong experience with IBM WebSphere Liberty.
- Strong grasp of reading messages.log, traces.log, FFDC logs.
- Strong grasp of configuring JVM heap sizes.
- Strong grasp of configuring tracing scope, tracing levels, tracing retention, and trace strings.
- Strong experience with Red Hat Enterprise Linux 8+.
- Deep familiarity with bash/shell commands for text processing (grep, awk, sed), file manipulation, and system navigation.
- Ability to manage, start, stop, and troubleshoot SystemD services using systemctl and journalctl for IWS agents and MDM.
- Managing user accounts, groups, service accounts and deep knowledge of Linux file permissions (chmod, chown, ACL on local filesystems and NFS).
- Ability to monitor system performance using top, htop, vmstat, iostat, sar to troubleshoot bottlenecks and platform unresponsiveness.
- Understanding of Logical Volume Manager (LVM) and filesystem usage.
- Checking TCP port availability, firewall rules (firewalld/iptables), and connectivity between MDM and Dynamic Agents using netstat, ss, ping, curl, etc.
- Managing SSL/TLS certificates, private keystores, public truststores, and working with Certificate Authority.
- Strong experience with scripting (Bash Shell, Python, etc.) for automation.
- Understanding of networking principles.
- Understanding of basic Oracle database administration, enough to troubleshoot with DBAs to prove when an issue is in Oracle.
- Understanding of basic SQL to query job metadata.
- Understanding of checking database connectivity.
- Understanding of AWS cloud infrastructure.
- Experience with using secrets manager (CyberArk PPM, Hashicorp Vault, or similar).