As a Staff Software Engineer - AI In-Market Engineering, you’ll be the final escalation point for the most complex and critical issues affecting enterprise and hyperscale environments. This hands-on role is ideal for a deep technical expert who thrives under pressure and has a passion for solving distributed system challenges at scale.
You’ll collaborate with Engineering, Product Management, and Field teams to drive root cause resolutions, define architectural best practices, and continuously improve product resiliency. Leveraging AI tools and automation, you’ll reduce time-to-resolution, streamline diagnostics, and elevate the support experience for strategic customers.
Key Responsibilities
Technical Expertise & Escalation Leadership
- Own critical customer case escalations end-to-end, including deep root cause analysis and mitigation strategies.
- Act as the highest technical escalation point for Infinia support incidents — especially in production-impacting scenarios.
- Lead war rooms, live incident bridges, and cross-functional response efforts with Engineering, QA, and Field teams.
- Utilize AI-powered debugging, log analysis, and system pattern recognition tools to accelerate resolution.
Product Knowledge & Value Creation
- Become a subject-matter expert on Infinia internals: metadata handling, storage fabric interfaces, performance tuning, AI integration, etc.
- Reproduce complex customer issues and propose product improvements or workarounds.
- Author and maintain detailed runbooks, performance tuning guides, and RCA documentation.
- Feed real-world support insights back into the development cycle to improve reliability and diagnostics.
Customer Engagement & Business Enablement
- Partner with Field CTOs, Solutions Architects, and Sales Engineers to ensure customer success.
- Translate technical issues into executive-ready summaries and business impact statements.
- Participate in post-mortems and executive briefings for strategic accounts.
- Drive adoption of observability, automation, and self-healing support mechanisms using AI/ML tools.
Required Qualifications
- 8+ years in enterprise storage, distributed systems, or cloud infrastructure support/engineering.
- Deep understanding of file systems (POSIX, NFS, S3), storage performance, and Linux kernel internals.
- Proven debugging skills at system/protocol/app levels (e.g., strace, tcpdump, perf).
- Hands-on experience with AI/ML data pipelines, container orchestration (Kubernetes), and GPU-based architectures.
- Exposure to RDMA, NVMe-oF, or high-performance networking stacks.
- Exceptional communication and executive reporting skills.
- Experience using AI tools (e.g., log pattern analysis, LLM-based summarization, automated RCA tooling) to accelerate diagnostics and reduce MTTR.
Preferred Qualifications
- Experience with DDN, VAST, Weka, or similar scale-out file systems.
- Strong scripting/coding ability in Python, Bash, or Go.
- Familiarity with observability platforms: Prometheus, Grafana, ELK, OpenTelemetry.
- Knowledge of replication, consistency models, and data integrity mechanisms.
- Exposure to Sovereign AI, LLM model training environments, or autonomous system data architectures.
This position requires participation in an on-call rotation to provide after-hours support as needed.