How do you spark joy in hundreds of millions of people? It starts with a vision - that technology can give voice to stories around the world. In delivering those much-loved stories, Netflix is responsible for a significant portion of global internet traffic.
To steward that responsibility, we work collaboratively with ISPs to deploy Open Connect, Netflix’s Content Delivery Network (CDN), our in-house custom-built network, and server infrastructure responsible for delivering 100% of Netflix's video traffic.
We strive to deliver a great Netflix viewing experience in over 190 countries so our customers can watch whatever, whenever, interruption-free.
We are seeking a seasoned Reliability Engineer with extensive experience in *nix, networking, data analysis, and large-scale platform operations experience to design, scale, operate, automate, and analyze our globally distributed CDN. Come join us and play a meaningful role in our journey to entertain the world!
Responsibilities
- Drive continual improvement in resiliency, observability, monitoring, instrumentation, and automation with the primary goal of maintaining a highly scalable and reliable CDN platform worldwide.
- Aggregate, analyze and correlate large amounts of server and application performance data. Use the innovative Netflix Big Data platform as a highly flexible, specialized, and efficient toolset to identify opportunities for platform optimization and system reliability improvements, as well as identifying patterns/anomalies for further investigation.
- Provide technical design and engineering assistance to ISP partners to integrate our Open Connect Appliances.
- Handle Tier 3 escalation and participate in an on-call rotation for the CDN platform production issues.
Qualifications
- 3+ years of Service Reliability/Operational experience running large-scale, high-performance systems & internet services with a focus on performance and reliability.
- Preferred - B.S. in Computer Science, Electrical or Computer Engineering (or equivalent professional experience)
- Strong working knowledge of networking concepts and application protocols, especially TCP/IP, BGP, DNS, TLS, and HTTP/S with focused experience on CDNs and HTTP cache/proxy technologies
- Skilled in designing, creating, and maintaining automation written in a programming language such as Python
- Expert-level knowledge of managing and debugging Unix/Linux systems (engineering fundamentals, networking, storage, operating systems) at scale.
- Experience with distributed analytic processing technologies (Hive, Presto/Trino, Spark SQL, etc)
- Strong understanding of applied statistics and the ability to code systems that identify outlier behavior in large systems
- Some experience with container and container orchestration technologies (Docker, Kubernetes)
- Ability to work in a highly collaborative environment and to communicate cross-functionally with internal and external partners
Things that show how we think