Our client, Netskrt.io is looking for a Director of Live Operations & Systems Reliability to oversee our managed service. Netskrt’s eCDN service is comprised of three major components: intelligent content collection, staging and distribution; adaptive networking, leveraging connectivity as and when available; and an edge cache that allows users to access the content they want locally, using the apps and subscriptions that they already have. Your prime responsibility and priority is to ensure customer excellence. You are passionate about system reliability to influence and drive the strategic Systems Reliability Engineering mission. As the leader of Live operations / Systems Reliability you are responsible for monitoring and maintaining the health of the system. We are a highly motivated team, dedicated to delivering products and services that improve the customer experience when accessing internet video at the edges of the network. We are developing a set of inter-related technologies targeting businesses that offer WiFi to their customers but which have limited bandwidth. You are somebody who enjoys solving problems and has a customer-centric mindset. You should be passionate not only about learning new technologies, but also about running systems and software in the real world. You must enjoy a close-knit team environment of shared responsibility, be a team player and a self-starter. You have exceptional technical skills, and enjoy solving challenging problems. You are a quick learner, you adapt easily and you have great interpersonal and communication skills. Netskrt offers the opportunity to obtain hands on experience with storage, networking, security, and cloud technologies. As part of the Netskrt team you will have the opportunity to design and implement solutions to solve challenging problems in a startup environment; working with accomplished engineers and a leadership team with a proven track history of success.
Key Responsibilities: As Director, Live Operations / Systems Reliability you are responsible for system health, performance and reliability. Your mission is to ensure that our service is fast, highly available, scalable and able to withstand unprecedented load. Your team will be at the heart of solving deployment and production problems; building automation tools for system health and production acceptance tests to validate production changes, and for ongoing live monitoring and reporting. You will work closely and collaborate with the Engineering and Networking & Infrastructure teams to ensure a holistic approach to troubleshooting and implementing preventative measures; ensuring the system is well instrumented and highly fault tolerant. The successful candidate will possess an outstanding record of professional experience and will thrive in an environment that demands accountability. You will be a key driver to help the team understand the big picture perspective, and instill a customer-first attitude.
Specific areas of responsibility include:
• Monitor, manage and maintain Netskrt’s managed service
• Manage availability, latency, scalability and efficiency by instilling engineering reliability into our deployed systems with a focus on fault tolerant approaches
• Drive quality accountability within the organization with well-defined processes, metrics, and goals for process quality. This includes leading effective post mortems and ensuring actions are followed-up
• Drive capacity planning, performance analysis, instrumentation and other nonfunctional systems requirements
• Must be able to define and report "progress" on strategic initiatives and project level tasks to all stakeholders including senior executives, clients and use effective communication approaches with each constituency.
• Implement metrics driven processes to ensure service quality targets are met
• Engage, influence, and evangelize SRE practices with development, operational and product groups to align technology service/solution delivery.
Required Qualifications, Skills, Experience:
• Degree in Computer Science or related technical field
• Accomplished leader with 5+ years managing regional and global teams and systems
• Expert knowledge in all aspects of designing, developing, managing large realtime systems
• Project and process management
• Prior successful experience as a systems performance or systems reliability engineer
• Mastery of Linux/Unix
• Mastery of coding / scripting languages (e.g., C++, PHP, Python, Perl)
• Mastery of fault tolerant approaches in a large scale distributed environment and high performance systems
• Demonstrated experience working in large, complex systems environments
• Deep understanding of internet and networking protocols • Analytical mind with excellent problem-solving skills
• Excellent time management, communication, decision-making, presentation, and leadership and organizational skills
• Ability to lead across functions and motivate a matrix staff
Desired Qualifications:
• Proven leader of technology solutions in a high volume transaction environment
• Maintain excellent written and verbal communications with clients, employees, and management chain, including status reports, project plans, presentations, etc.
• Familiarity with security frameworks and risk management methodologies
• Knowledge of patch management, intrusion detection/prevention systems
• Cloud computing and cloud technologies (AWS, OpenStack)
• Configuration/container management (Chef, Puppet, Mesos, Kubernetes)
• Experience with caching and CDN (content delivery network) technologies (Netflix, Amazon, Google, Limelight, Akamai, Fastly)
• Knowledge of data protection operations and legislation (e.g. GDPR)
• Experience with securing IoT and/or autonomous remote devices.
Any questions about the company or to apply: Raymond@netskrt.io or Raymond@gorecruitment.com