Click here to order the March 2017 issue in which this article appeared.
How to conquer the high-stakes pressure of managing a 24/7 multimillion-dollar e-commerce system.
The memories of a past system outage can be forever seared in your brain. Outages are terrible. They can even make you physically ill, because you know the gravity of the situation and its impact on revenue, your company’s reputation, and the never-ending residual cleanup efforts downstream.
Don’t get me wrong, I find managing the development of e-commerce systems thrilling. You’re constantly contributing to code that impacts millions of users, and developing highly visible, enterprise-wide projects that ensure that transactions keep happening. But with great reward comes great risk. An outage can destroy a company’s hard-won reputation in a matter of minutes. No one dies in a system outage, but it can sure feel like it. And as a good leader, you end up taking the brunt of that.
While this article will not delve into the technological nuts and bolts of system scalability, resiliency or uptime, it will show you how to maintain your composure in the high-stakes pressure that is e-commerce.
Why Outages Happen: Technological Factors
Many e-commerce shops share the same company-wide conditions that make them ripe for an outage. The most common three conditions are:
- Change vs. stability
- Extreme web traffic spikes
- Distributed systems
Change vs. Stability
The diametrically opposed forces of making the highest rate of system change while maintaining stability, uptime and performance are often the measure of successful software development. We have 35 developers working on cross-dependent systems with colliding code and releases that deploy multiple times a day. At the same time, we have non-IT system configuration changes—including content, product, pricing, promotions, and terms setup—that jeopardize system stability. What could possibly go wrong?
Extreme Web Traffic Spikes
Extreme traffic variability is common in e-commerce shops. For example, within the direct sales channel, there are huge spikes in traffic at the end of each month as distributors place orders to advance commissions, or prior to the discontinuation of product before a new catalog rolls out. At my company, the largest spike of the year always occurs during our huge holiday promotions. This cyclical business magnifies our web traffic 10 to 20 times its normal volume.
The impact of traffic on system performance is nonlinear. At certain inflection points while traffic is increasing, system performance can begin degrading rapidly. Though load testing is extremely important, it is a significant challenge to accurately simulate and identify real-life impacts on your system. After all, e-commerce system traffic is not on a 1-1 ratio with company year-over-year growth; while a company might be growing by 10 percent annually, systems may need to accommodate 100 percent growth during traffic spikes.
A recent technology trend is for IT organizations to migrate away from single monolithic applications to smaller, more distributed systems in order to provide improved scalability and elasticity. There is a possible tradeoff, though, as these systems can be more challenging to manage and troubleshoot during system outages. With a single monolithic system, outage issues are easily discernible by glancing at a health-dashboard. However, with increasingly more common distributed systems, root causes or symptoms of an outage are often difficult to determine.
How to Persevere Successfully: Human Factors
- Realize it’s not just on you. You need a support team. Don’t get me wrong, I’d absolutely take a bullet for my team, and always promote servant leadership. But you also need to enable ownership and accountability at all organizational levels through leader-leader management. One way to accomplish this is to encourage leaders at every level to provide solutions and help make decisions.
- Build relationships and mutual trust with solid partners across the business. This long-term investment takes time and effort, but it’s crucial to have a trusted support system external to IT when an outage occurs.
- Communicate with your executive team. Transparency and ongoing communication about day-to-day activities, successes and opportunities in good times will carry over during an outage. Providing timely updates and context about how the outage is impacting each business unit helps show your business partners that you care and are doing everything possible.
- Never become complacent. Always be vigilant. Build a repeatable process framework that includes load and stress testing and automated functional tests, and continue to challenge performance goals and KPIs within your team. Practice and prepare for outages by defining your team’s responsibilities, creating standard operating procedures and prepping tools for monitoring and triaging. Make sure you have manpower devoted to ensuring stability and resiliency, such as dedicated Performance Tuning Engineers. And most important, never forget the lessons learned from the last outage.
- Keep calm. It’s easy to get rattled during outages, especially as you realize the gravity of the impacts. Remain analytical, logical, businesslike and objective—learn from the failure and move on.
- Most important, work with and for great people. This makes the stress of an outage more tolerable. My team and I are more motivated to succeed and persevere because our company is caring and compassionate. Don’t get me wrong—we are held accountable and are continuously challenged to innovate and deliver stellar business results. But our best work happens because we work in an environment of support, not fear.
I want to conclude by sharing a note that my company’s CEO and Co-Founder sent me on the night of a recent system outage. We were a couple hours into the problem when he easily could have been livid.
The note read: “Make sure you and your teams feel loved and supported. We couldn’t be more proud of our IT team and today doesn’t diminish that. Remember, it’s a puzzle to solve rather than a failure to remedy. You are appreciated for the fix, not blamed for the error. Make sure the guys and gals on your team know how much we appreciate their diligence and hard work.”
My company’s entire culture originates from such incredible leadership, and that is the primary reason my team continues to thrive in this stressful e-commerce world.
Christopher E. Johnson is Vice President of Information Technology at Scentsy Inc.