SLA: The myth of simplicity

Service Level Agreements. SLAs.
Three of the most contentious words, and most contentious acronym, in the technology sector. Arguments are had, suits are filed, and relationships broken and strained as a result of this single concept.
How can something seemingly simple as setting an agreed upon level of service delivery be so problematic and misunderstood?
The word agreement is the key to the problem. SLAs assume that all parties understand and agree of the level of service. And how that information is to be reported. And who is responsible for reporting the data. And how long you have to file grievances. And who handles problems. And…well, lawyers are involved.
As Guy Kawasaki states regarding the lies of venture capitalists: there is no such thing as a vanilla term sheet.
There is also no such thing as a vanilla SLA. A company that tries to present you with a standardized SLA is trying to pull something over on you.
Some rules about SLAs.

  1. The vendor does not define the SLA. If the vendor selling the product tells you, the customer, what your expected level of service is, then they don’t care about you. Find another vendor.
  2. The customer does not define the SLA. If the customer tells you that they cannot sign an SLA unless you, the vendor, agree to their conditions, walk away from the deal.
  3. An SLA is not an SLO. Service Level Objectives are the targets of success defined by both parties within the SLA. These numbers, however, are not the alpha and the omega of an SLA
  4. A customer-initiated penalty condition is always in the vendors favor. If the vendor states that the client must initiate the SLA grievance conversation when SLOs are violated, then the vendor is assuming that you are not looking at the data.
  5. SLOs should never be based on single, aggregated metrics from the data. If some bozo tries to say that they provide 99% availability and 3 second average performance, walk away. That is not an SLO.
  6. SLAs are not set in stone. If something is not working, or if targets change, or anything changes, then the parties have to be willing to sit down on a schedule (defined in the SLA) and renegotiate their SLA.
  7. The vendor and the customer have transparent access to the data used for the SLO. If the ccustomer cannot see the data that the vendor is using in the SLO anytime it wants, there will always be a level of mistrust. If you like having all your customers mistrust you, this is a great strategy.
  8. The Problem and Issue Management processes are clearly defined. When something bad happens, or a change needs to be made, the customer and the vendor have to have very clearly defined roles in the process. Responsibility and trust. Do you have that in your current SLA?
  9. The customer and the vendor decide when a problem or issue is resolved. It is not up to one side in an SLA to decide when an issue or problem is resolved. As there are likely penalties involved the longer the abnormal state exists, the customer has a vested interest in quick resolution. As there is likely lost revenue on the table, the customer has the same interest. But the customer also has the seemingly unreasonable idea that this will never happen again, it will be clearly documented, and that getting the right solution is better than getting a solution.
  10. Communication is the key to a good SLA. In the 9 previous points, the emphasis is on communication, the sharing of information. Current SLAs seem to be designed to hide information from each side, and only release it under the most dire situation. People talk. The information will get out. You want your well-crafted brand to implode because you have a reputation as sneaky and untrustworthy?

I’ve likely missed many of the key points, but these are the ones that I see, from both sides of the field, on a pretty regular basis.
In the end, an SLA is not simple. It is not standardized. It is not defined by one side or the other. It is a negotiated treaty of behavior that, in the end, defines the daily operational relationship between two organizations. If you enter an SLA process with both sides trying to find the best way to work together in the long term, there is a good chance that the SLA will be easier than if you go in as stone-cold adversaries.

Why Web measurements? The Series.

In my life as a consultant, I often discuss What Web performance data means and how to interpret it to solve problems. Solving the problems is, however, inherently based on whether the data that is collected is meaningful. In trying to find data that is meaningful, we have found that there are four categories that Web performance measurements fall into: Customer Generation, Customer Retention, Business Operations, and Technical Operations.

Customer Generation

How can you use Web performance measurement data to outperform your competition and impress your prospects. Read it here!

Customer Retention

Impress your customers with your skill and responsiveness, and keep the competition from sneaking in the back door. Read it here!

Business Operations

Know how you are doing against your competition and prioritize what you need to do to stay ahead. Read it here!

Technical Operations

Know what to measure and how often to keep a detailed eye on your internal systems and external performance. Read it here!

Why Web Measurements? Part IV: Technical Operations

In the first three parts of this series, the focus has been on the business side of the business: Customer Generation, Customer Retention, and Business Operations. The final component of any discussion of why companies measure their Web performance falls down to Technical Operations.

Why is Technical Operations last?

This part of the conversation is the last, mainly because it is the most mature. A technical audience will understand the basics of a distributed Web performance measurement system, or a Web analytics system, or a QA testing tool without too much explanation. The problems that these tools solve are well-defined and have been around for many years.
Quickly thinking about these types of problems makes it clear, however, that the kind of data needed in a technical operations environment is substantially different than that which is needed at the Business Operations level. Here, the devil is in the details; at Business Operations, the devil is in the patterns and trends.

What are you trying to measure?

The short answer is that a Technical Operations team is trying to measure everything. More data is better data at this level. The key is the ability to correlate multiple sources of system inputs (Web performance data, systems data, network data, traffic data, database queries, etc.) to detect the patterns of behavior which could indicate impending crises or complete system outage, or simply a slower than expected response time during peak business hours.
And while Technical Operations teams thrive on data, they do not thrive on explaining this data very well to others. So the metrics which are important in one organization may not be the key ones in another. Or they may be called by a completely different name. Which is why Technical Operations sigh and throw up their hands in despair when talking to management who are working from Business Operations data.

How do you measure it?

Measure early. Measure often.
This sums up the philosophy of most Technical Operations teams. They want to gather as much data as possible. So much data that the gathering of this data is often one step away from affecting the performance of their own systems. This is how the scientific mind works. So, be prepared to control this urge to measure and instrument everything with a need to ensure that the system is operationally sound.


Even in the well-developed area of Technical Operations, there is still opportunity to ensure that you are measuring the right things the right way. Do an audit of your measurements. Ask the question “why do we measure this this way?”.
Measure meaningful things in a meaningful way.

Why Web Measurements? Part III: Business Operations

In the Customer Generation and Customer Retention articles of this series, the focus was on Web performance measurements designed to serve an audience outside of your organization. Starting with Business Operations, the focus shifts toward the use of Web performance measurements inside your organization.

Why Business Operations?

When I was initially developing these ideas with my colleague Jean Campbell, the idea was to call this section Reporting and Quality of Service. What we found was that this didn’t completely encompass all of the ideas that fall under these measurements. The question became: which part of the organization do reporting and QoS measurements serve?
What was clear was these were the metrics that reported on the health of the Web service to management and the company as a whole. This was the measurement data that the line of business tied to revenue and analytics data to get a true picture of the health of the online business.

What are you measuring?

Measurements for business operations need to capture the key metrics that are critical for making informed business decisions.

  • How do we compare to our competitors?
  • Are we close to breaching our SLAs?
  • Are the third-parties we use close to breaching their SLAs?
  • What parts of the site affect performance / user experience the most so we can set priorities?
  • How does Web performance correlate with all the other data we use in our online business?

Every company will use different measures to capture this information, and correlate the data in different ways. The key is that you do use it to understand how Web performance ties into the line of business.

How often do I look at it?

Well, honestly, most people who work in business operations only need to examine Web performance once a day in a summary business KPI report (your company has a useful daily KPI report that everyone understands and uses, right?), and in greater detail at weekly and monthly management meetings.
The goal of the people examining business operations data is not to solve the technical problems that are being encountered, but to understand how the performance of their site affects the general business health of the company, and how it plays in the competitive marketplace.

What metrics do I need?

Business operations teams need to understand

  • End-to-end response time for measured business processes
  • Page-level response times for measured business processes
  • Success rate of the transaction during the measurement period
  • How third-parties are affecting performance
  • How Web analytics and Web performance relate
  • How different regions are affected by performance
  • How does performance look from the customer ISPs and desktops

Detailed technical data is lost on these people. It is their role to take all of the data they have, and present a picture of the application as it affects the business, and discuss challenges that they face at a technical level in terms of how they affect the business.


For people who work at an extremely detailed level with Web measurement data (the topic for the next part of this series), Business Operations metrics seem light, fluffy, and often meaningless. But these metrics serve a distinct audience: the people who run the company. Frankly, if the senior business leaders at an organization are worried on a daily basis about the minute technical details taht go into troubleshooting and diagnosing performance issues, I would be concerned.
The objective of Business Operations measurements is to convey the health of the Web systems that support the business, and correlate that health with other KPIs used by the management team.

Why Web Measurements? Part II: Customer Retention

In the first part of this series, using Web performance measurements to generate new customers was the topic. This article focuses on using the same data to keep the customers you have, and make them believe in the value of your service.

Proving the Point

Getting a customer is the exciting and glamorous work. Resources are often drawn from far and wide in an organization to win over a prospect and make them a customer.
Once the deal is done, the day-to-day business of making the customer believe that they are getting what they paid for is the job of the ongoing benchmarking measurements. CDNs and third-party services need to prove that they are delivering the goods, and this can only be done by an agreed upon measurement metric.
Some people leap right into an SLA / SLO discussion. As a Web performance professional, I can tell you that there are few SLAs that are effective, and ever fewer that are enforceable.
Start with what you can prove. Was the performance that was shown me during the pre-sales process a fluke, or does it represent the true level of service that I am getting for my money?

Measure Often and Everywhere

The Web performance world has become addicted to the relatively clean and predictable measurements that originate from high-quality backbone measurement locations. This perspective can provide an slightly unrealistic view of the Web world.
How many times have you heard from the people around you about site X (maybe this is your site) behaving badly or unpredictably from home connections? Why, when you examine the Web performance data from the backbone, doesn’t this show up?
Web connections to the home are unpredicatble, unregulated, and have no QoS target. It is definitely best effort. This is especially true in the US, where there is no incentive (some would say that there is a barrier) to delivering the best quality performance to the home. But that is where the money is.
As a service provider, you better be willing to show that your service is able to surmount the obstacles and deliver Web performance advantages at the Last Mile and the Backbone.
Don’t ever base SLAs on Last Mile data – this is Web performance insanity. But be ready to prove that you can deliver high quality performance everywhere.

Show me the data

As a customer of your service, I expect you to show me the measurement that you’re are collecting. I expect you to be honest with me when you encounter a problem. I do not want to hear/see your finger-pointing, especially when you try and push the blame for any performance issues back to me.
As a service provider, you live and die by the Web performance data. And if you see something in the data, not related to your business, but that could make my site faster and better, tell me about it.
Remember that partnership you sold me on during the Customer Generation phase? Show it to me now. If you help me get better, this will be added to plus column on the decision chart at renewal time, when your competitor comes knocking on my door with a lower price and Web performance data that shows how much you suck.

Shit Happens. Fess up.

The beauty of Web performance measurement is that your customers can replicate exactly the same measurements that you run on their behalf. And, they may actually measure things that you hadn’t thought about.
And sure as shooting, they will show up at a meeting with your team one day with data that shows that your service FUBAR‘d on a massive scale.
It’s the Internet. Bad shit happens on the Internet. I’ve seen it.
If you can show them that you know about the problem, explain what caused it, how you resolved it, and how you are working to prevent it, good.
Better: Call them when the shit happens. Let them know that you know about the problem and that you have a crack team of Web performance commandos deployed worldwide to resolve the problem in non-relativistic time. Blog it. Tweet it. Put a big ‘ol email in their inbox. Call your pimary contact, and your secondary contact, and your tertiary contact.
Fess up. You can only hide so much before your customers start talking. And the last thing your want prospects seeing is your existing customers talking smack about your service.


Web performance measurement doesn’t go away the second you close the deal. In fact, the process has only just begun. It is a crazy, competitive world out there. Be prepared to show that you’re the best and that you aren’t perfect every single day.

GrabPERF: What and Why

Why GrabPERF?

About four years ago, I had a bright idea that I would like to learn more about how to build and scale a small Web performance measurement platform. I’ve worked in the Web performance industry for nearly a decade now, and this was an experimental platform for me to examine and encounter many of the challenges that I see on a daily basis.
The effort was so successful and garnered enough attention during the initial blogging boom that I was able to sell the whole platform for a tiny (that is not a typo) sum to Technorati.
The name is taken from another experimental tool I wrote called GrabIT2 which uses the PHP cURL libraries to capture timings and HTML data for HTTP requests. It is an extension of my articles and writings on Web performance that started at, and that have since moved to this blog.

What is GrabERF?

GrabPERF is a multi-location measurement platform, based on PERL, cURL, PHP, and MySQL that is designed to

  • Measure the base HTML or a single-object target using HTTP or HTTPS
  • Report the data to a central database (located in the San Francisco Area)
  • Report the data using a GUI or through text based download

Why not Full Pages with all Objects?

Reason 1: I work for a company that already does that. Lawyers and MBAs among you, do the math.
Reason 2: I am an analyst, not a programmer. The best I can say about my measurement script is hack job.

Why is the GrabPERF interface so clunky?

See reason 2 above.
If you want to write your own interface to the data, let me know.

Why has the interface not changed in nearly three years?

The current interface works. It’s simple, clean, and delivers the data that I and the regular users need to analyze performance issues. If there is something more that you would like to see, let me know!

I like what I see. How can I host a measurement location?

Just contact me, and I can provide you with a list of PERL modules you will need to install on your linux server. In return, I need a static IP address of the machine hosting the measurement agent.

How stable is GrabPERF?

Most of the time, I forget it’s even running. I have logged onto the servers and typed in uptime and discovered that it’s been 6 months or more since the servers have been re-booted.
It was designed to be simple, because that’s all I know how to do. The lack of complexity makes it effectively self-managing.
Shouldn’t all systems be that way?

What if my question isn’t asked / answered here?

Your should know the answer to this by now: contact me.

Why Web Measurements? Part I: Customer Generation

Introduction to the Series

This is the first of a four-part series focusing on the reasons why companies measure their Web performance. This perspective is substantially different than ones posited by others in the field as they focus on the meat and potatoes reasons, rather than the sometimes more difficult to imagine future effects that measurement will bring.

Reason One: Customer Generation

It is critical that a company be able to show that their Web services are superior to others, especially in the third-party services and delivery sectors of the Web. In this area, Web performance measurement is key to demonstrating the value and advantage of a service versus the option of self-delivering or using another competitor’s service.
Comparative benchmarking that clearly demonstrates the performance of each of the competitive services in the geographic regions that are of greatest interest to the prospect is key to these Web performance measurements. To achieve truly competitive benchmarks and prove the value of a service, measurements must be realistic and flexible.
In the CDN field, a one object fits all approach is no longer valid. CDNs are responsible for delivering not just images or static objects, but may also host an entire application on their edge servers, serving both HTTP and HTTPS content. In other cases, the application may not be hosted at the edge, but the edge server may act a s a proxy for the application, using advancing routing algorithms to deliver the visitor the requested dynamic content more quickly (in theory) than the origin server alone.
This complex range of services means that a CDN has to be willing to demonstrate effective and efficient service delivery before the sale is complete. A CDN has to be willing to expose their system not just to the backbone-based measurements offered in a traditional customer generation process, but to take measurements from the real-user perspective.
Ad-providers have to be willing to show that their service does not affect the overall performance of the site they are trying to place their content on. Web analytics firms face the same challenge. Web analytics firms have one advantage: if their object doesn’t load properly, it may not effect the visitor experience. However, neither ad-providers nor Web-analytics providers can hide frow Web measurement collection methods that show all of the bling and the blemishes.
Using Web performance measurements to generate customers is a way that a firm can clearly show that they have faith enough in their service to openly compare it to other providers and to the status quo.
But woe be the firm who uses Web performance metrics in a way that tries to show only their good side. Prospects become former prospects very quickly if a firm using Web performance data to generate new business is found to be gaming the system to their advantage. And it will happen.
Customer Generation is a key method that Web performance measurements are used by firms to clearly show how their service is superior to what a prospect currently has, or is also considering. However, this method does come with substantial caveats, including

  • The need to measure what is relevant
  • The need to measure from where the prospect has the greatest interest
  • The need to consider that gaming the system to show advantage will cost a firm in the end.