Category: Web Performance

SLA: The myth of simplicity

2008-12-17 / spierzchala / 1 Comment

Service Level Agreements. SLAs.

Three of the most contentious words, and most contentious acronym, in the technology sector. Arguments are had, suits are filed, and relationships broken and strained as a result of this single concept.

How can something seemingly simple as setting an agreed upon level of service delivery be so problematic and misunderstood?

The word agreement is the key to the problem. SLAs assume that all parties understand and agree of the level of service. And how that information is to be reported. And who is responsible for reporting the data. And how long you have to file grievances. And who handles problems. And…well, lawyers are involved.

As Guy Kawasaki states regarding the lies of venture capitalists: there is no such thing as a vanilla term sheet.

There is also no such thing as a vanilla SLA. A company that tries to present you with a standardized SLA is trying to pull something over on you.

Some rules about SLAs.

The vendor does not define the SLA. If the vendor selling the product tells you, the customer, what your expected level of service is, then they don’t care about you. Find another vendor.
The customer does not define the SLA. If the customer tells you that they cannot sign an SLA unless you, the vendor, agree to their conditions, walk away from the deal.
An SLA is not an SLO. Service Level Objectives are the targets of success defined by both parties within the SLA. These numbers, however, are not the alpha and the omega of an SLA.
A customer-initiated penalty condition is always in the vendors favor. If the vendor states that the client must initiate the SLA grievance conversation when SLOs are violated, then the vendor is assuming that you are not looking at the data.
SLOs should never be based on single, aggregated metrics from the data. If some bozo tries to say that they provide 99% availability and 3 second average performance, walk away. That is not an SLO.
SLAs are not set in stone. If something is not working, or if targets change, or anything changes, then the parties have to be willing to sit down on a schedule (defined in the SLA) and renegotiate their SLA.
The vendor and the customer have transparent access to the data used for the SLO. If the ccustomer cannot see the data that the vendor is using in the SLO anytime it wants, there will always be a level of mistrust. If you like having all your customers mistrust you, this is a great strategy.
The Problem and Issue Management processes are clearly defined. When something bad happens, or a change needs to be made, the customer and the vendor have to have very clearly defined roles in the process. Responsibility and trust. Do you have that in your current SLA?
The customer and the vendor decide when a problem or issue is resolved. It is not up to one side in an SLA to decide when an issue or problem is resolved. As there are likely penalties involved the longer the abnormal state exists, the customer has a vested interest in quick resolution. As there is likely lost revenue on the table, the customer has the same interest. But the customer also has the seemingly unreasonable idea that this will never happen again, it will be clearly documented, and that getting the right solution is better than getting a solution.
Communication is the key to a good SLA. In the 9 previous points, the emphasis is on communication, the sharing of information. Current SLAs seem to be designed to hide information from each side, and only release it under the most dire situation. People talk. The information will get out. You want your well-crafted brand to implode because you have a reputation as sneaky and untrustworthy?

I’ve likely missed many of the key points, but these are the ones that I see, from both sides of the field, on a pretty regular basis.

In the end, an SLA is not simple. It is not standardized. It is not defined by one side or the other. It is a negotiated treaty of behavior that, in the end, defines the daily operational relationship between two organizations. If you enter an SLA process with both sides trying to find the best way to work together in the long term, there is a good chance that the SLA will be easier than if you go in as stone-cold adversaries.

Why Web Measurements? Part IV: Technical Operations

2008-12-08 / spierzchala / 2 Comments

In the first three parts of this series, the focus has been on the business side of the business: Customer Generation, Customer Retention, and Business Operations. The final component of any discussion of why companies measure their Web performance falls down to Technical Operations.

Why is Technical Operations last?

This part of the conversation is the last, mainly because it is the most mature. A technical audience will understand the basics of a distributed Web performance measurement system, or a Web analytics system, or a QA testing tool without too much explanation. The problems that these tools solve are well-defined and have been around for many years.

Quickly thinking about these types of problems makes it clear, however, that the kind of data needed in a technical operations environment is substantially different than that which is needed at the Business Operations level. Here, the devil is in the details; at Business Operations, the devil is in the patterns and trends.

What are you trying to measure?

The short answer is that a Technical Operations team is trying to measure everything. More data is better data at this level. The key is the ability to correlate multiple sources of system inputs (Web performance data, systems data, network data, traffic data, database queries, etc.) to detect the patterns of behavior which could indicate impending crises or complete system outage, or simply a slower than expected response time during peak business hours.

And while Technical Operations teams thrive on data, they do not thrive on explaining this data very well to others. So the metrics which are important in one organization may not be the key ones in another. Or they may be called by a completely different name. Which is why Technical Operations sigh and throw up their hands in despair when talking to management who are working from Business Operations data.

How do you measure it?

Measure early. Measure often.

This sums up the philosophy of most Technical Operations teams. They want to gather as much data as possible. So much data that the gathering of this data is often one step away from affecting the performance of their own systems. This is how the scientific mind works. So, be prepared to control this urge to measure and instrument everything with a need to ensure that the system is operationally sound.

Summary

Even in the well-developed area of Technical Operations, there is still opportunity to ensure that you are measuring the right things the right way. Do an audit of your measurements. Ask the question “why do we measure this this way?”.
Measure meaningful things in a meaningful way.

Why Web Measurements? Part III: Business Operations

2008-12-05 / spierzchala / 1 Comment

In the Customer Generation and Customer Retention articles of this series, the focus was on Web performance measurements designed to serve an audience outside of your organization. Starting with Business Operations, the focus shifts toward the use of Web performance measurements inside your organization.

Why Business Operations?

When I was initially developing these ideas with my colleague Jean Campbell, the idea was to call this section Reporting and Quality of Service. What we found was that this didn’t completely encompass all of the ideas that fall under these measurements. The question became: which part of the organization do reporting and QoS measurements serve?

What was clear was these were the metrics that reported on the health of the Web service to management and the company as a whole. This was the measurement data that the line of business tied to revenue and analytics data to get a true picture of the health of the online business.

What are you measuring?

Measurements for business operations need to capture the key metrics that are critical for making informed business decisions.

How do we compare to our competitors?
Are we close to breaching our SLAs?
Are the third-parties we use close to breaching their SLAs?
What parts of the site affect performance / user experience the most so we can set priorities?
How does Web performance correlate with all the other data we use in our online business?

Every company will use different measures to capture this information, and correlate the data in different ways. The key is that you do use it to understand how Web performance ties into the line of business.

How often do I look at it?

Well, honestly, most people who work in business operations only need to examine Web performance once a day in a summary business KPI report (your company has a useful daily KPI report that everyone understands and uses, right?), and in greater detail at weekly and monthly management meetings.

The goal of the people examining business operations data is not to solve the technical problems that are being encountered, but to understand how the performance of their site affects the general business health of the company, and how it plays in the competitive marketplace.

What metrics do I need?

Business operations teams need to understand

End-to-end response time for measured business processes
Page-level response times for measured business processes
Success rate of the transaction during the measurement period
How third-parties are affecting performance
How Web analytics and Web performance relate
How different regions are affected by performance
How does performance look from the customer ISPs and desktops

Detailed technical data is lost on these people. It is their role to take all of the data they have, and present a picture of the application as it affects the business, and discuss challenges that they face at a technical level in terms of how they affect the business.

Summary

For people who work at an extremely detailed level with Web measurement data (the topic for the next part of this series), Business Operations metrics seem light, fluffy, and often meaningless. But these metrics serve a distinct audience: the people who run the company. Frankly, if the senior business leaders at an organization are worried on a daily basis about the minute technical details that go into troubleshooting and diagnosing performance issues, I would be concerned.
The objective of Business Operations measurements is to convey the health of the Web systems that support the business, and correlate that health with other KPIs used by the management team.

Why Web Measurements? Part II: Customer Retention

2008-12-02 / spierzchala / 1 Comment

In the first part of this series, using Web performance measurements to generate new customers was the topic. This article focuses on using the same data to keep the customers you have, and make them believe in the value of your service.

Proving the Point

Getting a customer is the exciting and glamorous work. Resources are often drawn from far and wide in an organization to win over a prospect and make them a customer.

Once the deal is done, the day-to-day business of making the customer believe that they are getting what they paid for is the job of the ongoing benchmarking measurements. CDNs and third-party services need to prove that they are delivering the goods, and this can only be done by an agreed upon measurement metric.

Some people leap right into an SLA / SLO discussion. As a Web performance professional, I can tell you that there are few SLAs that are effective, and ever fewer that are enforceable.

Start with what you can prove. Was the performance that was shown me during the pre-sales process a fluke, or does it represent the true level of service that I am getting for my money?

Measure Often and Everywhere

The Web performance world has become addicted to the relatively clean and predictable measurements that originate from high-quality backbone measurement locations. This perspective can provide an slightly unrealistic view of the Web world.

How many times have you heard from the people around you about site X (maybe this is your site) behaving badly or unpredictably from home connections? Why, when you examine the Web performance data from the backbone, doesn’t this show up?

Web connections to the home are unpredicatble, unregulated, and have no QoS target. It is definitely best effort. This is especially true in the US, where there is no incentive (some would say that there is a barrier) to delivering the best quality performance to the home. But that is where the money is.

As a service provider, you better be willing to show that your service is able to surmount the obstacles and deliver Web performance advantages at the Last Mile and the Backbone.

Don’t ever base SLAs on Last Mile data – this is Web performance insanity. But be ready to prove that you can deliver high quality performance everywhere.

Show me the data

As a customer of your service, I expect you to show me the measurement that you’re are collecting. I expect you to be honest with me when you encounter a problem. I do not want to hear/see your finger-pointing, especially when you try and push the blame for any performance issues back to me.

As a service provider, you live and die by the Web performance data. And if you see something in the data, not related to your business, but that could make my site faster and better, tell me about it.

Remember that partnership you sold me on during the Customer Generation phase? Show it to me now. If you help me get better, this will be added to plus column on the decision chart at renewal time, when your competitor comes knocking on my door with a lower price and Web performance data that shows how much you suck.

Shit Happens. Fess up.

The beauty of Web performance measurement is that your customers can replicate exactly the same measurements that you run on their behalf. And, they may actually measure things that you hadn’t thought about.

And sure as shooting, they will show up at a meeting with your team one day with data that shows that your service FUBAR‘d on a massive scale.

It’s the Internet. Bad shit happens on the Internet. I’ve seen it.
If you can show them that you know about the problem, explain what caused it, how you resolved it, and how you are working to prevent it, good.

Better: Call them when the shit happens. Let them know that you know about the problem and that you have a crack team of Web performance commandos deployed worldwide to resolve the problem in non-relativistic time. Blog it. Tweet it. Put a big ‘ol email in their inbox. Call your primary contact, and your secondary contact, and your tertiary contact.

Fess up. You can only hide so much before your customers start talking. And the last thing your want prospects seeing is your existing customers talking smack about your service.

Summary

Web performance measurement doesn’t go away the second you close the deal. In fact, the process has only just begun. It is a crazy, competitive world out there. Be prepared to show that you’re the best and that you aren’t perfect every single day.

GrabPERF: What and Why

2008-12-01 / spierzchala / 0 Comments

Why GrabPERF?

About four years ago, I had a bright idea that I would like to learn more about how to build and scale a small Web performance measurement platform. I’ve worked in the Web performance industry for nearly a decade now, and this was an experimental platform for me to examine and encounter many of the challenges that I see on a daily basis.

The effort was so successful and garnered enough attention during the initial blogging boom that I was able to sell the whole platform for a tiny (that is not a typo) sum to Technorati.

The name is taken from another experimental tool I wrote called GrabIT2 which uses the PHP cURL libraries to capture timings and HTML data for HTTP requests. It is an extension of my articles and writings on Web performance that started at Webperformance.org, and that have since moved to this blog.

What is GrabERF?

GrabPERF is a multi-location measurement platform, based on PERL, cURL, PHP, and MySQL that is designed to

Measure the base HTML or a single-object target using HTTP or HTTPS
Report the data to a central database (located in the San Francisco Area)
Report the data using a GUI or through text based download

Why not Full Pages with all Objects?

Reason 1: I work for a company that already does that. Lawyers and MBAs among you, do the math.

Reason 2: I am an analyst, not a programmer. The best I can say about my measurement script is hack job.

Why is the GrabPERF interface so clunky?

See reason 2 above.

If you want to write your own interface to the data, let me know.

Why has the interface not changed in nearly three years?

The current interface works. It’s simple, clean, and delivers the data that I and the regular users need to analyze performance issues. If there is something more that you would like to see, let me know!

I like what I see. How can I host a measurement location?

Just contact me, and I can provide you with a list of PERL modules you will need to install on your linux server. In return, I need a static IP address of the machine hosting the measurement agent.

How stable is GrabPERF?

Most of the time, I forget it’s even running. I have logged onto the servers and typed in uptime and discovered that it’s been 6 months or more since the servers have been re-booted.

It was designed to be simple, because that’s all I know how to do. The lack of complexity makes it effectively self-managing.

Shouldn’t all systems be that way?

What if my question isn’t asked / answered here?

Your should know the answer to this by now: contact me.

Why Web Measurements? Part I: Customer Generation

2008-12-01 / spierzchala / 5 Comments

Introduction to the Series

This is the first of a four-part series focusing on the reasons why companies measure their Web performance. This perspective is substantially different than ones posited by others in the field as they focus on the meat and potatoes reasons, rather than the sometimes more difficult to imagine future effects that measurement will bring.

Reason One: Customer Generation

It is critical that a company be able to show that their Web services are superior to others, especially in the third-party services and delivery sectors of the Web. In this area, Web performance measurement is key to demonstrating the value and advantage of a service versus the option of self-delivering or using another competitor’s service.

Comparative benchmarking that clearly demonstrates the performance of each of the competitive services in the geographic regions that are of greatest interest to the prospect is key to these Web performance measurements. To achieve truly competitive benchmarks and prove the value of a service, measurements must be realistic and flexible.

In the CDN field, a one object fits all approach is no longer valid. CDNs are responsible for delivering not just images or static objects, but may also host an entire application on their edge servers, serving both HTTP and HTTPS content. In other cases, the application may not be hosted at the edge, but the edge server may act as a proxy for the application, using advancing routing algorithms to deliver the visitor the requested dynamic content more quickly (in theory) than the origin server alone.

This complex range of services means that a CDN has to be willing to demonstrate effective and efficient service delivery before the sale is complete. A CDN has to be willing to expose their system not just to the backbone-based measurements offered in a traditional customer generation process, but to take measurements from the real-user perspective.

Ad-providers have to be willing to show that their service does not affect the overall performance of the site they are trying to place their content on. Web analytics firms face the same challenge. Web analytics firms have one advantage: if their object doesn’t load properly, it may not effect the visitor experience. However, neither ad-providers nor Web-analytics providers can hide from Web measurement collection methods that show all of the bling and the blemishes.
Using Web performance measurements to generate customers is a way that a firm can clearly show that they have faith enough in their service to openly compare it to other providers and to the status quo.

But woe be the firm who uses Web performance metrics in a way that tries to show only their good side. Prospects become former prospects very quickly if a firm using Web performance data to generate new business is found to be gaming the system to their advantage. And it will happen.

Customer Generation is a key method that Web performance measurements are used by firms to clearly show how their service is superior to what a prospect currently has, or is also considering. However, this method does come with substantial caveats, including

The need to measure what is relevant
The need to measure from where the prospect has the greatest interest
The need to consider that gaming the system to show advantage will cost a firm in the end.

Black Friday 2008: The pain, the horror, the suffering

2008-11-29 / spierzchala / 2 Comments

The GrabPERF Black Friday Dashboard is done for another year and there were two performance victims that suffered the most at the hands of the onslaught of bargain-hunters in the area of Web performance.

Some caveats that I need to mention about the GrabPERF measurement methodology.

Only the base HTML file of each site is measured.
Only the base HTML of the homepage is measured. This means that any issues that arose in the shopping process were not captured.

All of the sites in the GrabPERF Holiday Retail Measurement Index can be continually monitored on the GrabPERF Black Friday Dashboard. This page will be available until January 1 2009.

That said, the two primary performance victims this year are HP Shopping and Sears. We focus here on those that did not do that well because sites who have met the Web performance challenge and survived to fight another year are not as interesting from a learning perspective.

HP Shopping

HP Suffered the greatest response time problems, by effectively becoming unresponsive as of 09:00 EST. The greatest affect on overall response time came as a result of the First Byte time metric which is a solid proxy for measuring the server or application load, as it is the time between the initial client HTTP request and the server’s HTTP response.

Factored into the poor performance analysis is the fact that GrabPERF only captures data for the base HTML object. If the performance seen here is carried over to the download of all of the graphical content on the page, I would be surprised if anyone was able to make any kind of purchases on the HP web site on Black Friday.

Today, performance has returned to substantially lower levels, indicating that this application was simply not ready for the amount of traffic it received, or ran into a completely unexpected issue when the load increased.

Recommendation for 2009: Load Test the application using this year’s traffic metrics as a baseline for validating the scalability of the application.

Sears

Sears is a returning visitor from last year’s Black Friday measurements. Unfortunately, they return for exactly the same reason that they were on last year – scaling/capacity issues that appear as errors.

And these are the worst kind of errors. As can be seen in the graphic below, the Sears Web site announced to the whole world that they had over-reached and that they could not handle the incoming volume of traffic.

What is interesting is that Sears owns properties that survived the day very well, namely Lands End. The question that must be posed is why does the parent site fail so badly when the child sites handle the traffic without difficulty?

Recommendation for 2009: Load testing for capacity, and meeting with the Lands End team to understand what they are doing to handle the load.

Web Performance: Nice Display. Now Show Me the Data.

2008-10-16 / spierzchala / 0 Comments

Today’s Web interfaces are all about the Flash (literally). Smooth charting, cool effects, callouts to references — ways to try and simplify complex data collections.
Problem-solving and diagnosis requires a far deeper dive than the flashiest interface could ever provide, because it comes down to the numbers. The actual measurements that make up the flashy chart. If you look at the data used by a professional trader and a someone at home looking at stock charts, there is a substantial difference.

When you get down to that level of analysis, the interface becomes irrelevant. Any analyst worth her or his salary (or salt – same thing) can tell you more from a spreadsheet full of relevant numbers than they can from any pretty graphic. This is true in any field.

When do traders or Web performance analysts use pretty charts? When they have to explain complex issues to non-technical or non-specialist audiences. When these analysts work on solving the sticky problems faced in the everyday world, they always fall back on the numbers.

Web performance data consists of the same few components, regardless of which company is providing the data. In effect, beyond a few key pieces of information about how the measurement data is captured, all Web performance data is the same.

Just because the components that make up the data are the same does not guarantee that the data from two different providers is of the same quality. In an imaginary system, Web performance data from all the major providers could flow into a centralized repository and be transformed using an XSLT or some other mangler so that it would be indistinguishable in most cases to tell which firm was the source.

But a skilled analyst would quickly learn to recognize the data that can be trusted. That would be the data that quickly and accurately represented the issues he was trying to diagnose. The data that flowed with the known patterns of the Web site.

The data that helped him do his job more effectively.

In the end, a pretty interface can go a long way to hide the quality of the data that is being represented. A shiny gloss on poor data does not make it better data. It is critical that the data that underlies that pretty chart is able to live up to the quality demands of the people who use it every day.

Selling the interface is selling the brand. Trust in the data builds the reputation.
Which one sold you when you chose your Web performance measurement provider?

Web Performance: The Strength of Corporate Silos

2008-10-16 / spierzchala / 1 Comment

When I meet with clients, I am always astounded by the strength of the silos that exist inside companies. Business, Marketing, IT, Server ops, Development, Network ops, Finance. In the same house, sniping and plotting to ensure that their team has the most power.

Or so it seems to the outsider.

Organizations are all fighting over the same limited pool of resources. Also, the organization of the modern corporation is devised to create this division, with an emphasis on departments and divisions over teams with shared goals. But even the Utopian world of the cross-functional team is a false dream, as the teams begin to fight amongst themselves for the same meager resources at a project, rather than a department level.

I have no solution for this rather amusing situation. Why is it amusing? As an outsider (at my clients and in my own company) I look upon these running battles as a sign of an organization that has lost its way. Where the need to be managed and controlled has overcome the need to create and accept responsibility.

Start-ups are the villages of the corporate world. Cooperation is high, justice is swift, and creative local solutions abound. Large companies are the Rio de Janeiro’s of the economy. Communication is so broken that companies have to run private phone exchanges to other offices. Interesting things have to be accomplished in the back-channel.

This has a severe effect on Web performance initiatives. Each group is constant battling to maintain control over its piece of the system, and ensure that their need for resources is fulfilled. That means one group wants to test K while another wants to measure Q and yet a third needs to capture data on E.

This leads to a substantial amount of duplication and waste when it comes to solving problems and moving the Web site forward. There is no easy answer for this. I have discussed the need for business and IT to find some level of understanding in previous posts, and have yet to find a company that is able break down the silos without reducing the control that the organization imposes.

The Dog and The Toolbox: Using Web Performance Services Effectively

2008-09-29 / spierzchala / 0 Comments

The Dog and The Toolbox

One day, a dog stumbled upon a toolbox left on the floor. There was a note on it, left by his master, which he couldn’t read. He was only a dog, after all.

He sniffed it. It wasn’t food. It wasn’t a new chew toy. So, being a good dog, he walked off and lay on his mat, and had a nap.

When the master returned home that night, the dog was happy and excited to see him. He greeted his master with joy, and brought along his favorite toy to play with.
He was greeted with yelling and anger and “bad dog”. He was confused. What had he done to displease his master? Why did the master keep yelling at him, and pointing at the toolbox. He had been good and left it alone. He knew that it wasn’t his.

With his limited understanding of human language, he heard the words “fix”, “dishwasher”, and “bad dog”. He knew that the dishwasher was the yummy cupboard that all of the dinner plates went in to, and came out less yummy and smelling funny.

He also knew that the cupboard had made a very loud sound that had scared the dog two nights ago, and then had spilled yucky water on the floor. He had barked to wake his master, who came down, yelling at the dog, then yelling at the machine.
But what did fix mean? And why was the master pointing at the toolbox?

The Toolbox and Web Performance

It is far too often that I encounter companies that have purchased Web performance service that they believe will fix their problems. They then pass the day-to-day management of this information on to a team that is already overwhelmed with data.

What is this team supposed to do with this data? What does it mean? Who is going to use it? Does it make my life easier?

When it comes time to renew the Web performance services, the company feels gipped. And they end up yelling at the service company who sold them this useless thing, or their own internal staff for not using this tool.

To an overwhelmed IT team, Web performance tools are another toolbox on the floor. They know it’s there. It’s interesting. It might be useful. But it makes no sense to them, and is not part of what they do.

Giving your dog the toolbox does not fix your dishwasher. Giving an IT team yet another tool does not improve the performance of a Web site.

Only in the hands of a skilled and trained team does the Web performance of a site improve, or the dishwasher get fixed. As I have said before, a tool is just a tool. The question that all organizations must face is what they want from their Web performance services.

Has your organization set a Web performance goal? How do you plan to achieve your goals? How will you measure success? Does everyone understand what the goal is?

After you know the answers to those questions, you will know that that as amazing as he is, your dog will not ever be able to fix your dishwasher.

But now you know who can.