May 6, 2014

About Throttling

Use throttling in business services to protect the OSB from stuck threads.

Throttling allows us to limit the number of outgoing requests currently in progress.

At first, I thought it would only be useful to protect shaky downstream services, which crash if hit with too many concurrent requests at once. And throttling is useful for that. But not only for that.

It is also useful to protect the OSB itself from too many stuck threads.

A Slow Service DoS’es Its WM Group

Imagine a set of business services with the same work manager assigned (let’s call it WM1). The maximum capacity of the work manager is large enough that the different business services do not step on each other’s toes:

WM1 (maximum capacity=15): 
  Biz1: 3-4 threads reading responses
  Biz2: 3-4 threads reading responses
  Biz3: 3-4 threads reading responses
  3-6 threads are available

All’s good.

unsplash_52d7cdd15b9a3_1

But suddenly one of the services (say Biz1) begins to experience slow responses, and not just slow, but slow and fragmented responses. Large responses arrive in segments, each with a noticeable delay after the previous one, but still well within the read timeout value.

As a result, the threads reading the responses from Biz1 become occupied for a long, long time. Much longer than the read timeout. (I have described why this happens here).

The requests continue to come into Biz1. After some time all 15 threads in the WM1 are busy reading the responses from the slow service:

WM1 (maximum capacity=15): 
    Biz1: (15 threads are reading responses)
    Biz2: (no threads available)
    Biz3: (no threads available)

There are not threads left for Biz2 and Biz3 and they experience DoS.

Separate WMs? Geez, No!

One obvious solution is to assign each service its own work manager. This is not a perfect solution though.

First, for ESBs with hundreds of services, we’d end up having hundreds of work managers - hardly a manageable setup.

Second, and IMHO more important, every work manager has a safety margin on top of its forecasted capacity (to handle short spikes in traffic). For example, if we expect 3 concurrent requests to go via our service on average, we’d probably allocate 4 or 5 threads to its work manager because 3 is average, and sometimes it is 2, but sometimes it is 4.

When one work manager is assigned to multiple services, highs in the traffic for one service are compensated by lows in traffic for the other services, and so the margin is hedged. When, though, we assign a safety margin to each service separately (via a separate work manager), the margins only add up, and potentially can cause the domain to crash if spikes in traffic come at the same time to many services.

Throttling as a Second Line of Defence

Let’s leave our services under the same work manager, but instead assign a throttling value to each of them:

WM1 (maximum capacity=15): 
    Biz1: throttling(5)
    Biz2: throttling(5)
    Biz3: throttling(5)

Just like the work manager’s max capacity, the throttling value is based on the calculated traffic plus some safety margin. Together with the work manager, the throttling makes our services much more resilient.

Let’s review the same scenario as before. A service becomes slow and the business threads are checked out from the work manager to do the reading. But as soon as the throttling value is reached, no more new threads are allocated. Instead, when yet another request comes in, a fault is raised and the thread goes right back into the pool:

WM1 (maximum capacity=15): 
  Biz1: 5 threads reading responses, new requests are rejected immediately
  Biz2: 3-4 threads reading responses
  Biz3: 3-4 threads reading responses
  2-4 threads are available

While our slow service still experiences DoS, the rest of the services in the group continue to perform as expected - which is a huge improvement over the condition before.

Hardening the OSB

We have done stress tests for a few interesting corner cases, and found that for quite a few of them throttling helps.

These results made us to update our design guidelines to require the throttling is configured for all business services as a hardening measure.

Vladimir Dyuzhev, author of GenericParallel

About Me

My name is Vladimir Dyuzhev, and I'm the author of GenericParallel, an OSB proxy service for making parallel calls effortlessly and MockMotor, a powerful mock server.

I'm building SOA enterprise systems for clients large and small for almost 20 years. Most of that time I've been working with BEA (later Oracle) Weblogic platform, including OSB and other SOA systems.

Feel free to contact me if you have a SOA project to design and implement. See my profile on LinkedIn.

I live in Toronto, Ontario, Canada.  canada   Email me at info@genericparallel.com