Tackling complexity in the heart of Spring Cloud Feign errors

Feign, Hystrix, Ribbon, Eureka, are great tools, all nicely packed in Spring Cloud, allowing us to achieve great resilience in our massively distributed applications, with such great ease!!! This is true, at least till the easy part... To be honest, it is easier to get all the great resilience patterns working together with those tools than without, but making everything work as intended needs some studying, time and testing.

Unfortunately (or not) I'm not going to explain how to set all this up here, I'll just point out some tricks with error management with those tools. I chose this topic because I’ve struggled a lot with this (really)!!!

If you are looking for a getting started tutorial on those tools I recommend the following articles:

There will be code in this article, but not that much, you can find the missing parts in this repository

Dependencies

Let's say, after some trouble, you ended up with a dependency set looking like this one:

<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-eureka</artifactId>
</dependency>

<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-hystrix</artifactId>
</dependency>

<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-feign</artifactId>
</dependency>

<dependency>
  <groupId>org.springframework.cloud</groupId>
  <artifactId>spring-cloud-starter-ribbon</artifactId>
</dependency>

<dependency>
  <groupId>org.springframework.retry</groupId>
  <artifactId>spring-retry</artifactId>
</dependency>

Ok, so you are aiming at the full set:

  • Of course, you are going to use Eureka client to get your service instances from your Eureka server
  • So Ribbon can provide a proper client-side load-balancer using service names and not URLs (and decorate RestTemplate to use names and load-balancing)
  • Then comes Hystrix with lots of built-in anti-fragile patterns, another awesome tool but you need to keep an eye on it (not part of this article...)
  • Finally, everything is packed up by Feign for really easy-to-write rest clients

This article uses the following versions of Spring Cloud:

<parent>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-parent</artifactId>
  <version>1.5.13.RELEASE</version>
</parent>

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.springframework.cloud</groupId>
      <artifactId>spring-cloud-dependencies</artifactId>
      <version>Edgware.SR3</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Configuration

These tools need configuration, let's assume you have configured up something similar in your application.yml:

spring:
  application:
    name: my-awesome-app

eureka:
  client:
    serviceUrl:
      defaultZone: http://my-eureka-instance:port/eureka/

feign:
  hystrix:
    enabled: true

hystrix:
  threadpool:
    default:
      coreSize: 15
  command:
    default:
      execution:
        isolation:
          strategy: THREAD
          thread:
            timeoutInMilliseconds: 2000

ribbon:
  ReadTimeout: 400
  ConnectTimeout: 100
  OkToRetryOnAllOperations: true
  MaxAutoRetries: 1
  MaxAutoRetriesNextServer: 1

This configuration will work if your application can register to Eureka using its hostname and application port. For production / cloud / any environment with proxies you need to have additional properties:

  • eureka.instance.hostname with the real hostname to use to reach your service
  • eureka.instance.nonSecurePort with the non-secure-port to use or eureka.instance.securePort with eureka.instance.securePortEnabled=true

Also this configuration isn't authenticated, it can be a good idea to add authentication to Eureka, depending on your network.

From the Ribbon configuration I see you have confidence in your Web Services, 400ms for a ReadTimeout is quite short, the shorter the better!

We can also notice that all your services are idempotent because you accept to have 4 calls instead of 1 if your network / servers starts to get messy (yes, this Ribbon configuration will make 4 requests if the response times out because it is actually doing: ( 1 + MaxAutoRetries ) x ( 1 + MaxAutoRetriesNextServer) = 4. So if you set 2 and 3 respectively, you will have up to 12 requests only from Ribbon).

This gets us to the 2000ms Hystrix timeout, a shorter value will result in requests being done without the application waiting for the result so this seems legit (due to ribbon configuration : (400 + 100) * 4).

Customization

Everything goes well, you quickly understand that, for all FeignClients without fallback you only get HystrixRuntimeException for any error. This exception is mainly saying that something went wrong and you don't have a fallback but the cause can tell you a little bit more. You quickly build an ExceptionHandler to display nicer messages to users (because you don't want to put fallbacks on all FeignClient).

One day you call a new external service and this service can have normal responses with HTTP 404 for some resources, so you add decode404 = true to your @FeignClient to get a response and avoid circuit breaking on those (if this option is not set, a 404 will be counted for circuit breaking). But you don't get responses, what you get is:

...
Caused by: feign.codec.DecodeException: Could not extract response: no suitable HttpMessageConverter found for response type [class ...
...

This is because the 404 from this service has a different form than "normal" responses (can be a simple String saying that the resource wasn't found). A cool idea here would be to allow Optional<?> and ResponseEntity<?> types in FeignClient to get an empty body for those 404s.

AutoConfigured Spring Cloud Feign can map to ResponseEntity<?> but will fail to deserialize incompatible objects. It cannot, by default, put results in Optional<?> so it is still a cool feature to implement.

One way to achieve this is to define a Decoder similar to this:

package fr.ippon.feign;

import java.io.IOException;
import java.lang.reflect.ParameterizedType;
import java.lang.reflect.Type;
import java.util.Optional;

import org.springframework.http.ResponseEntity;
import org.springframework.util.Assert;

import feign.FeignException;
import feign.Response;
import feign.Util;
import feign.codec.DecodeException;
import feign.codec.Decoder;

public class NotFoundAwareDecoder implements Decoder {

  private final Decoder delegate;

  public NotFoundAwareDecoder(Decoder delegate) {
    Assert.notNull(delegate, "Can't build this decoder with a null delegated decoder");

    this.delegate = delegate;
  }

  @Override
  public Object decode(Response response, Type type) throws IOException, DecodeException, FeignException {
    if (!(type instanceof ParameterizedType)) {
      return delegate.decode(response, type);
    }

    if (isParameterizedTypeOf(type, Optional.class)) {
      return decodeOptional(response, type);
    }

    if (isParameterizedTypeOf(type, ResponseEntity.class)) {
      return decodeResponseEntity(response, type);
    }

    return delegate.decode(response, type);
  }

  private boolean isParameterizedTypeOf(Type type, Class<?> clazz) {
    ParameterizedType parameterizedType = (ParameterizedType) type;

    return parameterizedType.getRawType().equals(clazz);
  }

  private Object decodeOptional(Response response, Type type) throws IOException {
    if (response.status() == 404) {
      return Optional.empty();
    }

    Type enclosedType = Util.resolveLastTypeParameter(type, Optional.class);
    Object decodedValue = delegate.decode(response, enclosedType);

    if (decodedValue == null) {
      return Optional.empty();
    }

    return Optional.of(decodedValue);
  }

  private Object decodeResponseEntity(Response response, Type type) throws IOException {
    if (response.status() == 404) {
      return ResponseEntity.notFound().build();
    }

    return delegate.decode(response, type);
  }
}

Then, a @Configuration file:

package fr.ippon.feign;

import org.springframework.beans.factory.ObjectFactory;
import org.springframework.boot.autoconfigure.web.HttpMessageConverters;
import org.springframework.cloud.client.circuitbreaker.EnableCircuitBreaker;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;
import org.springframework.cloud.netflix.feign.EnableFeignClients;
import org.springframework.cloud.netflix.feign.support.ResponseEntityDecoder;
import org.springframework.cloud.netflix.feign.support.SpringDecoder;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;

import feign.codec.Decoder;

@Configuration
@EnableCircuitBreaker
@EnableDiscoveryClient
public class FeignConfiguration {

  @Bean
  public Decoder notFoundAwareDecoder(ObjectFactory<HttpMessageConverters> messageConverters) {
    return new NotFoundAwareDecoder(new ResponseEntityDecoder(new SpringDecoder(messageConverters)));
  }

}

Of course it is up to you to fit it to your exact needs, but this way you will be able to get proper responses.

Integration testing

All this really cool stuff can change from Spring Cloud one minor version to another (eg : Hystrix enabled by default to Hystrix disabled by default) so unless you aren't missing any update (I don't think it is possible) I strongly recommend adding good integration tests for this stack usage (unit tests will not be of any help here).

But having integration testing for this stack can be quite complicated. If we want to be as close as possible to reality we need:

  • A running Eureka instance.
  • A running service registered on Eureka.
  • A running client using this service.

One way to do this is to set up a dynamic test environment with Eureka and some applications but, depending on your organization, this can be really hard to achieve. Another way is to start all this in a Single JVM managed by JUnit thus integration with any build tool and CI platform will be really easy.

The drawback of this can be strange behaviors due to the Spring auto-configuration mechanism, it’s up to you to choose to make it in containers or this way, depending on what you can do.

To achieve this we will need to solve:

  • The fact that we cannot use the native SpringTest class because it can only manage one application by default. We can work around this, by using SpringApplication.run(...) and play with the resulting ConfigurableApplicationContext.
  • The need to start on available ports. Simply add --server.port in SpringApplication.run(...) with SocketUtils.findAvailableTcpPort(), not even a problem.
  • The impossibility to use any kind of default configuration path unless we want all our apps to get this configuration. This one is also easy, just add --spring.config.location with a specific configuration in our SpringApplication.run(...) and we can have separate configurations.
  • The need for our applications to have configurations depending on the Eureka server port. For this one we will need to ensure that Eureka is the first one to start (not needed for production, our client can handle this very well but will be annoying for tests) and then give the Eureka port one way or another to the other applications.
  • The fact that we can't, by default, start multiple Spring Boot applications on the same JVM instance because of JMX mbean name. Let’s disable it using --spring.jmx.enabled=false (or change the default domain using --spring.jmx.default-domain with a different name) and we are OK.
  • Finally, a strange one, you know that Spring Cloud tools use Archaius to manage their configuration, not the default Spring configuration system. Archaius takes Spring Boot configuration into account when the first application starts on the JVM, for the next one they aren't taken into account at the moment I'm writing this (check ArchaiusAutoConfiguration.configureArchaius(...) there is a static AtomicBoolean used to ensure that the configuration isn't loaded twice and "else" there is a TODO and a warn log). For our tests we will go for an ugly fix for this, reloading this configuration in an ApplicationListener<ApplicationReadyEvent> will do the trick.

I have done this here using mainly JUnitRules to handle the applications parts, feel free to take it if you like it and adapt those tests to your needs.

At the time of this writing, the project takes ~45sec to build, which is very slow considering that most of this time is for integration tests on already battle tested code... but I really don’t want to miss a breaking change in my usage of this great stack so I consider this time to be fair enough.

If you don’t need it remove the part testing circuit breaking on all HTTP error codes since those tests are very slow due to the sleeping phase…

Once again, really take the time to make strong integration tests on your usage of this stack to avoid really bad surprises after some months!!!

Going further

Depending on what you want to build, what we have here can be more than enough on the application side but if you are planning to use this in the real world, you really need some good metrics and alerts (at least to keep an eye on your fallbacks and circuit breaker openings).

For this you can check Hystrix dashboard and Turbine to provide you with lots of useful metrics to get dashboards with lots of those:

Turbine dashboard

You will then need to bind it to your alerting system, this will need some work and you are going to need to handle LOTS of data since those tools are really verbose (if you want to persist that data pay attention to your eviction strategy and choose a solid enough timeseries infrastructure). Depending on your needs and organization tools a simple metrics Counter on your fallbacks can do a good job. Once set up in your applications this will only need a @Counted(...) on your fallbacks methods.

It is also possible that the few tools discussed here are not antifragile enough for your needs, in that case, you can start by checking:

  • Hystrix configurations you will see that there are plenty of things you can do (playing with circuit breaker configuration can really help in some cases). Don't forget to add integration tests to ensure that the configuration you are adding is really behaving as expected.
  • Feign retries: I totally skipped this part but there is a built-in retry mechanism in Spring Cloud Feign coming on top of Hystrix and Ribbon mechanisms. You can check Retryer.Default to see the default retry strategy but this is kind of misleading in two ways:
    • First: if you have Hystrix Feign enabled the default retrier is Retryer.NEVER_RETRY (check FeignClientsConfiguration.feignRetryer())
    • Second : even if you define a Retryer Bean to Retryer.Default you won't get feign level retries by default because it is also important to check ErrorDecoder.Default to see that we have a RetryableException only when there is a well formatted date in the Retry-After HTTP header.
      So if you want to play with this you will need to :
    • define an ErrorDecoder that ends up in RetryableException in the cases you want (or add the Retry-After header in your services).
    • change the Retryer to the one actually retrying.
    • probably redefine the Feign.Builder Bean (be careful to keep the @Scope("prototype")) to suit your needs.

So, do we go live?

This stack really is great and every developer using it daily will enjoy it, at least after one guy in the team spends days setting all this up to check some “vital” points :

  • avoid retries on POST, PATCH and any non-idempotent services (should follow this)
  • ensure that fallback calls and opened circuit breakers are tracked and explained
  • ensure that Eureka is secured and not a SPOF (even without Eureka up and running the apps can talk to each other, at least for a fair amount of time)
  • ensure that some minor version change will not silently break all this anti-fragile stuff with strong integration tests.

In my opinion, this is a really great stack that needs a lot of work and understanding. So, make sure to use it only if you need it and otherwise stick to RestTemplate until you have time to give it a good try!

OUR COMPANY
Ippon Technologies is an international consulting firm that specializes in Agile Development, Big Data and DevOps / Cloud. Our 400+ highly skilled consultants are located in the US, France, Australia and Russia. Ippon technologies has a $42 million revenue.