When does splitting a monolithic system into multiple services make sense?

When does it stop making any sense?

Connections as dependencies between services

Services depend on some input to produce the output they were created for.

Based on these “communication lines” (dependencies), splitting a service (or not) into multiple ones can be objectively rationalized.

Criterion: time spent to verify system

Eventually, what matters is the overall human time spent on making changes to the system.

Now, the change in a system is only useful when it is finally verified¹.

The efforts on verification grow with the test space.

When we compare monolith and multiple services, we compare test space difference before and after the split.

Abstraction

Boolean function

The services are simplified to compute mere boolean functions where every “input line” is 1-bit boolean value.

This abstraction is easy to imagine and universal for our purpose.

The system as a whole (whether it is multiple services or monolith service) has fixed 12 inputs and 1 output.

Example system - boolean function with 12 inputs and 1 output — Initial system - boolean function with 12 inputs and 1 output

We compare initial monolith and its split into two services (A and B).

Stateless services

All services (boolean functions) are stateless.

Even if they were not, a state could be modeled by some input data. If we consider some input data as a state, it constitutes identical test space and does not change results in principle.

Expression dependencies

Whether monolith service is rather split-able or rather not depends on its boolean function.

For example:

boolean service(boolean input[]) {
    boolean potential_service_a = input[0] || input[7];
    boolean potential_service_b = input[0] && input[8];
    return potential_service_a ^ potential_service_b;
}

Specifically, split-ability is affected by:

What kind of data dependencies exists between expressions computing the function.
How (where) do we split these expressions.

// Shared input[1].
// Independent.
boolean service1(boolean input[]) {
    boolean potential_service_a = input[1] || input[7];
    boolean potential_service_b = input[1] && input[8];
    return potential_service_b;
}
// No shared input.
// Inter-dependent.
boolean service2(boolean input[]) {
    boolean potential_service_a = input[2] || input[7];
    boolean potential_service_b = input[3] && input[8] && potential_service_a;
    return potential_service_b;
}

We don’t care about exact expression for each case, we simply assume it can be anything suitable for that particular case.

Test space

Test space is determined purely by the input permutations. There cannot be more output variations than input.

Regardless of internal complexity, a service is only required to respect external contracts agreed with other services.

Neglected integration

Making equivalent inter-dependent changes in distributed services is obviously more time-consuming. We neglect that.

Why? Even before we start adding this extra cost, the split itself has to show its benefits.

For the same reason, we also neglect integration testing.

Even when the split is already justified, integration testing is normally limited to “communication lines” (right “wiring”) and avoids thorough input permutations (which is much more efficiently performed by unit test anyway).

CASE: 0-bits dependency, 0-bits shared

Boolean expressions: 0-bits dependency, 0-bits shared

No input data is shared.
No dependency between service A and B.

Only service B is clearly enough to produce result required by the system using just 5-bits subset out of 12-bits in input data.

This may appear as a degenerate case, but it is common in practice - a system which does more than required for its purpose.

As soon as the system is split into A and B, one can get rid of A without affecting the output.

Test space:

A	B	Total
`0`	`2^5=32`	`32`

We reduced initial test space from 2^12=4096 to 32.

CASE: 1-bits dependency, 0-bits shared

Boolean expressions: 1-bits dependency, 0-bits shared

No input data is shared.
Service B depends on 1-bit output of service A.

It is impossible to get rid of service A anymore - (potentially) all its input lines affect the overall system result (via dependency in service B).

Test space:

A	B	Total
`2^7=128`	`2^(5+1)=64`	`192`

We reduced initial test space from 2^12=4096 to 128+64=192.

CASE: 2-bits dependency, 0-bits shared

Boolean expressions: 2-bits dependency, 0-bits shared

This is similar to the previous case except for “thicker” 2-bits dependency between A and B.

Test space:

A	B	Total
`2^7=128`	`2^(5+2)=128`	`256`

We reduced initial test space from 2^12=4096 to 128+128+4=256.

CASE: 2-bits dependency, 1-bits shared

Boolean expressions: 2-bits dependency, 1-bits shared

This is similar to the previous case except for single shared input bit.

Test space:

A	B	Total
`2^7=128`	`2^(5+2+1)=256`	`384`

We reduced initial test space from 2^12=4096 initial to 128+256=384.

CASE: 1-bits dependency, all-bits shared

Boolean expressions: 1-bits dependency, all-bits shared

Test space:

A	B	Total
`2^7=128`	`2^(12+1)=8192`	`16781312`

We have test space explosion from initial 2^12=4096 to 128+8192=8320!

Multi-service system test space more than doubled in comparison with monolithic end-to-end testing.

CASE: all-bits dependency, all-bits shared

In the previous examples, we saw how growing dependencies per service cause growing test space.

Now, let us jump to the extreme end:

service B will depend on all output of service A
service B will also share all input with service A

Boolean expressions: all-bits dependency, all-bits shared

Test space:

A	B	Total
`2^(12)=4096`	`2^(12+12)=16777216`	`16781312`

We have test space explosion from initial 2^12=4096 to 4096+16777216=16781312!

Observation

What matters is the number of dependencies per individual service. The test efforts exponentially grow with that number.

Notice that splitting monolith service into multiple ones may require more dependencies (“communication lines”) for some services. These lines were hidden in the monolith. Now they are part of the explicit contract between services and require test coverage.

This is when splitting does not make sense.

Even if the number of dependencies before and after the split is equivalent, the nature of the dependencies (internal or external) still makes a considerable impact on maintenance cost.

Internal dependencies within monolith cost less human time to manage due to automation:

compile-time checks
refactoring tools
fast lightweight unit testing

External dependencies across multiple services:

have fewer automatic support
involve resource-intensive integration testing

Distributed Complications

Latencies

Frequent round trips for external data will bring the performance down.
Failures
- Multiple services fail independently.
  
  To recover resiliently, handling failures explodes testing space further.
- Monolith service fails as a whole.
  
  It may be easier to distribute whole monoliths for redundancy.
  
  Whole monoliths use reliable communication with itself reducing test space by impossible internal communication failures.

Conclusion

The 1-bit data dependencies is the minimum to differentiate one case from another. Real world input cases will have to provide at least that (to select different code branch). Any attempt for full input permutations is often beyond practically possible.

This is only bad for the real world, but does not make this argument invalid.

The total test space (for boolean expresions) grows as a polynomial:

A*2^(n) + B*2^(n-1) + C*2^(n-2) + ...

A - total number of services with (highest number) n dependencies.

The split of monolith should at least try to reduce highest number n of dependencies met in any service.

Depending on the internal dependencies, it may be impossible to simplify system by splitting it into multiple services. Graph Clustering Algorithms resolve this optimization problem where it is possible.

Non-optimal split into 2 tightly coupled services

Optimal split into 12 independent services

Test space:

2 services	12 services
`2*2^12=8192`	`12*2^1=24`

This somewhat quantifiable approach to justify splitting. The cost reduction is purely based on testing efforts (let alone avoiding other issues of distributed systems).

Is it all surprising?

There are reasons why all biological brains tend to be co-located/integrated into the single monolithic head rather than distributed across the entire body.

If our brain was distributed, the wiring would be exhaustive to support.

Now, what we have as distributed services are “loosely coupled across” yet “tightly integrated within” body parts:

the brain itself serves decision making
legs serve kicking
hands serve grabbing

This may not be obvious, but most of the time human spend on during changes is their verification. In fact, the very change itself is made only after some thought experiment suggested the change seemingly produce desired effect. ↩

Microservices Split Criterion

Criterion: time spent to verify system

Abstraction

Boolean function

Stateless services

Expression dependencies

Test space

Neglected integration

CASE: 0-bits dependency, 0-bits shared

CASE: 1-bits dependency, 0-bits shared

CASE: 2-bits dependency, 0-bits shared

CASE: 2-bits dependency, 1-bits shared

CASE: 1-bits dependency, all-bits shared

CASE: all-bits dependency, all-bits shared

Observation

Distributed Complications

Conclusion

Is it all surprising?

Deceptive Agile and Misconceived Waterfall

Shells and Cores

Obsessive Line Splitting Disorder (OLSD)