Intro
What is SmartThings
Eras of Our Architecture
What Failed for Us
Patterns Working Across Eras
Will that Monolith Ever Really Die
To the people around you:
What company or organization are you from?
What are you all looking to take away from this case study?
Jeff Beck
Software at SmartThings
@Beckje01
A platform for IoT, connect and automate all your devices.
Multiple Mobile Clients
Many Connected Devices
Open API
>150 μ-services
Java, Groovy, Kotlin, Scala, Rust, JavaScript, Swift
Grails, Ratpack, SpringBoot, Micronaut, Dropwizard, …
The architecture changed over time the lines between each era are blurry.
One service in the cloud for all things to connect to.
First iteration, make something work with the least amount of work.
Quick/Simple Deployment
Debugging was all one place to look.
Hard to solve all connectivity with one codebase
Mobile Client
Website
Hubs (Devices)
When we had lots of different things connected in different ways.
Some extra services stood up to deal with special connections but can not operate without the monolith.
Needed special connections to the phone and hub, but didn’t want to split up the core business logic.
Correct Technology for each use case
Simple Deployment
Debugging business logic was always in the core, connectivity in the other services.
Harder to test limited features in an environment
Team size getting bigger, starting to feel crowded in the codebase
Coupled features
Starting to hit scaling issues for the core service.
How to support a global rollout of our platform.
A small set of services available globally, with our core deployed in different geos.
Support multiple geographies at once.
Minimal change to support business needs.
Global services were new to us so limiting how many we needed
Quick escape valve for scale issues, just add another shard.
Minimal changes to mobile clients and hubs.
Pulled out Auth / User to a Microservice
Couldn’t scale different operations in different ways
Too many conflicting changes when trying to experiment.
Couldn’t scale different workloads in different geographies
Team has grown too big to share a codebase
Team is now distributed around the globe
Breaking up the Monolith is now a goal.
We created these categories of microservices that should have common operational patterns.
Teams were just starting to take over on call
Wanted to allow teams in a timezone to build their own service
Easy to reason about different services operationally
Creation of paved path
Reasoning about the whole system functionality is getting harder
Fitting new workloads etc into existing buckets
When is it worth creating a new "bucket"
We outgrew the need for making services operate the same way.
Refocus on letting teams to deploy and design how they want.
Flexibility to teams
Unblock teams thinking
React to evolving regulations by changing service / feature deployment architecture
Cost optimizations
Building common tools to support most of the use cases
Observability
What should be a paved path?
We will let you know.
You should try something else then these ideas.
Too rigid
Limited engineers thinking of solutions
Too costly
Sometimes miss at primitives
As system got advanced enough there was no longer use from the automated architecture view.
No matter what era of architecture, these work for us.
A drillable diagram. High level moving to details of functional areas.
A service that maintains a connect ideally has 0 logic outside of the connection.
Allows deployments to not disconnect hubs/devices
Small easy to maintain, hopefully mostly off the shelf.
Allows Moving URLs without rewritting all destinations.
Allows telling a coordinator where to route
Build tooling only looking ahead 1 era and plan on deprecation.
Spinnaker with 5 microservices is too heavy.
Spinnaker with 100+ works great.
Creating a documented tooled path that covers your 80% case.
Auth systems are so important to everything done you should always break them out first.
Allows adding new services without interacting with the old monolith.
You have tons of new services but that old big one is still there…
Does the rest of the system grow to eclipse it?
Can we still operate it?
Can we isolate it off behind some cleaner services?
If you can still operate this system and everything isn’t depending on it you can effectively keep the monolith in a corner
When you start seeing outages not cascade.
When deployments of the service graph don’t matter to every team.