@beckje01
Intro
Lab00 - Setup
Lab01 - Logging
Lab02 - Tracing
Lab03 - Metrics
Lab04 - Observability Bugs
Jeff Beck
Software at SmartThings
Daniel Gerson — Detailed notes on how to improve the labs
Adrian Cole — Tons of shared knowledge
Attendees at Greach and Devoxx UK — Tons of feedback
Apache Zipkin — Tons of inspiration from the docker compose
Labs are in a state where they will compile but not all are 100% correct. The answers are in the corresponding modules.
All the shared infrastructure for observability is in this directory you can run it with docker-compose.
A handful of simple requests that can exercise the system easily are in this directory. Each service has a file.
Each service is named for its role then a dash with it’s framework. You only need to run one of each role.
There is a high level task list in each lab directory, that has a rough order on the things to explore. It also has general pointers of where to get started.
The root gradle file isn’t something you need to modify if you need to add dependencies you can do that in the gradle file for a given project in each lab.
Any libraries needed to solve the lab should be added to the project gradle file.
We will do a wrap up discussion after each lab talking about more complex real world applications of the topics.
The property of systems that allows operators to clearly understand the state of the system.
From infra dir:
docker-compose up
requests
directory — Has calls to operate the system
links.html
— Has the links to our infrastructure
Expanding past a few μ-services and you quickly need a centralized logging solution.
We are using the ELK stack in the workshop as it’s open source and has good docker setup. But there are many great solutions use what’s best for you.
Elasticsearch — The Datastore
Logstash — Transformative Data Pipeline
Kibana — Front End
From each project directory
gradle run
Get Logs to One Place
Dynamic Log Filtering
Log Formatting
Correlation IDs
GOAL All Logs Available in Kibania
Correlation IDs are lightweight, good for small retrofit
Dynamic Logging is for cost savings
Formatting Matters
Same apps just add tracing.
Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time.
Span - An operation that took place.
Event - Something that occurs in a span.
Tag - Key value pair on a span.
Trace - End-to-end latency graph, made up of spans.
Tracer - Library that records spans and passes context
Instrumentation - Use of a tracer to record tasks.
Sample % - How often to record a trace.
Zipkin Support For Services
DataStore Tracing
Debug Issues
Debug Slow Transactions
Customization is Key
Service Mesh
When to use annotations?
When to use tags?
Prometheus - Monitoring and Alerting Tool
Grafana - Visualization Tool
Micrometer - Library to collect / report metrics in an app
Dropwizard Metrics - Library to collect / report metrics in app
Expose Metrics for Prometheus
Scrape All the Metrics
Custom Metrics
What metrics do you collect today?
How do metrics lie to you?
How do your metrics tie to users?
Odd Behaviors
Missing Logs
Traces with > 10k spans
Error rates thrown off by service reporting the wrong name.
Lost traces
Broken Traces
Fixed correlation IDs
Data is step 1
Actionable data is step 2
Pair all the tools for maximum effect.