This pays dividends in the form of less toil and more innovation. A good litmus test for observability is: can you ask the right questions? In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals. Is your team spending too much time on-call and struggling with technical debt?
Often existing tools outgrow their usefulness and the more opaque your services, the more challenging it is to manage when things go wrong. Getting ahead of issues is the only approach going forward. Start your observability journey now and instrument new services. To try to fix this problem, we created monitoring tools to help us figure out what was going on in the guts of our software.
We kept track of our application performance with monitoring data collection and time series analytics. This process was manageable for a while, but it quickly got out of hand. Modern systems — with everything turning into open-source cloud-native microservices running on Kubernetes clusters — are extraordinarily complex.
With these complex, distributed systems being developed at the speed of light, the possible failure modes are multiplying. We cannot keep up with this by simply building better applications.
In many cases, there are just too many. And this is dangerous. Standard monitoring — the kind that happens after release — cannot fix this problem. It can only track known unknowns. Or you end up with monitors whose purpose no one understands.
For known problems, you have everything under control—in theory. Customers still complain about issues in your system even if your monitors look good. Therefore, monitors for metrics alone are not enough. You need context, and you can get it from your logs.
You can place monitors for CPU, memory, networking, and databases. Many of these monitors will help you to understand where to apply known solutions to known problems. In other words, these type of applications are mature enough to know their own problems better. But nowadays, the story is different. Almost everyone is working with distributed systems. There are microservices, containers, cloud, serverless, and a lot of combinations of these technologies. All of these increase the number of failures that systems will have because there are too many parts interacting.
You should recognize what the SRE model emphasizes: systems will continue failing. As your system grows in usability and complexity, new and different problems will continue emerging. They can make guesses based on the metrics they see.
If the CPU usage is high, it might because the traffic has increased. But these are just guesses. Systems need to emit telemetry that shots out what the problem is. Observability is what will help you to be better when troubleshooting in production. So, at this point, you might be wondering how observability is different from monitoring. Well, as I said before, monitoring is not a good way to find out about unknown problems in the system.
Monitoring asks the same questions over and over again. Or, is the latency under ms? This is valuable information, but monitoring is useful for known problems.
Observability, on the other side is about asking different questions almost all the time. You discover new things. Then, you use all the information your systems produce like logs, traces, and metrics.
0コメント