A lot of excellent information in that blog post and linked from it... but if you're wondering where to start:
1. Write good logs... not too noisy when everything is running well, meaningful enough to let you know the key state or branch of code when things deviate from the good path. Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.
2. Instrument metrics using Prometheus, there are libraries that make this easy: https://prometheus.io/docs/instrumenting/clientlibs/ . Counts get you started, but you probably want to think in aggregation and to ask about the rate of things and percentiles. Use histograms for this https://prometheus.io/docs/practices/histograms/ . Use labels to create a more complex picture, i.e. A histogram of HTTP request times with a label of HTTP method means you can see all reqs, just the POST, or maybe the HEAD, GET together, etc... and then create rates over time, percentiles, etc. Do think about cardinality of label values, HTTP methods is good, but request identifiers are bad in high traffic environments... labels should group not identify.
Start with those things, tracing follows good logging and metrics as it takes a little more effort to instrument an entire system whereas logging and metrics are valuable even when only small parts of a system are instrumented.
Once you've instrumented... Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki) https://grafana.com/products/cloud/ so you can see the results of your work immediately.
If it's a big project, you have a lot of options and I assume you know them already, this is when you start looking at Cortex and Thanos, Datadog and Loki, tracing with Jaegar.
> Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki)
I haven't looked at their pricing before, but for small-ish environments, their standard plan looks really good and simple. None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).
> None of the "per host, but also per function, and extra for each feature, and extra for usage" approach like other providers (datadog, I'm looking at you).
I was thinking "God, this is exactly why I hate Datadog" as I was reading your description and got a great laugh when I reached the end. Their billing is absolutely byzantine.
I don't know that I've ever seen a company that had such a stark difference between great engineering/product and awful business/sales practices. Their product is really the best turn-key option out there, but I'm always hesitant to use its features without double checking it's not going to add 50% to my bill. Their sales teams are some of the worst I've dealt with, and I deal with a lot of vendors. They're starting to get a really bad reputation as well.
I even posted about it in another thread a couple weeks ago about great software.
I'm a customer that uses most of their tools (no network performance monitoring since it's less useful than a service mesh and no logging because we need longer history than most and cost would be prohibitive).
Is it really that expensive when compared to other vendors? Thought their newer logging tool was a lot cheaper than splunk and their apm tool for distributed tracing is also pretty cheap when compared with something like new relic. Sure it's more expensive than free tools that you need to setup yourself. But the velocity it lets your teams have is so much better than having to use something like grafana with tools like Prometheus. Again, sure it can be done for cheap, but the time it takes to manage those tools and the velocity that you lose when doing that doesn't seem like it's worth it for smaller companies but I can see it making more sense as you scale a company.
It's not the cost per se, though I do think they're pretty high for some features. It's the pricing models and the associated patterns.
For instance, you have to pay Datadog per host you install the agent on. In addition to the per host cost, you have to pay per container you run on that host (past a very small baseline per host), and the per contain cost turns out to be nearly as high as the per host cost if you have reasonable density. Why am I paying Datadog per container I run? Aside from a not particularly useful dashboard, why does a process namespace and some cgroup metrics nearly double my bill? They are literally just processes on a server. Because Datadog wants you to run more hosts, so you install more agents.
Every feature they add also seems to be charged separately, but is not behind any sort of feature gate. This means new features just show up for my developers, and they have no clue if it costs money to use them. I can't just disable or cap, for example, their custom metrics per user, per project, or at all. So when my developers see a useful feature and start using it, all of a sudden I have an extra $10k on my monthly bill. Even more fun are features that show up and are initially free but then start charging.
This is such a pain that we've had to tell dev teams not to use Datadog features outside of a curate list. Every product has some rough edges, but with Datadog the patterns are all setup such that you end up paying them thousands of extra dollars. Again, great product, but not a business I would be interested in associating with again given the choice.
It's not so much the total cost, but the fact that there's so much nickel and diming. When Trace Analytics came out they tried getting us to turn it on, and its like...we're already paying for APM and you want to charge us more, at least tell us how much more and they couldn't. I think it probably ended up not being a ton of money, but just the question was enough for us to not do it. From working with other providers, it's also much easier working with our finance if we can say 'it costs at most this' instead of 'it costs at least this'
It depends where you are in the world. When I was working in Switzerland, most SaaS pricing were no-brainers for us. But since I work in Latin America for small companies with local costumers, all the different services and tools you might want to use, with prices targeted at "western" customers, much more quickly add up to the equivalent of having multiple people on staff full time.
Still it is often not worth to roll your own, so it is nice to have alternatives for different price points and company scales.
Exactly this. We operate in Eastern Europe with local clients, offering on-prem SaaS. If I added all my clients' servers on datadog it would very easily eat through our profit margins.
> Still it is often not worth to roll your own
I tried hard not to, but at at the end, after spending 1 week trying to setup netdata and failed, I decided not to spend another week trying to setup grafana/influx/prometheus (lot's of docs to go through), and just have some bash scripts send metrics on a $10 digital ocean node service that sends me emails/sms when something "looks bad" (eg. high cpu temperature, stopped docker containers, etc).
I gave up on aggregated logging for the time being, since I can just ssh on each server and check journal and docker logs if I need to (as long as the hard drives don't crash).
Yeah, having looked at what the script does I decided to 'containerize' the agent, and that led to other issues like configuring email alerts etc.
I was already a week deep into looking at various options and had to deliver on basic metrics and alerting, so I figured a couple of bash scripts, that log into local files with log rotation, systemd, and a dump/memory only receiving end running on nodejs for the alerts would be much faster and easier to maintain.
From the post "The key, he says, is using the right transaction identifiers so that calls can be traced across components, services, and queues".
I think this is a key feature not many people implement especially in today's world of over blown micro services, having a transaction id from the time the request hits the reverse-proxy till the database write is so helpful in debugging, saves a ton of time.
I agree with this wholeheartedly. You can even define a standard and let downstream services opt in over time. Simple wins like this should not be put off because "someday" we're going to implement a complex distributed tracing solution.
Often it's political, but political friction can feed into technical friction if there are also a half dozen different half wrappers around half-baked HTTP libraries. Also, if someone has included OpenTracing as a shadow library in a company-wide library (JVM territory), but you want it as a top-level dependency, you have to write translators.
I agree with 2. I have a presentation at https://www.polibyte.com/2019/02/10/pytennessee-presentation... which goes into how to get started with Prometheus. Not as in "how to set it up", but more about what to instrument and why, how to name things, etc. Despite the title, there's very little in it that's specific to Python.
Why should you have to use Prometheus? There are plenty of options, and good reasons why you might want to push data rather than pull. Measurements should minimize perturbation of the system being measured, and the (computer) system generating data is likely best placed to determine when and how, when that matters -- e.g. in HPC, where jitter is important.
Gavin from Zebrium here. Completely concur with #1. We are big advocates of writing good logs and not having to worry about structured vs unstructured (and even if you structure your logs, you'll still probably have to deal with unstructured logs in third party components).
Logs are great, but only once you've identified the problem. If you are searching through logs to _find_ a problem, its far too late.
Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming (example: dont use access logs to collect 4xx/5xx and make a graph, collate and push the metrics directly)
Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x
Alerts must be actionable.
Alerts rules must be based on sensible clear cut rules: service x's response time is breeching its SLA not service x's response time is double its average for this time in may.
> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming
Yeah nah, but, okay, nah yeah.
Generating metrics in the app is much more intrusive, and requires that you figure out the metrics you need ahead of time. It adds dependencies, sockets, and threads to your app.
Unless you're very careful, it's also easy to end up double-aggregating, computing medians of medians and other meaningless pseudo-statistics - if you're using the Dropwizard Metrics library, for example, you've already lost.
If you output structured log events, where everything is JSON or whatever and there are common schema elements, you can easily pull out the metrics you need, configure new ones on the fly, and retrospectively calculate them if you keep a window of log history.
When i've worked on systems with both pre- and post-calculated metrics, the post-calculated metrics were vastly more useful.
The huge, virtually showstopping, caveat here is that there is lots of decent, easy-to-use tooling for pre-calculated metrics, and next to nothing for post-calculated metrics. You can drop in some libraries and stand up a couple of servers and have traditional metrics going in a day, with time for a few games of table tennis. You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.
Anyway if there's a VC reading this with twenty million quid burning a hole in their pocket who isn't fussy about investing in companies with absolutely no path to profitability, let me know, and i'll do a startup to fix all this. I'll even put the metrics on the blockchain for you, guaranteed street cred.
> Unless you're very careful, it's also easy to end up double-aggregating,
Oh no, never do anything fancy on the client end. yeah thats total trash. Any client that does any kind of aggregating is a massive pain in the arse.
Counters are good enough for 90% of everything you want. You can turn counters into hits per second easily. Plus they are more resistant to time based averaging. If you do your stats correctly, you can even has resetting counters create nice smooth graphs (non negative derivatives are a god send)
Yes, this is a library that argues strongly against the use of metrics. From what I recall 1 node of casasndra will output close to 50,000 metrics by default. That is too much.
When a team I worked with were migrating away from splunk to graphite/grafana, they shat out something close to a million metrics. 99.8% were totally useless.
> You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.
Yes! I think thats my main objection. Its so bloody expensive to do post-hock metrics. you can buy in splunk, but thats horrifically expensive. Or you can use an open source version and loose 4 person years before you even get a graph.
> if you're using the Dropwizard Metrics library, for example, you've already lost.
Can you go into a bit more detail here? Curious to know where Dropwizzard goes wrong.
I prefer to use the Prometheus client libraries where possible. Prometheus' data model is "richer" -- metric families and labels, rather than just named metrics. Adapting from Dropwizzard to Prometheus is a pain, and never results in the data feeling "native" to Prometheus.
I think they just mean the host is aggregating, so any further aggregation is compounded slant time the data. Like StatsD’s default is shipping metrics every 10s, so if you graph it and your graph rolls up those data points into 10 minute data points (cuz you’re viewing a week at once), then you’re averaging an average. Or averaging a p95. People often miss that this is happening, and it can drastically change the narrative.
Yes, exactly this. It's the fact that you're doing aggregation in two places. Since you're always going to be aggregating on the backend, aggregating in the app is bad news.
It may be interesting to think about the class of aggregate metrics that you can safely aggregate. Totals can be summed. Counts can be summed. Maxima can be maxed. Minima can be minned. Histograms can be summed (but histograms are lossy). A pair of aggregatable metrics can be aggregated pairwise; a pair of a total and a count lets you find an average.
Medians and quantiles, though, can't be combined, and those are what we want most of the time.
Someone who loves functional programming can tell us if metrics in this class are monoids or what.
There is an unjustly obscure beast called a t-digest which is a bit like an adaptive histogram; it provides a way to aggregate numbers such that you can extract medians and quantiles, and the aggregates can be combined:
The problem is post calculating is so slow. At least from my naive viewpoint. I can load dozens of graphs in datadog in seconds, can change tags or time frame and takes literally a second to load. Our Splunk dashboards can take over a minute to load, and reload for any change is more waiting.
Splunk taking minutes to load dashboard is not a problem imposed by post-calculation, it's more of a problem of lack of schema.
Most post calculation works on free text logs and thus has to regex it's way to a solution.
But it doesn't have to be that way; that's why the original poster talked about a lack of tooling in the post-calculation world
You're optimizing for the wrong thing. The hard part about this space isn't extracting value from data, it's physically shipping the data through the infrastructure and into the relevant systems. Metrics are so great compared to logs precisely because they're precalculated (read: highly compressed) before leaving the originating service.
Not for many people running such environments. So you either see with very minimal setups without tooling to help (which is survivable at small scales, but inefficient and mindnumbing), or towers of complexity that followed widely shared advice by people always assuming massive scale.
You can go the https://www.honeycomb.io/ way and make structured logs your metrics. It will cost you a lot in storage, but simplifies a lot. Just throw properly structured logs into storage as long as you query them efficiently (which honeycomb provides)
I think the only times it ever really makes sense to use logs to generate metrics are fairly limited:
1. You haven't yet instrumented the application with metrics yet.
2. The logs are from a third party tool that don't emit metrics
3. The log format is well defined and doesn't change (I'd still prefer native metrics)
Otherwise the issue is that logging messages can and do change over the lifetime of an application. Relying on the content of the log for metrics becomes an implicit API that's not obvious to developers working on the code. I've seen issues of broken monitoring and alerting because a refactor changed log formatting and content. Much better to be explicit about metrics and instrument them directly.
almost never. structured logs are expensive in terms of infra, management and query time. Storing logs just in case is much more expensive at any kind of scale compared to metrics alone.
A lot of it depends on what the service/program is meant to be doing.
If we take a proxying webs service router for example listening on example.com/* We would want metrics to tell us how well its doing for its specific job, and any upstream services.
So for each service URL we'd want at least a hit count for 2xx, 3xx, specific 4xx and 5xx return codes. We'd also want the time taken to process that request.
We'd also probably want to know the total number of active connection to back end, and total clients connected. Memory and CPU usage would also be a given.
From that we could easily ascertain the health of upstream services, the performance, and total load (which is useful for autoscaling of either the service router, or the upstream apps)
I think it requires sitting down with a peice of paper and imagining your service/app breaking, and then working back to see how that would look. Once you've done that, you can figure out some counters to keep track of those thins.
> Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x
I think in general the business goals metrics are OK but you still need to keep lower level metrics as well, otherwise it would be more difficult to pinpoint the exact failure, you will just know that a % of visitors is unable to perform action X. In a moderately-complex system a user-level action X is probably composed by several low-level services.
I was trying to get across that just because you collect metrics it doesn't make them useful. I encourage people to generate metrics for everything, we can always join them together later to make something useful.
I think what I should have said is: "Collect metrics for everything, but be sure to display them is a way thats relevant to the customer"
Gavin from Zebrium here. We've found that if only you somehow knew what you were monitoring for in logs, they can be a great source of detecting (and then describing) the long tail of unknown/unknowns (failure modes with unknown symptoms and causes). Our approach is to be able to find these patterns in near real-time using ML. This blog by our CTO explains the tech with some good examples: https://www.zebrium.com/blog/is-autonomous-monitoring-the-an....
Agreed for most what was said there.
Still, I find that people mostly use SLA as only thing important to track for alerting and incidents arousal. There is a lot of said about importance of defining solid SLI - Service Level Indicators which are aligned to SLO - Service Level Objectives
SLAs are usually given to external user of SaaS, not very useful for SRE team.
The Art of Monitoring , covers most of these stuff in a unified manner.
You are introduced to some basics (push vs. pull monitoring), then proceeded with simple system metrics collection (cpu, memory) via collectd, then goes to logs ingestion and ends up extracting application-specific metrics from jvm and python applications.
I highly recommend it, even for seasoned professionals.
I never see an important system management principle brought up: If you get a user complaint (for some value of "user") and not an alert, you should fix the monitoring system so that you don't get another occurrence of it or related problems. Obviously that's within reason, depending on the circumstances; the effort might not be worth it.
We log extensively. Here are some of my thoughts it
- at least in C++, the requirement to be able to log from pretty much anywhere can lead to messy code that either passes a reference to your logger to all classes that might possibly need it, or you've got an extern global somewhere. Yuck.
- logging can enable laziness. Being able to log that something weird happened can be considered a sufficient substitute for proper testing.
- logs are only as useful as the info they contain. This can mean state needs to be passed around all over the place just so that it can all be eventually logged on one line (it saves your data team from having to do a 'join')
- if your logger doesn't support cycling log files it's useless. If something goes wrong you can easily fill a disk.
2. Given a large enough system you will encounter situations where the only action you can take is to log "this really shouldn't happen" and try to roll back as cleanly as possible. This may be due to either complexity or a bug manifesting in a layer completely different than where it occurred (I've seen a null reference crash on "if(foo) foo->bar();" in the past)
4. I believe loggers should ideally know as little as possible about your logs. Logs can be rotated externally, can be buffered and sent to other hosts without touching the disk, can be ignored. Ideall, the system should care, not the app.
It’s weird to see the stuff by Jay Kreps (of Kafka ~fame~) listed in the logs section. His writing is specifically _not_ about logs the observability tool, but logs the data structure such as you’d see at the heart of a database.
> There is a large amount of “log” data generated at any sizable internet company. This data typically includes (1) user activity events corresponding to logins, pageviews, clicks, “likes”, sharing, comments, and search queries; (2) operational metrics such as service call stack, call latency, errors, and system metrics such as CPU, memory, network, or disk utilization on each machine. Log data has long been a component of analytics used to track user engagement, system utilization, and other metrics.
> We have built a novel messaging system for log processing called Kafka  that combines the benefits of traditional log aggregators and messaging systems....Kafka provides an API similar to a messaging system and allows applications to consume log events in real time.
A quote from the LinkedIn blog post linked in the article:
“But before we get too far let me clarify something that is a bit confusing. Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using syslog or log4j. For clarity I will call this "application logging". The application log is a degenerative form of the log concept I am describing”
Fair enough. But I don't think quoting this for logging in the tracing sense is wrong here. He does acknowledge that trace logs are a degenerative form of logs from the perspective of log processing. The only difference being in the semantics of human readable text v/s binary logs.
Very true. Jay Krep's log is completely unrelated to the topic of this article. This added to my feeling that this "guide" is rather a collection of fragments put together without a real understanding of the subject from the author.
Is there an open source solution for processing streams of structured and unstructured logs and routing then onward? I see solutions for moving logs to elastic or Kafka but nothing for evaluating the log.
This is a problem that is both solved again and again, but also all the available solutions are bad.
In my experience what happens is:
1. you start with a "ship logs from X to Y" product
2. you add more sources and more destinations, making it more of a central router. you add config options for specifying your sources and dests.
3. since the way you checkpoint or consume or pull or push certain sources or dests doesn't generalize, you end up buffering internally to present a unified "I have recieved / sent this message successfuly" concept to your inputs and outputs.
4. you want to do some basic transforms on the logs as you go. you implement "filters" or "transforms" or "steps" and make them configurable. your config now describes a graph of sources -> filters -> dests
5. your filters need to be more flexible. you add generic filters whose behaviour is mostly controlled by their config options. your configs grow more complicated as you use multiple layers of differently-configured filters
6. you have a bad turing complete programming language embedded in your config file. getting simple tasks done is possible, getting complex tasks done becomes an awful, inefficient and unreadable mess.
My solution to this cycle has been to just write simple hard-coded applications that can only do the job I need them to do. If they need a different configuration later I edit the source. I'm writing my transforms in a real programming language and I avoid the additional complexity of abstractions. Of course, that comes with its own costs but I consider it well worth it.
There are many more variants depending on how much complexity you are trying to apply. If you need to apply machine learning models, for example, you're probably going to end up with something similar to Apache Storm, though I don't know if it's operational story has improved enough to consider it over other alternatives, I lost track years ago between Apache Spark and the half dozen other stream processing projects.
It doesn't route them onward - it will collect, aggregate and provide you the tools to correlate/analyze logs across your environment. Enable the built in network monitoring tools too and you have not only a powerful tool to help you with application management, but security as well (hence its namesake).
Beware - in pealing back the layers of your environment you can really get sucked in. I never seem to have enough hardware to do what I want with SO but it's pretty amazing what you can do with it.
EDIT - wow, I'm a little shocked that no one else has brought Security Onion up. I guess they need to up their advertising game!
> Logging is critical to detecting attacks and intrusions.
Yes, but not universally - and just collecting logs will not take you far. Logging everything and trying to approach security via the ’collect all data’ is both expensive and inaccurate, and one of the major inefficiencies in modern cyber.
Security Onion does an amazing job at collecting and correlating, especially for an open source product. The traditional trade of with Open Source is there - a bit of up front effort for longer term value.
Recently, I was searching for a service which offers those functionalities on a very basic level. I tried several options and was really disappointed with all of them. The only one that I found to be usable was https://logdna.com/. I've now been using it for a couple of weeks and it works OK. It offers logging, alerts, metrics/dashboards, and some other things. And all that for a reasonable pricing.
If you don't need all the fancy metrics, and just want something simple to keep an eye on your services, alert you if they fail, and automatically restart them, check out my stealthcheck service. It's all of 150 lines of free range, 0-dependency go: