Metrics, logs and traces
Generally, full stack developers. Mainly use logs to debug errors or verify new functionality
Use metrics to understand service health metrics. Use logs to triage operational issues. Sometimes uses traces to do performance (generally later stage companies)
SREs are closely related to devops but they get paid more. Generally more involved with incident response and managing issues. Rely on logs and metrics to root cause issues.
Relatively easy and well understood
Challenges: - explosion in logs and metrics - kubernetes
Challenges: - explosion in logs and metrics
Vendors: - sst.dev (easy to deploy and basic monitoring) - AWS SAM (local debugging)
Challenges: - logs need to be combined with application logs - need to correlate infra metrics with application metrics
There are three primary ways that companies manage observability:
The common path for most companies is to start with the cloud provider and then transition to an observability vendor. If the company is an infrastructure company or their observability bill gets high enough, they will go for maintaining an observability stack in house
For open source solutions, common setups include the following: - for logs: - opensearch/elasticsearch (defacto standard for a long time, ELK stack) - loki (an easier to maintain elasticsearch by grafana) - clickhouse (faster aggregation, came from uber) - for metrics: - prometheus - cortex/thanos (scale promethesus horizontally) - influxdb (smaller player) - for traces - jaeger (standard open source tracing) - grafana tempo (new)
Prometheus by far is the most popular solution for metrics. Main challenge is horizontal scaling. This is where coretx/thanos come in (s3 backends for prometheus).
There are also open source data dog alterntaives that attepmt to be an end to end observability platform. - opstrace: metrics and logs. build on top of prometheus, cortex, and loki. acquired by gitlab - signoz: metrics, logs and tracing. build on top of clickhouse. - grafana: metrics frontend. have not build out an end to end suite for observability with cloud offering
Common features for an observability platform
lifecycle management of data (active, archive, cold storage)
advanced
To build an observability platform, you need the following elements:
tracing frontend
metric client/agent
A client is language specific SDK that requires a code change to instrument. An agent is a daemon that runs and automatically collects metrics/logs/traces from the system
Special purpose monitoring - may or may not be included in general observability platforms
Not exactly monitoring but often grouped together. Putting remote breakpoints in remote services like lambdas and mobile devices.
Examples: - appspector.com - Mission control, for remote iOS/Android/Flutter debugging - sst.dev
Using a remote machine to impersonate user. Meant to provide monitoring from endpoint external to the service that you're running
Examples - checklyhq.com - Open source E2E / Synthetic monitoring and deep API monitoring for developers. Free plan with 5 users and 50k+ check runs.
Using AI to automatically highlight issues. Usually built on top of existing metrics/logs vs a standalone service
Examples: - datadog watchdog
There's usually been a split between application logs, infrastructure logs, and error logs. End2end observability platforms do all of it but most providers only provide one of the three.
Error logs are generally associated with frontend applications. There's some sort of client side error and the stack trace is captured and logged.
Examples: - sentry (incumbent) - Logrocket - Airbrake.io
Recently theres been new open source contenders like highlight (YC).
Error logging is also different from session replay/capture. This is when not just the logs but a screen recording of the session is captured for later analysis. Incumbent in this space is fullstory
There's an entire industry that does logs for compliance (soc2, fedramp, etc) In AWS, the service that does this is cloudtrail though cloudtrail by itself is not sufficient (you'll also need application level logs).
Check if a system is up.
Use cases: - cron jobs - servers - containers - apis
Examples: - deadmanssnitch.com — Monitoring for cron jobs. 1 free snitch (monitor), more if you refer others to sign up - Pingmeter.com - 5 uptime monitors with 10 minutes interval. Monitor SSH, HTTP, HTTPS, and any custom TCP ports.
There's an entire industry on monitoring cloud costs. Observability providers like grafana even dedicate specific dashboards for this
cloud-native.slack.com
Created 2023-06-18T22:20:05.893000, updated 2023-06-27T14:49:00.204000 · History · Edit