Anyone else getting some random errors and 500s on here recently?

  • Speex@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I can give a brief(ish) overview sure.

    Monitor everything :P

    But really monitor meaningfully. CPU usage matters but a high CPU usage doesn’t indicate an issue. High load doesn’t mean an issue.
    High CPU for a long period of time or outside normal time frames does mean something. High load outside normal usage times could indicate an issue. Or when the service isn’t running. Understand your key metrics and what they mean to failures, end user experience, and business expectation.

    Start all projects with monitoring in mind, the earlier to you begin monitoring the easier it is to implement. Re configuring code and infrastructure after the fact is a lot of technical debt. If you are willing and can guarantee that debt will be handled at a later time then good luck. But we know how projects go.

    Assign flags to calls. If your application runs results in a response that’s started from and ends up at an end user, Send an identifying flag. Let that flag travel the entire call and you are able to break down traces and find failures… Failures don’t have to be in error outs, time outs. A call that takes 10x longer than the rest of the calls can cascade and shows the inefficiency and realiability.

    Spend time on log and error handling. These are your gatekeepers to troubleshooting. The more time spent upfront making them valuable, the less time you have to look at them when shit hits the fan.

    Alerts and Monitors MUST mean something. Alert fatigue is real, you experience it everyday I’m sure. That email that comes in that has some kind of daily/weekly status information that gets right clicked and marked as read. That’s alert fatigue. Alerts should be made in a way that scales.

    • Take a Look as a time allows - logs with potential issues
    • Investigate as something could be wrong - warnings
    • Shits down fix it - Alert

    APM matters Collect that data, you want to see everything from processor to response times, latency, and performance. These metrics will help you identify not only alerting opportunities but also efficiency opportunities. We know users can be fickle. How long are people willing to sit and wait for a webpage to load…. Unlike the 1990’s 10-30 seconds is not groovy. Use the metrics and try to compare and marry them with business key performance indicators(KPI). What is the business side looking for to show things are successful. How can you use application metrics and server metrics to match their KPIs.

    Custom scripts are great. They are part of the cycle that companies go through.
    Custom scripts to monitor —> Too much not enough staff —> SAAS Solutions (Datadog, Solar Winds, Prometheus, Grafana, New Relic) —>. Company huge SAAS costs high and doesn’t accurately monitor our own custom applications —> and we’re back to custom scripts. Netflix, Google, Twitter all have custom monitoring tools.

    Many of the SAAS solutions are low cost and have options and even free tiers. The open source solutions also have excellent and industry level tools. All solutions require the team to actively work on them in a collaborative way. Buy in is required for successful monitoring, alerting, and incident response.

    Log everything, parse it all, win.