Select Page

Observability and system monitoring: data, the new raw material

Today there is no doubt that data is the new raw material for large and medium-sized companies. Being able to have information from our web systems and applications at all times is essential for Maximize application performance and availability, and have a quick response to incidents.

In this article, we will talk about the Observability and Monitoring tests, which are essential to guarantee the effective, safe and reliable operation of the systems. They provide critical information that allows organizations to detect, diagnose and resolve problems more efficiently, optimize performance and plan for the future. Additionally, they are essential to meet regulatory requirements and ensure the security of systems and data.

At a company level, the tools we choose to observe and monitor our systems must meet these objectives. Do you want to know how to choose the right ones?

Keep reading!

Basic concepts about Observability

In a tennis match, we always think that the ball will be hit correctly, but sometimes unexpectedly, it hits the edge of the net. This metaphor reflects the sometimes unpredictable nature of software: just as the ball can go forward or backward, applications can behave in unexpected ways, generating costs and even significant losses for companies.

The ability to understand and control what is happening in our systems becomes crucial to be able to take preventive measures.

To start, let's look at some basic concepts to know:

  • We can define the Observability such as the ability to understand what is happening in a complex system, and if possible, in real time.
  • Moreover, the Controllability It would be the property that indicates whether the behavior of a system can be controlled through its inputs.
  • And finally, the Monitoring It would be the ability to interpret data and alert about deviations from the expected output.

Let's see in a practical way the application of these definitions with an example: Let's think about a car driver.

La Observability It would be the driver's ability to see the state of the road, and whether he can go faster or slower. The Controllability They would be the brake and acceleration pedals: when you press the brake or step on the accelerator, we get a response that the car slows down or speeds up. The Monitoring They would be the alerts that the car shows, either because we are running out of gas, because a tire has punctured and the pressure is decreasing, etc. They are alerts that warn us to act on the car and correct deviations.

In the world of technology, these capabilities translate into understanding what is happening in our applications, monitoring their behavior, and receiving alerts about potential problems. In this sense, the concept of traces, logs and metrics become fundamental elements for effective monitoring, which we must know:

  • Logs: They represent discrete events enriched with context, essential for detecting errors.
  • Metrics: Collection of data displayed in graphs or counters that help make performance-based decisions.
  • Traces: Visual representation of the life cycle of a request, essential for understanding performance and debugging.

 

Maximizing performance and availability

The acquisition of Observability tools should seek to maximize the performance and availability of the applications, understanding what the company's objectives are.

For example, it may be important for our organization to monitor aspects such as hosting, content delivery network, code optimization, or compatibility with mobile devices. At the application or web level, we can consider monitoring the physical and virtual infrastructure, bandwidth, changes made, logs, IP addresses, etc. Ultimately it will depend on the priorities and objectives of the organization.

But before choosing the tools on which we are going to rely on in our Observability and Monitoring practices, we must take into account three fundamental principles:

  • Healing and Partition: We should only measure what is useful. Just as we would not install surveillance cameras on a rarely traveled street, we should not measure data that does not add value to our understanding of the system, since we will obtain useless information and, in addition, it will consume unnecessary resources.
  • Democratization: The information must be understandable to everyone. There is no point in having valuable data if only a few can understand it. The information must also be presented in an accessible manner.
  • Efficiency and Speed: The information should help us identify problems before they become errors. The ability to anticipate problems and act quickly is essential.

 

Which Observability Tools to choose

There are many tools on the market and as we said, choosing the right one will depend on the objectives and specific environment of each company.

In general terms, our choice should take into account, in addition to the three previous principles, that these tools provide us with fast response times and scalability. And we should make sure of its consumption, since it could harm the performance of the applications themselves to be monitored.

There are tools that help to recognisepilar data, others that allow tracing of the different actions of the services involved in the system, and finally those that will provide us system log information and that will facilitate those errors that occur in our developments. Choosing these and configuring them together will allow our organization to have all the necessary information about how your website/application is behaving. In this way we can measure times, correct errors, manage containers, actions with kubernetes, automate operational tasks, facilitate deployments, scale applications... With the help of artificial intelligence and machine learning we can also draw conclusions to optimize conclusions from A/B tests , improve performance, etc.

In the video that we leave you at the end, a large number of tools are shown such as Grafana, DataDog, Splunk that allow you to recover records and data and perform distributed tracking, among other things. NextData, Elasticsearch, Kamon They are tools that will allow the failure zone to be detected. Most others will allow you to monitor the performance of the system, and that is where it stands out. Dynatrace, Collectd or Zipkin, or the applications where the containers run, highlighting Jaeger and Kardia. All those shown have an OpenSource version that may evolve to a premium version depending on the volume of data collected, data maintenance time, number of licenses, etc.

Finally, it must be taken into account that these tools evolve and improve every day, so it is also important to keep them updated, monitor new technologies in case changes arise at a technological level, and carry out periodic audits to know if infrastructure monitoring is adequate. the appropriate one at all times.

Final conclusions: Beyond Observation, Action

Observability and Monitoring of systems is essential in today's digital world. From maximizing performance to anticipating problems and improving efficiency, these practices are the key to ensuring our applications operate optimally.

However, Monitoring is not just about observing, but about taking action. We must take action. In this way, we will be able to review the scope of the project, reinforce what is done well and reschedule activities that are not being carried out correctly.

As developers and IT professionals, we must take into account the tools available to us to stay ahead. Despite the initial costs of configuring these tools, and the payments per use, obtaining information in real time through the monitoring of our systems, being able to act early to solve systematic failures and even predicting them before they occur, will facilitate a quick and effective response that without a doubt It will minimize losses for the company.

Remember that an application crash is downtime, and that can be more expensive than the price of the tool. 

The price of light is less than the cost of darkness.

 

If you want to go deeper into the topic, I leave you the video of my 101PanelTechDays talk, in which in addition to reviewing everything discussed in the article, I show at the end a practical implementation with an example of four services in different languages, and the results obtained using collectors such as Jaeger and Temp under different circumstances in a real environment.

 


At Panel we carry out Observability and System Monitoring tests in our security services. QX Experience y Cloud & Infrastructure Services. If you want to know how, don't hesitate to contact us.

Fernando Simon

Fernando Simon

Fernando is a QA Engineer in the QX Experience unit of Panel Sistemas. You can visit his profile at LinkedIn

Leave us your comment

0 comments

Send a comment

Your email address will not be published. Required fields are marked with *

Share This