Data Correlation – Growing Data Points & Volume

Digital transformation is designed to intensify the touch between the businesses, customers, and prospects. Although it expands the workflow agility, on the other hand, it introduces a significant level of complexity as it requires more agile Information Technology (IT) architecture. This belittles the network and application visibility creating a substantial level of data volume and data points that require monitoring.

The digital world swells the volume of data required by IT departments. Apart from impacting the network and server resources, the staff is also taxed in their attempts to manually analyze the data while resolving the root cause of the application or network performance problem. Furthermore, IT teams operate in silos, making it difficult to process the data coming from all the IT domains – this severely limits business velocity.

Technology Transformation

Conventional systems while easy to troubleshoot and manage, do not meet today’s requirement, which has led to the introduction of an array of new technologies. The technological transformation umbrella consists of virtualization, hybrid cloud, hyper-convergence and containers to name a few.

The introduction of these technologies while technically remarkable pose an array of operationally complex monitoring tasks, increased the volume of data and data points. Today’s infrastructures comprise of complex technologies and architectures. They entail a variety of sophisticated control planes consisting of next-generation routing along with new principles such as software-defined networking (SDN), network function virtualization (NFV), service chaining and virtualization solutions.

Virtualization and service chaining introduce new layers of complexity that don’t follow the traditional monitoring rules. Service chaining does not adhere to the standard packet forwarding paradigms while the virtualization hides layers of valuable information. Micro-segmentation changes the security paradigm while the introduction of virtual machine (VM) mobility introduces north to south and east to west traffic trombones. The VM, which the application sits on, now has mobility requirements and may move in an instant either to a different on-premise data center or external to the hybrid cloud.

The hybrid cloud dissolves the traditional network perimeter and triggers numerous data points in multiple locations. Containers and microservices introduce an entirely new wave of application complexity and data volume. Individual microservices require cross-communication, potentially located in geographically dispersed data centers.  

All these waves of new technologies increase the number of data points and volume of data by order of magnitude. Therefore, an IT organization needs to compute millions of data points to correlate information from business transactions to infrastructures such as invoices and orders.

Growing Data Points & Volumes

As part of the digital transformation, organizations are launching more applications. More applications require additional infrastructure. As a result, the infrastructure is always snowballing; therefore the number of data points you need to monitor also increases.

Breaking up a monolithic system into smaller, fine-grained microservices adds complexity when it comes to monitoring the system in production. With a monolithic application, we have well known and obvious investigation starting points. But the world of microservices introduces multiple data points to monitor and it’s harder to pinpoint latency or other performance-related problems. Actually, the human capacity hasn’t changed – a human can correlate at most 100 data points per hour. The real challenge surfaces with the fact that they are monitored in a silo.

Containers are deployed to run software which is found more reliable when moved from one computing environment to another, often used to increase business agility. However, the increase in agility comes at a high cost – containers will generate 18x more data than it would in traditional environments. Conventional systems may have a manageable set of data points to be managed, while a full-fledged container architecture could have millions.

The amount of data to be correlated to support digital transformation far exceeds human capabilities. It’s just too much for the human brain to handle. Traditional monitoring methods are not prepared to meet the demands of what is known as “big data”. While data volumes grow to an unprecedented level, the level of visibility is decreasing due to the new style of application and complexity of the underlying infrastructure. All this is compounded by ineffective troubleshooting and team collaboration.

Ineffective Troubleshooting Team Collaboration

The application rides on a variety of complex infrastructures and at some stage require troubleshooting. There should be a science to troubleshooting, but the majority of departments stay with the manual way. This causes challenges with cross-team collaboration upon an application troubleshooting event among multiple data center segments – network, storage, database, and application.

IT workflows are complex and a single response/request query that will touch all elements of the supporting infrastructure consisting of routers, servers, storage, and database etc. For example, an application request may traverse the web front ends in one segment to be processed by database and storage modules on different segments. This may require firewalling or load balancing services potentially located in different on and off-premise data centers.

IT departments will never have a single team overlooking all areas of the network, server, storage, database and other infrastructures modules. The technical skill sets required are far too broad for any individual to handle efficiently. Multiple technical teams are more than often distributed to support various technical skill levels, at a variety of locations, time zones, and cultures.

Troubleshooting workflows between teams should be automated although are not because monitoring and troubleshooting are carried out in silos completely lacking any data point correlation. The natural assumption is to add more people, which is nothing less than adding fuel to the fire. An efficient monitoring solution is a winning formula.

There is an increasingly huge lack of collaboration due to silo boundaries that don’t even allow you to look at each other’s environments. By the design of the silos, engineers blame each other as collaboration is not built by the very nature of how different technical teams communicate. Engineers make blunt statements – “it’s not my problem, it’s not my environment”. When in reality, no one knows how to drill down and pinpoint the root cause.

When the application is facing downtime, Mean Time to Innocence becomes the de facto working practice. It’s all about how can you save yourself. Application complexity compounded by the lack of efficient collaborating and lack of a troubleshooting science creates a pretty bleak picture.

How to Win the Race with Growing Volumes of Data and Data Volumes?

How do we resolve all this mess and make sure the application is meeting service level agreement (SLA) and operating at peak performance levels? The first thing you need to do is collect the data. Not just from one domain but all domains and at the same time. Data must be collected from a variety of data points from all infrastructure modules no matter how complicated they are.

Once the data is collected, application flows are detected and the application path is computed in real time. The data is extracted from all data center data points and correlated to determine the exact path and time. The path presents the correct application route visually and over what devices the application is traversing. For example, the application path can instantly show that application A is flowing over a particular switch, router, firewall, load balancer, web frontend and database server.

It’s An Application World

The application path defines what components of the infrastructure are being used and in today’s environment will change dynamically. The application that rides over the infrastructure uses every element in the data center, including interconnects to the cloud and other off-premise locations, either physical or virtualized.

When there is an issue degrading application performance, it can happen in any compartment or any domain that the application depends on. In a world that monitors everything but monitors in silo, it’s difficult to understand the cause of the application problem quickly. The majority of time is spent isolating and identifying rather than fixing the problem.

Imagine a monitoring solution helping a customer select the best coffee shop to order a cup of coffee.  The customer has a variety of coffee shops to choose from and in each, there are a number of lanes. One lane could be blocked due to a spillage while the other could be slow due to a training cashier. Wouldn’t it be great to have all this information up front before you left your house?

Economic Value

Time is money in two ways. First is the cost, and the other is damage to company brand due to poor application performance. Each device requires a number of basic data points to monitor. These data points contribute to determining the overall health of the infrastructure.

Fifteen data points aren’t too bad to monitor but what about a million data points? These points must be observed and correlated across teams to conclude application performance. Unfortunately, the traditional approach of monitoring in silos has a high time value.

Using traditional monitoring methods and in the face of application downtime, the theory of elimination and answers are not easily placed in front of the engineer. There is a time value that creates a cost. Given the amount of data today, on an average, it takes 4 hours to repair an outage and the cost of an outage is $300K. If there is lost revenue, the cost to the enterprise on average is $5.6M. How much will it take and what cost will a company incur if the amount data increases 18x? A recent report from states that only 21% of organization can successfully troubleshoot within the first hour. That’s an expensive hour that could have been saved with the right monitoring solution in place.

There is real economic value for applying the correct monitoring solution to the problem and properly correlating between silos. What if a solution does all the correlation? The time value is now shortened because algorithmically the system is carrying out the manual heavy duty work for you.

 

About Matt Conran

Matt Conran has created 165 entries.

Leave a Reply