The traditional network visibility tools give you the foundational data to see what’s going on in your network. Network visibility solutions are not new, and the network visibility tools such as NetFlow and IPFIX have been around for a while. However, they give you an incomplete part of the landscape. Then we have a new way of looking, done with a new practice of distributed systems observability. Observability software engineering brings a different context to the meaning of the data, allowing you to examine your infrastructure and its applications from different and more interesting angles. It combines traditional network visibility with a platform approach enabling robust network analysis and visibility with full-stack microservices Observability.
Network Visibility Solutions
A key point: Network visibility solutions: Observability vs. monitoring
In this video, we will start by discussing how our approach to monitoring needs to adapt to the current megatrends, such as the rise of microservices. Furthermore, failures are unknown and unpredictable, calling the need for network visibility solutions that encompass the practices of reliability in distributed system. Therefore a pre-defined monitoring dashboard will need help keeping up with the rate of change and unknown failure modes. For this, we should have the practice of Observability for software and monitoring for infrastructure.
Observability vs MonitoringWe will start by discussing how our approach to monitoring needs to adapt to the current megatrends, such as the rise of microservices. Failures are unknown and unpredictable. Therefore a pre-defined monitoring dashboard will have a hard time keeping up with the rate of change and unknown failure modes. For this, we should look to have the practice of observability for software and monitoring for infrastructure.
Start a free trial for all my Elearning courses at Pluralsight with the following link:
Visit my website for additional technical content:
Contact me directly at firstname.lastname@example.org
A key point: Security threats with network analysis and visibility
Remember those performance problems are often a direct result of a security breach. So distributed systems observability goes hand in hand with networking and security. It does this by gathering as much data as possible, commonly known as machine data, from multiple data points. It then ingests the data and applies normalization and correlation techniques with some algorithm or statistical model to derive meaning.
Starting Network Visibility
Network visibility solutions
Combating the constantly evolving threat actor requires good network analysis and visibility along with analytics into all areas of the infrastructure, especially the host and user behaviour aligning with the traffic flowing between hosts. This is where we see machine learning (ML) and multiple analytical engines detect and respond to suspicious and malicious activity in the network. All of which is done against machine data that multiple tools have traditionally gathered is stored in separate databases. Adding content to previously unstructured data will allow you to extract all sorts of useful insights, which can be useful for security, network performance, and user behaviour monitoring.
System observability and data-driven visibility
The big difference between traditional network visibility and distributed systems observability is between seeing and understanding what’s happening in your network and, more importantly, understanding why it’s happening. This empowers you to get to the root cause more quickly. Be it a network or security-related incident. For all of this, we need to turn to data to find meaning, often called data-driven visibility in real-time, needed to maximize positive outcomes while minimizing or eliminating issues before they happen.
A key point: Machine data and observability
Data-drive visibility is derived from machine data. So, what is machine data? Machine data is everywhere and flows from all the devices we interact with and make up around 90% of today’s data. And harnessing this data can give you powerful insights for networking and security. The machine data can be in many formats, such as structured and unstructured. As a result, it cannot be easy to predict and process. So when you find issues in machine data, you need to be able to fix them in less time. So you need to be able to pinpoint, correlate, and alert on specific events so we can save time.
So we need a platform that can perform network analysis and visibility instead of only using multiple tools dispersed throughout the network. A platform can take data from any tool and create an intelligent, searchable index. For example, a SIEM solution can create a searchable index for you. There are several network visibility solutions, such as cloud-based or on-premise-based solutions.
Network Visibility Tools
Traditional, legacy, or network visibility tools are the data we collect with SNMP, network flows, and IPFIX, even from routing tables and geo-locations. To recap, IPFIX is an accounting technology that monitors traffic flows. IPFIX then interprets the client, server, protocol, and port used, counts the number of bytes and packets, and sends that data to an IPFIX collector. Network flow or traffic is the amount of data transmitted across a network over a specific period. The flow identification is performed based on five fields in the packet header. These fields are the following: source I.P. address, destination I.P. address, protocol identifier, source port number, and destination port number.
Then we have SNMP, a networking protocol used to manage and monitor network-connected devices. The SNMP protocol is embedded in multiple local devices. None of these technologies is going away; they must be correlated and connected.
Traditional network visibility tools:
- Populate charts and create baselines
From this data, we can implement network security. First, we can create baselines, identify anomalies, and start to organize network activity. Alerts are triggered when thresholds are met. So we get an alert about a router that is down, or an application is not performing as expected. This can be real-time or historical. However, this is all good for the previous way of doing things. But for example, when an application is not performing well, a threshold tells you nothing; you need to be able to see the full paths and any use of each part of the transaction.
All of which were used to populate the charts and graphs. These dashboards rely on known problems that we have seen in the past. However, today’s networks fail in creative ways often referred to as unknown/unknown, calling for a new approach to distributed systems observability that Site Reliability Engineering (SRE) teams employ.
Observability Software Engineering
To start an observability project, we need diverse data and visibility to see various things going on today. We don’t just have known problems anymore. We have a mix of problems that we have not seen before. Networks fail in creative ways, some of which have never happened before. So we need to look at the network differently with new and old network visibility tools and the practices of observability software engineering. We need to diversify your data so we have multiple perspectives to understand better what you are looking at. And this can only be done with a distributed systems observability platform. So what does this platform need?
Network analysis and visibility: Multiple data types and point solutions
So, we need to get as much data as possible, from all network visibility tools such as flows, SNMP, IPFIX, routing tables, packets, telemetry logs, metrics, logs, and traces. There are all the things that we are familiar with and have used in the past, and each data type provides a different perspective of what is going on. However, the main drawback of not using a platform is that it lends itself to a series of point solutions, leaving gaps in network visibility. Now we have a database of each one. So, we could have a database for network traffic flow information for application visibility or a database for SNMP. The issue with the point solution is that you can’t see everything. Each data point acts on its island of visibility, and you will have difficulty understanding what is happening. At a bare minimum, you should have some automation between all these devices.
A key point: The use of automation as the starting point
Automation could be used to glue everything together. There are two variants of the Ansible architecture with a CLI version known as Ansible Core or a Platform based approach with Ansible Tower. Automation does not provide visibility, buts it is a starting point to glue together the different point solutions to increase network visibility. For example, collecting all logs from all firewall devices and sending them to a backend for analysis. Ansible variables are recommended, and you can use the Ansible Inventory variable to fine-tune how you connect to your managed assets. In addition, variables bring a lot of benefits and modularity to Ansible playbooks.
Isolated monitoring for effective network analysis and visibility
I know what happens on my LAN, but what happens in my service provider networks.? I can see VPC flows from a single cloud provider, but what is happening in my multi-cloud designs? I can see what is happening in my interface states but what is happening in my overlay networks? For SD-WAN monitoring, if a performance problem with one of my applications or a bad user experience is reported from a remote office, how do we map this back to tunnels? So we have pieces of information that are missing end-to-end pictures. For additional information on monitoring and visibility in SD-WAN environments, check out this SDWAN tutorial.
The issue without data correlation?
The issue is how do we find out if there is a problem when we have to search through multiple databases and dashboards? And when there is a problem, how do you correlate to determine the root cause? What if you have tons of logs and have to figure out that this interface utilization correlates with this slow DNS lookup time, which correlates to a change in BGP configuration? So you can see everything with traditional or legacy visibility, but how do you go beyond that? How do you know why something has happened? And this is where distributed systems observability and the practices of observability software engineering come in. Having full-stack observability with network visibility solutions into all angles of the network.
Distributed Systems Observability: Seeing is believing
The difference between seeing and understanding. Traditional network visibility solutions let you see what’s happening on your networks. But, on the other hand, observability helps you understand why it is happening. So with observability, we are not replacing network visibility; we are augmenting this with a distributed systems observability platform that lets us put all the dots together to form a complete picture. So with a distributed systems observability platform, we still collect the same information. For example, routing information, network traffic, VPC flow logs, and the results from synthetic tests, along with metrics, traces, and logs. But now we have several additional steps of normalization and correlations that the platform takes care of for you.
Distributed systems observability and normalization
Interface statistics could be packet per second; flow data might be in the percentage of traffic, such as 10% is DNS traffic. Then we have to normalize and correlate it to make sense of what is happening for the entire business transaction. So the first step is to ingest as much data as possible, identify or tag data, and then normalize the data. Keep in mind this could be short-lived data, such as interface statistics.
Applying machine learning algorithms
All these different types of data are ingested, normalized, and correlated. And this can not be done with a human. Distributed systems observability gives you useful, actionable intelligence that automates the root cause and measures network health by applying machine learning algorithms. We will discuss these machine learning algorithms and statistically analyzes them in just a moment. Supervised and unsupervised machine learning is used heavily in the security world. So, in summary for effective network analysis and visibility, we need to do the following:
Summary Point 1
We must inject a large amount of data from many sources and types
Summary Point 2
Automate baseline and anomaly detection and make this more accurate
Summary Point 3
Accurate group data and create a structured amount of unstructured data
Summary Point 4
Then, correlate data to learn how everything is related to each other
- This will give you full stack observability for enhanced network visibility that traditional network visibility tools cannot give you.
Full Stack Observability
Let us briefly describe the transitions we have gone through and why we must address full stack observability. First, we had a monolithic application, and the monolithic application is still very alive today, and this is where the mission-critical system lives. We then moved to the cloud and started adopting containers and platforms. Then there was a drive to re-architect the code and begin from the beginning with cloud-native and now with observability. With the move to containers and kubernetes, monitoring becomes more important. Why? Because the environments are dynamic and you need to embed security somehow.
The traditional world of normality
In the past, network analysis and visibility were simple. Applications ran in single private data centers, potentially two data centers for high availability. These data centers were on-premises, and all components were housed internally. In addition, the network and infrastructure were pretty static, and there weren’t that many changes to the stack, for example, daily. However, nowadays, we are in a different environment where we have complex and distributed applications. This a with components/services located in many different places and types of places, on-premises and in the cloud, depending on both local and remote services.
A key point: The wave of containers and its effect on the network analysis and visibility
There has been a considerable rise in the use of containers. The container wave introduces dynamic environments with cloud-like behaviour where you can scale up and down very quickly and easily. We have ephemeral components. Inside containers and part of services, these are things that are coming up and going down. So the paths and transactions are both complex but also shifting. So you have multiple steps or services for an application to work: Commonly known as a business transaction. It would be best if you strived to have the automatic discovery of business transactions and application topology maps of how the traffic flows.
A key point: The wave of Microservices and its effect on the network analysis and visibility
So with the wave towards microservices, we get the benefits of scalability and business continuity, but managing is very complex. In addition, what used to be method calls or interprocess calls within the monoliths host now go over the network and are susceptible to deviations in latency.
The issue of silo-based monitoring
With all these new waves of microservices and containers, we have an issue in silo monitoring with poor network analysis and visibility in a very distributed environment. So let us look at an example of trying to isolate a problem with traditional network visibility and monitoring. So for mobile or web, the checkout is slow; for the application, there could be JVM perf issues. Then on the database, we could have a slow SQL query; on the network side of things, we have an interface rate of 80%. So traditional network visibility and monitoring with a silo-based approach have their tools, but nothing is connected how do you quickly get to the root cause of this problem?
Network visibility solutions
When you look at monitoring, it’s based on event alerts and the dashboard. All of which is populated with passive ( sampling ) to generate a dashboard. It is also per domain. However, we have very complex, distributed, and hybrid environments. So we have a hybrid notion from a code perspective and physical location with cloud-native solutions. The way you consume API will be different, too, in each location. How you consume API for SaaS will differ for authentication for on-premise to that of the cloud. Keep in mind that API security is a top concern.
So with our network visibility solutions, we must support all the journeys in a complex and distributed world. So we need system full-stack observability and observability software engineering to see what is happening in each domain and to know what is happening in real-time. So instead of being passive with data, we are active with metrics, logs, traces, and events, along with any other types of data we can inject. So if there is a network problem, we inject all network-related data. Same with security, we inject all security-related information if there is a security problem.
- Example: Getting hit by Malware
So if Malware hits you, you need to be able to detect a container quickly. Then, avoid remote code execution attempts from succeeding while putting the affected server in quarantine for patching. So there are several stages you need to perform. And the security breach has affected you across different domains and teams. The topology now all changes too. The backend and front end will change, so we must re-route traffic while keeping the performance. To solve this we need to analyze different types of data.
The different data types
So you need to inject as much telemetry data as possible. Application, security, VPC, VNETs, and even Internet statistics. So we get all of this data created via automation, metrics, events, log, and distributed tracing based on open telemetry.
- Metrics: Metrics are aggregated measurements grouped or collected at the regular interface or a given period. For example, there could be a 1 min aggregate, so some details are lots. Aggregation helps you save on storage but requires proper pre-planning on what metrics to consider.
- Events: Events are discrete actions happening at the moment in time. The more metadata associated with an event, the better. Events are helpful to confirm that particular actions occurred at a particular time.
- Logs: Logs are detailed and have timestamps associated with them. These can either be structured or unstructured. As a result, logs are very versatile and empower many use cases.
- Traces: Traces are events that change between different components in the applications. This item was purchased via credit cards at this time; it took 37 seconds to complete the transactions. All chain details and dependencies are part of the trace. Traces allow you to follow what is going on.
So in the case of Malware detection. This is where a combination of metrics, traces, and logs would have helped you, and switching between views and having automated correlation will help you get to the root cause. But you also need to detect and respond appropriately, leading us to secure network analytics.
Secure Network Analytics
For this, we need good secure network analytics for visibility and detection and then respond in the best way. So we have several different types of analytical engines that can be used to detect a threat. In the last few years, we have seen an increase in the talk and drive around analytics and how it can be used in networking and security. Many vendors claim they do both supervised and unsupervised machine learning. All of which are used in the detection phase.
Algorithms and statistical models
For analytics, we have algorithms and statistical models. The algorithms and statistical models aim to achieve some outcome and are extremely useful in understanding domains constantly evolving with many variations. This is exactly what the security domain is, by definition. However, the threat landscape is evolving daily, so if you want to find these threats, you need to shift through a lot of data, commonly known as machine data, that we discussed at the start of the post.
For supervised machine learning, we get a piece of Malware and build up a threat profile that can be gleaned from massive amounts of data. So when you see a matching behaviour profile for that, you can make an alarm. But you need a lot of data to start with.
This can capture very evasive threats such as crypto mining. A cryptocurrency miner is a piece of software that uses your computer resources to mine cryptocurrency. A crypto mining event of the current miner is just a long-lived flow, so you need additional ways to determine or gather more metrics to understand that this long-lived flow is malicious and is a cryptocurrency miner.
Multilayer Machine Learning Model
So by their very nature, crypto mining and even Tor will escape most security controls. To capture these, you need a multilayer machine learning model of supervised and unsupervised. So, if you are on a standard network that blocks Tor, it will block 70% of the time; the other 30% of the entry and exit nodes are unknown.
Machine Learning (ML)
Supervised and unsupervised machine learning gives you the additional visibility to find those unknown / unknowns. The unique situations that are lurking on your networks. So here we are making an observation, and these models will help you understand whether these are not normal. There are different observation triggers. First, there is the known bad behaviour – such as security policy violation and communication to known C&C. Then we have anomaly conditions which are the observed behaviour that is different from normal. And we need to make these alerts meaningful to the business.
So if I.P. addresses 192.168.1.1/24, upload a large amount of data. It should say that the PCI server is uploading a large amount of data to a known malicious external network, and these are the remediation options. So the statement or alert needs to mean something to the business. So we need to express the algorithms in the language of the business. So there could be a behaviour profile on this host that does not expect it to download or upload anything.
So when events leave the system, you can enrich it with data from other systems. So you can enhance data inputs with additional telemetry to enrich data with other sources that give it more meaning. So to help with your alarm, you can add information to the entity. So there’s a lot of telemetry in the network. Most devices support NetFlow and IPFIX; you can have Encrypted Traffic Analyses (ETA) and Deep Packet Inspection (DPI).
So you can get loads of useful insights from these different types of, let’s say, technologies. Here you can get usernames, device identities, roles, pattern behaviour, and locations for additional data sources. ETA can get a lot of information just by looking at the header without needing to perform decryption. So you can enhance your knowledge of the entity with additional telemetry data.
Network analysis and visibility with a tiered alarm system
Once an alert is received, you can create actions such as sending a Syslog message, email, SMTP trap, and webhooks. So you have a tiered alarm system with different priorities and severity on alarms. Then you can enrich or extend the detection with data from other products. So it can query other products via their API such as Cisco Talos. So instead of presenting all the data, they need to present them with the data they care about. This will add context to the investigation and help the overworked security analyst that is spending 90 mins on one Phishing email investigation.