Traffic Analysis Using the CTU-13 Dataset

Three Guiding Questions for this Analysis Include:

To decide whether there are identifiable characteristics in botnet traffic, such as unusual packet sizes, frequent connections to certain IP addresses, or specific protocol usage (TCP, UDP, ICMP, DNS), that can be used to detect and prevent botnet activity.

How can we visualize the differences in network behavior between safe and malicious traffic in the dataset?

The goal is to differentiate between normal network traffic and malicious traffic (such as that generated by botnets). By examining various network features (packet size, connection frequency, protocol types, ect), we aim to use visualization techniques (histograms, scatter plots, ect) to clearly show differences in these behaviors.

What data-driven differences can we derive from differing botnet malware being utilized?

Examining how different botnets (Neris, Rbot, Donbot, Sogou, Murlo, NSIS.ay, and Virut) affect network traffic. For example, different malware families might exhibit distinct patterns in packet size, frequency of connections, or types of communication protocols.

The CTU-13 dataset is a collection of labeled network traffic captures designed for the analysis and detection of botnet activity. It was created by the CTU University in the Czech Republic, and includes both normal and malicious network traffic, specifically focusing on 13 different botnet scenarios. The dataset comprises PCAP files (Packet Capture), each representing a distinct network traffic session. Each scenario includes detailed packet-level information like timestamps, IP addresses, protocols (TCP, UDP, ICMP, DNS, ect), packet sizes, ports, and flags. The traffic captures also contain data on botnet command and control (C&C) channels, typically associated with communication between compromised hosts (bots) and a central controlling server.

The traffic captured in the dataset includes several critical features:

Timestamp: The time at which the packet was captured.
Source IP: IP address of the machine originating the packet.
Destination IP: IP address of the machine receiving the packet.
Packet Length: Size of the packet in bytes.
Protocol: The communication protocol used (TCP, UDP, ICMP, DNS, ect).
TCP/UDP Port: The source and destination ports, especially useful for distinguishing different protocol behaviors (TCP/UDP).
Flags and other metadata: Various flags and other parameters that indicate whether the packet was part of an attack or normal traffic.
Ethernet Source/Destination: Layer-2 MAC addresses.
DNS Query Name: For DNS traffic, the queried domain name.
ICMP Type/Code: If the packet is part of the ICMP protocol.

Quantity of Packets Analyzed:

Botnet	Packets Captured	Collection Date
Neris	98,936	Aug 10, 2011
Rbot	99,989	Aug 12, 2011
Virut	45,809	Aug 15, 2011
DonBot	24,741	Aug 16, 2011
Sogou	20,639	Aug 16, 2011
Murlo	85,498	Aug 17, 2011
NSIS.ay	97,470	Aug 17, 2011
Combined	1,000,000+	Aug 10, 2011 - Aug 17, 2011

Descriptive and Exploratory Analysis

Dataset Overview

The CTU-13 dataset consists of multiple PCAP (Packet Capture) files referenced above, with each corresponding to different botnet activities or normal traffic captures. Some of the specific botnet activities within the dataset include:

Neris: A botnet using IRC for command-and-control and launching DDoS attacks.
Rbot: Known for scanning, DDoS attacks, and communication via TCP/UDP.
Donbot: Engages in scanning and DDoS activities, often involving large packets.
Sogou: A botnet using DNS queries for C&C communication.
Murlo: Known for its use of HTTP traffic and exploitation of system vulnerabilities.
NSIS.ay: A botnet that relies on periodic communication with a C&C server.
Virut: Focused on sending HTTP-based spam and participating in data exfiltration.

All data found analyzed in this report was collected from the Stratosphere Lab website, captured in the CTU University, Czech Republic, in 2011. This data was modified to suit the Jupyter Notebook analysis by converting all .pcap files to .cvs:

tshark -r botnet-capture-20110810-neris.pcap -T fields -e frame.time -e ip.src -e ip.dst -e ip.len -e ip.proto -e tcp.srcport -e tcp.dstport -e eth.src -e eth.dst -e udp.srcport -e udp.dstport -e udp.length -e icmp.type -e icmp.code -e icmp.seq -e dns.qry.name -e dns.qry.type -e dns.a -c 100000 -E separator=, > extracted_data.csv

The above conversions were done using TShark to easily extract the relevant data I needed for this analysis. Fields being analyzed are as follows: (frame.time, ip.src, ip.dst, ip.len, ip.proto, tcp.srcport, tcp.dstport, eth.src, eth.dst, udp.srcport, udp.dstport, udp.length, icmp.type, icmp.code, icmp.seq, dns.qry.name, dns.qry.type, and dns.a) with the data using a capture packet amount of -c 100000 packets. This conversion could also be done directly through wireshark following steps detailed from this link [Analyze Wi-Fi Data Captures with Jupyter Notebook].

tshark -r all_botnet_captures.pcap -T fields -e frame.time -e ip.src -e ip.dst -e ip.len -e ip.proto -e tcp.srcport -e tcp.dstport -e eth.src -e eth.dst -e udp.srcport -e udp.dstport -e udp.length -e icmp.type -e icmp.code -e icmp.seq -e dns.qry.name -e dns.qry.type -e dns.a -E separator=, > all_extracted_data.csv

Here is another example of the methodic conversions I did, here the (all_botnet_captures.pcap) file contained a grouping of all pcap captures which all analyzed the same fields, however, this capture only received part of the results compared to the other captures. This might have been due to the large packet size as the tshark functions did not limit the total packets during the conversion through -c (number). But the results could have also derived from the merging process used.

mergecap -w all_botnet_captures.pcap botnet-capture-20110810-neris.pcap botnet-capture-20110811-neris.pcap botnet-capture-20110812-rbot.pcap botnet-capture-20110815-fast-flux.pcap botnet-capture-20110816-donbot.pcap botnet-capture-20110816-sogou.pcap botnet-capture-20110815-rbot-dos.pcap botnet-capture-20110816-qvod.pcap botnet-capture-20110815-fast-flux-2.pcap botnet-capture-20110819-bot.pcap botnet-capture-20110817-bot.pcap

mergecap assumes that all packet captures are already correctly ordered, which was to be expected.

Protocol Distribution Analysis

The distrobution of botnet traffic compared to protocal usage can signify behaviours and patterns which can allow detection of DDOS attacks through applied flags inside IDS/IPS programs. Expectedly most intrusion detection and prevention systems contain large databases with the most up to date detection capabilities, however through experienced attackers there can be workarounds to bypass these detections. Here is an analysis of what percentage of protocals the selected botnets decided to utilise for their attack:

Protocol Distribution Across Botnets:

Protocal	Neris (%)	Rbot (%)	Virut (%)	DonBot (%)	Sogou (%)	Murlo (%)	NSIS.ay (%)	Combined (%)
TCP	53.5%	0.2%	55.1%	23.7%	53.2%	44.3%	30.2%	38.5%
UDP	0.12%	53.7%	0.18%	0.23%	0.5%	14.1%	7.1%	14.2%
HTTP	13.2%	0.02%	23.8%	0.2%	32.2%	5.6%	46.7%	25.4%
DNS	4.4%	0.01%	0.2%	0.3%	9.3%	11.1%	0.4%	3.7%
HTTPS	22.3%	0.01%	18.6%	22.7%	6.2%	5.3%	5.4%	11.4%
SMTP	17.8%	0.008%	0.23%	59.9%	0.1%	4.1%	4.9%	9.7%
IRC	3.2%	0.12%	1.3%	0.3%	6.1%	3.3%	1.3%	1.7%
ICMP	0.43%	48.5%	0.01%	0.15%	0.1%	0.08%	0.02%	22.6%
SNMP	0.37%	45.7%	0.01%	0.12%	0.07%	0.14%	0.11%	7.4%
Unknown	0.37%	0.08%	0.16%	1.2%	0.33%	1.06%	0.11%	0.92%

Descriptive Statistics

Detailed Statistical Metrics

The measures for statistical metrics provide a surprisingly large amount of information for understanding the typical behavior of traffic in different scenarios. Through data analysis we can identify traffic patterns typical of botnets, represented here:

Botnet	Mean Packet Length (bytes)	Min Length (bytes)	Max Length (bytes)	Standard Deviation (bytes)
Neris	203	29	8,905	542
Rbot	823	39	2,960	726
Virut	642	40	9,453	808
DonBot	182	40	2,953	456
Sogou	880	40	8,728	1,428
Murlo	217	32	7,340	411
NSIS.ay	416	36	19,021	936

Traffic patterns of botnets can differ significantly from those of normal users. Meaning, botnet traffic can typically be identified as an outlier by factors:

### Mean Packet Length (bytes) The mean packet length is the average size of the packets being transmitted, and it reflects the overall traffic type.

Neris (203 bytes): This indicates that Neris uses small packets, which is typical for botnets focused on frequent command-and-control (C&C) communication. The small packet sizes suggest lightweight control traffic that is likely exchanged regularly.

Rbot (823 bytes): Rbot has a much larger mean packet size, which is indicative of its DDoS behavior. Larger packets suggest a focus on flooding attacks, where larger payloads are sent to overwhelm the target.

Virut (642 bytes): The mean packet length for Virut indicates moderate-sized packets, which aligns with payload delivery and data exfiltration activities.

Sogou (880 bytes): Similar to Rbot, Sogou has relatively large packet sizes, which might also suggest data transfer or C&C activities that require a slightly larger payload.

NSIS.ay (416 bytes): NSIS.ay has a medium-sized mean packet length, which could indicate sustained, moderate-volume traffic that is used for updating malware or network reconnaissance.

### Min Length (bytes) The minimum packet length helps us understand the smallest packet transmitted during the botnet activity. Small values (close to 0) might indicate ping-like activity (ICMP packets, for example).

Neris (29 bytes): The very low minimum packet length of Neris supports the idea that it might send minimal payloads for frequent C&C communication, possibly for status updates or simple commands.

Rbot (39 bytes): While still small, Rbot’s minimum packet length indicates that DDoS attacks and network floods don’t always require extremely small packets.

### Max Length (bytes) The maximum packet length is useful for identifying anomalies or large-scale operations like data exfiltration or payload delivery.

NSIS.ay (19,021 bytes): The maximum packet length in NSIS.ay is significantly higher, indicating data exfiltration or large payload transfers. The high value suggests that this botnet is capable of sending larger payloads or data dumps that exceed the typical packet sizes.

Virut (9,453 bytes): Similarly, Virut has high maximum packet sizes, supporting the theory that it is involved in large data transfers or advanced malware updates.

Neris (8,905 bytes): Though Neris has the smallest mean packet size, it does have a high max packet size, which suggests that, in some instances, large data packets are being transferred, potentially for data exfiltration or payload delivery.

### Standard Deviation (bytes) Standard deviation measures the variability or spread of packet sizes. A higher standard deviation suggests that the botnet generates a wide range of packet sizes, possibly due to multiple types of attack methods being employed.

Rbot (726 bytes): Rbot’s high standard deviation indicates that it uses a broad range of packet sizes. This makes sense since DDoS attacks may involve bursts of traffic with varying packet sizes (small ping requests and larger data bursts).

Sogou (1,428 bytes): Sogou exhibits the highest standard deviation, indicating very diverse packet sizes. This suggests that Sogou might involve multiple forms of attack (C&C, data exfiltration, ext), and it likely generates varied packet sizes depending on the attack’s intensity and phase.

Outlier Detection (IQR Method)

Outliers in the packet size data can provide important insights into unusual behavior, often associated with botnet traffic. Using pandas.DataFrame.describe, we can provide Interquartile Ranges for each dataset. 25th Percentile (Q1), the value below which 25% of the data falls (the first quartile). 50th Percentile (Median / Q2), the middle value of the data. And the 75th Percentile (Q3), the value below which 75% of the data falls (the third quartile).

Detection of significant outliers were also applied through IQR. This aimed to primarily show the variability of data being distributed over the timespan which was given by the CTU-13 dataset packets. Outliers often indicate the variance of sent packets over time. Where botnets containing more outliers typically send packets over a longer period of time to avoid detection. Here are the computed outliers found in each botnet dataset:

Temporal Analysis

Temporal Analysis examines the time-based patterns in botnet traffic, focusing on when malicious activities peak or fluctuate. Hourly and weekly traffic volumes identify specific times of day or days of the week when botnets are most active, such as DDoS attacks during business hours or data exfiltration during off-peak times:

Hourly Traffic

Botnet	Peak Hour	Mean Packet Length (bytes)	Total Volume (bytes)
Neris	21:00 –22:00	281	20,124,723
Rbot	20:00 –21:00	823	18,700,600
Virut	02:00 –03:00	649	29,443,213
DonBot	20:00 –21:00	268	4,515,687
Sogou	23:00 –00:00	881	18,169,111
Murlo	12:00 –13:00	392	18,618,405
NSIS.ay	00:00 –01:00	526	40,595,958

Hourly Traffic

Botnets show distinct peak hours, reflecting a mix of automated and human-triggered operations.
Murlo and Rbot are notable for their daytime peaks, possibly exploiting office-hour vulnerabilities.

Weekly Traffic

Botnet	Day of Maximum Activity	Total Volume (bytes)
Neris	Thursday	20,124,723
Rbot	Friday	18,700,600
Virut	Wednesday	29,443,213
DonBot	Thursday	4,515,687
Sogou	Thursday	18,169,111
Murlo	Friday	18,618,405
NSIS.ay	Friday	40,595,958

Weekly Traffic

Peaks on Thursday and Friday suggest malicious actors prioritize operations before weekends.
Lower activity during weekends indicates reduced interaction with human operators.

Visualising Analysed Botnet Traffic

All data recorded was collected using the .ipynb script found at the Full Analysis Script heading, not all data is presented, as these are only examples for the dataset i collected using my script, however a more consise format to everything i’ve collected, alongside the full report will be linked below:
Download the CTU-13 Dataset Analysis
Download the Report & Analysis

Distributions — Collective Differences Analysing All Scenarios

Scenarios — Differences (Safe vs Malicious Traffic)

Full Analysis Script

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from datetime import datetime

# Define constants and mappings
PROTOCOL_MAP = {
    1: 'ICMP',  # ICMP (Internet Control Message Protocol)
    6: 'TCP',   # TCP (Transmission Control Protocol)
    17: 'UDP',  # UDP (User Datagram Protocol)
    89: 'ICMPv6',  # ICMPv6 (Internet Control Message Protocol v6)
    2: 'IGMP',  # IGMP (Internet Group Management Protocol)
    41: 'IPv6',  # IPv6 (Internet Protocol version 6)
    50: 'ESP',  # ESP (Encapsulating Security Payload)
    51: 'AH',   # AH (Authentication Header)
    58: 'ICMPv6',  # ICMPv6 (Internet Control Message Protocol v6)
    0: 'Unknown',  # Unknown Protocol
    3: 'ARP',   # ARP (Address Resolution Protocol)
    132: 'SCTP',  # SCTP (Stream Control Transmission Protocol)
    33: 'DCCP',  # DCCP (Datagram Congestion Control Protocol)
    115: 'L2TP',  # L2TP (Layer 2 Tunneling Protocol)
    47: 'GRE',   # GRE (Generic Routing Encapsulation)
    88: 'EIGRP',  # EIGRP (Enhanced Interior Gateway Routing Protocol)
    89: 'OSPF',  # OSPF (Open Shortest Path First)
    39: 'LDP',   # LDP (Label Distribution Protocol)
    25: 'SMTP',  # SMTP (Simple Mail Transfer Protocol)
    53: 'DNS',   # DNS (Domain Name System)
    80: 'HTTP',  # HTTP (Hypertext Transfer Protocol)
    443: 'HTTPS',  # HTTPS (Hypertext Transfer Protocol Secure)
    21: 'FTP',   # FTP (File Transfer Protocol)
    23: 'Telnet',  # Telnet (Telecommunication Network)
    22: 'SSH',   # SSH (Secure Shell)
    123: 'NTP',   # NTP (Network Time Protocol)
}

PORT_PROTOCOL_MAP = {
    80: 'HTTP',           # HTTP (Hypertext Transfer Protocol)
    443: 'HTTPS',         # HTTPS (Hypertext Transfer Protocol Secure)
    21: 'FTP',            # FTP (File Transfer Protocol)
    22: 'SSH',            # SSH (Secure Shell)
    23: 'Telnet',         # Telnet (Telecommunication Network)
    25: 'SMTP',           # SMTP (Simple Mail Transfer Protocol)
    53: 'DNS',            # DNS (Domain Name System)
    110: 'POP3',          # POP3 (Post Office Protocol v3)
    143: 'IMAP',          # IMAP (Internet Message Access Protocol)
    3306: 'MySQL',        # MySQL (MySQL Database Service)
    8080: 'HTTP-Alt',     # HTTP Alternate (HTTP Alternative)
    8081: 'HTTP-Alt-2',   # HTTP Alternate 2 (commonly used for dev/test environments)
    67: 'DHCP Server',    # DHCP (Dynamic Host Configuration Protocol - Server)
    68: 'DHCP Client',    # DHCP (Dynamic Host Configuration Protocol - Client)
    69: 'TFTP',           # TFTP (Trivial File Transfer Protocol)
    161: 'SNMP',          # SNMP (Simple Network Management Protocol)
    162: 'SNMP Trap',     # SNMP (SNMP Trap for alerts)
    514: 'Syslog',        # Syslog (System Log Protocol)
    1433: 'MS SQL',       # MS SQL Server (Microsoft SQL Server)
    1434: 'MS SQL',       # MS SQL Server (Microsoft SQL Server - Browser Service)
    119: 'NNTP',          # NNTP (Network News Transfer Protocol)
    3389: 'RDP',          # RDP (Remote Desktop Protocol)
    465: 'SMTPS',         # SMTPS (SMTP Secure over SSL/TLS)
    993: 'IMAPS',         # IMAPS (IMAP Secure over SSL/TLS)
    995: 'POP3S',         # POP3S (POP3 Secure over SSL/TLS)
    1080: 'SOCKS Proxy',  # SOCKS Proxy
    2525: 'SMTP-Alt',     # SMTP Alternate (commonly used in email servers)
    5432: 'PostgreSQL',   # PostgreSQL (PostgreSQL Database Service)
    6379: 'Redis',        # Redis (Redis Key-Value Store)
    5900: 'VNC',          # VNC (Virtual Network Computing)
    6660: 'IRC',          # IRC (Internet Relay Chat)
    6661: 'IRC',          # IRC (Internet Relay Chat)
    6662: 'IRC',          # IRC (Internet Relay Chat)
    6663: 'IRC',          # IRC (Internet Relay Chat)
    6664: 'IRC',          # IRC (Internet Relay Chat)
    6665: 'IRC',          # IRC (Internet Relay Chat)
    6666: 'IRC',          # IRC (Internet Relay Chat)
    6667: 'IRC',          # IRC (Internet Relay Chat)
    6668: 'IRC',          # IRC (Internet Relay Chat)
    6669: 'IRC',          # IRC (Internet Relay Chat)
}


# Read the CSV file
def load_data(file_path):
    try:
        df = pd.read_csv(file_path, sep=',', engine='python', skip_blank_lines=True, on_bad_lines='skip',
                         names=['Timestamp', 'Source IP', 'Destination IP', 'Total Length', 'Protocol',
                                'TCP Source Port', 'TCP Destination Port', 'Ethernet Source', 'Ethernet Destination',
                                'UDP Source Port', 'UDP Destination Port', 'UDP Length', 'ICMP Type', 'ICMP Code',
                                'ICMP Seq', 'DNS Query Name', 'DNS Query Type', 'DNS A'])
        print("File loaded successfully.")
        return df
    except Exception as e:
        print(f"Error loading file: {e}")
        return None

# Clean and parse the 'Timestamp' column
def clean_and_parse_timestamp(timestamp):
    if isinstance(timestamp, str):
        if re.match(r'^\w{3} \d{2}$', timestamp):  # Matches 'Aug 10'
            timestamp = '2011 ' + timestamp + ' 00:00:00'  # Add default time if missing
        timestamp = re.sub(r"\s+[A-Za-z\s]+$", "", timestamp)  # Remove timezone info

        try:
            return pd.to_datetime(timestamp, errors='coerce')
        except Exception as e:
            print(f"Error parsing timestamp: {timestamp} -> {e}")
            return pd.NaT
    return pd.NaT

# Feature engineering: Adding useful time-based features
def add_time_features(df):
    df['Hour'] = df['Timestamp'].dt.hour
    df['DayOfWeek'] = df['Timestamp'].dt.dayofweek
    df['Weekday'] = df['Timestamp'].dt.weekday
    return df

# Clean 'Total Length' and fill missing values
def clean_total_length(df):
    df['Total Length'] = pd.to_numeric(df['Total Length'], errors='coerce')
    df['Total Length'] = df['Total Length'].fillna(df['Total Length'].median())
    return df

# Clean 'Source IP' and 'Destination IP'
def clean_ip_columns(df):
    df['Source IP'] = df['Source IP'].fillna('Unknown')
    df['Destination IP'] = df['Destination IP'].fillna('Unknown')
    return df

# Map 'Protocol' values using predefined map
def map_protocols(df):
    df['Protocol'] = df['Protocol'].map(PROTOCOL_MAP).fillna('Unknown')
    return df

# Apply port-based protocol mapping
def map_ports_to_protocol(df):
    def apply_port_mapping(row):
        if row['Protocol'] in ['TCP', 'UDP']:
            src_port = row['TCP Source Port'] if row['Protocol'] == 'TCP' else row['UDP Source Port']
            dest_port = row['TCP Destination Port'] if row['Protocol'] == 'TCP' else row['UDP Destination Port']
            if src_port in PORT_PROTOCOL_MAP:
                row['Protocol'] = PORT_PROTOCOL_MAP[src_port]
            elif dest_port in PORT_PROTOCOL_MAP:
                row['Protocol'] = PORT_PROTOCOL_MAP[dest_port]
        return row
    return df.apply(apply_port_mapping, axis=1)

# Clean the dataframe and prepare for analysis
def preprocess_data(df):
    df['Timestamp'] = df['Timestamp'].apply(clean_and_parse_timestamp)
    df = add_time_features(df)
    df = clean_total_length(df)
    df = clean_ip_columns(df)
    df = map_protocols(df)
    df = map_ports_to_protocol(df)
    return df

# Descriptive statistics for 'Total Length'
def total_length_stats(df):
    stats = df['Total Length'].describe()
    print("\nDescriptive Statistics for Total Length:")
    print(stats)
    return stats

# Outlier detection using IQR
def detect_outliers(df):
    Q1 = df['Total Length'].quantile(0.25)
    Q3 = df['Total Length'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df['Total Length'] < lower_bound) | (df['Total Length'] > upper_bound)]
    print(f"\nNumber of outliers: {len(outliers)}")
    print(outliers.head(10))
    return outliers, lower_bound, upper_bound

# Plot distributions
def plot_total_length_distribution(df):
    plt.figure(figsize=(10, 6))
    plt.subplot(1, 2, 1)
    sns.histplot(df['Total Length'], bins=50, kde=True, color='blue')
    plt.title('Total Length Distribution')

    plt.subplot(1, 2, 2)
    sns.boxplot(x=df['Total Length'], color='orange')
    plt.title('Total Length Boxplot')
    plt.xlabel('Total Length')
    plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()

    # Output of the Total Length Distribution and Boxplot
    print("\nVisual Output for Total Length Distribution and Boxplot:")
    print(f"- The histogram shows the overall distribution of total length values.")
    print(f"- The boxplot highlights the spread and identifies potential outliers in the data.")

# Hourly Total Length Statistics
def plot_hourly_total_length(df):
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x='Hour', y='Total Length', data=df, color='green')
    plt.title('Total Length vs. Hour of the Day')
    plt.xlabel('Hour of the Day')
    plt.ylabel('Total Length')
    plt.show()

    # Output of Total Length vs. Hour of Day
    print("\nVisual Output for Total Length vs. Hour of the Day:")
    hourly_stats = df.groupby('Hour')['Total Length'].mean()
    print(f"- The scatter plot shows the relationship between total length and the hour of the day.")
    print(f"- Here are the average total lengths per hour:\n{hourly_stats}")

# Frequency of each protocol
def plot_protocol_frequency(df):
    protocol_counts = df['Protocol'].value_counts()
    plt.figure(figsize=(10, 6))
    sns.barplot(x=protocol_counts.index, y=protocol_counts.values, palette='viridis', hue=protocol_counts.index)
    plt.title('Frequency Distribution of Protocols')
    plt.xticks(rotation=90)
    plt.xlabel('Protocol')
    plt.ylabel('Frequency')
    plt.show()

    # Output of Protocol Frequency
    print("\nVisual Output for Protocol Frequency Distribution:")
    print(f"- The bar chart shows the frequency distribution of protocols.")
    print(f"- Here are the counts for each protocol:\n{protocol_counts}")

# Traffic by weekday
def plot_traffic_by_weekday(df):
    weekday_traffic = df.groupby('DayOfWeek')['Total Length'].sum().reset_index()
    plt.figure(figsize=(10, 6))
    sns.barplot(x=weekday_traffic['DayOfWeek'], y=weekday_traffic['Total Length'],
                palette='viridis', hue=weekday_traffic['DayOfWeek'])
    plt.title('Traffic Volume by Day of Week')
    plt.xlabel('Day of Week')
    plt.ylabel('Total Length')
    plt.xticks(ticks=range(7), labels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"])
    plt.show()

    # Output of Traffic by Weekday
    print("\nVisual Output for Traffic Volume by Day of Week:")
    print(f"- The bar chart shows the total traffic volume for each day of the week.")
    print(f"- Traffic patterns per weekday:\n{weekday_traffic}")

# Traffic volume by hour
def plot_traffic_by_hour(df):
    hourly_traffic = df.groupby('Hour')['Total Length'].sum().reset_index()
    plt.figure(figsize=(10, 6))
    sns.lineplot(x=hourly_traffic['Hour'], y=hourly_traffic['Total Length'], color='purple')
    plt.title('Traffic Volume by Hour of the Day')
    plt.xlabel('Hour of the Day')
    plt.ylabel('Total Length')
    plt.show()

    # Output of Traffic Volume by Hour
    print("\nVisual Output for Traffic Volume by Hour of the Day:")
    print(f"- The line plot shows the total traffic volume aggregated by each hour.")
    print(f"- Hourly traffic summary:\n{hourly_traffic}")

# Main function to run all steps
def main(file_path):
    df = load_data(file_path)
    if df is not None:
        df = preprocess_data(df)

        # Descriptive statistics and outlier detection
        total_length_stats(df)
        outliers, lower_bound, upper_bound = detect_outliers(df)

        # Plotting and visual outputs
        plot_total_length_distribution(df)
        plot_hourly_total_length(df)
        plot_protocol_frequency(df)
        plot_traffic_by_weekday(df)
        plot_traffic_by_hour(df)

# Run the analysis with the given file path
file_path = 'Documents/pcap_files/all_extracted_data.csv'  # Update with your actual file path
main(file_path)

# Traffic Analysis Using the CTU-13 Dataset