Many pod scenarios now support the exclude_label parameter to protect critical pods while testing others. See individual scenario pages (Pod Failures, Pod Network Chaos) for details.
Deploys mutable grafana loaded with dashboards visualizing performance metrics pulled from in-cluster prometheus. The dashboard will be exposed as a route.
False
CAPTURE_METRICS
Captures metrics as specified in the profile from in-cluster prometheus. Default metrics captures are listed here
False
ENABLE_ALERTS
Evaluates expressions from in-cluster prometheus and exits 0 or 1 based on the severity set. Default profile.
False
ALERTS_PATH
Path to the alerts file to use when ENABLE_ALERTS is set
config/alerts
CHECK_CRITICAL_ALERTS
When enabled will check prometheus for critical alerts firing post chaos
OC CLI path, if not specified will be searched in $PATH
blank
Note
For setting the TELEMETRY_ARCHIVE_SIZE, the lower the value the higher the number of archive files produced and uploaded (processed by TELEMETRY_BACKUP_THREADS simultaneously). For unstable or slow connections, keep this value low and increase TELEMETRY_BACKUP_THREADS so that on upload failure only the failed chunk is retried.
Health Checks
Application endpoint monitoring during chaos. See Health Checks config for full details.
Parameter
Description
Default
HEALTH_CHECK_URL
URL to continually check and detect downtimes
blank
HEALTH_CHECK_INTERVAL
Interval in seconds at which to run health checks
2
HEALTH_CHECK_BEARER_TOKEN
Bearer token used for authenticating into health check URL
blank
HEALTH_CHECK_AUTH
Tuple of (username, password) used for authenticating into health check URL
blank
HEALTH_CHECK_EXIT_ON_FAILURE
If True, exits when health check fails for application
blank
HEALTH_CHECK_VERIFY
Health check URL SSL validation
False
Virt Checks
KubeVirt VMI SSH connection monitoring during chaos. See Virt Checks config for full details.
Parameter
Description
Default
KUBE_VIRT_CHECK_INTERVAL
Interval in seconds at which to test kubevirt connections
2
KUBE_VIRT_NAMESPACE
Namespace to find VMIs in and watch
blank
KUBE_VIRT_NAME
Regex style name to match VMIs to watch
blank
KUBE_VIRT_FAILURES
If True, will only report when ssh connections fail to VMI
blank
KUBE_VIRT_DISCONNECTED
Use disconnected check by passing cluster API
False
KUBE_VIRT_NODE_NAME
If set, will filter VMs to only track ones running on the specified node
blank
KUBE_VIRT_EXIT_ON_FAIL
Fails run if VMs still have false status at end of run
False
KUBE_VIRT_SSH_NODE
If set, will be a backup way to SSH to a node. Should be a node not targeted in chaos
blank
2 - Krknctl All Scenarios Variables
These variables are to be used for the top level configuration template that are shared by all the scenarios in Krknctl.
Each section below corresponds to a section in the Krkn config reference. Pass flags when running a scenario:
krknctl run <scenario> --<parameter> <value>
Kraken
General run settings. See Kraken config for full details.
Parameter
Description
Type
Possible Values
Default
--krkn-kubeconfig
Sets the path where krkn will search for kubeconfig in container
string
-
/home/krkn/.kube/config
--uuid
Sets krkn run uuid instead of generating it
string
-
-
--krkn-debug
Enables debug mode for Krkn
enum
True/False
False
Cerberus
Cluster health monitoring integration. See Cerberus config for full details.
For –telemetry-archive-size, the lower the value the higher the number of archive files produced and uploaded (processed by –telemetry-backup-threads simultaneously). For unstable or slow connections, keep this value low and increase –telemetry-backup-threads so that on upload failure only the failed chunk is retried.
Health Checks
Application endpoint monitoring during chaos. See Health Checks config for full details.
Parameter
Description
Type
Possible Values
Default
--health-check-url
URL to check the health of
string
-
-
--health-check-interval
How often to check the health check urls (seconds)
number
-
2
--health-check-auth
Authentication tuple to authenticate into health check URL
string
-
-
--health-check-bearer-token
Bearer token to authenticate into health check URL
string
-
-
--health-check-exit
Exit on failure when health check URL is not able to connect
string
-
-
--health-check-verify
SSL verification for health check URL
string
-
false
Virt Checks
KubeVirt VMI SSH connection monitoring during chaos. See Virt Checks config for full details.
Parameter
Description
Type
Possible Values
Default
--kubevirt-check-interval
How often to check the KubeVirt VMs SSH status (seconds)
number
-
2
--kubevirt-namespace
KubeVirt namespace to check the health of
string
-
-
--kubevirt-name
KubeVirt regex names to watch
string
-
-
--kubevirt-only-failures
KubeVirt checks only report if failure occurs
enum
True/False
false
--kubevirt-disconnected
KubeVirt checks in disconnected mode, bypassing the cluster’s API
enum
True/False
false
--kubevirt-ssh-node
KubeVirt backup node to SSH into when checking VMI IP address status
NOTE: For clusters with AWS make sure AWS CLI is installed and properly configured using an AWS account. This should set a configuration file at $HOME/.aws/config for your the AWS account. If you have multiple profiles configured on AWS, you can change the profile by setting export AWS_DEFAULT_PROFILE=<profile-name>
exportAWS_DEFAULT_REGION=<aws-region>
This configuration will work for self managed AWS, ROSA and Rosa-HCP
GCP
NOTE: For clusters with GCP make sure GCP CLI is installed.
A google service account is required to give proper authentication to GCP for node actions. See here for how to create a service account.
NOTE: A user with ‘resourcemanager.projects.setIamPolicy’ permission is required to grant project-level permissions to the service account.
After creating the service account, enable it by exporting the credentials path or running gcloud init:
NOTE: For clusters with Openstack Cloud, ensure to create and source the OPENSTACK RC file to set the OPENSTACK environment variables from the server where Kraken runs.
Azure
NOTE: You will need to create a service principal and give it the correct access, see here for creating the service principal and setting the proper permissions.
To properly run the service principal requires “Azure Active Directory Graph/Application.ReadWrite.OwnedBy” api permission granted and “User Access Administrator”.
Before running you will need to set the following:
This configuration will only work for self managed Azure, not ARO. ARO service puts a deny assignment in place over cluster managed resources, that only allows the ARO service itself to modify the VM resources. This is a capability unique to Azure and the structure of the service to prevent customers from hurting themselves. Refer to the links below for more documentation around this.
Scenario to block the traffic ( Ingress/Egress ) of an application matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.
You can add in your applications URL into the health checks section of the config to track the downtime of your application during this scenario
Rollback Scenario Support
Krkn supports rollback for Application outages. For more details, please refer to the Rollback Scenarios documentation.
Debugging steps in case of failures
Kraken creates a network policy blocking the ingress/egress traffic to create an outage, in case of failures before reverting back the network policy, you can delete it manually by executing the following commands to stop the outage:
application_outage:# Scenario to create an outage of an application by blocking trafficduration: 600 # Duration in seconds after which the routes will be accessible. Default if omitted:60namespace:<namespace-with-application> # Namespace to target - all application routes will go inaccessible if pod selector is emptypod_selector:{app:foo} # Pods to targetexclude_label:""# Optional label selector to exclude pods. Supports dict, string, or list formatblock:[Ingress, Egress] # It can be Ingress or Egress or Ingress, Egress
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- application_outages_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- application_outages_scenarios:- scenarios/app-outage.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- application_outages_scenarios:# Same type can appear multiple times- scenarios/app-outage-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario disrupts the traffic to the specified application to be able to understand the impact of the outage on the dependent service/user experience. Refer docs for more details.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
DURATION
Duration in seconds after which the routes will be accessible
number
600
NAMESPACE
Namespace to target - all application routes will go inaccessible if pod selector is empty ( Required )
string
No default
POD_SELECTOR
Pods to target. For example “{app: foo}”
string
No default
EXCLUDE_LABEL
Pods to exclude after getting list of pods from POD_SELECTOR to target. For example “{app: foo}”
string
No default
BLOCK_TRAFFIC_TYPE
It can be Ingress or Egress or Ingress, Egress ( needs to be a list )
string
[Ingress]
Note
Defining the NAMESPACE parameter is required for running this scenario while the pod_selector is optional. In case of using pod selector to target a particular application, make sure to define it using the following format with a space between key and value: “{key: value}”.
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Namespace to target - all application routes will go inaccessible if pod selector is empty
string
True
--chaos-duration
Set chaos duration (in sec) as desired
number
False
600
--pod-selector
Pods to target. For example “{app: foo}”
string
True
--exclude-selector
Pods to exclude after using pod-selector to target. For example “{app: foo}”
string
False
--block-traffic-type
It can be [Ingress] or [Egress] or [Ingress, Egress]
string
False
“[Ingress, Egress]”
Behavior Notes
Empty --pod-selector: When left empty, krkn creates a NetworkPolicy that targets all pods in the namespace, causing a namespace-wide outage.
Automatic cleanup: After --chaos-duration expires, krkn automatically deletes the NetworkPolicy it created and traffic resumes. A rollback handler is also registered to ensure cleanup if the scenario fails unexpectedly.
To see all available scenario options
krknctl run application-outages --help
Demo
See a demo of this scenario:
5 - Aurora Disruption Scenario
This scenario blocks a pod’s outgoing MySQL and PostgreSQL traffic, effectively preventing it from connecting to any AWS Aurora SQL engine. It works just as well for standard MySQL and PostgreSQL connections too.
This uses the pod network filter scenario but set with specific parameters to disrupt aurora
How to Run Aurora Disruption Scenarios
Choose your preferred method to run aurora disruption scenarios:
This scenario blocks a pod’s outgoing MySQL and PostgreSQL traffic, effectively preventing it from connecting to any AWS Aurora SQL engine. It works just as well for standard MySQL and PostgreSQL connections too.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- network_chaos_ng_scenarios:- scenarios/aurora-disruption.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- network_chaos_ng_scenarios:# Same type can appear multiple times- scenarios/aurora-disruption-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario disrupts a targeted zone in the public cloud by blocking egress and ingress traffic to understand the impact on both Kubernetes/OpenShift platforms control plane as well as applications running on the worker nodes in that zone. More information is documented here
Kraken uses the `oc exec` command to `kill` specific containers in a pod.
This can be based on the pods namespace or labels. If you know the exact object you want to kill, you can also specify the specific container name or pod name in the scenario yaml file.
These scenarios are in a simple yaml format that you can manipulate to run your specific tests or use the pre-existing scenarios to see how it works.
Recovery Time Metrics in Krkn Telemetry
Krkn tracks three key recovery time metrics for each affected container:
pod_rescheduling_time - The time (in seconds) that the Kubernetes cluster took to reschedule the pod after it was killed. This measures the cluster’s scheduling efficiency and includes the time from pod deletion until the replacement pod is scheduled on a node. In some cases when the container gets killed, the pod won’t fully reschedule so the pod rescheduling might be 0.0 seconds
pod_readiness_time - The time (in seconds) the pod took to become ready after being scheduled. This measures application startup time, including container image pulls, initialization, and readiness probe success.
total_recovery_time - The total amount of time (in seconds) from pod deletion until the replacement pod became fully ready and available to serve traffic. This is the sum of rescheduling time and readiness time.
These metrics appear in the telemetry output under PodsStatus.recovered for successfully recovered pods. Pods that fail to recover within the timeout period appear under PodsStatus.unrecovered without timing data.
The following are the components of Kubernetes for which a basic chaos scenario config exists today.
scenarios:- name:"<name of scenario>"namespace:"<specific namespace>"# can specify "*" if you want to find in all namespaceslabel_selector:"<label of pod(s)>"container_name:"<specific container name>"# This is optional, can take out and will kill all containers in all pods found under namespace and labelpod_names:# This is optional, can take out and will select all pods with given namespace and label- <pod_name>exclude_label:"<label to exclude pods from chaos>"# Optional: pods matching this label will be excluded from disruptioncount:<number of containers to disrupt, default=1>action:<kill signal to run. For example 1 ( hang up ) or 9. Default is set to 1>expected_recovery_time:<number of seconds to wait for container to be running again> (defaults to 60seconds)
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- container_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- container_scenarios:- scenarios/container-kill.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- container_scenarios:# Same type can appear multiple times- scenarios/container-kill-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario disrupts the containers matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
$ docker run \
-e <VARIABLE>=<value> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:container-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Default
NAMESPACE
Targeted namespace in the cluster
openshift-etcd
LABEL_SELECTOR
Label of the container(s) to target
k8s-app=etcd
EXCLUDE_LABEL
Pods to exclude after getting list of pods from LABEL_SELECTOR to target. For example “app=foo”
No default
DISRUPTION_COUNT
Number of containers to disrupt
1
CONTAINER_NAME
Name of the container to disrupt
etcd
ACTION
kill signal to run. For example 1 ( hang up ) or 9
1
EXPECTED_RECOVERY_TIME
Time to wait before checking if all containers that were affected recover properly
60
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Pods to exclude from targeting. For example “{app: foo}”
string
No
""
--disruption-count
Number of containers to disrupt
number
No
1
--container-name
Name of the container to disrupt
string
No
etcd
--action
kill signal to run. For example 1 ( hang up ) or 9
string
No
1
--expected-recovery-time
Time to wait before checking if all containers that were affected recover properly
number
No
60
Behavior Notes
Recovery monitoring: After disrupting containers, krkn monitors for recovery up to --expected-recovery-time seconds. If any containers remain unrecovered after the timeout, the scenario reports failure.
To see all available scenario options
krknctl run container-scenarios --help
Demo
See a demo of this scenario:
7 - DNS Outage Scenarios
This scenario blocks all outgoing DNS traffic from a specific pod, effectively preventing it from resolving any hostnames or service names.
How to Run DNS Outage Scenarios
Choose your preferred method to run DNS outage scenarios:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- network_chaos_ng_scenarios:- scenarios/dns-outage.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- network_chaos_ng_scenarios:# Same type can appear multiple times- scenarios/dns-outage-2.yaml
This scenario creates an outgoing firewall rule on specific nodes in your cluster, chosen by node name or a selector. This rule blocks connections to AWS EFS, leading to a temporary failure of any EFS volumes mounted on those affected nodes.
How to Run EFS Disruption Scenarios
Choose your preferred method to run EFS disruption scenarios:
This scenario creates an outgoing firewall rule on specific nodes in your cluster, chosen by node name or a selector. This rule blocks connections to AWS EFS, leading to a temporary failure of any EFS volumes mounted on those affected nodes.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- network_chaos_ng_scenarios:- scenarios/efs-disruption.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- network_chaos_ng_scenarios:# Same type can appear multiple times- scenarios/efs-disruption-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario disrupts a targeted zone in the public cloud by blocking egress and ingress traffic to understand the impact on both Kubernetes/OpenShift platforms control plane as well as applications running on the worker nodes in that zone. More information is documented here
This scenario isolates an etcd node by blocking its network traffic. This action forces an etcd leader re-election. Once the scenario concludes, the cluster should temporarily exhibit a split-brain condition, with two etcd leaders active simultaneously. This is particularly useful for testing the etcd cluster’s resilience under such a challenging state.
DANGER
This scenario carries a significant risk: it might break the cluster API, making it impossible to automatically revert the applied network rules. The iptables rules will be printed to the console, allowing for manual reversal via a shell on the affected node. This scenario is best suited for disposable clusters and should be used at your own risk.
How to Run ETCD Split Brain Scenarios
Choose your preferred method to run ETCD split brain scenarios:
This scenario isolates an etcd node by blocking its network traffic. This action forces an etcd leader re-election. Once the scenario concludes, the cluster should temporarily exhibit a split-brain condition, with two etcd leaders active simultaneously. This is particularly useful for testing the etcd cluster’s resilience under such a challenging state.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- network_chaos_ng_scenarios:- scenarios/etcd-split-brain.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- network_chaos_ng_scenarios:# Same type can appear multiple times- scenarios/etcd-split-brain-2.yaml
Run
python run_kraken.py --config config/config.yaml
DANGER
This scenario carries a significant risk: it might break the cluster API, making it impossible to automatically revert the applied network rules. The iptables rules will be printed to the console, allowing for manual reversal via a shell on the affected node. This scenario is best suited for disposable clusters and should be used at your own risk.
Hog Scenarios are designed to push the limits of memory, CPU, or I/O on one or more nodes in your cluster. They also serve to evaluate whether your cluster can withstand rogue pods that excessively consume resources without any limits.
These scenarios involve deploying one or more workloads in the cluster. Based on the specific configuration, these workloads will use a predetermined amount of resources for a specified duration.
Config Options
Common options
Option
Type
Description
duration
number
the duration of the stress test in seconds
workers
number (Optional)
the number of threads instantiated by stress-ng, if left empty the number of workers will match the number of available cores in the node.
hog-type
string (Enum)
can be cpu, memory or io.
image
string
the container image of the stress workload (quay.io/krkn-chaos/krkn-hog)
namespace
string
the namespace where the stress workload will be deployed
node-selector
string (Optional)
defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector.
taints
list (Optional) default []
list of taints for which tolerations need to be created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
number-of-nodes
number (Optional)
restricts the number of selected nodes by the selector
Krkn supports rollback for all available Hog scenarios. For more details, please refer to the Rollback Scenarios documentation.
10.1 - CPU Hog Scenario
Overview
The CPU Hog scenario is designed to create CPU pressure on one or more nodes in your Kubernetes/OpenShift cluster for a specified duration. This scenario helps you test how your cluster and applications respond to high CPU utilization.
How It Works
The scenario deploys a stress workload pod on targeted nodes. These pods use stress-ng to consume CPU resources according to your configuration. The workload runs for a specified duration and then terminates, allowing you to observe your cluster’s behavior under CPU stress.
When to Use
Use the CPU Hog scenario to:
Test your cluster’s ability to handle CPU resource contention
Validate that CPU resource limits and quotas are properly configured
Evaluate the impact of CPU pressure on application performance
Test whether your monitoring and alerting systems properly detect CPU saturation
Verify that the Kubernetes scheduler correctly handles CPU-constrained nodes
Simulate scenarios where rogue pods consume excessive CPU without limits
In addition to the common hog scenario options, you can specify the below options in your scenario configuration to specificy the amount of CPU to hog on a certain worker node
Option
Type
Description
cpu-load-percentage
number
the amount of cpu that will be consumed by the hog
cpu-method
string
reflects the cpu load strategy adopted by stress-ng, please refer to the stress-ng documentation for all the available options
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named hog_scenarios then add the desired scenario
pointing to the hog.yaml file.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- hog_scenarios:- scenarios/kube/cpu-hog.yml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- hog_scenarios:# Same type can appear multiple times- scenarios/kube/cpu-hog-2.yml
Run
python run_kraken.py --config config/config.yaml
This scenario hogs the cpu on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-cpu-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-cpu-hog
$ docker run \
-e <VARIABLE>=<value> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-cpu-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Default
TOTAL_CHAOS_DURATION
Set chaos duration (in sec) as desired
60
NODE_CPU_CORE
Number of cores (workers) of node CPU to be consumed
2
NODE_CPU_PERCENTAGE
Percentage of total cpu to be consumed
50
NAMESPACE
Namespace where the scenario container will be deployed
default
NODE_SELECTOR
Defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector.
""
TAINTS
List of taints for which tolerations need to be created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
[]
NUMBER_OF_NODES
Restricts the number of selected nodes by the selector
""
IMAGE
The container image of the stress workload
quay.io/krkn-chaos/krkn-hog
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Number of cores (workers) of node CPU to be consumed
number
No
--cpu-percentage
Percentage of total cpu to be consumed
number
No
50
--namespace
Namespace where the scenario container will be deployed
string
No
default
--node-selector
Node selector where the scenario containers will be scheduled in the format “=”. NOTE: Will be instantiated a container per each node selected with the same scenario options. If left empty a random node will be selected
string
No
--taints
List of taints for which tolerations need to be created. For example [“node-role.kubernetes.io/master:NoSchedule”]"
string
No
[]
--number-of-nodes
restricts the number of selected nodes by the selector
number
No
--image
The hog container image. Can be changed if the hog image is mirrored on a private repository
string
No
quay.io/krkn-chaos/krkn-hog
To see all available scenario options
krknctl run node-cpu-hog --help
Demo
See a demo of this scenario:
10.2 - IO Hog Scenario
Overview
The IO Hog scenario is designed to create disk I/O pressure on one or more nodes in your Kubernetes/OpenShift cluster for a specified duration. This scenario helps you test how your cluster and applications respond to high disk I/O utilization and storage-related bottlenecks.
How It Works
The scenario deploys a stress workload pod on targeted nodes. These pods use stress-ng to perform intensive write operations to disk, consuming I/O resources according to your configuration. The scenario supports attaching node paths to the pod as a hostPath volume or using custom pod volume definitions, allowing you to test I/O pressure on specific storage targets.
When to Use
Use the IO Hog scenario to:
Test your cluster’s behavior under disk I/O pressure
Validate that I/O resource limits are properly configured
Evaluate the impact of disk I/O contention on application performance
Test whether your monitoring systems properly detect disk saturation
Verify that storage performance meets requirements under stress
Simulate scenarios where pods perform excessive disk writes
Test the resilience of persistent volume configurations
The size of each individual write operation performed by the stressor
io-write-bytes
string
The total amount of data that will be written by the stressor. Can be specified as a percentage (%) of free space on the filesystem or in absolute units (b, k, m, g for Bytes, KBytes, MBytes, GBytes)
io-target-pod-folder
string
The path within the pod where the volume will be mounted
io-target-pod-volume
dictionary
The pod volume definition that will be stressed by the scenario (typically a hostPath volume)
WARNING
Modifying the structure of io-target-pod-volume might alter how the hog operates, potentially rendering it ineffective.
Example Values
io-block-size: "1m" - Write in 1 megabyte blocks
io-block-size: "4k" - Write in 4 kilobyte blocks
io-write-bytes: "50%" - Write data equal to 50% of available free space
io-write-bytes: "10g" - Write 10 gigabytes of data
How to Run IO Hog Scenarios
Choose your preferred method to run IO hog scenarios:
To enable this plugin add the pointer to the scenario input file scenarios/kube/io-hog.yaml as described in the
Usage section.
In addition to the common hog scenario options, you can specify the below options in your scenario configuration to target specific pod IO
Option
Type
Description
io-block-size
string
the block size written by the stressor
io-write-bytes
string
the total amount of data that will be written by the stressor. The size can be specified as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g
io-target-pod-folder
string
the folder where the volume will be mounted in the pod
io-target-pod-volume
dictionary
the pod volume definition that will be stressed by the scenario.
WARNING
Modifying the structure of io-target-pod-volume might alter how the hog operates, potentially rendering it ineffective.
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named hog_scenarios then add the desired scenario
pointing to the hog.yaml file.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- hog_scenarios:- scenarios/kube/io-hog.yml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- hog_scenarios:# Same type can appear multiple times- scenarios/kube/io-hog-2.yml
Run
python run_kraken.py --config config/config.yaml
This scenario hogs the IO on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/root/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-io-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/root/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-io-hog
$ docker run \
-e <VARIABLE>=<value> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/root/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-io-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
TOTAL_CHAOS_DURATION
Set chaos duration (in sec) as desired
number
180
IO_BLOCK_SIZE
string size of each write in bytes. Size can be from 1 byte to 4m
string
1m
IO_WORKERS
Number of stressorts
number
5
IO_WRITE_BYTES
string writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g
string
10m
NAMESPACE
Namespace where the scenario container will be deployed
string
default
NODE_SELECTOR
defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector.
string
""
TAINTS
List of taints for which tolerations need to be created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
string
[]
NODE_MOUNT_PATH
the local path in the node that will be mounted in the pod and that will be filled by the scenario
string
/root
NUMBER_OF_NODES
restricts the number of selected nodes by the selector
number
""
IMAGE
the container image of the stress workload
string
quay.io/krkn-chaos/krkn-hog
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Size of each write in bytes. Size can be from 1 byte to 4 Megabytes (allowed suffix are b,k,m)
string
No
1m
--io-workers
Number of stressor instances
number
No
5
--io-write-bytes
string writes N bytes for each hdd process. The size can be expressed as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g
string
No
10m
--node-mount-path
the path in the node that will be mounted in the pod and where the io hog will be executed. NOTE: be sure that kubelet has the rights to write in that node path
string
No
/root
--namespace
Namespace where the scenario container will be deployed
string
No
default
--node-selector
Node selector where the scenario containers will be scheduled in the format “=”. NOTE: Will be instantiated a container per each node selected with the same scenario options. If left empty a random node will be selected
string
No
--taints
List of taints for which tolerations need to be created. For example [“node-role.kubernetes.io/master:NoSchedule”]"
string
No
[]
--number-of-nodes
restricts the number of selected nodes by the selector
number
No
--image
The hog container image. Can be changed if the hog image is mirrored on a private repository
string
No
quay.io/krkn-chaos/krkn-hog
To see all available scenario options
krknctl run node-io-hog --help
10.3 - Memory Hog Scenario
Overview
The Memory Hog scenario is designed to create virtual memory pressure on one or more nodes in your Kubernetes/OpenShift cluster for a specified duration. This scenario helps you test how your cluster and applications respond to memory exhaustion and pressure conditions.
How It Works
The scenario deploys a stress workload pod on targeted nodes. These pods use stress-ng to allocate and consume memory resources according to your configuration. The workload runs for a specified duration, allowing you to observe how your cluster handles memory pressure, OOM (Out of Memory) conditions, and eviction scenarios.
When to Use
Use the Memory Hog scenario to:
Test your cluster’s behavior under memory pressure
Validate that memory resource limits and quotas are properly configured
Test pod eviction policies when nodes run out of memory
Verify that the kubelet correctly evicts pods based on memory pressure
Evaluate the impact of memory contention on application performance
Test whether your monitoring systems properly detect memory saturation
Simulate scenarios where rogue pods consume excessive memory without limits
Validate that memory-based horizontal pod autoscaling works correctly
The amount of memory that the scenario will attempt to allocate and consume. Can be specified as a percentage (%) of available memory or in absolute units (b, k, m, g for Bytes, KBytes, MBytes, GBytes)
Example Values
memory-vm-bytes: "80%" - Consume 80% of available memory
memory-vm-bytes: "2g" - Consume 2 gigabytes of memory
memory-vm-bytes: "512m" - Consume 512 megabytes of memory
How to Run Memory Hog Scenarios
Choose your preferred method to run memory hog scenarios:
To enable this plugin add the pointer to the scenario input file scenarios/kube/memory-hog.yml as described in the
Usage section.
In addition to the common hog scenario options, you can specify the below options in your scenario configuration to specificy the amount of memory to hog on a certain worker node
Option
Type
Description
memory-vm-bytes
string
the amount of memory that the scenario will try to hog.The size can be specified as % of free space on the file system or in units of Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named hog_scenarios then add the desired scenario
pointing to the hog.yaml file.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- hog_scenarios:- scenarios/kube/memory-hog.yml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- hog_scenarios:# Same type can appear multiple times- scenarios/kube/memory-hog-2.yml
Run
python run_kraken.py --config config/config.yaml
This scenario hogs the memory on the specified node on a Kubernetes/OpenShift cluster for a specified duration. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-memory-hog
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-memory-hog
$ docker run \
-e <VARIABLE>=<value> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-memory-hog
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Default
TOTAL_CHAOS_DURATION
Set chaos duration (in sec) as desired
60
MEMORY_CONSUMPTION_PERCENTAGE
percentage (expressed with the suffix %) or amount (expressed with the suffix b, k, m or g) of memory to be consumed by the scenario
90%
NUMBER_OF_WORKERS
Total number of workers (stress-ng threads)
1
NAMESPACE
Namespace where the scenario container will be deployed
default
NODE_SELECTOR
defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress. If number-of-nodes is specified, that many nodes will be randomly selected from those identified by the selector.
""
TAINTS
List of taints for which tolerations need to be created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
[]
NUMBER_OF_NODES
restricts the number of selected nodes by the selector
""
IMAGE
the container image of the stress workload
quay.io/krkn-chaos/krkn-hog
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
percentage (expressed with the suffix %) or amount (expressed with the suffix b, k, m or g) of memory to be consumed by the scenario
string
No
90%
--namespace
Namespace where the scenario container will be deployed
string
No
default
--node-selector
Node selector where the scenario containers will be scheduled in the format “=”. NOTE: Will be instantiated a container per each node selected with the same scenario options. If left empty a random node will be selected
string
No
--taints
List of taints for which tolerations need to be created. For example [“node-role.kubernetes.io/master:NoSchedule”]"
string
No
[]
--number-of-nodes
restricts the number of selected nodes by the selector
number
No
--image
The hog container image. Can be changed if the hog image is mirrored on a private repository
string
No
quay.io/krkn-chaos/krkn-hog
To see all available scenario options
krknctl run node-memory-hog --help
Demo
See a demo of this scenario:
11 - HTTP Load Scenarios
HTTP Load Scenarios
This scenario generates distributed HTTP load against one or more target endpoints using
Vegeta load testing pods deployed inside the Kubernetes cluster.
It leverages the distributed nature of Kubernetes clusters to instantiate multiple load generator pods,
significantly increasing the effectiveness of the load test.
The scenario supports multiple concurrent pods, configurable request rates, multiple HTTP methods
(GET, POST, PUT, DELETE, PATCH, HEAD), custom headers, request bodies, and comprehensive results
collection with aggregated metrics across all pods.
The configuration allows for the specification of multiple node selectors, enabling Kubernetes to schedule
the attacker pods on a user-defined subset of nodes to make the test more realistic.
The attacker container source code is available here.
How to Run HTTP Load Scenarios
Choose your preferred method to run HTTP load scenarios:
- http_load_scenario:runs:1# number of times to execute the scenarionumber-of-pods:2# number of attacker pods instantiatednamespace:default # namespace to deploy load testing podsimage:quay.io/krkn-chaos/krkn-http-load:latest # http load attacker container imageattacker-nodes:# node affinity to schedule the attacker podsnode-role.kubernetes.io/worker:# per each node label selector can be specified- ""# multiple values so the kube scheduler will schedule# the attacker pods in the best way possible# set empty value `attacker-nodes: {}` to let kubernetes schedule the podstargets:# Vegeta round-robins across all endpointsendpoints: # supported methods:GET, POST, PUT, DELETE, PATCH, HEAD- url:"https://your-service.example.com/health"method:"GET"- url:"https://your-service.example.com/api/data"method:"POST"headers:Content-Type:"application/json"Authorization:"Bearer your-token"body:'{"key":"value"}'rate:"50/1s"# request rate per pod: "50/1s", "1000/1m", "0" for max throughputduration:"30s"# attack duration: "30s", "5m", "1h"workers:10# initial concurrent workers per podmax_workers:100# maximum workers per pod (auto-scales)connections:100# max idle connections per hosttimeout:"10s"# per-request timeoutkeepalive:true# use persistent HTTP connectionshttp2:true# enable HTTP/2insecure:false# skip TLS verification (for self-signed certs)
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- http_load_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- http_load_scenarios:- scenarios/http-load.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- syn_flood_scenarios:- scenarios/syn-flood.yaml- http_load_scenarios:# Same type can appear multiple times- scenarios/http-load-2.yaml
Run
python run_kraken.py --config config/config.yaml
HTTP Load scenario
This scenario generates distributed HTTP load against one or more target endpoints using Vegeta load testing pods deployed inside the cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The node selectors are used to guide the cluster on where to deploy attacker pods. You can specify one or more labels in the format key=value;key=value2 (even using the same key) to choose one or more node categories. If left empty, the pods will be scheduled on any available node, depending on the cluster’s capacity.
NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts. For example:
The node selectors are used to guide the cluster on where to deploy attacker pods. You can specify one or more labels in the format key=value;key=value2 (even using the same key) to choose one or more node categories. If left empty, the pods will be scheduled on any available node, depending on the cluster s capacity.
string
To see all available scenario options
krknctl run http-load --help
12 - KubeVirt VM Outage Scenario
Simulating VM-level disruptions in KubeVirt/OpenShift CNV environments
This scenario enables the simulation of VM-level disruptions in clusters where KubeVirt or OpenShift Containerized Network Virtualization (CNV) is installed. It allows users to delete a Virtual Machine Instance (VMI) to simulate a VM crash and test recovery capabilities.
The kubevirt_vm_outage scenario deletes a specific KubeVirt Virtual Machine Instance (VMI) to simulate a VM crash or outage. This helps users:
Test the resilience of applications running inside VMs
Verify that VM monitoring and recovery mechanisms work as expected
Validate high availability configurations for VM workloads
Understand the impact of sudden VM failures on workloads and the overall system
Prerequisites
Before using this scenario, ensure the following:
KubeVirt or OpenShift CNV is installed in your cluster
The target VMI exists and is running in the specified namespace
Your cluster credentials have sufficient permissions to delete and create VMIs
Parameters
The scenario supports the following parameters:
Parameter
Description
Required
Default
vm_name
The name of the VMI to delete
Yes
N/A
namespace
The namespace where the VMI is located
No
“default”
timeout
How long to wait (in seconds) before attempting recovery for VMI to start running again
No
60
kill_count
How many VMI’s to kill serially
No
1
Expected Behavior
When executed, the scenario will:
Validate that KubeVirt is installed and the target VMI exists
Save the initial state of the VMI
Delete the VMI
Wait for the VMI to become running or hit the timeout
Attempt to recover the VMI:
If the VMI is managed by a VirtualMachine resource with runStrategy: Always, it will automatically recover
If automatic recovery doesn’t occur, the plugin will manually recreate the VMI using the saved state
Validate that the VMI is running again
Note
If the VM is managed by a VirtualMachine resource with runStrategy: Always, KubeVirt will automatically try to recreate the VMI after deletion. In this case, the scenario will wait for this automatic recovery to complete.
Validating VMI SSH Connection
While the kubvirt outage is running you can enable kube virt checks to check the ssh connection to a list of VMIs to test if an outage of one VMI effects any others become unready/unconnectable.
See more details on how to enable these checks in kubevirt checks
Advanced Use Cases
Testing High Availability VM Configurations
This scenario is particularly useful for testing high availability configurations, such as:
Clustered applications running across multiple VMs
VMs with automatic restart policies
Applications with cross-VM resilience mechanisms
Recovery Strategies
The plugin implements two recovery strategies:
Automated Recovery: If the VM is managed by a VirtualMachine resource with runStrategy: Always, the plugin will wait for KubeVirt’s controller to automatically recreate the VMI.
Manual Recovery: If automatic recovery doesn’t occur within the timeout period, the plugin will attempt to manually recreate the VMI using the saved state from before the deletion.
Recovery Time Metrics in Krkn Telemetry
Krkn tracks three key recovery time metrics for each affected VMI:
pod_rescheduling_time - The time (in seconds) that the Kubernetes cluster took to reschedule the VMI after it was deleted. This measures the cluster’s scheduling efficiency and includes the time from VMI deletion until the replacement VMI is scheduled on a node.
pod_readiness_time - The time (in seconds) the VMI took to become ready after being scheduled. This measures VMI startup time, including container image pulls, VM boot process, and readiness probe success.
total_recovery_time - The total amount of time (in seconds) from VMI deletion until the replacement VMI became fully ready and available. This is the sum of rescheduling time and readiness time.
These metrics appear in the telemetry output under PodsStatus.recovered for successfully recovered VMIs. VMIs that fail to recover within the timeout period appear under PodsStatus.unrecovered without timing data.
Krkn supports rollback for KubeVirt VM Outage Scenario. For more details, please refer to the Rollback Scenarios documentation.
Limitations
The scenario currently supports deleting a single VMI at a time
If VM spec changes during the outage window, the manual recovery may not reflect those changes
The scenario doesn’t simulate partial VM failures (e.g., VM freezing) - only complete VM outage
Troubleshooting
If the scenario fails, check the following:
Ensure KubeVirt/CNV is properly installed in your cluster
Verify that the target VMI exists and is running
Check that your credentials have sufficient permissions to delete and create VMIs
Examine the logs for specific error messages
How to Run KubeVirt VM Outage Scenarios
Choose your preferred method to run KubeVirt VM outage scenarios:
KubeVirt VM Outage Scenario in Kraken
The kubevirt_vm_outage scenario in Kraken enables users to simulate VM-level disruptions by deleting a Virtual Machine Instance (VMI) to test resilience and recovery capabilities.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- kubevirt_vm_outage:- scenarios/kubevirt/kubevirt-vm-outage.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- kubevirt_vm_outage:# Same type can appear multiple times- scenarios/kubevirt/kubevirt-vm-outage-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario deletes a VMI matching the namespace and name on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
$ docker run \
-e <VARIABLE>=<value> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:kubevirt-outage
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
NAMESPACE
VMI Namespace to target
string
""
VM_NAME
VMI name to delete, supports regex
string
""
TIMEOUT
Timeout to wait for VMI to start running again, will fail if timeout is hit
number
60
KILL_COUNT
Number of VMI’s to kill (will perform serially)
number
1
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Scenario specific parameters: (be sure to scroll to right)
Parameter
Description
Type
Required
Default
Possible Values
--namespace
VMI Namespace to target
string
Yes
default
--vm-name
Name of the VM to delete
string
Yes
--timeout
Time that scenario will wait for VM to come back
number
No
60
--kill-count
Number of VMI’s to kill (will perform serially)
number
No
1
Behavior Notes
VM recovery: After krkn deletes the VM, the KubeVirt controller automatically recreates the VMI unless runStrategy is set to Manual. The --timeout parameter controls how long krkn waits for the VM to come back before reporting failure.
ManagedCluster scenarios leverage ManifestWorks to inject faults into the ManagedClusters.
The following ManagedCluster chaos scenarios are supported:
managedcluster_start_scenario: Scenario to start the ManagedCluster instance.
managedcluster_stop_scenario: Scenario to stop the ManagedCluster instance.
managedcluster_stop_start_scenario: Scenario to stop and then start the ManagedCluster instance.
start_klusterlet_scenario: Scenario to start the klusterlet of the ManagedCluster instance.
stop_klusterlet_scenario: Scenario to stop the klusterlet of the ManagedCluster instance.
stop_start_klusterlet_scenario: Scenario to stop and start the klusterlet of the ManagedCluster instance.
ManagedCluster scenarios can be injected by placing the ManagedCluster scenarios config files under managedcluster_scenarios option in the Kraken config. Refer to managedcluster_scenarios_example config file.
managedcluster_scenarios:- actions:# ManagedCluster chaos scenarios to be injected- managedcluster_stop_start_scenariomanagedcluster_name:cluster1 # ManagedCluster on which scenario has to be injected; can set multiple names separated by comma# label_selector: # When managedcluster_name is not specified, a ManagedCluster with matching label_selector is selected for ManagedCluster chaos scenario injectioninstance_count:1# Number of managedcluster to perform action/select that match the label selectorruns:1# Number of times to inject each scenario under actions (will perform on same ManagedCluster each time)timeout:420# Duration to wait for completion of ManagedCluster scenario injection# For OCM to detect a ManagedCluster as unavailable, have to wait 5*leaseDurationSeconds# (default leaseDurationSeconds = 60 sec)- actions:- stop_start_klusterlet_scenariomanagedcluster_name:cluster1# label_selector:instance_count:1runs:1timeout:60
14 - Network Chaos NG Scenarios
This scenario introduce a new infrastructure to refactor and port the current implementation of the network chaos plugins
All the plugins must implement the AbstractNetworkChaosModule abstract class in order to be instantiated and ran by the Netwok Chaos NG plugin.
This abstract class implements two main abstract methods:
run(self, target: str, kubecli: KrknTelemetryOpenshift, error_queue: queue.Queue = None) is the entrypoint for each Network Chaos module.
If the module is configured to be run in parallel error_queue must not be None
target: param is the name of the resource (Pod, Node etc.) that will be targeted by the scenario
kubecli: the KrknTelemetryOpenshift needed by the scenario to access to the krkn-lib methods
error_queue: a queue that will be used by the plugin to push the errors raised during the execution of parallel modules
get_config(self) -> (NetworkChaosScenarioType, BaseNetworkChaosConfig) returns the common subset of settings shared by all the scenarios BaseNetworkChaosConfig and the type of Network Chaos Scenario that is running (Pod Scenario or Node Scenario)
BaseNetworkChaosConfig base module configuration
Is the base class that contains the common parameters shared by all the Network Chaos NG modules.
id is the string name of the Network Chaos NG module
wait_duration if there is more than one network module config in the same config file, the plugin will wait wait_duration seconds before running the following one
test_duration the duration in seconds of the scenario
label_selector the selector used to target the resource
instance_count if greater than 0 picks instance_count elements from the targets selected by the filters randomly
execution if more than one target are selected by the selector the scenario can target the resources both in serial or parallel.
namespace the namespace were the scenario workloads will be deployed
service_account optional service account for the scenario workload (empty string uses the cluster default)
taints : List of taints for which tolerations need to be created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
14.2 - Node Interface Down
Brings one or more network interfaces down on a target node for a configurable duration, then restores them. Can be used to simulate network partitions, NIC failures, or loss of connectivity at the node level.
How to Run Node Interface Down Scenarios
Choose your preferred method to run node interface down scenarios:
- id:node_interface_downimage:quay.io/krkn-chaos/krkn-network-chaos:latestwait_duration:0test_duration:60label_selector:"node-role.kubernetes.io/worker="instance_count:1execution:parallelnamespace:default# scenario specific settingstarget:""interfaces:[]recovery_time:30taints:[]
For the common module settings please refer to the documentation.
target: the node name to target (used when label_selector is not set)
interfaces: a list of network interface names to bring down (e.g. ["eth0", "bond0"]). Leave empty to auto-detect the node’s default interface
recovery_time: seconds to wait after bringing the interface(s) back up before continuing. Set to 0 to skip the recovery wait
Usage
To enable node interface down scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named network_chaos_ng_scenarios then add the desired scenario
pointing to the scenario yaml file.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-interface-down
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-interface-down
OR
$ docker run -e <VARIABLE>=<value> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-interface-down
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
ex.)
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Default
TOTAL_CHAOS_DURATION
Duration in seconds to keep the interface(s) down
60
RECOVERY_TIME
Seconds to wait after bringing the interface(s) back up
0
NODE_SELECTOR
Label selector to choose target nodes. If not specified, a schedulable node will be chosen at random
""
NODE_NAME
The node name to target (used when label selector is not set)
INSTANCE_COUNT
Restricts the number of nodes selected by the label selector
1
EXECUTION
Execution mode for multiple nodes: serial or parallel
parallel
INTERFACES
Comma-separated list of interface names to bring down (e.g. eth0 or eth0,bond0). Leave empty to auto-detect the default interface
""
NAMESPACE
Namespace where the chaos workload pod will be deployed
default
TAINTS
List of taints for which tolerations need to be created. Example: ["node-role.kubernetes.io/master:NoSchedule"]
""
SERVICE_ACCOUNT
Optional service account for the chaos workload pod
""
NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts. For example:
Seconds to wait after bringing the interface(s) back up before continuing
false
0
--node-selector
string
Label selector to choose target nodes
false
node-role.kubernetes.io/worker=
--node-name
string
Node name to target (used when node-selector is not set)
false
--namespace
string
Namespace where the chaos workload pod will be deployed
false
default
--instance-count
number
Number of nodes to target from those matching the selector
false
1
--execution
enum
Execution mode when targeting multiple nodes: serial or parallel
false
parallel
--interfaces
string
Comma-separated list of interface names to bring down. Leave empty to auto-detect the default interface
false
--image
string
The chaos workload container image
false
quay.io/redhat-chaos/krkn-ng-tools:latest
--taints
string
List of taints for which tolerations need to be created
false
14.3 - Node Network Chaos
Injects network degradation (latency, packet loss, bandwidth) into a target node’s network interfaces using Linux tc rules.
Injects network degradation (latency, packet loss, bandwidth restriction) into a target node’s network interfaces using Linux tc (traffic control) rules. Unlike node-network-filter which blocks specific ports via iptables, this module shapes traffic at the interface level. Includes safety checks for existing tc rules on the node.
How to Run Node Network Chaos Scenarios
Choose your preferred method to run node network chaos scenarios:
Configuration
- id:node_network_chaosimage:"quay.io/krkn-chaos/krkn-network-chaos:latest"wait_duration:1test_duration:60label_selector:""service_account:""instance_count:1execution:parallelnamespace:default# scenario specific settingstarget:"<node_name>"interfaces:[]ingress:trueegress:truelatency:""# empty string to skip; or e.g. 100ms (units: us, ms, s)loss:10# percentage (no % symbol)bandwidth: 1gbit # supported units:bit, kbit, mbit, gbit, tbitforce:falsetaints:[]
For the common module settings please refer to the documentation.
latency: network latency to inject. Format: integer followed by us (microseconds), ms (milliseconds), or s (seconds). Example: 100ms. Set to empty string to skip.
loss: packet loss percentage as a plain integer (no % symbol). Example: 10 means 10% packet loss. Set to empty string to skip.
bandwidth: bandwidth limit. Format: integer followed by bit, kbit, mbit, gbit, or tbit. Example: 100mbit. Set to empty string to skip.
interfaces: list of network interface names to target. Leave empty to auto-detect the node’s default interface.
ingress: apply rules to incoming traffic (default: true)
egress: apply rules to outgoing traffic (default: true)
target: the node name to target (used when label_selector is not set)
force: by default (false), if the target node already has tc rules configured, the scenario aborts with a warning to avoid damaging cluster networking. Set to true to override existing rules. A 10-second warning delay is inserted before proceeding. Use with caution.
Usage
To enable node network chaos scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named network_chaos_ng_scenarios then add the desired scenario
pointing to the scenario yaml file.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
When force is set to false (default), the scenario will check if the target node already has complex tc queueing disciplines configured. If existing rules are detected, the scenario aborts to prevent damaging cluster networking. Only set force: true if you understand the implications of overriding existing traffic control rules.
Run
python run_kraken.py --config config/config.yaml
Not yet supported
node_network_chaos is not currently available as a krkn-hub container image. Use the Krkn tab to run this scenario directly.
Not yet supported
node_network_chaos is not currently available via krknctl. Use the Krkn tab to run this scenario directly.
Creates iptables rules on one or more nodes to block incoming and outgoing traffic on a port in the node network interface. Can be used to block network based services connected to the node or to block inter-node communication.
How to Run Node Network Filter Scenarios
Choose your preferred method to run node network filter scenarios:
- id:node_network_filterwait_duration:300test_duration:100label_selector:"kubernetes.io/hostname=ip-10-0-39-182.us-east-2.compute.internal"instance_count:1execution:parallelnamespace:'default'# scenario specific settingsingress:falseegress:truetarget:node-nameinterfaces:[]protocols:- tcpports:- 2049taints:[]service_account:""
for the common module settings please refer to the documentation.
ingress: filters incoming traffic on one or more ports
egress: filters outgoing traffic on one or more ports
target: the node name (if label_selector is not set)
interfaces: network interfaces used for outgoing traffic when egress is enabled (same semantics as krknctl and krkn-hub)
ports: ports that incoming and/or outgoing filtering applies to (depending on ingress / egress)
protocols: the IP protocols to filter (tcp and udp)
taints: list of taints for which tolerations are created. Example: ["node-role.kubernetes.io/master:NoSchedule"]
service_account: optional service account for the scenario workload (empty string uses the default)
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named network_chaos_ng_scenarios then add the desired scenario
pointing to the hog.yaml file.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- network_chaos_ng_scenarios:- scenarios/kube/node-network-filter.yml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- network_chaos_ng_scenarios:# Same type can appear multiple times- scenarios/kube/node-network-filter-2.yml
Examples
Please refer to the use cases section for some real usage scenarios.
Run
python run_kraken.py --config config/config.yaml
Run
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-network-filter
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-network-filter
OR
$ docker run -e <VARIABLE>=<value> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-network-filter
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
ex.)
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
TOTAL_CHAOS_DURATION
set chaos duration (in sec) as desired
number
60
NODE_SELECTOR
defines the node selector for choosing target nodes. If not specified, one schedulable node in the cluster will be chosen at random. If multiple nodes match the selector, all of them will be subjected to stress.
string
""
NODE_NAME
the node name to target (if label selector not selected)
string
INSTANCE_COUNT
restricts the number of selected nodes by the selector
number
“1”
EXECUTION
sets the execution mode of the scenario on multiple nodes, can be parallel or serial
enum
“parallel”
INGRESS
sets the network filter on incoming traffic, can be true or false
boolean
false
EGRESS
sets the network filter on outgoing traffic, can be true or false
boolean
false
INTERFACES
a list of comma separated names of network interfaces (eg. eth0 or eth0,eth1,eth2) to filter for outgoing traffic
string
""
PORTS
a list of comma separated port numbers (eg 8080 or 8080,8081,8082) to filter for both outgoing and incoming traffic
string
""
PROTOCOLS
a list of comma separated protocols to filter (tcp, udp or both)
string
TAINTS
List of taints for which tolerations need to be created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
string
[]
SERVICE_ACCOUNT
optional service account for the Node Network Filter workload
string
""
NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts. For example:
krknctl marks --ingress and --egress as required flags (you should pass both). Values: at least one of --ingress or --egress must be true; both may be true to filter incoming and outgoing traffic.
Argument
Type
Description
Required
Default Value
--chaos-duration
number
Chaos duration in seconds
false
60
--node-selector
string
Node label selector (format: key=value)
false
--node-name
string
Specific node name to target (alternative to node-selector)
false
--namespace
string
Namespace where the scenario container is deployed
false
default
--instance-count
number
Number of nodes to target when using node-selector
false
1
--execution
enum
Execution mode: parallel or serial
false
parallel
--ingress
boolean
Filter incoming traffic (true / false)
true
--egress
boolean
Filter outgoing traffic (true / false)
true
--interfaces
string
Network interfaces for outgoing traffic (comma-separated, e.g. eth0,eth1). Optional; empty uses workload defaults
false
--ports
string
Network ports to filter traffic (comma-separated, e.g., 8080,8081,8082)
true
--image
string
The network chaos injection workload container image
false
quay.io/krkn-chaos/krkn-network-chaos:latest
--protocols
string
Network protocols to filter: tcp, udp, or tcp,udp
false
tcp
--taints
string
Comma-separated taints (tolerations are derived for the workload). Same notation as elsewhere in Network Chaos NG docs, e.g. node-role.kubernetes.io/master:NoSchedule
false
--service-account
string
Service account for the workload (optional)
false
Parameter Format Details
Node Selection:
--node-selector: Label selector in format key=value (e.g., node-role.kubernetes.io/worker=)
--node-name: Specific node name (e.g., ip-10-0-1-100.ec2.internal)
Specify either --node-selector OR --node-name, not both
When using --node-selector, use --instance-count to limit the number of selected nodes
Port Format:
Single port: 8080
Multiple ports: 8080,8081,8082 (comma-separated, no spaces)
Protocol Format:
Valid values: tcp, udp, tcp,udp, or udp,tcp
Default: tcp
Interface Format:
Applies to egress (outgoing) filtering, matching the scenario image metadata
Single interface: eth0
Multiple interfaces: eth0,eth1,eth2 (comma-separated, no spaces)
May be left empty when not needed for your egress rules
Taints Format:
Comma-separated Kubernetes taints; the workload gets matching tolerations
Examples: node-role.kubernetes.io/master:NoSchedule or key=value:NoSchedule when the taint includes a value
Usage Notes
Node targeting: This scenario targets nodes (not pods) and creates iptables rules on the target node(s) to filter network traffic
Ingress/Egress: Pass both flags; at least one must be true. Both may be true to filter incoming and outgoing traffic
Execution modes:
parallel: Applies network filtering to all selected nodes simultaneously
serial: Applies network filtering to nodes one at a time
Injects network degradation (latency, packet loss, bandwidth) into a target pod’s network interfaces using Linux tc rules.
Injects network degradation (latency, packet loss, bandwidth restriction) into a target pod’s network interfaces using Linux tc (traffic control) rules. Unlike pod-network-filter which blocks specific ports via iptables, this module shapes traffic at the interface level.
How to Run Pod Network Chaos Scenarios
Choose your preferred method to run pod network chaos scenarios:
Configuration
- id:pod_network_chaosimage:"quay.io/krkn-chaos/krkn-network-chaos:latest"wait_duration:1test_duration:60label_selector:""service_account:""instance_count:1execution:parallelnamespace:default# scenario specific settingstarget:"<pod_name>"interfaces:[]ingress:trueegress:truelatency:""# empty string to skip; or e.g. 100ms (units: us, ms, s)loss:10# percentage (no % symbol)bandwidth: 1gbit # supported units:bit, kbit, mbit, gbit, tbittaints:[]
For the common module settings please refer to the documentation.
latency: network latency to inject. Format: integer followed by us (microseconds), ms (milliseconds), or s (seconds). Example: 100ms. Set to empty string to skip.
loss: packet loss percentage as a plain integer (no % symbol). Example: 10 means 10% packet loss. Set to empty string to skip.
bandwidth: bandwidth limit. Format: integer followed by bit, kbit, mbit, gbit, or tbit. Example: 100mbit. Set to empty string to skip.
interfaces: list of network interface names to target. Leave empty to auto-detect the pod’s default interface.
ingress: apply rules to incoming traffic (default: true)
egress: apply rules to outgoing traffic (default: true)
target: the pod name to target (used when label_selector is not set)
Usage
To enable pod network chaos scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named network_chaos_ng_scenarios then add the desired scenario
pointing to the scenario yaml file.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
Creates iptables rules on one or more pods to block incoming and outgoing traffic on a port in the pod network interface. Can be used to block network based services connected to the pod or to block inter-pod communication.
How to Run Pod Network Filter Scenarios
Choose your preferred method to run pod network filter scenarios:
- id:pod_network_filterwait_duration:300test_duration:100label_selector:"app=label"instance_count:1execution:parallelnamespace:'default'# scenario specific settingsingress:falseegress:truetarget:'pod-name'interfaces:[]protocols:- tcpports:- 80taints:[]
for the common module settings please refer to the documentation.
ingress: filters the incoming traffic on one or more ports. If set one or more network interfaces must be specified
egress : filters the outgoing traffic on one or more ports.
target: the pod name (if label_selector not set)
interfaces: a list of network interfaces where the incoming traffic will be filtered
ports: the list of ports that will be filtered
protocols: the ip protocols to filter (tcp and udp)
taints : List of taints for which tolerations need to be created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
Usage
To enable hog scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named network_chaos_ng_scenarios then add the desired scenario
pointing to the hog.yaml file.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- network_chaos_ng_scenarios:- scenarios/kube/pod-network-filter.yml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- network_chaos_ng_scenarios:# Same type can appear multiple times- scenarios/kube/pod-network-filter-2.yml
Examples
Please refer to the use cases section for some real usage scenarios.
Run
python run_kraken.py --config config/config.yaml
Run
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:z -d quay.io/krkn-chaos/krkn-hub:pod-network-filter
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:z -d quay.io/krkn-chaos/krkn-hub:pod-network-filter
OR
$ docker run -e <VARIABLE>=<value> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:z -d quay.io/krkn-chaos/krkn-hub:pod-network-filter
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
ex.)
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
TOTAL_CHAOS_DURATION
set chaos duration (in sec) as desired
number
60
POD_SELECTOR
defines the pod selector for choosing target pods. If multiple pods match the selector, all of them will be subjected to stress.
string
""
POD_NAME
the pod name to target (if POD_SELECTOR not specified)
string
INSTANCE_COUNT
restricts the number of selected pods by the selector
number
“1”
EXECUTION
sets the execution mode of the scenario on multiple pods, can be parallel or serial
enum
“parallel”
INGRESS
sets the network filter on incoming traffic, can be true or false
boolean
false
EGRESS
sets the network filter on outgoing traffic, can be true or false
boolean
true
INTERFACES
a list of comma separated names of network interfaces (eg. eth0 or eth0,eth1,eth2) to filter for outgoing traffic
string
""
PORTS
a list of comma separated port numbers (eg 8080 or 8080,8081,8082) to filter for both outgoing and incoming traffic
string
""
PROTOCOLS
a list of comma separated network protocols (tcp, udp or both of them e.g. tcp,udp)
string
“tcp”
NAMESPACE
namespace where the scenario container will be deployed
string
default
IMAGE
the network chaos injection workload container image
string
quay.io/krkn-chaos/krkn-network-chaos:latest
TAINTS
List of taints for which tolerations need to be created. Example: [“node-role.kubernetes.io/master:NoSchedule”]
string
[]
SERVICE_ACCOUNT
optional service account for the Pod Network Filter workload
string
""
NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts. For example:
Injects network degradation into a KubeVirt Virtual Machine Instance (VMI) by shaping traffic on the VM's tap interface inside the virt-launcher network namespace. Supports configurable bandwidth limiting, latency injection, and packet loss. Unlike node or pod network chaos, this scenario targets the tap device that connects QEMU to the bridge, so only the specific VMI is affected without disrupting OVN's BFD heartbeats or other workloads on the same node.
How to Run VMI Network Chaos Scenarios
Choose your preferred method to run VMI network chaos scenarios:
For the common module settings please refer to the documentation.
target: regex to match VMI names within the namespace (e.g. "<vmi-name-prefix>-.*" or ".*" for all)
namespace: namespace containing the target VMIs (required; also supports regex to match multiple namespaces)
interfaces: list of tap interface names to target. Leave empty to auto-detect the tap device in the virt-launcher network namespace
ingress: shape incoming traffic to the VM
egress: shape outgoing traffic from the VM
latency: artificial network latency added to packets (e.g. "100ms", "500ms")
loss: percentage of packets to drop (e.g. "10" for 10%, "50" for 50%)
bandwidth: maximum throughput cap (e.g. "100mbit", "1gbit", "500kbit")
Note
At least one of latency, loss, or bandwidth should be set. Setting all three simultaneously compounds the degradation.
Catastrophic Configurations
The following combinations produce the most impactful chaos:
Complete network degradation (maximum chaos):
latency:"2000ms"loss:"50"bandwidth:"1mbit"
Combines severe latency with heavy packet loss and near-complete bandwidth exhaustion.
DNS blackout via latency (cascading failures):
latency:"5000ms"loss:"0"bandwidth:""
5-second latency causes DNS timeouts across every service in the VM, producing cascading failures without a hard cut.
Bandwidth starvation:
latency:""loss:"0"bandwidth:"100kbit"
Throttles the VMI to 100 kbit/s — enough to keep connections alive but too slow for most application traffic.
Usage
To enable VMI network chaos scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named network_chaos_ng_scenarios then add the desired scenario
pointing to the scenario yaml file.
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:vmi-network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:vmi-network-chaos
OR
$ docker run -e <VARIABLE>=<value> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:vmi-network-chaos
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
ex.)
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
TOTAL_CHAOS_DURATION
Chaos duration in seconds
number
120
NAMESPACE
Namespace containing the target VMIs (required)
string
VMI_NAME
Regex to match VMI names (e.g. virt-server-.* or .* for all)
string
.*
LABEL_SELECTOR
Label selector to filter VMIs (e.g. app=myapp)
string
""
INSTANCE_COUNT
Maximum number of VMIs to target
number
1
EXECUTION
Execution mode: serial or parallel
enum
serial
INGRESS
Shape incoming traffic to the VM
boolean
true
EGRESS
Shape outgoing traffic from the VM
boolean
true
INTERFACES
Comma-separated tap interface names (empty to auto-detect)
string
""
LATENCY
Artificial latency added to packets (e.g. 100ms, 500ms)
string
""
LOSS
Packet loss percentage (e.g. 10 for 10%)
string
""
BANDWIDTH
Maximum throughput cap (e.g. 100mbit, 1gbit)
string
""
WAIT_DURATION
Seconds to wait before running the next scenario in the same file
number
300
IMAGE
Network chaos injection workload image
string
quay.io/krkn-chaos/krkn-network-chaos:latest
TAINTS
List of taints for which tolerations are created (e.g. ["node-role.kubernetes.io/master:NoSchedule"])
string
[]
SERVICE_ACCOUNT
Optional service account for the scenario workload
string
""
NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts. For example:
Injects iptables-based network filtering into a KubeVirt Virtual Machine Instance (VMI) by applying INPUT and OUTPUT rules inside the virt-launcher network namespace via nsenter. Supports port and protocol-specific filtering so you can selectively block DNS, SSH, HTTP, or any other traffic without cutting all connectivity. The tap interface (tap0) is targeted directly so only the specific VMI is isolated, leaving OVN's BFD heartbeats and other node workloads unaffected.
How to Run VMI Network Filter Scenarios
Choose your preferred method to run VMI network filter scenarios:
For the common module settings please refer to the documentation.
target: regex to match VMI names within the namespace (e.g. "<vmi-name-prefix>-.*" or ".*" for all)
namespace: namespace containing the target VMIs (required; also supports regex to match multiple namespaces)
interfaces: list of tap interface names to target. Leave empty to auto-detect the tap device in the virt-launcher network namespace
ingress: apply iptables DROP rules to incoming traffic
egress: apply iptables DROP rules to outgoing traffic
ports: list of ports to block (omit or leave empty to block all ports)
protocols: list of IP protocols to filter — tcp, udp, or both (defaults to ["tcp", "udp"])
Note
ports and protocols are optional. When ports is omitted or empty, all traffic on the specified protocols is blocked — equivalent to full network isolation.
Catastrophic Configurations
Full network isolation (most catastrophic):
ingress:trueegress:true# no ports or protocols — blocks all TCP and UDP
Blocking DNS (port 53) causes every service inside the VM that resolves hostnames to fail with timeouts. Cascading failures across the application stack without a hard cut — often the most realistic chaos scenario.
Kills HTTP/HTTPS traffic only — tests application resilience without taking the entire VM offline.
Usage
To enable VMI network filter scenarios edit the kraken config file, go to the section kraken -> chaos_scenarios of the yaml structure
and add a new element to the list named network_chaos_ng_scenarios then add the desired scenario
pointing to the scenario yaml file.
$ podman run --name=<container_name> --net=host --pull=always --env-host=true -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:vmi-network-filter
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:vmi-network-filter
OR
$ docker run -e <VARIABLE>=<value> --net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:vmi-network-filter
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
ex.)
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
TOTAL_CHAOS_DURATION
Chaos duration in seconds
number
120
NAMESPACE
Namespace containing the target VMIs (required)
string
VMI_NAME
Regex to match VMI names (e.g. virt-server-.* or .* for all)
string
.*
LABEL_SELECTOR
Label selector to filter VMIs (e.g. app=myapp)
string
""
INSTANCE_COUNT
Maximum number of VMIs to target
number
1
EXECUTION
Execution mode: serial or parallel
enum
serial
INGRESS
Apply DROP rules to incoming traffic
boolean
true
EGRESS
Apply DROP rules to outgoing traffic
boolean
true
INTERFACES
Comma-separated tap interface names (empty to auto-detect)
string
""
PORTS
Comma-separated port numbers to block (empty = all ports)
string
""
PROTOCOLS
Comma-separated protocols to filter: tcp, udp, or both
string
tcp,udp
WAIT_DURATION
Seconds to wait before running the next scenario in the same file
number
300
IMAGE
Network chaos injection workload image
string
quay.io/krkn-chaos/krkn-network-chaos:latest
TAINTS
List of taints for which tolerations are created (e.g. ["node-role.kubernetes.io/master:NoSchedule"])
string
[]
SERVICE_ACCOUNT
Optional service account for the scenario workload
string
""
NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts. For example:
Scenario to introduce network latency, packet loss, and bandwidth restriction in the Node's host network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
How to Run Network Chaos Scenarios
Choose your preferred method to run network chaos scenarios:
network_chaos:# Scenario to create an outage by simulating random variations in the network.duration:300# In seconds - duration network chaos will be applied.node_name:# Comma separated node names on which scenario has to be injected.label_selector:node-role.kubernetes.io/master # When node_name is not specified, a node with matching label_selector is selected for running the scenario.instance_count:1# Number of nodes in which to execute network chaos.interfaces:# List of interface on which to apply the network restriction.- "ens5"# Interface name would be the Kernel host network interface name.execution: serial # Default: serial. Options:serial, parallel. Execute each of the egress options as a single scenario(parallel) or as separate scenario(serial).egress:latency:500msloss:2# 2% packet loss (value is a percentage, e.g. 50 = 50%)bandwidth:10mbitimage:quay.io/krkn-chaos/krkn:tools
Sample scenario config for ingress traffic shaping (using a plugin)
- id:network_chaosconfig:node_interface_name:# Dictionary with key as node name(s) and value as a list of its interfaces to testip-10-0-128-153.us-west-2.compute.internal:- ens5- genev_sys_6081label_selector:node-role.kubernetes.io/master # When node_interface_name is not specified, nodes with matching label_selector is selected for node chaos scenario injectioninstance_count:1# Number of nodes to perform action/select that match the label selectorkubeconfig_path:~/.kube/config # Path to kubernetes config file. If not specified, it defaults to ~/.kube/configexecution_type:parallel # Execute each of the ingress options as a single scenario(parallel) or as separate scenario(serial).network_params:latency:500msloss:'2'# 2% packet loss (value is a percentage, must be quoted)bandwidth:10mbitwait_duration:120test_duration:60image:quay.io/krkn-chaos/krkn:tools
Note: For ingress traffic shaping, ensure that your node doesn’t have any IFB interfaces already present. The scenario relies on creating IFBs to do the shaping, and they are deleted at the end of the scenario.
Steps
Pick the nodes to introduce the network anomaly either from node_name or label_selector.
Verify interface list in one of the nodes or use the interface with a default route, as test interface, if no interface is specified by the user.
Set traffic shaping config on node’s interface using tc and netem.
Wait for the duration time.
Remove traffic shaping config on node’s interface.
Remove the job that spawned the pod.
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- network_chaos_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- network_chaos_scenarios:- scenarios/network-chaos.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- network_chaos_scenarios:# Same type can appear multiple times- scenarios/network-chaos-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario introduces network latency, packet loss, bandwidth restriction in the egress traffic of a Node’s interface using the tc and Netem. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run \
-e <VARIABLE>=<value> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:network-chaos
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
Note
export TRAFFIC_TYPE=egress for Egress scenarios and export TRAFFIC_TYPE=ingress for Ingress scenarios
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Egress Scenarios
Parameter
Description
Default
DURATION
Duration in seconds - during with network chaos will be applied.
300
IMAGE
Image used to disrupt network on a pod
quay.io/krkn-chaos/krkn:tools
NODE_NAME
Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma
""
LABEL_SELECTOR
When NODE_NAME is not specified, a node with matching label_selector is selected for running.
node-role.kubernetes.io/master
INSTANCE_COUNT
Targeted instance count matching the label selector
1
INTERFACES
List of interface on which to apply the network restriction.
[]
EXECUTION
Execute each of the egress option as a single scenario(parallel) or as separate scenario(serial).
parallel
EGRESS
Dictonary of values to set network latency(latency: 50ms), packet loss(loss: 0.02), bandwidth restriction(bandwidth: 100mbit)
{bandwidth: 100mbit}
Ingress Scenarios
Parameter
Description
Default
DURATION
Duration in seconds - during with network chaos will be applied.
300
IMAGE
Image used to disrupt network on a pod
quay.io/krkn-chaos/krkn:tools
TARGET_NODE_AND_INTERFACE
# Dictionary with key as node name(s) and value as a list of its interfaces to test. For example: {ip-10-0-216-2.us-west-2.compute.internal: [ens5]}
""
LABEL_SELECTOR
When NODE_NAME is not specified, a node with matching label_selector is selected for running.
node-role.kubernetes.io/master
INSTANCE_COUNT
Targeted instance count matching the label selector
1
EXECUTION
Used to specify whether you want to apply filters on interfaces one at a time or all at once.
parallel
NETWORK_PARAMS
latency, loss and bandwidth are the three supported network parameters to alter for the chaos test. For example: {latency: 50ms, loss: ‘0.02’}
""
WAIT_DURATION
Ensure that it is at least about twice of test_duration
300
Note
For disconnected clusters, be sure to also mirror the helper image of quay.io/krkn-chaos/krkn:tools and set the mirrored image path properly
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Selects the network chaos scenario type can be ingress or egress
enum
Yes
ingress | egress
--image
Image used to disrupt network on a pod
string
No
quay.io/krkn-chaos/krkn:tools
--duration
Duration in seconds - during with network chaos will be applied.
number
No
300
--label-selector
When NODE_NAME is not specified, a node with matching label_selector is selected for running.
string
No
node-role.kubernetes.io/master
--execution
Execute each of the egress option as a single scenario(parallel) or as separate scenario(serial).
enum
No
parallel
--instance-count
Targeted instance count matching the label selector.
number
No
1
--node-name
Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma
string
No
--interfaces
List of interface on which to apply the network restriction. eg. [eth0,eth1,eth2]
string
No
[]
--egress
Dictonary of values to set network latency(latency: 50ms), packet loss(loss: 0.02), bandwidth restriction(bandwidth: 100mbit) eg. {bandwidth: 100mbit}
string
No
“{bandwidth: 100mbit}”
--target-node-interface
Dictionary with key as node name(s) and value as a list of its interfaces to test. For example: {ip-10-0-216-2.us-west-2.compute.internal: [ens5]}
string
No
--network-params
latency, loss and bandwidth are the three supported network parameters to alter for the chaos test. For example: {latency: 50ms, loss: 0.02}
string
No
--wait-duration
Ensure that it is at least about twice of test_duration
number
No
300
Parameter Dependencies
--node-name: Egress only. Ignored when --traffic-type is ingress.
--network-params and --target-node-interface: Ingress only. Ignored when --traffic-type is egress.
--wait-duration: Must be at least 2× --duration to allow the network to stabilize before verification.
Behavior Notes
Empty --interfaces: When left empty [], krkn auto-detects the primary network interface on the target node using the default route. If specified, each interface is validated against the node’s actual interfaces before applying chaos.
To see all available scenario options
krknctl run network-chaos --help
16 - Node Scenarios
This scenario disrupts the node(s) matching the label or node name(s) on a Kubernetes/OpenShift cluster. These scenarios are performed in two different ways, either by the clusters cloud cli or by common/generic commands that can be performed on any cluster.
Actions
The following node chaos scenarios are supported:
node_start_scenario: Scenario to start the node instance. Need access to cloud provider
node_stop_scenario: Scenario to stop the node instance. Need access to cloud provider
node_stop_start_scenario: Scenario to stop and then start the node instance. Not supported on VMware. Need access to cloud provider
node_termination_scenario: Scenario to terminate the node instance. Need access to cloud provider
node_reboot_scenario: Scenario to reboot the node instance. Need access to cloud provider
stop_kubelet_scenario: Scenario to stop the kubelet of the node instance. Need access to cloud provider
stop_start_kubelet_scenario: Scenario to stop and start the kubelet of the node instance. Need access to cloud provider
restart_kubelet_scenario: Scenario to restart the kubelet of the node instance. Can be used with generic cloud type or when you don’t have access to cloud provider
node_crash_scenario: Scenario to crash the node instance. Can be used with generic cloud type or when you don’t have access to cloud provider
stop_start_helper_node_scenario: Scenario to stop and start the helper node and check service status. Need access to cloud provider
node_block_scenario: Scenario to block inbound and outbound traffic from other nodes to a specific node for a set duration (only for Azure). Need access to cloud provider
node_disk_detach_attach_scenario: Scenario to detach and reattach disks (only for baremetals).
If the node does not recover from the node_crash_scenario injection, reboot the node to get it back to Ready state.
Note
node_start_scenario, node_stop_scenario, node_stop_start_scenario, node_termination_scenario, node_reboot_scenario and stop_start_kubelet_scenario are supported on
AWS
Azure
OpenStack
BareMetal
GCP
VMware
Alibaba
IbmCloud
IbmCloudPower
Recovery Times
In each node scenario, the end telemetry details of the run will show the time it took for each node to stop and recover depening on the scenario.
The details printed in telemetry:
node_name: Node name
node_id: Node id
not_ready_time: Amount of time the node took to get to a not ready state after cloud provider has stopped node
ready_time: Amount of time the node took to get to a ready state after cloud provider has become in started state
stopped_time: Amount of time the cloud provider took to stop a node
running_time: Amount of time the cloud provider took to get a node running
terminating_time: Amount of time the cloud provider took for node to become terminated
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- node_scenarios:- scenarios/node-reboot.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- node_scenarios:# Same type can appear multiple times- scenarios/node-stop-start.yaml
Sample scenario file, you are able to specify multiple list items under node_scenarios that will be ran serially
node_scenarios:- actions:# node chaos scenarios to be injected- <action> # Can specify multiple actions herenode_name:<node_name> # node on which scenario has to be injected; can set multiple names separated by commalabel_selector:<label> # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection; can specify multiple by a comma separated listexclude_label:<label> # if label_selector is set, will exclude nodes marked by this label from the chaos scenarioinstance_count:<instance_number># Number of nodes to perform action/select that match the label selectorruns:<run_int> # number of times to inject each scenario under actions (will perform on same node each time)timeout:<timeout> # duration to wait for completion of node scenario injectionduration:<duration> # duration to stop the node before running the start actioncloud_type:<cloud> # cloud type on which Kubernetes/OpenShift runs parallel:<true_or_false> # Run action on label or node name in parallel or sequential, defaults to sequentialkube_check:<true_or_false># Run the kubernetes api calls to see if the node gets to a certain state during the node scenariodisable_ssl_verification:<true_or_false># Disable SSL verification, to avoid certificate errors
AWS
Cloud setup instructions can be found here.
Sample scenario config can be found here.
The cloud type in the scenario yaml file needs to be aws
The cloud type in the scenario yaml file needs to be bm
Note
Baremetal requires setting the IPMI user and password to power on, off, and reboot nodes, using the config options bm_user and bm_password. It can either be set in the root of the entry in the scenarios config, or it can be set per machine.
If no per-machine addresses are specified, kraken attempts to use the BMC value in the BareMetalHost object. To list them, you can do ‘oc get bmh -o wide –all-namespaces’. If the BMC values are blank, you must specify them per-machine using the config option ‘bmc_addr’ as specified below.
For per-machine settings, add a “bmc_info” section to the entry in the scenarios config. Inside there, add a configuration section using the node name. In that, add per-machine settings. Valid settings are ‘bmc_user’, ‘bmc_password’, ‘bmc_addr’ and ‘disks’.
See the example node scenario or the example below.
Note
Baremetal requires oc (openshift client) be installed on the machine running Kraken.
Note
Baremetal machines are fragile. Some node actions can occasionally corrupt the filesystem if it does not shut down properly, and sometimes the kubelet does not start properly.
Docker
The Docker provider can be used to run node scenarios against kind clusters.
kind is a tool for running local Kubernetes clusters using Docker container “nodes”.
kind was primarily designed for testing Kubernetes itself, but may be used for local development or CI.
GCP
Cloud setup instructions can be found here. Sample scenario config can be found here.
The cloud type in the scenario yaml file needs to be gcp
Openstack
How to set up Openstack cli to run node scenarios is defined here.
The cloud type in the scenario yaml file needs to be openstack
The supported node level chaos scenarios on an OPENSTACK cloud are only: node_stop_start_scenario, stop_start_kubelet_scenario and node_reboot_scenario.
Note
For stop_start_helper_node_scenario, visit here to learn more about the helper node and its usage.
To execute the scenario, ensure the value for ssh_private_key in the node scenarios config file is set with the correct private key file path for ssh connection to the helper node. Ensure passwordless ssh is configured on the host running Kraken and the helper node to avoid connection errors.
Azure
Cloud setup instructions can be found here. Sample scenario config can be found here.
The cloud type in the scenario yaml file needs to be azure
Alibaba
How to set up Alibaba cli to run node scenarios is defined here.
Note
There is no “terminating” idea in Alibaba, so any scenario with terminating will “release” the node
. Releasing a node is 2 steps, stopping the node and then releasing it.
The cloud type in the scenario yaml file needs to be alibaba
VMware
How to set up VMware vSphere to run node scenarios is defined here
The cloud type in the scenario yaml file needs to be vmware
IBMCloud
How to set up IBMCloud to run node scenarios is defined here
The cloud type in the scenario yaml file needs to be ibmpower or ibmcloudpower
General
Note
The node_crash_scenario and stop_kubelet_scenario scenarios are supported independent of the cloud platform.
Use ‘generic’ or do not add the ‘cloud_type’ key to your scenario if your cluster is not set up using one of the current supported cloud types.
Run
python run_kraken.py --config config/config.yaml
This scenario disrupts the node(s) matching the label on a Kubernetes/OpenShift cluster. Actions/disruptions supported are listed here
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios
$ docker run \
-e <VARIABLE>=<value> \
--net=host \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:node-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
Skip OpenShift-specific cluster checks (set to true for vanilla Kubernetes)
string
false
BMC_USER
Only needed for Baremetal ( bm ) - IPMI/bmc username
string
""
BMC_PASSWORD
Only needed for Baremetal ( bm ) - IPMI/bmc password
string
""
BMC_ADDR
Only needed for Baremetal ( bm ) - IPMI/bmc address
string
""
DISKS
Comma-separated list of disks for baremetal disk detach/attach scenarios
string
""
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Node name to inject faults in case of targeting a specific node; Can set multiple node names separated by a comma
string
No
--instance-count
Targeted instance count matching the label selector
number
No
1
--runs
Iterations to perform action on a single node
number
No
1
--cloud-type
Cloud platform on top of which cluster is running, supported platforms - aws, azure, gcp, vmware, ibmcloud, bm
enum
No
aws
--kube-check
Connecting to the kubernetes api to check the node status, set to False for SNO
enum
No
true
--timeout
Duration to wait for completion of node scenario injection
number
No
180
--duration
Duration to wait for completion of node scenario injection
number
No
120
--vsphere-ip
vSphere IP address
string
No
--vsphere-username
vSphere IP address
string (secret)
No
--vsphere-password
vSphere password
string (secret)
No
--aws-access-key-id
AWS Access Key Id
string (secret)
No
--aws-secret-access-key
AWS Secret Access Key
string (secret)
No
--aws-default-region
AWS default region
string
No
--bmc-user
Only needed for Baremetal ( bm ) - IPMI/bmc username
string(secret)
No
--bmc-password
Only needed for Baremetal ( bm ) - IPMI/bmc password
string(secret)
No
--bmc-address
Only needed for Baremetal ( bm ) - IPMI/bmc address
string
No
--ibmc-address
IBM Cloud URL
string
No
--ibmc-api-key
IBM Cloud API Key
string (secret)
No
--ibmc-power-address
IBM Power Cloud URL
string
No
--ibmc-cnr
IBM Cloud Power Workspace CNR
string
No
--disable-ssl-verification
Disable SSL verification, to avoid certificate errors
enum
Yes
false
--azure-tenant
Azure Tenant
string
No
--azure-client-secret
Azure Client Secret
string(secret)
No
--azure-client-id
Azure Client ID
string(secret)
No
--azure-subscription-id
Azure Subscription ID
string (secret)
No
--gcp-application-credentials
GCP application credentials file location
file
No
NOTE: The secret string types will be masked when scenario is ran
Parameter Dependencies
--node-name vs --label-selector: When --node-name is set, --label-selector is ignored. The scenario targets the named node(s) directly.
--instance-count: Only applies when using --label-selector. It limits how many of the matched nodes are targeted.
Cloud credentials: The --vsphere-*, --aws-*, --bmc-*, --ibmc-*, --azure-*, and --gcp-* parameters are only required for their respective --cloud-type value. For example, --aws-access-key-id is only needed when --cloud-type is aws.
To see all available scenario options
krknctl run node-scenarios --help
Demo
See a demo of this scenario:
16.1 - Node Scenarios on Bare Metal
Disrupts node(s) on a bare metal Kubernetes/OpenShift cluster by driving power state through the host's BMC (IPMI). Unlike the cloud-provider node scenarios, this flow requires IPMI credentials (either default or per-machine) and the OpenShift `oc` CLI on the runner host. Supported actions are inherited from the parent [Node Scenarios](../_index.md) page (start, stop, stop_start, terminate, reboot, kubelet stop/restart, disk detach/attach, and so on).
How to Run Node Scenarios on Bare Metal
Choose your preferred method to run baremetal node scenarios:
For baremetal, set cloud_type: bm and provide IPMI credentials either at the root of the scenario entry (bmc_user / bmc_password) or per-machine inside bmc_info. If bmc_addr is omitted, Krkn falls back to the BMC value found on the matching BareMetalHost (oc get bmh -o wide --all-namespaces).
node_scenarios:- actions:- node_stop_start_scenario # any action listed on the parent Node Scenarios pagelabel_selector:node-role.kubernetes.io/workerinstance_count:1runs:1timeout:360duration:120parallel:falsecloud_type:bmkube_check:truebmc_user:defaultuser # default IPMI user; optional if every machine sets its ownbmc_password:defaultpass # default IPMI password; optional if every machine sets its ownbmc_info:# per-machine overrides (optional)node-1:bmc_addr:mgmt-machine1.example.comnode-2:bmc_addr:mgmt-machine2.example.combmc_user:userbmc_password:pass
For the full set of node-scenario fields shared with other cloud providers (actions, node_name, label_selector, instance_count, etc.) see the parent Node Scenarios page.
Baremetal-specific fields
cloud_type — must be bm.
bmc_user, bmc_password — default IPMI credentials. May also be supplied via environment variables (BMC_USER, BMC_PASSWORD) — Krkn falls back to env when the YAML keys are absent.
bmc_info — per-machine overrides keyed by node name. Each entry accepts bmc_addr, bmc_user, bmc_password, and (for node_disk_detach_attach_scenario) a disks list.
For node_disk_detach_attach_scenario, bmc_info.<node>.disks is required and bmc_addr is not used.
Baremetal requires oc (OpenShift client) installed on the host running Krkn. Some node actions can occasionally corrupt the filesystem if the node does not shut down cleanly — keep recovery procedures handy.
Run
python run_kraken.py --config config/config.yaml
Run
Unlike other krkn-hub scenarios, baremetal node scenarios require a base64-encoded scenario file rather than per-parameter env vars. Author your scenario locally following the scenario syntax, then pass it to the container via SCENARIO_BASE64.
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED for the chaos injection container to auto-connect.
$ podman run --name=<container_name> --net=host --pull=always --env-host=true\
-e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios-bm
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
$ docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always \
-e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios-bm
OR
$ docker run -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"\
--net=host --pull=always -v <path-to-kube-config>:/home/krkn/.kube/config:Z -d quay.io/krkn-chaos/krkn-hub:node-scenarios-bm
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> --format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container:
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario-specific variables.
Parameter
Description
Type
Default
Required
SCENARIO_BASE64
Base64-encoded contents of a baremetal node scenario YAML (base64 -w0 baremetal_node_scenarios.yml)
string
Yes
KRKN_DEBUG
When set to True, prints the decoded scenario and config files before running and enables --debug True
bool
False
No
The contents of SCENARIO_BASE64 are validated against the node-scenarios-bm JSON schema before Krkn starts — invalid scenarios fail fast with a schema error.
NOTE In case of using a custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics/alerts files from the host under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts:
Absolute path to the baremetal node-scenarios YAML file. krknctl base64-encodes the file and supplies it as SCENARIO_BASE64 to the container.
true
The scenario YAML must follow the baremetal node scenario schema. See the Krkn tab on this page for an annotated example and the list of supported actions.
Example
krknctl run node-scenarios-bm \
--scenario-file-path ~/krkn/scenarios/openshift/baremetal_node_scenarios.yml
Note
krknctl handles the base64 encoding for you — pass a plain filesystem path. The validation step inside the container (against config-schema.json) still applies, so invalid YAML is rejected before Krkn runs.
Demo
See a demo of this scenario:
17 - Pod Network Scenarios
Pod outage
Scenario to block the traffic (Ingress/Egress) of a pod matching the labels for the specified duration of time to understand the behavior of the service/other services which depend on it during downtime. This helps with planning the requirements accordingly, be it improving the timeouts or tweaking the alerts etc.
With the current network policies, it is not possible to explicitly block ports which are enabled by allowed network policy rule. This chaos scenario addresses this issue by using OVS flow rules to block ports related to the pod. It supports OpenShiftSDN and OVNKubernetes based networks.
Excluding Pods from Network Outage
The pod outage scenario now supports excluding specific pods from chaos testing using the exclude_label parameter. This allows you to target a namespace or group of pods with your chaos testing while deliberately preserving certain critical workloads.
Why Use Pod Exclusion?
This feature addresses several common use cases:
Testing resiliency of an application while keeping critical monitoring pods operational
Preserving designated “control plane” pods within a microservice architecture
Allowing targeted chaos without affecting auxiliary services in the same namespace
Enabling more precise pod selection when network policies require all related services to be in the same namespace
How to Use the exclude_label Parameter
The exclude_label parameter works alongside existing pod selection parameters (label_selector and pod_name). The system will:
Identify all pods in the target namespace
Exclude pods matching the exclude_label criteria (in format “key=value”)
Apply the existing filters (label_selector or pod_name)
Apply the chaos scenario to the resulting pod list
In this example, network disruption is applied to all pods with the label app=my-service in the my-application namespace, except for those that also have the label critical=true.
This scenario blocks ingress traffic on port 8443 for pods matching component=ui label in the openshift-console namespace, but will skip any pods labeled with excluded=true.
The exclude_label parameter is also supported in the pod network shaping scenarios (pod_egress_shaping and pod_ingress_shaping), allowing for the same selective application of network latency, packet loss, and bandwidth restriction.
How to Run Pod Network Scenarios
Choose your preferred method to run pod network scenarios:
- id:pod_network_outageconfig:namespace:openshift-console # Required - Namespace of the pod to which filter need to be applieddirection:# Optional - List of directions to apply filters- ingress # Blocks ingress traffic, Default both egress and ingressingress_ports:# Optional - List of ports to block traffic on- 8443# Blocks 8443, Default [], i.e. all ports.label_selector:'component=ui'# Blocks access to openshift consoleexclude_label:'critical=true'# Optional - Pods matching this label will be excluded from the chaosimage:quay.io/krkn-chaos/krkn:tools
Pod Network shaping
Scenario to introduce network latency, packet loss, and bandwidth restriction in the Pod’s network interface. The purpose of this scenario is to observe faults caused by random variations in the network.
Sample scenario config for egress traffic shaping (using plugin)
- id:pod_egress_shapingconfig:namespace:openshift-console # Required - Namespace of the pod to which filter need to be applied.label_selector:'component=ui'# Applies traffic shaping to access openshift console.exclude_label:'critical=true'# Optional - Pods matching this label will be excluded from the chaosnetwork_params:latency:500ms # Add 500ms latency to egress traffic from the pod.image:quay.io/krkn-chaos/krkn:tools
Sample scenario config for ingress traffic shaping (using plugin)
- id:pod_ingress_shapingconfig:namespace:openshift-console # Required - Namespace of the pod to which filter need to be applied.label_selector:'component=ui'# Applies traffic shaping to access openshift console.exclude_label:'critical=true'# Optional - Pods matching this label will be excluded from the chaosnetwork_params:latency:500ms # Add 500ms latency to egress traffic from the pod.image:quay.io/krkn-chaos/krkn:tools
Steps
Pick the pods to introduce the network anomaly either from label_selector or pod_name.
Identify the pod interface name on the node.
Set traffic shaping config on pod’s interface using tc and netem.
Wait for the duration time.
Remove traffic shaping config on pod’s interface.
Remove the job that spawned the pod.
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- pod_network_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- pod_network_scenarios:- scenarios/pod-network.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- pod_network_scenarios:# Same type can appear multiple times- scenarios/pod-network-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario runs network chaos at the pod level on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-network-chaos
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-network-chaos
$ docker run \
-e <VARIABLE>=<value> \
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-network-chaos
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Default
NAMESPACE
Required - Namespace of the pod to which filter need to be applied
""
IMAGE
Image used to disrupt network on a pod
“quay.io/krkn-chaos/krkn:tools”
LABEL_SELECTOR
Label of the pod(s) to target
""
POD_NAME
When label_selector is not specified, pod matching the name will be selected for the chaos scenario
""
EXCLUDE_LABEL
Pods matching this label will be excluded from the chaos even if they match other criteria
""
INSTANCE_COUNT
Number of pods to perform action/select that match the label selector
1
TRAFFIC_TYPE
List of directions to apply filters - egress/ingress ( needs to be a list )
[ingress, egress]
INGRESS_PORTS
Ingress ports to block ( needs to be a list )
[] i.e all ports
EGRESS_PORTS
Egress ports to block ( needs to be a list )
[] i.e all ports
WAIT_DURATION
The duration (in seconds) that the network chaos (traffic shaping, packet loss, etc.) persists on the target pods. This is the actual time window where the network disruption is active. It must be longer than TEST_DURATION to ensure the fault is active for the entire test.
300
TEST_DURATION
Duration of the test run (e.g. workload or verification)
120
Note
For disconnected clusters, be sure to also mirror the helper image of quay.io/krkn-chaos/krkn:tools and set the mirrored image path properly
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Namespace of the pod to which filter need to be applied
string
Yes
--image
Image used to disrupt network on a pod
string
No
quay.io/krkn-chaos/krkn:tools
--label-selector
When pod_name is not specified, pod matching the label will be selected for the chaos scenario
string
No
--exclude-label
Pods matching this label will be excluded from the chaos even if they match other criteria
string
No
""
--pod-name
When label_selector is not specified, pod matching the name will be selected for the chaos scenario
string
No
--instance-count
Targeted instance count matching the label selector
number
No
1
--traffic-type
List of directions to apply filters - egress/ingress ( needs to be a list )
string
No
“[ingress,egress]”
--ingress-ports
Ingress ports to block ( needs to be a list )
string
No
--egress-ports
Egress ports to block ( needs to be a list )
string
No
--wait-duration
Ensure that it is at least about twice of test_duration
number
No
300
--test-duration
Duration of the test run
number
No
120
Parameter Dependencies
--ingress-ports / --egress-ports: When left empty, all ports are blocked for that traffic direction. Specify port numbers to restrict the filter to only those ports.
--wait-duration: Must be at least 2× --test-duration to allow the network to stabilize before verification.
To see all available scenario options
krknctl run pod-network-chaos --help
18 - Pod Scenarios
This scenario disrupts the pods matching the label, excluded label or pod name in the specified namespace on a Kubernetes/OpenShift cluster.
Why pod scenarios are important:
Modern applications demand high availability, low downtime, and resilient infrastructure. Kubernetes provides building blocks like Deployments, ReplicaSets, and Services to support fault tolerance, but understanding how these interact during disruptions is critical for ensuring reliability. Pod disruption scenarios test this reliability under various conditions, validating that the application and infrastructure respond as expected.
Use cases of pod scenarios
Deleting a single pod
Use Case: Simulates unplanned deletion of a single pod
Why It’s Important: Validates whether the ReplicaSet or Deployment automatically creates a replacement.
Customer Impact: Ensures continuous service even if a pod unexpectedly crashes.
Recovery Timing: Typically less than 10 seconds for stateless apps (seen in Krkn telemetry output).
HA Indicator: Pod is automatically rescheduled and becomes Ready without manual intervention.
kubectl delete pod <pod-name> -n <namespace>
kubectl get pods -n <namespace> -w # watch for new pods```bash
2. Deleting multiple pods simultaneously
- **Use Case:** Simulates a larger failure event, such as a node crash or AZ outage.
- **Why It's Important:** Tests whether the system has enough resources and policies to recover gracefully.
- **Customer Impact:** If all pods of a service fail, user experience is directly impacted.
- **HA Indicator:** Application can continue functioning from other replicas across zones/nodes.
3. Pod Eviction (Soft Disruption)
- **Use Case:** Triggered by Kubernetes itself during node upgrades or scaling down.
- **Why It's Important:** Ensures graceful termination and restart elsewhere without user impact.
- **Customer Impact:** Should be zero if readiness/liveness probes and PDBs are correctly configured.
- **HA Indicator:** Rolling disruption does not take down the whole application.
</krkn-hub-scenario>
## How to know if it is highly available - ***Multiple Replicas Exist:*** Confirmed by checking `kubectl get deploy -n <namespace>` and seeing atleast 1 replica.
- ***Pods Distributed Across Nodes/availability zones:*** Using `topologySpreadConstraints` or observing pod distribution in `kubectl get pods -o wide`. See [Health Checks](../../krkn/health-checks.md)for real time visibility into the impact of chaos scenarios on application availability and performance
- ***Service Uptime Remains Unaffected:*** During chaos test, verify app availability (synthetic probes, Prometheus alerts, etc).
- ***Recovery Is Automatic:*** No manual intervention needed to restore service.
- ***Krkn Telemetry Indicators:*** End of run data includes recovery times, pod reschedule latency, and service downtime which are vital metrics for assessing HA.
## Excluding Pods from DisruptionEmploy `exclude_label` to designate the safe pods in a group, while the rest of the pods in a namespace are subjected to chaos. Some frequent use cases are:
- Turn off the backend pods but make sure the database replicas that are highly available remain untouched.
- Inject the fault in the application layer, do not stop the infrastructure/monitoring pods.
- Run a rolling disruption experiment with the control-plane or system-critical components that are not affected.
**Format:**
```yaml
exclude_label: "key=value"
Mechanism:
Pods are selected based on namespace_pattern + label_selector or name_pattern.
Before deletion, the pods that match exclude_label are removed from the list.
Rest of the pods are subjected to chaos.
Example: Have the Leader Protected While Different etcd Replicas Are Killed
By default, pod scenarios target all pods matching the namespace and label selectors regardless of which node they run on. However, you can narrow down the scope to only affect pods running on specific nodes using two options:
Option 1: Using Node Label Selector
Target pods running on nodes with specific labels (e.g., control-plane nodes, worker nodes, nodes in a specific zone).
Format:
node_label_selector:"key=value"
Use Cases:
Test resilience of control-plane workloads by disrupting pods only on master/control-plane nodes
Simulate zone-specific failures by targeting nodes in a particular availability zone
Test worker node failures without affecting control-plane components
Pods are selected based on namespace_pattern + label_selector or name_pattern
The selection is further filtered to only include pods running on the specified nodes
If exclude_label is also specified, it’s applied after node filtering
The remaining pods are subjected to chaos
Recovery Time Metrics in Krkn Telemetry
Krkn tracks three key recovery time metrics for each affected pod:
pod_rescheduling_time - The time (in seconds) that the Kubernetes cluster took to reschedule the pod after it was killed. This measures the cluster’s scheduling efficiency and includes the time from pod deletion until the replacement pod is scheduled on a node.
pod_readiness_time - The time (in seconds) the pod took to become ready after being scheduled. This measures application startup time, including container image pulls, initialization, and readiness probe success.
total_recovery_time - The total amount of time (in seconds) from pod deletion until the replacement pod became fully ready and available to serve traffic. This is the sum of rescheduling time and readiness time.
These metrics appear in the telemetry output under PodsStatus.recovered for successfully recovered pods. Pods that fail to recover within the timeout period appear under PodsStatus.unrecovered without timing data.
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- pod_disruption_scenarios:- scenarios/pod-kill.yaml- scenarios/etcd-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- pod_disruption_scenarios:# Same type can appear multiple times- scenarios/pod-kill-2.yaml
You can then create the scenario file with the following contents:
# yaml-language-server: $schema=../plugin.schema.json- id:kill-podsconfig:namespace_pattern:^kube-system$label_selector:k8s-app=kube-schedulerkrkn_pod_recovery_time:120#Not needed by default, but can be used if you want to target pods on specific nodes# Option 1: Target pods on nodes with specific labels [master/worker nodes]node_label_selector:node-role.kubernetes.io/control-plane= # Target control-plane nodes (works on both k8s and openshift)exclude_label:'critical=true'# Optional - Pods matching this label will be excluded from the chaos# Option 2: Target pods of specific nodes (testing mixed node types)node_names:- ip-10-0-31-8.us-east-2.compute.internal # Worker node 1- ip-10-0-48-188.us-east-2.compute.internal # Worker node 2- ip-10-0-14-59.us-east-2.compute.internal # Master node 1
Please adjust the schema reference to point to the schema file. This file will give you code completion and documentation for the available options in your IDE.
Pod Chaos Scenarios
The following are the components of Kubernetes/OpenShift for which a basic chaos scenario config exists today.
Kills random pods running in the OpenShift system namespaces.
✔️
Run
python run_kraken.py --config config/config.yaml
This scenario disrupts the pods matching the label in the specified namespace on a Kubernetes/OpenShift cluster.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-scenarios
$ docker run \
-e <VARIABLE>=<value> \
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pod-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~/kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
NAMESPACE
Targeted namespace in the cluster ( supports regex )
string
openshift-.*
POD_LABEL
Label of the pod(s) to target
string
""
EXCLUDE_LABEL
Pods matching this label will be excluded from the chaos even if they match other criteria
string
""
NAME_PATTERN
Regex pattern to match the pods in NAMESPACE when POD_LABEL is not specified
string
.*
DISRUPTION_COUNT
Number of pods to disrupt
number
1
KILL_TIMEOUT
Timeout to wait for the target pod(s) to be removed in seconds
number
180
EXPECTED_RECOVERY_TIME
Fails if the pod disrupted do not recover within the timeout set
number
120
NODE_LABEL_SELECTOR
Label of the node(s) to target
string
""
NODE_NAMES
Name of the node(s) to target. Example: [“worker-node-1”,“worker-node-2”,“master-node-1”]
string
[]
Note
Set NAMESPACE environment variable to openshift-.* to pick and disrupt pods randomly in openshift system namespaces, the DAEMON_MODE can also be enabled to disrupt the pods every x seconds in the background to check the reliability.
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Targeted namespace in the cluster ( supports regex )
string
No
openshift-*
--pod-label
Label of the pod(s) to target ex. “app=test”
string
No
--exclude-label
Pods matching this label will be excluded from the chaos even if they match other criteria
string
No
""
--name-pattern
Regex pattern to match the pods in NAMESPACE when POD_LABEL is not specified
string
No
.*
--disruption-count
Number of pods to disrupt
number
No
1
--kill-timeout
Timeout to wait for the target pod(s) to be removed in seconds
number
No
180
--expected-recovery-time
Fails if the pod disrupted do not recover within the timeout set
number
No
120
--node-label-selector
Label of the node(s) to target
string
No
""
--node-names
Name of the node(s) to target. Example: [“worker-node-1”,“worker-node-2”,“master-node-1”]
string
No
[]
Behavior Notes
Recovery monitoring: After disrupting pods, krkn monitors for recovery up to --expected-recovery-time seconds. If any pods remain unrecovered after the timeout, the scenario reports failure.
To see all available scenario options
krknctl run pod-scenarios --help
Demo
See a demo of this scenario:
19 - Power Outage Scenarios
This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy.
How to Run Power Outage Scenarios
Choose your preferred method to run power outage scenarios:
Power Outage/ Cluster shut down scenario can be injected by placing the shut_down config file under cluster_shut_down_scenario option in the kraken config. Refer to cluster_shut_down_scenario config file.
cluster_shut_down_scenario:# Scenario to stop all the nodes for specified duration and restart the nodes.runs:1# Number of times to execute the cluster_shut_down scenario.shut_down_duration:120# Duration in seconds to shut down the cluster.cloud_type:aws # Cloud type on which Kubernetes/OpenShift runs.
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- cluster_shut_down_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- cluster_shut_down_scenarios:- scenarios/power-outage.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- cluster_shut_down_scenarios:# Same type can appear multiple times- scenarios/power-outage-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario shuts down Kubernetes/OpenShift cluster for the specified duration to simulate power outages, brings it back online and checks if it’s healthy. More information can be found here
Right now power outage and cluster shutdown are one in the same. We originally created this scenario to stop all the nodes and then start them back up how a customer would shut their cluster down.
In a real life chaos scenario though, we figured this scenario was close to if the power went out on the aws side so all of our ec2 nodes would be stopped/powered off.
We tried to look at if aws cli had a way to forcefully poweroff the nodes (not gracefully) and they don’t currently support so this scenario is as close as we can get to “pulling the plug”
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:power-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:power-outages
$ docker run \
-e <VARIABLE>=<value> \
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:power-outages
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Cloud platform on top of which cluster is running, supported platforms - aws, azure, gcp, vmware, ibmcloud, bm
enum
No
aws
--timeout
Time in seconds to wait for each node to be stopped or running after the cluster comes back
number
No
180
--shutdown-duration
Duration in seconds to shut down the cluster
number
No
1200
--vsphere-ip
vSphere IP address
string
No
--vsphere-username
vSphere IP address
string (secret)
No
--vsphere-password
vSphere password
string (secret)
No
--aws-access-key-id
AWS Access Key Id
string (secret)
No
--aws-secret-access-key
AWS Secret Access Key
string (secret)
No
--aws-default-region
AWS default region
string
No
--bmc-user
Only needed for Baremetal ( bm ) - IPMI/bmc username
string(secret)
No
--bmc-password
Only needed for Baremetal ( bm ) - IPMI/bmc password
string(secret)
No
--bmc-address
Only needed for Baremetal ( bm ) - IPMI/bmc address
string
No
--ibmc-address
IBM Cloud URL
string
No
--ibmc-api-key
IBM Cloud API Key
string (secret)
No
--azure-tenant
Azure Tenant
string
No
--azure-client-secret
Azure Client Secret
string(secret)
No
--azure-client-id
Azure Client ID
string(secret)
No
--azure-subscription-id
Azure Subscription ID
string (secret)
No
--gcp-application-credentials
GCP application credentials file location
file
No
NOTE: The secret string types will be masked when scenario is ran
To see all available scenario options
krknctl run power-outages --help
Demo
See a demo of this scenario:
20 - PVC Scenario
Scenario to fill up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults caused by the application using this volume.
How to Run PVC Scenarios
Choose your preferred method to run PVC scenarios:
pvc_scenario:
pvc_name: <pvc_name> # Name of the target PVC. pod_name: <pod_name> # Name of the pod where the PVC is mounted. It will be ignored if the pvc_name is defined. namespace: <namespace_name> # Namespace where the PVC is. fill_percentage: 50# Target percentage to fill up the cluster. Value must be higher than current percentage. Valid values are between 0 and 99. duration: 60# Duration in seconds for the fault.
Steps
Get the pod name where the PVC is mounted.
Get the volume name mounted in the container pod.
Get the container name where the PVC is mounted.
Get the mount path where the PVC is mounted in the pod.
Get the PVC capacity and current used capacity.
Calculate file size to fill the PVC to the target fill_percentage.
Connect to the pod.
Create a temp file kraken.tmp with random data on the mount path:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- pvc_scenarios:- scenarios/pvc-fill.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- pvc_scenarios:# Same type can appear multiple times- scenarios/pvc-fill-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario fills up a given PersistenVolumeClaim by creating a temp file on the PVC from a pod associated with it. The purpose of this scenario is to fill up a volume to understand faults cause by the application using this volume. For more information refer the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pvc-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pvc-scenarios
$ docker run \
-e <VARIABLE>=<value> \
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:pvc-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
If both PVC_NAME and POD_NAME are defined, POD_NAME value will be overridden from the Mounted By: value on PVC definition.
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
PVC_NAME
Targeted PersistentVolumeClaim in the cluster (if null, POD_NAME is required)
string
POD_NAME
Targeted pod in the cluster (if null, PVC_NAME is required)
string
NAMESPACE
Targeted namespace in the cluster (required)
string
FILL_PERCENTAGE
Targeted percentage to be filled up in the PVC
number
50
DURATION
Duration in seconds with the PVC filled up
number
60
BLOCK_SIZE
Block size in bytes for the dd command used to fill the PVC
number
102400
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Targeted PersistentVolumeClaim in the cluster (if null, POD_NAME is required)
string
No
--pod-name
Targeted pod in the cluster (if null, PVC_NAME is required)
string
No
--namespace
Targeted namespace in the cluster (required)
string
Yes
--fill-percentage
Targeted percentage to be filled up in the PVC
number
No
50
--duration
Duration in seconds with the PVC filled up
number
No
60
Parameter Dependencies
--pvc-name vs --pod-name: At least one is required. If both are set, --pvc-name takes precedence and --pod-name is ignored.
Behavior Notes
Automatic cleanup: After --duration expires, krkn automatically deletes the temporary fill file from the PVC.
PVC requirements: The target PVC must be in Bound state and mounted to an active pod. The scenario locates the mount path by inspecting the pod’s volume mounts.
To see all available scenario options
krknctl run pvc-scenarios --help
21 - Service Disruption Scenarios
Using this type of scenario configuration one is able to delete crucial objects in a specific namespace, or a namespace matching a certain regex string.
How to Run Service Disruption Scenarios
Choose your preferred method to run service disruption scenarios:
namespace: Specific namespace or regex style namespace of what you want to delete. Gets all namespaces if not specified; set to "" if you want to use the label_selector field.
Set to ‘^.*$’ and label_selector to "" to randomly select any namespace in your cluster.
label_selector: Label on the namespace you want to delete. Set to "" if you are using the namespace variable.
delete_count: Number of namespaces to kill in each run. Based on matching namespace and label specified, default is 1.
runs: Number of runs/iterations to kill namespaces, default is 1.
sleep: Number of seconds to wait between each iteration/count of killing namespaces. Defaults to 10 seconds if not set
This scenario will select a namespace (or multiple) dependent on the configuration and will kill all of the below object types in that namespace and will wait for them to be Running in the post action
Services
Daemonsets
Statefulsets
Replicasets
Deployments
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- service_disruption_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- service_disruption_scenarios:- scenarios/service-disruption.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- container_scenarios:- scenarios/container-kill.yaml- service_disruption_scenarios:# Same type can appear multiple times- scenarios/service-disruption-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario deletes main objects within a namespace in your Kubernetes/OpenShift cluster. More information can be found here.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-disruption-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-disruption-scenarios
$ docker run \
-e <VARIABLE>=<value> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-disruption-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Type
Default
LABEL_SELECTOR
Label of the namespace to target. Set this parameter only if NAMESPACE is not set
string
""
NAMESPACE
Name of the namespace you want to target. Set this parameter only if LABEL_SELECTOR is not set
string
“openshift-etcd”
SLEEP
Number of seconds to wait before polling to see if namespace exists again
number
15
DELETE_COUNT
Number of namespaces to kill in each run, based on matching namespace and label specified
number
1
RUNS
Number of runs to execute the action
number
1
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Label of the namespace to target. Set this parameter only if NAMESPACE is not set
string
No
--delete-count
Number of namespaces to kill in each run, based on matching namespace and label specified
number
No
1
--runs
Number of runs to execute the action
number
No
1
Behavior Notes
No automatic recovery: After krkn deletes the services, they are not automatically recreated. Services will only come back if managed by a controller (e.g. Helm release, operator, or GitOps pipeline). Verify your recovery mechanism before running this scenario.
To see all available scenario options
krknctl run service-disruption-scenarios --help
Demo
See a demo of this scenario:
22 - Service Hijacking Scenario
Service Hijacking Scenarios aim to simulate fake HTTP responses from a workload targeted by a Service already deployed in the cluster. This scenario is executed by deploying a custom-made web service and modifying the target Service selector to direct traffic to this web service for a specified duration.
It employs a time-based test plan from the scenario configuration file, which specifies the behavior of resources during the chaos scenario as follows:
The scenario will focus on the service_name within the service_namespace,
substituting the selector with a randomly generated one, which is added as a label in the mock service manifest.
This allows multiple scenarios to be executed in the same namespace, each targeting different services without causing conflicts.
The newly deployed mock web service will expose a service_target_port,
which can be either a named or numeric port based on the service configuration.
This ensures that the Service correctly routes HTTP traffic to the mock web service during the chaos run.
Each step will last for duration seconds from the deployment of the mock web service in the cluster.
For each HTTP resource, defined as a top-level YAML property of the plan
(it could be a specific resource, e.g., /list/index.php, or a path-based resource typical in MVC frameworks),
one or more HTTP request methods can be specified. Both standard and custom request methods are supported.
During this time frame, the web service will respond with:
mime_type: The MIME type (can be standard or custom).
payload: The response body to be returned to the client.
At the end of the step duration, the web service will proceed to the next step (if available) until
the global chaos_duration concludes. At this point, the original service will be restored,
and the custom web service and its resources will be undeployed.
NOTE: Some clients (e.g., cURL, jQuery) may optimize queries using lightweight methods (like HEAD or OPTIONS)
to probe API behavior. If these methods are not defined in the test plan, the web service may respond with
a 405 or 404 status code. If you encounter unexpected behavior, consider this use case.
How to Run Service Hijacking Scenarios
Choose your preferred method to run service hijacking scenarios:
service_target_port:http-web-svc# The port of the service to be hijacked (can be named or numeric, based on the workload and service configuration).service_name:nginx-service# The name of the service that will be hijacked.service_namespace:default# The namespace where the target service is located.image:quay.io/krkn-chaos/krkn-service-hijacking:v0.1.3# Image of the krkn web service to be deployed to receive traffic.chaos_duration:30# Total duration of the chaos scenario in seconds.privileged:True# True or false if need privileged securityContext to runplan:- resource:"/list/index.php"# Specifies the resource or path to respond to in the scenario. For paths, both the path and query parameters are captured but ignored. For resources, only query parameters are captured.steps:# A time-based plan consisting of steps can be defined for each resource.GET: # One or more HTTP methods can be specified for each step. Note:Non-standard methods are supported for fully custom web services (e.g., using NONEXISTENT instead of POST).- duration:15# Duration in seconds for this step before moving to the next one, if defined. Otherwise, this step will continue until the chaos scenario ends.status:500# HTTP status code to be returned in this step.mime_type:"application/json"# MIME type of the response for this step.payload:| # The response payload for this step.{"status":"internal server error"}- duration:15status:201mime_type:"application/json"payload:| {
"status":"resource created"
}POST:- duration:15status:401mime_type:"application/json"payload:| {
"status": "unauthorized"
}- duration:15status:404mime_type:"text/plain"payload:"not found"
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- service_hijacking_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- service_hijacking_scenarios:- scenarios/service-hijack.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- network_chaos_scenarios:- scenarios/network-chaos.yaml- service_hijacking_scenarios:# Same type can appear multiple times- scenarios/service-hijack-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario reroutes traffic intended for a target service to a custom web service that is automatically deployed by Krkn.
This web service responds with user-defined HTTP statuses, MIME types, and bodies.
For more details, please refer to the following documentation.
Run
Unlike other krkn-hub scenarios, this one requires a specific configuration due to its unique structure.
You must set up the scenario in a local file following the scenario syntax,
and then pass this file’s base64-encoded content to the container via the SCENARIO_BASE64 variable.
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs.
Make sure to start it before injecting the chaos and set CERBERUS_ENABLED
environment variable for the chaos injection container to autoconnect.
$ podman run --name=<container_name> \
-e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"\
-v <path_to_kubeconfig>:/home/krkn/.kube/config:Z containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-hijacking
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ exportSCENARIO_BASE64="$(base64 -w0 <scenario_file>)"$ docker run $(./get_docker_params.sh) --name=<container_name> \
--net=host --pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-hijacking
OR
$ docker run --name=<container_name> -e SCENARIO_BASE64="$(base64 -w0 <scenario_file>)"\
--net=host --pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:service-hijacking
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
ecause the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
SCENARIO_BASE64
Base64 encoded service-hijacking scenario file. Note that the -w0 option in the command substitution SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" is mandatory in order to remove line breaks from the base64 command output
A sample scenario file can be found here, you’ll need to customize it based on your wanted response codes for API calls
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
The absolute path of the scenario file compiled following the documentation
file_base64
Yes
A sample scenario file can be found here, you’ll need to customize it based on your wanted response codes for API calls
Note
Note that the -w0 option in the command substitution SCENARIO_BASE64="$(base64 -w0 <scenario_file>)" is mandatory in order to remove line breaks from the base64 command output
To see all available scenario options
krknctl run service-hijacking --help
23 - Syn Flood Scenarios
Syn Flood Scenarios
This scenario generates a substantial amount of TCP traffic directed at one or more Kubernetes services within
the cluster to test the server’s resiliency under extreme traffic conditions.
It can also target hosts outside the cluster by specifying a reachable IP address or hostname.
This scenario leverages the distributed nature of Kubernetes clusters to instantiate multiple instances
of the same pod against a single host, significantly increasing the effectiveness of the attack.
The configuration also allows for the specification of multiple node selectors, enabling Kubernetes to schedule
the attacker pods on a user-defined subset of nodes to make the test more realistic.
The attacker container source code is available here.
How to Run Syn Flood Scenarios
Choose your preferred method to run syn flood scenarios:
packet-size:120# hping3 packet sizewindow-size:64# hping 3 TCP window sizeduration:10# chaos scenario durationnamespace:default# namespace where the target service(s) are deployedtarget-service:target-svc# target service name (if set target-service-label must be empty)target-port:80# target service TCP porttarget-service-label :""# target service label, can be used to target multiple target at the same time# if they have the same label set (if set target-service must be empty)number-of-pods:2# number of attacker pod instantiated per each targetimage:quay.io/krkn-chaos/krkn-syn-flood# syn flood attacker container imageattacker-nodes:# this will set the node affinity to schedule the attacker node. Per each node label selector# can be specified multiple values in this way the kube scheduler will schedule the attacker pods# in the best way possible based on the provided labels. Multiple labels can be specifiedkubernetes.io/hostname:- host_1- host_2kubernetes.io/os:- linux
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- syn_flood_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- syn_flood_scenarios:- scenarios/syn-flood.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- network_chaos_scenarios:- scenarios/network-chaos.yaml- syn_flood_scenarios:# Same type can appear multiple times- scenarios/syn-flood-2.yaml
Run
python run_kraken.py --config config/config.yaml
Syn Flood scenario
This scenario simulates a user-defined surge of TCP SYN requests directed at one or more services deployed within the cluster or an external target reachable by the cluster.
For more details, please refer to the following documentation.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
TIP: Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Default
PACKET_SIZE
The size in bytes of the SYN packet
120
WINDOW_SIZE
The TCP window size between packets in bytes
64
TOTAL_CHAOS_DURATION
The number of seconds the chaos will last
120
NAMESPACE
The namespace containing the target service and where the attacker pods will be deployed
default
TARGET_SERVICE
The service name (or the hostname/IP address in case an external target will be hit) that will be affected by the attack. Must be empty if TARGET_SERVICE_LABEL will be set
TARGET_PORT
The TCP port that will be targeted by the attack
TARGET_SERVICE_LABEL
The label that will be used to select one or more services. Must be left empty if TARGET_SERVICE variable is set
NUMBER_OF_PODS
The number of attacker pods that will be deployed
2
IMAGE
The container image that will be used to perform the scenario
quay.io/krkn-chaos/krkn-syn-flood:latest
NODE_SELECTORS
The node selectors are used to guide the cluster on where to deploy attacker pods. You can specify one or more labels in the format key=value;key=value2 (even using the same key) to choose one or more node categories. If left empty, the pods will be scheduled on any available node, depending on the cluster’s capacity.
NOTE In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts. For example:
The namespace containing the target service and where the attacker pods will be deployed
string
No
default
--target-service
The service name (or the hostname/IP address in case an external target will be hit) that will be affected by the attack.Must be empty if TARGET_SERVICE_LABEL will be set
string
No
--target-port
The TCP port that will be targeted by the attack
number
Yes
--target-service-label
The label that will be used to select one or more services.Must be left empty if TARGET_SERVICE variable is set
string
No
--number-of-pods
The number of attacker pods that will be deployed
number
No
2
--image
The container image that will be used to perform the scenario
string
No
quay.io/krkn-chaos/krkn-syn-flood:latest
--node-selectors
The node selectors are used to guide the cluster on where to deploy attacker pods. You can specify one or more labels in the format key=value;key=value2 (even using the same key) to choose one or more node categories. If left empty, the pods will be scheduled on any available node, depending on the cluster s capacity.
string
No
To see all available scenario options
krknctl run syn-flood --help
24 - Time Scenarios
Using this type of scenario configuration, one is able to change the time and/or date of the system for pods or nodes.
How to Run Time Scenarios
Choose your preferred method to run time scenarios:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- time_scenarios:- scenarios/time-skew.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- time_scenarios:# Same type can appear multiple times- scenarios/time-skew-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario skews the date and time of the nodes and pods matching the label on a Kubernetes/OpenShift cluster. More information can be found here.
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:time-scenarios
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:time-scenarios
$ docker run \
-e <VARIABLE>=<value> \
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:time-scenarios
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
example:
export <parameter_name>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
Parameter
Description
Default
OBJECT_TYPE
Object to target. Supported options: pod, node
pod
LABEL_SELECTOR
Label of the container(s) or nodes to target
k8s-app=etcd
ACTION
Action to run. Supported actions: skew_time, skew_date
skew_date
OBJECT_NAME
List of the names of pods or nodes you want to skew ( optional parameter )
[]
CONTAINER_NAME
Container in the specified pod to target in case the pod has multiple containers running. Random container is picked if empty
""
NAMESPACE
Namespace of the pods you want to skew, need to be set only if setting a specific pod name
""
Note
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.
Action to run. Supported actions: skew_time or skew_date
enum
No
skew_date
--object-names
List of the names of pods or nodes you want to skew
string
No
--container-name
Container in the specified pod to target in case the pod has multiple containers running. Random container is picked if empty
string
No
--namespace
Namespace of the pods you want to skew, need to be set only if setting a specific pod name
string
No
To see all available scenario options
krknctl run time-scenarios --help
Demo
See a demo of this scenario:
25 - Zone Outage Scenarios
Scenario to create outage in a targeted zone in the public cloud to understand the impact on both Kubernetes/OpenShift control plane as well as applications running on the worker nodes in that zone.
There are 2 ways these scenarios run:
For AWS, it tweaks the network acl of the zone to simulate the failure and that in turn will stop both ingress and egress traffic from all the nodes in a particular zone for the specified duration and reverts it back to the previous state.
For GCP, it in a specific zone you want to target and finds the nodes (master, worker, and infra) and stops the nodes for the set duration and then starts them back up. The reason we do it this way is because any edits to the nodes require you to first stop the node before performing any updates. So, editing the network as the AWS way would still require you to stop the nodes first.
How to Run Zone Outage Scenarios
Choose your preferred method to run zone outage scenarios:
Zone outage can be injected by placing the zone_outage config file under zone_outages option in the kraken config. Refer to zone_outage_scenario config file for the parameters that need to be defined.
zone_outage:# Scenario to create an outage of a zone by tweaking network ACL.cloud_type:aws # Cloud type on which Kubernetes/OpenShift runs. aws is the only platform supported currently for this scenario.duration:600# Duration in seconds after which the zone will be back online.vpc_id:# Cluster virtual private network to target.subnet_id:[subnet1, subnet2] # List of subnet-id's to deny both ingress and egress traffic.
Note
vpc_id and subnet_id can be obtained from the cloud web console by selecting one of the instances in the targeted zone ( us-west-2a for example ).
zone_outage:# Scenario to create an outage of a zone by tweaking network ACLcloud_type:gcp # cloud type on which Kubernetes/OpenShift runs. aws is only platform supported currently for this scenario.duration:600# duration in seconds after which the zone will be back onlinezone:<zone> # Zone of nodes to stop and then restart after the duration endskube_check:True# Run kubernetes api calls to see if the node gets to a certain state during the scenario
Note
Multiple zones will experience downtime in case of targeting multiple subnets which might have an impact on the cluster health especially if the zones have control plane components deployed.
AWS- Debugging steps in case of failures
In case of failures during the steps which revert back the network acl to allow traffic and bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady condition. Here is how to fix it:
OpenShift by default deploys the nodes in different zones for fault tolerance, for example us-west-2a, us-west-2b, us-west-2c. The cluster is associated with a virtual private network and each zone has its own subnet with a network acl which defines the ingress and egress traffic rules at the zone level unlike security groups which are at an instance level.
From the cloud web console, select one of the instances in the zone which is down and go to the subnet_id specified in the config.
Look at the network acl associated with the subnet and you will see both ingress and egress traffic being denied which is expected as Kraken deliberately injects it.
Kraken just switches the network acl while still keeping the original or default network acl around, switching to the default network acl from the drop-down menu will get back the nodes in the targeted zone into Ready state.
GCP - Debugging steps in case of failures
In case of failures during the steps which bring back the cluster nodes in the zone, the nodes in the particular zone will be in NotReady condition. Here is how to fix it:
From the gcp web console, select one of the instances in the zone which is down
Kraken just stops the node, so you’ll just have to select the stopped nodes and START them. This will get back the nodes in the targeted zone into Ready state
How to Use Plugin Name
Add the plugin name to the list of chaos_scenarios section in the config/config.yaml file
kraken:kubeconfig_path:~/.kube/config # Path to kubeconfig..chaos_scenarios:- zone_outages_scenarios:- scenarios/<scenario_name>.yaml
Note
You can specify multiple scenario files of the same type by adding additional paths to the list:
You can also combine multiple different scenario types in the same config.yaml file. Scenario types can be specified in any order, and you can include the same scenario type multiple times:
kraken:chaos_scenarios:- zone_outages_scenarios:- scenarios/zone-outage.yaml- pod_disruption_scenarios:- scenarios/pod-kill.yaml- node_scenarios:- scenarios/node-reboot.yaml- zone_outages_scenarios:# Same type can appear multiple times- scenarios/zone-outage-2.yaml
Run
python run_kraken.py --config config/config.yaml
This scenario disrupts a targeted zone in the public cloud by blocking egress and ingress traffic to understand the impact on both Kubernetes/OpenShift platforms control plane as well as applications running on the worker nodes in that zone. More information is documented here
Run
If enabling Cerberus to monitor the cluster and pass/fail the scenario post chaos, refer docs. Make sure to start it before injecting the chaos and set CERBERUS_ENABLED environment variable for the chaos injection container to autoconnect.
$ podman run \
--name=<container_name> \
--net=host \
--pull=always \
--env-host=true\
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages
$ podman logs -f <container_name or container_id> # Streams Kraken logs$ podman inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Note
–env-host: This option is not available with the remote Podman client, including Mac and Windows (excluding WSL2) machines.
Without the –env-host option you’ll have to set each environment variable on the podman command line like -e <VARIABLE>=<value>
$ docker run $(./get_docker_params.sh)\
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages
$ docker run \
-e <VARIABLE>=<value> \
--name=<container_name> \
--net=host \
--pull=always \
-v <path-to-kube-config>:/home/krkn/.kube/config:Z \
-d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:zone-outages
$ docker logs -f <container_name or container_id> # Streams Kraken logs$ docker inspect <container-name or container-id> \
--format "{{.State.ExitCode}}"# Outputs exit code which can considered as pass/fail for the scenario
Tip
Because the container runs with a non-root user, ensure the kube config is globally readable before mounting it in the container. You can achieve this with the following commands:
kubectl config view --flatten > ~/kubeconfig && chmod 444 ~/kubeconfig && docker run $(./get_docker_params.sh) --name=<container_name> --net=host --pull=always -v ~kubeconfig:/home/krkn/.kube/config:Z -d containers.krkn-chaos.dev/krkn-chaos/krkn-hub:<scenario>
Supported parameters
The following environment variables can be set on the host running the container to tweak the scenario/faults being injected:
Example if –env-host is used:
export <parameter_name>=<value>
OR on the command line like example:
-e <VARIABLE>=<value>
See list of variables that apply to all scenarios here that can be used/set in addition to these scenario specific variables
In case of using custom metrics profile or alerts profile when CAPTURE_METRICS or ENABLE_ALERTS is enabled, mount the metrics profile from the host on which the container is run using podman/docker under /home/krkn/kraken/config/metrics-aggregated.yaml and /home/krkn/kraken/config/alerts.