Model Evasion

Overview

Model evasion crafts inputs that cause an ML model to produce incorrect predictions while appearing unchanged (or minimally changed) to a human observer. The canonical example: adding carefully calculated pixel perturbations to an image so a classifier labels a panda as a gibbon — while the image still looks like a panda to human eyes.

Evasion is a test-time attack — the model's weights are not modified. The attacker manipulates only the input data. This makes evasion attacks relevant wherever ML models make security decisions: malware classifiers, spam filters, intrusion detection systems, fraud detection, and content moderation.

Szegedy et al. (2013) first demonstrated that neural networks are vulnerable to small, imperceptible perturbations. Goodfellow et al. (2014) explained the phenomenon through the lens of model linearity and introduced FGSM, the first practical attack method.

ATLAS Mapping

Tactic: AML.TA0012 - ML Attack Staging
Tactic: AML.TA0014 - Impact
Technique: AML.T0015 - Evade AI Model
Technique: AML.T0043 - Craft Adversarial Data

Prerequisites

Access to the model's predictions (at minimum, the predicted class or score)
For white-box attacks: access to model weights and architecture (to compute gradients)
For black-box attacks: ability to query the model repeatedly and observe outputs
For transfer attacks: access to a similar model or training data to build a proxy

Attack Taxonomy

Access level	Attack type	Requires
White-box	Gradient-based (FGSM, PGD, C&W)	Full model access
Black-box (score)	Query-based (boundary, HopSkipJump)	Prediction scores
Black-box (decision)	Decision-based	Only final class label
Transfer	Proxy model attack	Similar model or data

Techniques

FGSM (Fast Gradient Sign Method)

The simplest and fastest gradient-based attack. Computes the gradient of the loss with respect to the input, then adds a perturbation in the direction that maximizes the loss:

x_adv = x + epsilon * sign(gradient_of_loss(x, y))

Where: - x = original input - y = true label - epsilon = perturbation magnitude (controls visibility vs. effectiveness) - sign() = element-wise sign function

Properties: - Single-step (one gradient computation) — very fast - Perturbation bounded by epsilon in L-infinity norm - Often insufficient against adversarially trained models

PGD (Projected Gradient Descent)

An iterative version of FGSM. Takes multiple smaller gradient steps and projects back onto the epsilon-ball after each step:

x_0 = x + random_perturbation (within epsilon)
for i in 1..N:
    x_i = x_{i-1} + alpha * sign(gradient_of_loss(x_{i-1}, y))
    x_i = clip(x_i, x - epsilon, x + epsilon)
x_adv = x_N

Properties: - Stronger than FGSM due to iterative refinement - Considered the "standard" white-box attack for evaluating robustness - Computational cost scales with number of iterations

C&W (Carlini & Wagner)

Optimization-based attack that minimizes the perturbation size while ensuring misclassification. Formulated as an optimization problem:

minimize ||delta||_p + c * loss(x + delta, target_class)

Properties: - Produces smaller perturbations than FGSM/PGD - Can target specific output classes (targeted attack) - Slower — requires iterative optimization per sample - Effective against many defenses that stop FGSM/PGD

Transfer Attacks

Adversarial examples generated against one model often transfer to different models trained on similar data. This enables black-box attacks:

Train or obtain a proxy model similar to the target
Generate adversarial examples against the proxy (using white-box methods)
Submit the adversarial examples to the target model
The perturbations often transfer — the target misclassifies them too

Transfer works best when: - Proxy and target share similar architectures - Both were trained on similar data distributions - The adversarial perturbation exploits features common to both models

Security-Relevant Evasion Scenarios

Malware Classification Evasion

ML-based malware detectors (AV engines, EDR solutions) classify files as malicious or benign based on learned features. Evasion techniques:

Feature-space attacks — modify non-functional bytes in a PE file to shift its feature representation while preserving malicious functionality
Append attacks — add benign content to the end of a malicious file to shift static analysis features
Functionality-preserving mutations — reorder instructions, add NOPs, change register allocation to evade behavioral signatures while maintaining the same execution flow

Network Intrusion Detection Evasion

ML-based IDS/IPS systems classify network traffic as normal or malicious. Evasion modifies traffic features:

Adjust packet timing to change flow statistics
Fragment payloads across packets differently
Modify benign header fields to shift the feature vector

Spam/Phishing Filter Evasion

Text classifiers for spam detection can be evaded with:

Character substitution (homoglyphs: "Clìck hère")
Invisible Unicode characters between words
Image-based text (bypasses text-only classifiers)
Adversarial token insertions that shift classification without changing human-perceived meaning

Testing Tools

Adversarial Robustness Toolbox (ART)

ART is IBM's comprehensive library for adversarial ML. It implements attack methods, defenses, and robustness evaluation.

# Adversarial Robustness Toolbox
# https://github.com/Trusted-AI/adversarial-robustness-toolbox
pip install adversarial-robustness-toolbox

# Adversarial Robustness Toolbox
# https://github.com/Trusted-AI/adversarial-robustness-toolbox

# FGSM attack example
from art.attacks.evasion import FastGradientMethod
from art.estimators.classification import KerasClassifier

# Wrap the target model
classifier = KerasClassifier(model=keras_model, clip_values=(0, 1))

# Create FGSM attack
attack = FastGradientMethod(estimator=classifier, eps=0.3)

# Generate adversarial examples
x_adv = attack.generate(x=x_test)

# Adversarial Robustness Toolbox
# https://github.com/Trusted-AI/adversarial-robustness-toolbox

# PGD attack example
from art.attacks.evasion import ProjectedGradientDescent

attack = ProjectedGradientDescent(
    estimator=classifier,
    eps=0.3,
    eps_step=0.01,
    max_iter=40
)
x_adv = attack.generate(x=x_test)

Foolbox

# Foolbox
# https://github.com/bethgelab/foolbox
pip install foolbox

# Foolbox
# https://github.com/bethgelab/foolbox
import foolbox as fb

# Wrap a PyTorch model
fmodel = fb.PyTorchModel(pytorch_model, bounds=(0, 1))

# Run PGD attack
attack = fb.attacks.PGD()
_, advs, success = attack(fmodel, images, labels, epsilons=[0.03])

Detection Methods

Input preprocessing — apply transformations (JPEG compression, spatial smoothing, bit-depth reduction) that remove adversarial perturbations while preserving clean image content
Ensemble detection — run the input through multiple models; adversarial examples that fool one model may not fool others
Statistical detection — measure distributional properties of the input (e.g., feature squeezing) and flag inputs that show suspicious differences between original and processed versions
Gradient masking detection — monitor for inputs where the model's gradient magnitude is unusually high

Mitigation Strategies

Adversarial training — include adversarial examples in the training set (most effective single defense, but increases training cost and may reduce clean accuracy)
Certified defenses — randomized smoothing and other provable methods that guarantee robustness within a bounded perturbation region
Input preprocessing — apply non-differentiable transformations to break gradient-based attacks (limited effectiveness against adaptive attackers)
Model ensembles — combine predictions from diverse models to reduce transferability
Defense in depth — combine ML with rule-based detection; don't rely solely on a single ML classifier for security decisions