File Identification & Triage
Overview
File identification is the first step in malware analysis — determining what
a sample actually is before any deeper analysis. Malware often disguises file
types (e.g., a PE executable renamed to .pdf), so relying on file extensions
is unreliable. This file covers file type detection, hashing for IOC
generation, fuzzy hashing for similarity analysis, and metadata extraction.
File Type Identification
file Command
The file command identifies file types by examining magic bytes (file
signatures), not extensions.
# Identify file type
file sample.exe
# Output: sample.exe: PE32+ executable (GUI) x86-64, for MS Windows, 6 sections
file sample.pdf
# Output: sample.pdf: PE32 executable (GUI) Intel 80386, for MS Windows
# (Malware disguised as PDF)
# Check multiple files
file *
# Show MIME type
file -i sample.exe
# Output: sample.exe: application/x-dosexec; charset=binary
# Don't follow symlinks
file -h sample
Common Magic Bytes
| File Type | Magic Bytes (hex) | ASCII |
|---|---|---|
| PE (EXE/DLL) | 4D 5A |
MZ |
| ELF (Linux) | 7F 45 4C 46 |
.ELF |
25 50 44 46 |
%PDF |
|
| ZIP/DOCX/XLSX | 50 4B 03 04 |
PK.. |
| Mach-O (macOS) | CF FA ED FE |
— |
| RAR | 52 61 72 21 |
Rar! |
| GZIP | 1F 8B |
— |
| Java class | CA FE BA BE |
— |
# View raw hex bytes of a file
xxd sample.exe | head -5
# Check just the magic bytes
xxd -l 16 sample.exe
Detect It Easy (DIE)
DIE provides deeper identification including compiler, linker, packer, and protector detection.
# Detect It Easy
# https://github.com/horsicq/Detect-It-Easy
diec sample.exe
# Deep scan (more thorough analysis)
diec -d sample.exe
# Heuristic scan
diec -u sample.exe
# Output as JSON
diec -j sample.exe
# Recursive scan (for archives)
diec -r archive.zip
Hashing
Cryptographic Hashes
Hash values uniquely identify a sample for IOC sharing, VirusTotal lookups, and tracking across incidents.
# MD5 (fast, widely used for IOCs, not collision-resistant)
md5sum sample.exe
# SHA-1
sha1sum sample.exe
# SHA-256 (preferred for modern IOCs)
sha256sum sample.exe
# All hashes at once
md5sum sample.exe && sha1sum sample.exe && sha256sum sample.exe
# Hash with radare2
# radare2
# https://github.com/radareorg/radare2
rahash2 -a md5,sha1,sha256 sample.exe
Fuzzy Hashing (ssdeep)
ssdeep computes context-triggered piecewise hashes (CTPH) that can detect similarity between files — useful for identifying malware variants.
# ssdeep
# https://github.com/ssdeep-project/ssdeep
# Generate a fuzzy hash
ssdeep sample.exe
# Directory mode — computes hashes of all provided files and compares each against the others
ssdeep -d sample1.exe sample2.exe
# Generate hashes for multiple files
ssdeep *.exe > hashes.txt
# Compare a file against a hash database
ssdeep -m hashes.txt new_sample.exe
# Pretty matching mode — compare all files in a directory (recursive)
ssdeep -p -r samples/
# Threshold — only show matches above a score (0-100)
ssdeep -t 50 -d samples/
A match score of 0 means no similarity; 100 means identical. Scores above 50 typically indicate related samples (variants, recompiled versions).
Import Hashing (imphash)
The import hash is computed from a PE file's import table. Malware compiled from the same source code with the same compiler typically has the same imphash, even with different content.
# pefile
# https://github.com/erocarrera/pefile
import pefile
pe = pefile.PE('sample.exe')
print('Imphash:', pe.get_imphash())
Metadata Extraction
exiftool
# ExifTool
# https://github.com/exiftool/exiftool
# Extract all metadata
exiftool sample.exe
# Key fields to look for:
# - CompanyName, ProductName — may reveal legitimate software being abused
# - OriginalFileName — the intended filename (often differs from actual)
# - TimeStamp — compilation timestamp (can be faked)
# - FileDescription — embedded description
# Extract metadata from Office documents
exiftool document.docx
# Look for author, creation/modification dates, embedded macros
binwalk
binwalk scans files for embedded file signatures — useful for finding hidden payloads inside documents, images, or firmware.
# binwalk
# https://github.com/ReFirmLabs/binwalk
# Scan for embedded signatures
binwalk sample.bin
# Extract embedded files
binwalk -e sample.bin
# Recursive extraction (unpack nested archives)
binwalk -Me sample.bin
# Scan for specific signature types
binwalk -R "\x4d\x5a" sample.bin
YARA Rules
YARA rules match patterns in files for malware classification and hunting.
# YARA
# https://github.com/VirusTotal/yara
# Scan a file with a YARA rule
yara rule.yar sample.exe
# Scan with multiple rule files
yara rules/*.yar sample.exe
# Scan recursively
yara -r rule.yar samples/
# Print matching strings
yara -s rule.yar sample.exe
# Print metadata from matching rules
yara -m rule.yar sample.exe
# Fast scan mode
yara -f rule.yar sample.exe
Basic YARA Rule Structure
rule MalwareExample
{
meta:
author = "analyst"
description = "Detects example malware family"
date = "2026-02-11"
strings:
$s1 = "cmd.exe /c" ascii
$s2 = "powershell" ascii nocase
$hex1 = { 4D 5A 90 00 03 00 00 00 }
$url = /https?:\/\/[a-z0-9\-\.]+\.(com|net|org)/ ascii
condition:
uint16(0) == 0x5A4D and
filesize < 500KB and
2 of ($s*)
}
Key YARA concepts:
- uint16(0) == 0x5A4D — checks for MZ header at offset 0
- filesize — constrains file size
- N of ($pattern*) — requires N matches from a set
- ascii / wide — string encoding
- nocase — case-insensitive matching
Triage Workflow
Sample received
│
├── file → identify type (PE, ELF, document, script?)
├── sha256sum → generate hash → check VirusTotal
├── ssdeep → compare to known samples
├── exiftool → extract metadata
├── diec → identify compiler/packer
│
├── Is it packed? → Yes → proceed to unpacking
│ → No → proceed to deeper static analysis
│
└── Document initial IOCs (hashes, filenames, embedded strings)