File Identification & Triage

Overview

File identification is the first step in malware analysis — determining what a sample actually is before any deeper analysis. Malware often disguises file types (e.g., a PE executable renamed to .pdf), so relying on file extensions is unreliable. This file covers file type detection, hashing for IOC generation, fuzzy hashing for similarity analysis, and metadata extraction.

File Type Identification

file Command

The file command identifies file types by examining magic bytes (file signatures), not extensions.

# Identify file type
file sample.exe
# Output: sample.exe: PE32+ executable (GUI) x86-64, for MS Windows, 6 sections

file sample.pdf
# Output: sample.pdf: PE32 executable (GUI) Intel 80386, for MS Windows
# (Malware disguised as PDF)

# Check multiple files
file *

# Show MIME type
file -i sample.exe
# Output: sample.exe: application/x-dosexec; charset=binary

# Don't follow symlinks
file -h sample

Common Magic Bytes

File Type Magic Bytes (hex) ASCII
PE (EXE/DLL) 4D 5A MZ
ELF (Linux) 7F 45 4C 46 .ELF
PDF 25 50 44 46 %PDF
ZIP/DOCX/XLSX 50 4B 03 04 PK..
Mach-O (macOS) CF FA ED FE
RAR 52 61 72 21 Rar!
GZIP 1F 8B
Java class CA FE BA BE
# View raw hex bytes of a file
xxd sample.exe | head -5

# Check just the magic bytes
xxd -l 16 sample.exe

Detect It Easy (DIE)

DIE provides deeper identification including compiler, linker, packer, and protector detection.

# Detect It Easy
# https://github.com/horsicq/Detect-It-Easy
diec sample.exe

# Deep scan (more thorough analysis)
diec -d sample.exe

# Heuristic scan
diec -u sample.exe

# Output as JSON
diec -j sample.exe

# Recursive scan (for archives)
diec -r archive.zip

Hashing

Cryptographic Hashes

Hash values uniquely identify a sample for IOC sharing, VirusTotal lookups, and tracking across incidents.

# MD5 (fast, widely used for IOCs, not collision-resistant)
md5sum sample.exe

# SHA-1
sha1sum sample.exe

# SHA-256 (preferred for modern IOCs)
sha256sum sample.exe

# All hashes at once
md5sum sample.exe && sha1sum sample.exe && sha256sum sample.exe

# Hash with radare2
# radare2
# https://github.com/radareorg/radare2
rahash2 -a md5,sha1,sha256 sample.exe

Fuzzy Hashing (ssdeep)

ssdeep computes context-triggered piecewise hashes (CTPH) that can detect similarity between files — useful for identifying malware variants.

# ssdeep
# https://github.com/ssdeep-project/ssdeep

# Generate a fuzzy hash
ssdeep sample.exe

# Directory mode — computes hashes of all provided files and compares each against the others
ssdeep -d sample1.exe sample2.exe

# Generate hashes for multiple files
ssdeep *.exe > hashes.txt

# Compare a file against a hash database
ssdeep -m hashes.txt new_sample.exe

# Pretty matching mode — compare all files in a directory (recursive)
ssdeep -p -r samples/

# Threshold — only show matches above a score (0-100)
ssdeep -t 50 -d samples/

A match score of 0 means no similarity; 100 means identical. Scores above 50 typically indicate related samples (variants, recompiled versions).

Import Hashing (imphash)

The import hash is computed from a PE file's import table. Malware compiled from the same source code with the same compiler typically has the same imphash, even with different content.

# pefile
# https://github.com/erocarrera/pefile
import pefile

pe = pefile.PE('sample.exe')
print('Imphash:', pe.get_imphash())

Metadata Extraction

exiftool

# ExifTool
# https://github.com/exiftool/exiftool

# Extract all metadata
exiftool sample.exe

# Key fields to look for:
# - CompanyName, ProductName — may reveal legitimate software being abused
# - OriginalFileName — the intended filename (often differs from actual)
# - TimeStamp — compilation timestamp (can be faked)
# - FileDescription — embedded description

# Extract metadata from Office documents
exiftool document.docx

# Look for author, creation/modification dates, embedded macros

binwalk

binwalk scans files for embedded file signatures — useful for finding hidden payloads inside documents, images, or firmware.

# binwalk
# https://github.com/ReFirmLabs/binwalk

# Scan for embedded signatures
binwalk sample.bin

# Extract embedded files
binwalk -e sample.bin

# Recursive extraction (unpack nested archives)
binwalk -Me sample.bin

# Scan for specific signature types
binwalk -R "\x4d\x5a" sample.bin

YARA Rules

YARA rules match patterns in files for malware classification and hunting.

# YARA
# https://github.com/VirusTotal/yara

# Scan a file with a YARA rule
yara rule.yar sample.exe

# Scan with multiple rule files
yara rules/*.yar sample.exe

# Scan recursively
yara -r rule.yar samples/

# Print matching strings
yara -s rule.yar sample.exe

# Print metadata from matching rules
yara -m rule.yar sample.exe

# Fast scan mode
yara -f rule.yar sample.exe

Basic YARA Rule Structure

rule MalwareExample
{
    meta:
        author = "analyst"
        description = "Detects example malware family"
        date = "2026-02-11"

    strings:
        $s1 = "cmd.exe /c" ascii
        $s2 = "powershell" ascii nocase
        $hex1 = { 4D 5A 90 00 03 00 00 00 }
        $url = /https?:\/\/[a-z0-9\-\.]+\.(com|net|org)/ ascii

    condition:
        uint16(0) == 0x5A4D and
        filesize < 500KB and
        2 of ($s*)
}

Key YARA concepts: - uint16(0) == 0x5A4D — checks for MZ header at offset 0 - filesize — constrains file size - N of ($pattern*) — requires N matches from a set - ascii / wide — string encoding - nocase — case-insensitive matching

Triage Workflow

Sample received
    │
    ├── file → identify type (PE, ELF, document, script?)
    ├── sha256sum → generate hash → check VirusTotal
    ├── ssdeep → compare to known samples
    ├── exiftool → extract metadata
    ├── diec → identify compiler/packer
    │
    ├── Is it packed? → Yes → proceed to unpacking
    │                → No → proceed to deeper static analysis
    │
    └── Document initial IOCs (hashes, filenames, embedded strings)

References

Tools