Assembly for Malware Analysts: x86, x64 & ARM in Ghidra and Binary Ninja

Assembly is the language between high-level code and raw machine bytes. When you open a sample in Ghidra or Binary Ninja, what you see is the disassembler’s reconstruction of that language. This guide teaches you to read it fluently — not just to understand each instruction, but to recognize the patterns that malware authors use: API resolvers, XOR decryptors, persistence loops, and anti-debug tricks.

Coverage spans x86 (32-bit), x64 (64-bit), and ARM/AArch64 (embedded, mobile, and modern Windows on Arm targets).

Related posts in this blog: Understanding and Attacking EDRs EDR Bypass Roadmap Anti-Debugging Techniques Windows API Attack Surface

Why Assembly for Malware Analysis?

Modern malware arrives stripped of symbols, packed, and obfuscated. Decompilers help but they lie — they reconstruct intent from behavior, and when the behavior is adversarial, the reconstruction drifts. Raw disassembly never lies: every byte the CPU executes is exactly what you see.

Specifically, assembly literacy lets you:

  • Identify API calls even when the import table is empty (PEB walking, GetProcAddress chains)
  • Recognize crypto primitives by their bitwise patterns (XOR loops, XTEA key schedules, Salsa20 quarter-rounds)
  • Spot anti-analysis tricks before they fire (RDTSC timing, IsDebuggerPresent checks, NtQueryInformationProcess calls)
  • Understand shellcode that can never be decompiled — it has no PE header, no sections, no symbols

Getting Started — What to Expect

Learning assembly for the first time feels like having the rug pulled out: no types, no function names, no meaningful variable names — everything is registers, offsets, and flags. The cognitive load is real, but it drops fast once the patterns click.

The Mental Model Shift

In C you write x = a + b. In assembly you first load a into a register, add b to it, and the result sits in the same register. The instruction stream is completely flat — there is no notion of scope, type, or lifetime beyond what the calling convention imposes.

The most important shift: think in state, not abstractions. At any point in a function you can ask: what is in EAX right now? What does [EBP-8] hold? Where did ESP go? Building this running state machine in your head is the core skill the job requires.

What Is Actually Hard

  • Registers carry context that changes line-by-line. A register can hold a loop counter on one line and a pointer on the next. There is no IDE tooltip to tell you which it is right now.
  • Flags are invisible shared state. CMP EAX, EBX sets flags, and then ten instructions later a JL reads them. Other instructions between the compare and the branch can also modify flags — beginners miss this constantly.
  • Obfuscation looks syntactically identical to normal code. A dead XOR, a fake loop, a JMP to the very next instruction — nothing in the syntax signals “this is junk.”
  • Calling conventions are implicit. Nothing in the binary says “this is cdecl.” You have to infer it from how the caller prepares arguments and how the callee tears down.
  • Pointer arithmetic and integer arithmetic are indistinguishable. ADD EAX, 4 could be advancing a pointer by one int or incrementing a counter by four. Only context tells you which.

What Clicks Surprisingly Quickly

  • Most real malware uses fewer than 20 distinct instructions. MOV, PUSH/POP, CMP/TEST, JE/JNE/JL/JG, CALL/RET, XOR/AND/OR, ADD/SUB, LEA, INC/DEC. Master these and you can read around 80 % of what you will encounter.
  • Prologues and epilogues are boilerplate. After a few sessions you will recognise push ebp / mov ebp, esp / sub esp, N in under a second and jump straight to the logic that follows.
  • CFG loops are always the same shape. A back-edge in the control-flow graph is a loop — full stop. Train your eye on the graph view and you stop reading instructions linearly and start reading structure.
  • XOR decryptors look identical everywhere. Load byte, XOR, store byte, increment counter, compare to length, branch back. Once you recognise the shape you will spot it in any binary within seconds.
  • The PEB walk is copy-pasted across malware families. FS:[0x30] (x86) or GS:[0x60] (x64) followed by three or four chained dereferences is the same code in hundreds of samples.
Stage Focus Suggested exercise
1 — Foundation x86 registers, the stack, MOV, PUSH/POP, CALL/RET Hand-trace a cdecl “hello world” step-by-step in Ghidra’s listing view
2 — Control flow CMP, TEST, Jcc, loops, switch-jump tables Find a counted loop in any open-source binary; label the counter, body, and exit
3 — Conventions cdecl vs stdcall vs x64 ABI; argument location rules Identify argument-passing in five Win32 API calls (CreateFile, VirtualAlloc, etc.)
4 — Patterns XOR decryptors, PEB walks, anti-debug idioms Analyse a CTF reversing challenge from pwn.college or crackmes.one
5 — x64 Shadow space, RIP-relative addressing, R8–R15 Repeat stages 1–4 on a 64-bit Windows binary
6 — ARM RISC philosophy, conditional execution suffix, Thumb Analyse a simple Android .so from an open-source APK

Tools to Have Ready Before You Open a Sample

Tool Purpose Free?
Ghidra (NSA) Full disassembler + decompiler; the best free starting point Yes
Binary Ninja Fast UI, excellent MLIL/HLIL layers, great scripting API Trial / paid
x64dbg Dynamic debugger for Windows x86/x64; pairs with Ghidra for static+dynamic Yes
PE-bear PE header inspector — understand the binary’s imports and sections before loading it Yes
CFF Explorer Import table, overlay, and resource inspector Yes
FLOSS (Mandiant) Extracts obfuscated and stack-built strings without executing the binary Yes
Detect-It-Easy Packer and compiler fingerprinting — tells you what unpacking you need first Yes

Beginner trap to avoid: Do not start dynamic analysis (running the sample in a debugger) before you have done at least a pass of static analysis (Ghidra/Binary Ninja). Dynamic analysis is powerful but dangerous — malware can detect the debugger and feed you a decoy execution path. Static first, dynamic second.


VSCode Setup for Assembly Practice

Reading assembly in a disassembler is one skill; writing it to build intuition is another. VSCode with NASM gives you a lightweight environment to experiment with snippets without spinning up a full VM.

Essential Extensions

Install these four extensions from the VSCode Marketplace (Ctrl+Shift+X):

Extension ID What it does
13xforever.language-x86-64-assembly Syntax highlighting for x86/x64 NASM, MASM, GAS, and AT&T syntax
OrangeX4.vscode-masm-run Adds run/build buttons for MASM/NASM files directly in the editor
usernamehw.errorlens Inline error display — useful when nasm outputs errors with line numbers
streetsidesoftware.code-spell-checker Optional but saves you from typo-driven bugs in label names

Install all four in one shot from the terminal:

code --install-extension 13xforever.language-x86-64-assembly
code --install-extension OrangeX4.vscode-masm-run
code --install-extension usernamehw.errorlens
code --install-extension streetsidesoftware.code-spell-checker

Installing NASM

Windows:

  1. Download the NASM installer from nasm.us — pick the latest win64 .exe
  2. Run the installer; tick “Add to PATH”
  3. Verify in a new terminal: nasm --version

You also need a linker. The easiest option on Windows is to install the free GoLink linker or use the MinGW ld that ships with Git for Windows:

# Check both are on PATH
nasm --version   # e.g. NASM version 2.16.x
ld   --version   # GNU ld (part of MinGW / binutils)

Linux / WSL:

sudo apt install nasm build-essential   # Debian / Ubuntu
sudo dnf install nasm gcc               # Fedora / RHEL

Your First Assembly File

Create hello.asm and paste this x64 Linux snippet (works in WSL):

; hello.asm — x64 Linux, NASM syntax
; Assemble: nasm -f elf64 hello.asm && ld -o hello hello.o && ./hello

section .data
    msg  db "hello, asm", 10   ; 10 = newline
    len  equ $ - msg

section .text
    global _start

_start:
    mov rax, 1          ; syscall: write
    mov rdi, 1          ; fd: stdout
    mov rsi, msg        ; buffer address
    mov rdx, len        ; byte count
    syscall

    mov rax, 60         ; syscall: exit
    xor rdi, rdi        ; status: 0
    syscall

For Windows (x64 MASM-style with the Windows API), create hello_win.asm:

; hello_win.asm — x64 Windows, NASM syntax, links against kernel32
; Assemble+link:
;   nasm -f win64 hello_win.asm -o hello_win.obj
;   link /subsystem:console /entry:main hello_win.obj kernel32.lib

extern  ExitProcess
extern  GetStdHandle
extern  WriteConsoleA

section .data
    msg     db "hello, asm", 13, 10
    msglen  equ $ - msg
    written dq 0

section .text
    global main

main:
    sub     rsp, 40                 ; shadow space + alignment

    mov     rcx, -11                ; STD_OUTPUT_HANDLE
    call    GetStdHandle
    mov     rcx, rax                ; hConsole

    lea     rdx, [rel msg]          ; lpBuffer
    mov     r8d, msglen             ; nNumberOfCharsToWrite
    lea     r9,  [rel written]      ; lpNumberOfCharsWritten
    push    0                       ; lpReserved (5th arg on stack)
    call    WriteConsoleA

    xor     rcx, rcx
    call    ExitProcess

Tip for analysts: The Windows snippet demonstrates the x64 Microsoft ABI in action — shadow space, register arguments in RCX/RDX/R8/R9, and a stack-passed fifth argument. It is more instructive than the Linux version if your target is Windows malware.

Build Task (tasks.json)

Create .vscode/tasks.json in your project root so Ctrl+Shift+B assembles and links automatically:

{
  "version": "2.0.0",
  "tasks": [
    {
      "label": "NASM — build (Linux/WSL elf64)",
      "type": "shell",
      "command": "nasm -f elf64 ${file} -o ${fileDirname}/${fileBasenameNoExtension}.o && ld -o ${fileDirname}/${fileBasenameNoExtension} ${fileDirname}/${fileBasenameNoExtension}.o",
      "group": { "kind": "build", "isDefault": true },
      "presentation": { "reveal": "always", "panel": "shared" },
      "problemMatcher": {
        "owner": "nasm",
        "fileLocation": ["absolute"],
        "pattern": {
          "regexp": "^(.+):(\\d+):\\s+(.+)$",
          "file": 1, "line": 2, "message": 3
        }
      }
    },
    {
      "label": "Run assembled binary",
      "type": "shell",
      "command": "${fileDirname}/${fileBasenameNoExtension}",
      "group": "test",
      "dependsOn": "NASM — build (Linux/WSL elf64)",
      "presentation": { "reveal": "always", "panel": "shared" }
    }
  ]
}

After saving, press Ctrl+Shift+B while any .asm file is active to assemble it. NASM errors appear inline in the editor via ErrorLens.

Debugging with x64dbg

x64dbg is the go-to Windows debugger for malware analysis and also the best way to step through your hand-written assembly:

  1. Download x64dbg and extract it — no install needed
  2. Right-click the .exe your NASM build produced → Open with x64dbg
  3. The binary breaks at the entry point automatically (_start / main)
  4. Use F7 (step into) and F8 (step over) to trace execution
  5. Watch the Registers panel on the right — every instruction updates it live

Workflow for learning: Write a small snippet in VSCode, build it, open the output in x64dbg, and step through it. Watching RSP change on every PUSH/POP and seeing RAX set to your expected value after a calculation is the fastest way to build register intuition.

VSCode + x64dbg shortcut: Add an x64dbg open task to tasks.json so pressing a keybinding launches the debugger directly on the built binary, saving the manual drag-and-drop step.


Assembly 101 — How to Read Assembler Code

Before drilling into registers and instructions, you need to parse the notation. This section teaches you to decode any line the disassembler shows you.

Anatomy of One Line

Every assembly line has up to four parts:

[label:]   mnemonic   [operand1[, operand2[, operand3]]]   [; comment]
Part Optional? Example Meaning
Label Yes loop_start: Named address — targets for jumps and calls
Mnemonic No MOV The operation the CPU performs
Operands Most mnemonics need 1–2 EAX, 5 What the operation acts on
Comment Yes ; i = 0 Human annotation, ignored by assembler
xor_loop:           MOV   EAX, [ESI + ECX]   ; load byte from buffer
; ↑ label          ↑mnem  ↑dst   ↑src          ↑ comment

Intel syntax rule (used by Ghidra, Binary Ninja, and NASM):

Destination is always the left operand.

MOV EAX, 5 means “put 5 into EAX”, not “put EAX into 5”. Every instruction follows this convention: left = where the result lands, right = the source.

Intel vs AT&T Syntax

You will encounter both in the wild. Ghidra and Binary Ninja default to Intel; GDB and older GNU tools default to AT&T.

Feature Intel (NASM / MASM) AT&T (GAS / GDB)
Operand order dst, src src, dstreversed
Register names EAX %eax — prefixed with %
Immediates 5 $5 — prefixed with $
Memory reference [EAX] (%eax) — uses parentheses
Size suffix DWORD PTR [EAX] movl (%eax) — letter suffix on mnemonic (b=byte, w=word, l=long/dword, q=qword)
Example mov eax, [ebx + 8] movl 8(%ebx), %eax

If you see % before register names and $ before numbers, you are reading AT&T — flip the operand order mentally.

Practical tip: You can tell Ghidra to switch between syntaxes via Edit → Tool Options → Listing Fields → Operands → “Language”. Most analysts stay on Intel.

Parsing a Memory Reference

Square brackets in Intel syntax mean “dereference this address” — the same as *ptr in C.

[  base  +  index * scale  +  displacement  ]
Component What it is Example
base A register holding the start address EBX
index An optional register acting as offset ECX
scale Multiplier for index: 1, 2, 4, or 8 4 (size of int)
displacement A constant byte offset 8

Decode each piece of [EBX + ECX*4 + 8] in English:

EBX          →  base address (start of an array)
ECX * 4      →  index × sizeof(int) — the Nth element
+ 8          →  skip 8 bytes past the start (e.g., past a struct header)
Result       →  array[N].field  where field is at offset 8

Common patterns you will see constantly:

[EBP - 4]          ; local variable #1 (4 bytes below frame pointer)
[EBP + 8]          ; first function argument (cdecl / stdcall)
[EAX]              ; *(ptr)  — simple dereference
[EAX + 0x3C]       ; ptr->field_at_offset_0x3C  (e.g. PE header offset)
[EAX + ECX]        ; ptr[i]  — byte array element
[EAX + ECX*4]      ; ptr[i]  — int array element (4 bytes each)

Reading a Sequence — Building Mental State

Assembly has no scope, no types, no variable names. Reading it means running a tiny virtual machine in your head. For every line, ask three questions:

  1. Which registers change? — only the destination operand is written
  2. Which flags change? — arithmetic and compare instructions update flags; MOV and LEA do not
  3. Does memory get read or written? — any operand in [ ] touches memory

Work through a sequence by tracking register values as a table:

; Trace these five instructions top-to-bottom
mov  eax, 10        ; 1
mov  ecx, 3         ; 2
mul  ecx            ; 3 — EDX:EAX = EAX * ECX
sub  eax, 2         ; 4
push eax            ; 5
Step Instruction EAX ECX EDX ESP Memory
start ? ? ? 0xFF
1 mov eax, 10 10 ? ? 0xFF
2 mov ecx, 3 10 3 ? 0xFF
3 mul ecx 30 3 0 0xFF
4 sub eax, 2 28 3 0 0xFF
5 push eax 28 3 0 0xFB [0xFB] = 28

The table discipline forces you to track exactly what each instruction does without skipping ahead — the most common beginner mistake.

Worked Example — Trace Five Lines

Here is a real-world snippet from a malware loader. Read it cold, then check the annotations:

00401020  mov  eax, [ebp + 8]    ; (1)
00401023  test eax, eax          ; (2)
00401025  jz   00401040          ; (3)
00401027  mov  ecx, [eax + 0x3C] ; (4)
0040102A  add  ecx, eax          ; (5)

Line by line:

# Instruction What it does Mental note
1 mov eax, [ebp+8] Load the first argument into EAX EAX = arg1 (likely a pointer)
2 test eax, eax AND EAX with itself — sets ZF if EAX is zero, no write null-check on the pointer
3 jz 00401040 Jump to 0x401040 if ZF=1 (EAX was zero) if (arg1 == NULL) goto error
4 mov ecx, [eax + 0x3C] Read a DWORD 60 bytes into the struct EAX points at 0x3C is the e_lfanew field of a DOS header — this is reading the PE offset
5 add ecx, eax ECX = ECX + EAX (base + offset) ECX now points to the PE signature / IMAGE_NT_HEADERS

The five lines implement IMAGE_NT_HEADERS *nt = (IMAGE_NT_HEADERS*)(base + base->e_lfanew) — a pattern found in virtually every PE parser and loader you will encounter in malware analysis.

Key takeaway: You do not need to know every instruction before you start reading. You need the three questions (what changes, what flags, what memory?) and the habit of building the register table as you go. The patterns — null checks, struct field access, PE walking — repeat endlessly once you recognise them the first time.


CPU Registers — The Fast Lane

Registers are the CPU’s own ultra-fast memory — typically 8 to 32 slots, each holding one word of data. Every computation happens in registers; RAM is just slow storage the CPU ferries values to and from.

x86 General-Purpose Registers

On x86 (32-bit), eight general-purpose registers each hold a 32-bit (4-byte) value. Each register also exposes sub-word aliases that address smaller portions without extra instructions:

Full (32-bit) Low 16-bit High byte (bits 8–15) Low byte (bits 0–7) Primary convention
EAX AX AH AL Return value; arithmetic accumulator
EBX BX BH BL Base pointer; callee-saved
ECX CX CH CL Loop counter; LOOP, REP, SHIFT
EDX DX DH DL Extended return (EDX:EAX); I/O port
ESI SI Source index for string ops
EDI DI Destination index for string ops
ESP SP Stack pointer — always points to TOS
EBP BP Frame pointer — anchors local variable base

Ghidra / Binary Ninja tip: When you see [EBP - 0x8], that is a local variable 8 bytes below the frame pointer. When you see [EBP + 0x8], that is the first function argument (cdecl convention).

Register Bit Layout — EAX as example
3116
158
70
· · ·
AH
AL
· · ·
AX 16-bit
EAX 32-bit
bits 31–16
bits 15–8
bits 7–0

x64 Extensions

x64 extends every 32-bit register to 64 bits and adds eight new registers. The naming convention prefixes R for the full 64-bit form:

x64 (64-bit) x86 alias (low 32) Low 16 Low 8 Convention
RAX EAX AX AL Return value
RBX EBX BX BL Callee-saved
RCX ECX CX CL Arg 1 (Windows)
RDX EDX DX DL Arg 2 (Windows)
RSI ESI SI SIL Arg 2 (Linux); callee-saved (Windows)
RDI EDI DI DIL Arg 1 (Linux); callee-saved (Windows)
RSP ESP SP SPL Stack pointer
RBP EBP BP BPL Frame pointer (optional in x64)
R8R11 R8D–R11D R8W–R11W R8B–R11B Arg 3–4 (Windows/Linux); caller-saved
R12R15 R12D–R15D R12W–R15W R12B–R15B Callee-saved

Critical x64 gotcha: Writing to a 32-bit sub-register (e.g. EAX) zero-extends into the 64-bit register (RAX). Writing to a 16-bit or 8-bit sub-register does not. This catches many analysts off-guard when reading decompiler output.

mov eax, 1      ; RAX = 0x0000000000000001  (upper 32 bits zeroed!)
mov ax,  1      ; RAX unchanged except low 16 bits
mov al,  1      ; RAX unchanged except low  8 bits

ARM / AArch64 Registers

ARM uses a load-store architecture: unlike x86, arithmetic instructions can only operate on registers, never directly on memory. Data must be explicitly loaded into a register first.

ARM (32-bit) registers:

Register Alias Role
R0R3 Function arguments 1–4; return value in R0
R4R11 General purpose; callee-saved
R12 IP Intra-procedure-call scratch register
R13 SP Stack pointer
R14 LR Link Register — holds return address
R15 PC Program Counter — current instruction address
CPSR Current Program Status Register (flags)

AArch64 (64-bit) registers:

Register Width Role
X0X7 64-bit Function arguments 1–8; return in X0
X8 64-bit Indirect result location / syscall number (Linux)
X9X15 64-bit Caller-saved temporaries
X16X17 64-bit Intra-procedure-call scratch
X18 64-bit Platform reserved (TEB on Windows ARM64)
X19X28 64-bit Callee-saved
X29 64-bit Frame pointer (FP)
X30 64-bit Link register (LR)
SP 64-bit Stack pointer (not a general register)
PC 64-bit Program counter (not directly writeable)
32-bit each W0W30 — 32-bit aliases of X registers

The Instruction Pointer

The instruction pointer is the CPU’s “current position” register:

Architecture Register Notes
x86 EIP Cannot be read directly; modified by JMP, CALL, RET
x64 RIP Readable indirectly via CALL $+5; POP RAX; used for RIP-relative addressing
ARM32 PC (R15) Readable and writable — writing to PC is a branch
AArch64 PC Not directly writeable; only modified by branch instructions

Reversing tip: In x64 binaries, you will constantly see patterns like lea rax, [rip + 0x1234]. This is RIP-relative addressing — the operand is relative to the next instruction’s address. Ghidra and Binary Ninja both resolve these to absolute addresses automatically.

EFLAGS / RFLAGS — The Status Word

Every comparison and arithmetic operation updates individual bits in the flags register. Conditional jumps then branch based on these bits.

Flag Bit Set when… Common instructions that set it Jump / branch that reads it
CF 0 Carry/borrow out of the MSB (unsigned overflow) ADD, SUB, SHL/SHR, CLC/STC, MUL JB/JNAE (CF=1 → unsigned below); JAE/JNB (CF=0 → unsigned above-or-equal)
PF 2 Low byte of result has even parity Most arithmetic and logic ops JP/JPE (PF=1); JNP/JPO (PF=0) — rare in modern code; seen in CRC loops
AF 4 Carry from bit 3 to bit 4 (BCD arithmetic) ADD, SUB, INC, DEC Not tested by Jcc; consumed by DAA/DAS — almost never seen outside legacy x86
ZF 6 Result is zero CMP, TEST, AND, OR, XOR, ADD, SUB, INC, DEC JE/JZ (ZF=1 → equal/zero); JNE/JNZ (ZF=0 → not equal) — the most-used flag in disassembly
SF 7 Result is negative (sign bit is 1) Most arithmetic and logic ops JS (SF=1); JNS (SF=0); combined with OF for JL/JG
OF 11 Signed overflow — result too large for the signed type ADD, SUB, IMUL, NEG, INC, DEC JO (OF=1); JNO (OF=0); paired with SF for JL (SF≠OF) and JGE (SF=OF)
DF 10 Direction for string ops (0 = forward / increment, 1 = backward / decrement) CLD clears it; STD sets it Not a Jcc flag — implicitly consumed by REP MOVS, REP STOS, SCAS. Malware sets DF=1 before REP STOSD to wipe memory backwards
IF 9 Interrupts enabled STI sets; CLI clears Not testable from user mode — kernel/driver context only

Analyst tip — ZF is king: In practice, ZF is the flag you will track most often. TEST EAX, EAX / JNZ is the universal “is this value non-null?” idiom. CMP EAX, 0 / JE is “did this function return 0 (error/false)?”. If you can only track one flag, track ZF.

The ARM equivalent is the CPSR (Current Program Status Register) / NZCV flags in AArch64:

ARM Flag x86 Equivalent Meaning
N SF Negative result
Z ZF Zero result
C CF Carry
V OF Overflow

Memory and Addressing Modes

The Memory Map

A typical 32-bit Windows user-mode process looks like this:

Windows 32-bit Process — Virtual Address Space
0x00000000
NULL guard / unmapped
0x00010000
.text  code · RX
.data  initialised globals · RW
.rdata read-only data, const strings
.bss   uninitialised globals
Heap malloc / HeapAlloc — grows upward
· · ·
Stack local vars, return addresses — grows downward
0x7FFFFFFF
──── user / kernel boundary ────
Kernel space not accessible from user mode
0xFFFFFFFF
▲ top of addressable space

On 64-bit Windows the user-mode range extends to 0x00007FFFFFFFFFFF. The structure is the same but the addresses are much larger. The kernel occupies the upper half of the virtual address space.

Addressing Mode Syntax

Intel syntax (used by Ghidra and Binary Ninja by default) wraps memory references in square brackets:

; Direct (absolute address)
mov eax, [0x402000]          ; load 4 bytes from address 0x402000

; Register indirect
mov eax, [ebx]               ; load 4 bytes from address stored in EBX

; Base + displacement
mov eax, [ebp + 8]           ; first argument in a cdecl frame
mov eax, [ebp - 4]           ; first local variable

; Base + Index * Scale + Displacement  (SIB byte)
mov eax, [ebx + ecx*4 + 8]  ; array element: base + index*sizeof(int) + offset

ARM uses different syntax but the concept is identical:

; ARM32 — load/store
LDR  R0, [R1]          ; R0 = *(R1)
LDR  R0, [R1, #8]      ; R0 = *(R1 + 8)
LDR  R0, [R1, R2]      ; R0 = *(R1 + R2)
LDR  R0, [R1, R2, LSL #2] ; R0 = *(R1 + R2<<2)  — array index
STR  R0, [R1]          ; *(R1) = R0
STMFD SP!, {R4-R7, LR} ; push multiple registers onto stack (PUSH equivalent)
LDMFD SP!, {R4-R7, PC} ; pop and branch to LR — the ARM function return idiom

The Stack — How Functions Think

The stack grows downward on all common architectures: pushing a value decrements the stack pointer and writes the value at the new address.

x86 Stack Layout

x86 Stack Layout — cdecl frame
▲ higher addresses
· · · previous frames · · ·
← old EBP saved by callee
arg N
EBP + 4·(N+1)
· · ·
arg 2
EBP + 0x0C
arg 1
EBP + 0x08
return address
EBP + 0x04
saved EBP ← EBP points here
EBP + 0x00
local var 1
EBP − 0x04
local var 2
EBP − 0x08
· · ·
local var N
EBP − 4·N
← ESP (grows downward)
(unallocated stack space)
▼ lower addresses

Key rules:

  • EBP is the stable anchor — it does not move during a function call. All local variables and arguments are addressed relative to it.
  • ESP moves freely as values are pushed/popped. Compilers often omit EBP in optimized code (frame-pointer omission / -fomit-frame-pointer) and use ESP-relative addressing instead.

Function Prologue and Epilogue

Every function you see in a disassembler begins and ends with boilerplate code to set up and tear down the stack frame.

x86 standard prologue:

push  ebp          ; save caller's frame pointer
mov   ebp, esp     ; establish new frame pointer
sub   esp, 0x28    ; reserve 0x28 (40) bytes for local variables
push  ebx          ; callee-saved registers that this function uses
push  esi
push  edi

x86 standard epilogue:

pop   edi          ; restore callee-saved registers (reverse order)
pop   esi
pop   ebx
mov   esp, ebp     ; collapse stack frame
pop   ebp          ; restore caller's frame pointer
ret                ; pop return address into EIP

The leave instruction is shorthand for mov esp, ebp; pop ebp. You will see it often in GCC output:

leave              ; equivalent: mov esp, ebp; pop ebp
ret

x64 prologue (Windows):

push  rbp
mov   rbp, rsp
sub   rsp, 0x40       ; shadow space (0x20) + locals
push  rbx             ; callee-saved registers
push  r12
push  r13
push  r14

In x64, many compilers omit the frame pointer entirely and address locals relative to RSP:

sub   rsp, 0x58       ; allocate stack space for locals + shadow space
; locals at [rsp+0], [rsp+8], etc.
add   rsp, 0x58       ; epilogue: collapse frame
ret

ARM32 prologue/epilogue:

; Prologue — push callee-saved regs and LR onto stack
PUSH  {R4, R5, R6, R7, LR}
SUB   SP, SP, #0x10    ; allocate 16 bytes for locals

; Epilogue — restore and return (loading LR into PC branches back)
ADD   SP, SP, #0x10
POP   {R4, R5, R6, R7, PC}

Writing PC from a pop is ARM’s atomic “restore and return” — it simultaneously restores registers and jumps to the saved LR value.


Core Instructions in Depth

Data Movement

Instruction Example Effect
MOV mov eax, 5 EAX ← 5
MOV mov eax, [ebx] EAX ← memory at EBX
MOV mov [eax], ebx memory at EAXEBX
LEA lea eax, [ebx+4] EAX ← address EBX+4 (no memory read)
MOVZX movzx eax, byte [ebx] Load byte, zero-extend to 32 bits
MOVSX movsx eax, byte [ebx] Load byte, sign-extend to 32 bits
XCHG xchg eax, ebx Swap EAXEBX (atomic with LOCK prefix)
PUSH push eax ESP -= 4; [ESP]EAX
POP pop eax EAX[ESP]; ESP += 4

LEA trick: Compilers routinely abuse LEA for fast arithmetic. lea eax, [eax + eax*4] computes EAX * 5 without a multiply instruction. When you see LEA with no obvious pointer, think “fast multiply or multi-operand add.”

ARM equivalents:

; ARM32
MOV   R0, #5          ; R0 = 5 (immediate)
MOV   R0, R1          ; R0 = R1
LDR   R0, [R1]        ; R0 = *(R1)       — equivalent to x86 MOV reg, [reg]
STR   R0, [R1]        ; *(R1) = R0       — equivalent to x86 MOV [reg], reg
LDRB  R0, [R1]        ; load byte (zero-extended)
LDRSB R0, [R1]        ; load byte (sign-extended)
ADR   R0, label       ; R0 = address of label  (LEA equivalent)

Arithmetic

add   eax, 5           ; EAX += 5
sub   eax, ebx         ; EAX -= EBX
imul  eax, ecx, 7      ; EAX = ECX * 7  (signed multiply, 3-operand form)
mul   ecx              ; EDX:EAX = EAX * ECX  (unsigned; high bits in EDX!)
idiv  ecx              ; EAX = EAX/ECX quotient; EDX = remainder  (signed)
inc   eax              ; EAX++  (does NOT set CF — common gotcha)
dec   eax              ; EAX--
neg   eax              ; EAX = -EAX  (two's complement negation)

Malware pattern — mul for obfuscation: Malware authors sometimes use MUL or IMUL with unusual constants as a cheap hash function or address offset calculation. If you see a multiply followed by an add and then a memory dereference, you are likely looking at a hash-table lookup.

ARM32 arithmetic:

ADD   R0, R1, R2       ; R0 = R1 + R2
ADD   R0, R0, #4       ; R0 += 4
SUB   R0, R1, R2       ; R0 = R1 - R2
MUL   R0, R1, R2       ; R0 = R1 * R2  (low 32 bits)
UMULL R0, R1, R2, R3   ; R1:R0 = R2 * R3  (64-bit unsigned result)
RSB   R0, R1, #0       ; R0 = 0 - R1  (negate; ARM has no NEG instruction)

Bitwise & Shift

and   eax, 0xFF        ; mask — keep only low byte
or    eax, 0x04        ; set bit 2
xor   eax, eax         ; EAX = 0  (fastest zero idiom; also clears CF/OF)
xor   eax, key         ; encrypt/decrypt byte with key (most common malware op)
not   eax              ; bitwise complement
shl   eax, 3           ; logical shift left  3 ≡ multiply by 8
shr   eax, 1           ; logical shift right 1 ≡ unsigned divide by 2
sar   eax, 1           ; arithmetic shift right (preserves sign bit)
rol   eax, 4           ; rotate left  4 bits (used in hash functions / crypto)
ror   eax, 4           ; rotate right 4 bits
bswap eax              ; reverse byte order (endian swap)

xor reg, reg is the canonical “zero a register” idiom. It generates a 2-byte encoding versus the 5-byte mov eax, 0. You will see it at the start of almost every function to zero out return value or loop counter.

ARM32 bitwise:

AND   R0, R1, R2       ; R0 = R1 & R2
ORR   R0, R1, R2       ; R0 = R1 | R2  (note: ORR not OR)
EOR   R0, R1, R2       ; R0 = R1 ^ R2  (XOR)
MVN   R0, R1           ; R0 = ~R1  (NOT + move)
LSL   R0, R1, #3       ; R0 = R1 << 3
LSR   R0, R1, #1       ; R0 = R1 >> 1 (logical)
ASR   R0, R1, #1       ; R0 = R1 >> 1 (arithmetic)
ROR   R0, R1, #4       ; R0 = rotate_right(R1, 4)

; ARM's barrel shifter lets you combine shift with any data op:
ADD   R0, R1, R2, LSL #2   ; R0 = R1 + (R2 << 2)  — all in one instruction!

Comparison and Flags

CMP subtracts two values and discards the result — only flags are updated. TEST ANDs two values and discards the result. Neither instruction writes a register.

cmp   eax, 0           ; sets ZF if EAX==0, SF if EAX<0
test  eax, eax         ; exactly like `cmp eax, 0` but 1 byte shorter

cmp   eax, ebx
jl    less_label       ; signed: jump if EAX < EBX  (SF != OF)
jb    below_label      ; unsigned: jump if EAX < EBX (CF=1)

test  eax, 0x01        ; test bit 0
jnz   odd_label        ; jump if bit 0 was set

ARM32 comparisons:

CMP   R0, R1           ; flags = R0 - R1 (discards result)
TST   R0, #0x01        ; flags = R0 & 0x01
CMN   R0, R1           ; flags = R0 + R1 (compare negative)

; ARM conditionals are unique: any instruction can be conditional!
MOVEQ R0, #1           ; R0 = 1 ONLY if Z flag is set  (x86 needs a Jcc)
ADDNE R2, R2, #4       ; R2 += 4 ONLY if Z flag is clear

This conditional execution is a key ARM differentiator — instead of a cmp + jcc + branch target, a short if-else can be two unconditional + two conditional instructions with no branch at all.

Control Flow

; Unconditional
jmp   label            ; EIP = label
call  label            ; push EIP; EIP = label
ret                    ; EIP = [ESP]; ESP += 4
ret   8                ; EIP = [ESP]; ESP += 12  (stdcall — also pops 2 dwords of args)

; Conditional jumps (check after CMP/TEST)
je  / jz   label       ; jump if ZF=1  (equal / zero)
jne / jnz  label       ; jump if ZF=0  (not equal / not zero)
jl  / jnge label       ; signed less than         (SF!=OF)
jle / jng  label       ; signed less-than-or-equal (ZF=1 or SF!=OF)
jg  / jnle label       ; signed greater than       (ZF=0 and SF=OF)
jge / jnl  label       ; signed greater-than-or-equal (SF=OF)
jb  / jnae label       ; unsigned below            (CF=1)
ja  / jnbe label       ; unsigned above            (CF=0 and ZF=0)

; Loop
loop  label            ; ECX--; jump if ECX!=0
loope label            ; ECX--; jump if ECX!=0 AND ZF=1

ARM32 branches:

B     label            ; unconditional branch (x86 JMP)
BL    label            ; branch with link — saves PC+4 into LR (x86 CALL)
BX    LR               ; branch to address in LR — function return (x86 RET)
BLX   R0               ; branch-with-link to address in R0 — indirect call

; Conditional branches
BEQ   label            ; branch if Z=1
BNE   label            ; branch if Z=0
BLT   label            ; branch if N!=V  (signed less than)
BGT   label            ; branch if Z=0 and N=V (signed greater than)
BLO   label            ; branch if C=0  (unsigned lower)
BHI   label            ; branch if C=1 and Z=0 (unsigned higher)

String Operations

x86 has a family of bulk-memory instructions that operate on ESI/EDI and auto-increment/decrement them based on the DF flag. Combined with the REP prefix they form efficient memory loops.

cld                    ; clear DF — direction = forward (ESI/EDI increment)
std                    ; set DF   — direction = backward (decrement)

rep  movsb             ; copy ECX bytes from [ESI] to [EDI]
rep  stosd             ; fill ECX dwords at [EDI] with EAX (memset-like)
rep  cmpsb             ; compare ECX bytes at [ESI] vs [EDI] (memcmp-like)
repe scasb             ; scan EDI for byte in AL; ECX counts down

; Common Ghidra/BN patterns for these:
; rep movsd  ->  memmove(edi, esi, ecx*4)
; rep stosd  ->  memset(edi, eax, ecx*4)   (EAX is usually 0 = bzero)

Shellcode pattern: REP MOVSD/STOSD shows up in PE loaders embedded in shellcode — copying sections into allocated memory or zeroing the BSS.


Calling Conventions

Calling conventions define: where arguments go, who cleans the stack, and which registers must be preserved across a call.

x86 cdecl and stdcall

          cdecl                   stdcall
          ─────────────────────   ──────────────────────
args      pushed right to left    pushed right to left
cleanup   CALLER cleans stack     CALLEE cleans (RET n)
return    EAX (small values)      EAX
          EDX:EAX (64-bit)        EDX:EAX (64-bit)
saved     EBX, ESI, EDI, EBP     EBX, ESI, EDI, EBP

Spotting cdecl vs stdcall in a disassembler:

  • cdecl: after CALL, you see add esp, N — the caller cleaning up N bytes of arguments
  • stdcall: the CALL target ends with RET N — callee cleans its own arguments
; cdecl call: add(3, 7)
push  7
push  3
call  _add
add   esp, 8        ; caller pops 2 x 4-byte args

; stdcall call: MessageBoxA(NULL, "hi", "cap", 0)
push  0
push  offset caption
push  offset text
push  0
call  MessageBoxA   ; MessageBoxA does: ret 0x10  (cleans 16 bytes itself)

x64 Microsoft ABI

On 64-bit Windows, the first four integer/pointer arguments go in registers. There is no stack cleanup by the caller.

Arg 1 -> RCX     (or XMM0 if float)
Arg 2 -> RDX     (or XMM1)
Arg 3 -> R8      (or XMM2)
Arg 4 -> R9      (or XMM3)
Arg 5+ -> stack (above shadow space)

The shadow space (also called “home space”) is 32 bytes (4 x 8) that the caller must always allocate on the stack before a call, even if the function takes fewer than 4 arguments. The callee may spill its register arguments into this space.

; x64 call: CreateFileA(name, GENERIC_READ, ...)
sub   rsp, 0x28         ; shadow space (0x20) + alignment
mov   rcx, rax          ; arg1 = filename
mov   edx, 0x80000000   ; arg2 = GENERIC_READ
xor   r8d, r8d          ; arg3 = 0 (share mode)
xor   r9d, r9d          ; arg4 = NULL (security attrs)
; arg5-arg7 go on stack at [rsp+0x20], [rsp+0x28], [rsp+0x30]
mov   dword [rsp+0x20], 3          ; arg5 = OPEN_EXISTING
mov   dword [rsp+0x28], 0          ; arg6 = FILE_ATTRIBUTE_NORMAL
mov   qword [rsp+0x30], 0          ; arg7 = NULL
call  CreateFileA
add   rsp, 0x28

x86 fastcall and thiscall

Two more conventions appear constantly in Windows binaries — especially those compiled with MSVC.

__fastcall passes the first two integer arguments in ECX and EDX (skipping the stack for them), with the rest pushed right-to-left. The callee cleans the stack.

; __fastcall: myfunc(3, 7, 99)
mov   ecx, 3      ; arg1 → ECX
mov   edx, 7      ; arg2 → EDX
push  99          ; arg3 on stack (right-to-left)
call  myfunc      ; callee does: ret 4  (cleans only arg3)

Recognition tip: if you see MOV ECX, value and MOV EDX, value before a CALL and there is no ADD ESP, N after it, you are likely in __fastcall.

__thiscall is MSVC’s calling convention for C++ member functions. The hidden this pointer goes in ECX; remaining arguments are pushed right-to-left; the callee cleans.

; C++: obj->method(42)
mov   ecx, obj_ptr    ; ECX = this  ← the telltale sign
push  42              ; first explicit arg
call  MyClass_method  ; callee does: ret 4

C++ recognition shortcut: When you see MOV ECX, [some_ptr] immediately before a CALL, you are almost certainly looking at a C++ virtual or non-virtual method call. If the call is CALL [ECX] or CALL [ECX + N], it is a vtable dispatch — follow the pointer to find the virtual function table.

x64 System V (Linux)

Linux and macOS use a different ABI:

Arg 1 -> RDI
Arg 2 -> RSI
Arg 3 -> RDX
Arg 4 -> RCX
Arg 5 -> R8
Arg 6 -> R9
Arg 7+ -> stack
Callee-saved: RBX, R12-R15, RBP
No shadow space required
Syscall number -> RAX; invoke with SYSCALL instruction

Calling Convention Comparison — At a Glance

Use this table when you need to quickly identify which convention a binary uses and reconstruct the argument list from the disassembly:

Convention Arg 1 Arg 2 Arg 3 Arg 4 Arg 5+ Stack cleanup Callee must preserve Common context
cdecl stack stack stack stack stack Caller (ADD ESP, N after CALL) EBX, ESI, EDI, EBP C functions, GCC x86 default, printf-style varargs
stdcall stack stack stack stack stack Callee (RET N) EBX, ESI, EDI, EBP Win32 API (WINAPI / PASCAL macros)
fastcall ECX EDX stack stack stack Callee (RET N) EBX, ESI, EDI, EBP MSVC /Gr flag, Windows kernel internal functions
thiscall ECX (this) stack stack stack stack Callee (RET N) EBX, ESI, EDI, EBP MSVC C++ non-virtual & virtual methods
x64 Windows RCX RDX R8 R9 stack (above shadow) Caller RBX, RBP, RDI, RSI, R12–R15, XMM6–XMM15 All 64-bit Windows code; RCX = this in C++
x64 System V RDI RSI RDX RCX R8, R9, then stack Caller RBX, RBP, R12–R15 Linux, macOS x64; RDI = this in C++
ARM32 AAPCS R0 R1 R2 R3 stack (right-to-left) Caller R4–R11, SP Android NDK, iOS (older), embedded ARM
AArch64 AAPCS64 X0 X1 X2 X3 X4–X7, then stack Caller X19–X28, X29 (FP), X30 (LR) Apple Silicon, Android ARM64, Windows on Arm

Identifying the convention from disassembly:

Clue Convention
ADD ESP, N immediately after CALL cdecl — caller cleaning N bytes
RET N inside the callee stdcall or thiscall — callee cleaning N bytes
MOV ECX, ptr then CALL with no ADD ESP after thiscall — ECX is this
MOV ECX, val; MOV EDX, val before a CALL fastcall — first two args in registers
MOV RCX, …; MOV RDX, …; MOV R8D, … before CALL x64 Windows ABI
MOV RDI, …; MOV RSI, …; MOV RDX, … before CALL x64 System V (Linux/macOS)
MOV R0, …; MOV R1, …; BL func ARM32 AAPCS

ARM Assembly

ARM vs x86 Philosophy

Aspect x86 / x64 ARM
Architecture CISC — complex, variable-length instructions RISC — uniform 32-bit instructions (mostly)
Memory operands Allowed in arithmetic: ADD EAX, [EBX] Never — only LDR/STR touch memory
Instruction size 1–15 bytes 4 bytes (ARM) / 2 or 4 bytes (Thumb)
Condition codes Only branch instructions Any instruction can be conditional
Barrel shifter Separate shift instructions Built-in: ADD R0, R1, R2, LSL #2
Endianness Always little-endian Configurable (usually little-endian)

ARM Registers Deep Dive

The ARM calling convention (AAPCS) assigns specific roles to registers that the disassembler will display without aliases. You must know them:

Saved registers (must be preserved):
  R4  R5  R6  R7  R8  R9  R10 R11(FP)

Scratch / argument registers (caller-saved):
  R0  R1  R2  R3

Special:
  R12 = IP  (intra-procedure scratch; used by PLT stubs on Linux)
  R13 = SP  (stack pointer — never use for anything else)
  R14 = LR  (link register — holds return address after BL)
  R15 = PC  (program counter — read is PC+8 in ARM mode, PC+4 in Thumb)

ARM PC offset gotcha: In ARM32 mode, reading PC gives the address of the current instruction +8 (not +4). This is a pipeline artifact from ARM’s 3-stage pipeline. Ghidra and Binary Ninja compensate automatically, but if you calculate addresses manually, remember the offset.

Key ARM Instructions

; ── Load / Store ─────────────────────────────────────────────
LDR   R0, [R1]          ; 32-bit load
LDRH  R0, [R1]          ; 16-bit load, zero-extend
LDRB  R0, [R1]          ; 8-bit load, zero-extend
LDRSB R0, [R1]          ; 8-bit load, sign-extend
STR   R0, [R1]          ; 32-bit store
STRB  R0, [R1, #3]      ; byte store with offset

; Pre-indexing (update base before access)
LDR   R0, [R1, #4]!     ; R0 = *(R1+4); R1 += 4

; Post-indexing (update base after access)
LDR   R0, [R1], #4      ; R0 = *R1; R1 += 4  -- very common in loops

; Multiple-register transfer (callee save/restore)
STMFD SP!, {R4-R11, LR} ; push R4..R11 and LR
LDMFD SP!, {R4-R11, PC} ; pop R4..R11 and jump to saved LR

; ── Branching ────────────────────────────────────────────────
B     func              ; jump
BL    func              ; call (saves PC+4 to LR)
BX    LR                ; return (branch to address in LR)
BLX   R0               ; indirect call (also switches ARM/Thumb mode)

; ── Data Processing ──────────────────────────────────────────
MOV   R0, #0xFF         ; R0 = 255
MOVW  R0, #0x1234       ; R0 = 0x1234  (16-bit immediate, ARMv6T2+)
MOVT  R0, #0x5678       ; R0[31:16] = 0x5678  (upper 16 bits)
; Together: MOVW/MOVT pair loads a full 32-bit constant
; This is the ARM equivalent of x86 `mov eax, imm32`

MRS   R0, CPSR          ; read flags/mode register
MSR   CPSR_f, R0        ; write flags field of CPSR

Thumb and Thumb-2 Mode

ARM processors can switch between ARM mode (4-byte instructions) and Thumb mode (2-byte instructions). This halves code size at a small performance cost — critical for embedded/mobile malware.

Detection in disassemblers:

  • Thumb mode functions have their symbol address OR’d with 1 (e.g., 0x00008001 instead of 0x00008000)
  • Ghidra and Binary Ninja auto-detect and display the right instruction set
  • BX Rn with the LSB of Rn set = switch to Thumb; clear = switch to ARM

Thumb-2 (ARMv6T2 / Cortex-A) extends Thumb with 32-bit instructions, giving near-ARM performance with compact encoding. Most modern Android/iOS malware uses Thumb-2.

; Thumb (16-bit) — notice missing base register in 2-reg ops
PUSH  {R4, LR}          ; save
MOV   R0, #5
ADD   R0, R1            ; R0 += R1  (Thumb: only 2-register form)
POP   {R4, PC}          ; restore and return

; Thumb-2 (32-bit prefix: 0xE8xx, 0xF0xx, 0xF8xx...)
MOVW  R0, #0xABCD       ; 32-bit immediate in Thumb-2
MOVT  R0, #0x1234

AArch64 (ARM64)

AArch64 is a complete redesign — not backward compatible with ARM32. Used in Apple Silicon, Raspberry Pi 4+, and Windows on Arm.

; Registers: X0-X30 (64-bit), W0-W30 (low 32 bits), SP, PC
; No condition codes on most instructions (unlike ARM32)
; No barrel shifter in addressing modes (separate shift instructions)

; Load / store
LDR   X0, [X1]          ; 64-bit load
LDR   W0, [X1]          ; 32-bit load (zero-extends into X0)
LDRB  W0, [X1]          ; byte load
STP   X29, X30, [SP, #-16]!  ; store pair (typical frame setup)
LDP   X29, X30, [SP], #16    ; load pair (typical frame teardown)

; Arithmetic
ADD   X0, X1, X2        ; X0 = X1 + X2
ADD   X0, X1, #8        ; X0 = X1 + 8
MUL   X0, X1, X2        ; X0 = X1 * X2  (low 64 bits)

; Branching
BL    func              ; call (saves PC+4 to X30/LR)
RET                     ; return via X30/LR (NOT ret like x86 — no stack pop)
BR    X0                ; indirect branch (x86: jmp rax)
BLR   X0               ; indirect call  (x86: call rax)

; Conditionals (separate compare-and-branch)
CBZ   X0, label         ; branch if X0 == 0  (no CMP needed)
CBNZ  X0, label         ; branch if X0 != 0
TBZ   X0, #3, label     ; branch if bit 3 of X0 == 0

ARM Calling Convention (AAPCS)

ARM32 (AAPCS):

Arguments 1-4 : R0  R1  R2  R3
Arguments 5+  : stack (pushed right-to-left)
Return value  : R0  (64-bit: R1:R0)
Callee-saved  : R4-R11, SP
Caller-saved  : R0-R3, R12, LR
Stack         : 8-byte aligned at public interfaces

AArch64 (AAPCS64):

Arguments 1-8 : X0-X7
Arguments 9+  : stack
Return value  : X0  (128-bit: X1:X0)
Callee-saved  : X19-X28, X29(FP), X30(LR), SP
Caller-saved  : X0-X18
Stack         : 16-byte aligned always

Reading Disassembly in Ghidra and Binary Ninja

Function Prologue Recognition

When you open a binary in Ghidra or Binary Ninja, every function begins with a recognizable setup sequence. Train your eye to skip past it instantly:

; Classic x86 frame setup — skim past this
55                push ebp
89 E5             mov  ebp, esp
83 EC 20          sub  esp, 0x20
53                push ebx
56                push esi
57                push edi
; <- HERE is where the actual logic starts

In Ghidra the decompiler view (press F on a function) collapses this to nothing — you see int local_24; int local_20; as declarations. In Binary Ninja, the HLIL (High-Level IL) also hides the prologue, but the MLIL and disassembly view show it raw.

Ghidra — Listing: decode_payload  ·  Flat Dark theme
Listing (disassembly)
decode_payload
0040107b55               PUSHEBP; save caller frame
0040107c89 e5          MOVEBP,ESP
0040107e83 ec 28       SUBESP,0x28; 40 bytes of locals
0040108153               PUSHEBX
0040108256               PUSHESI
0040108357               PUSHEDI
004010848b 75 08       MOVESI,[EBP + param_1]; ← logic starts here
004010878b 7d 0c       MOVEDI,[EBP + param_2]
0040108a8b 4d 10       MOVECX,[EBP + param_3]
Decompiler (Ghidra)
void decode_payload(byte *buf,byte *key,int len)
{
  int local_24;
  int local_20;
  /* prologue variables auto-declared above */
  /* ↓ actual logic the analyst cares about */
  local_24 = 0;
  while (local_24 < len) {
    ...
  }
}

Key takeaway: The highlighted row at 00401084 is where the prologue ends and the real function body begins. Everything above it is bookkeeping — train your eye to skip it instantly.

x64 prologue without frame pointer (common in MSVC /O2):

48 83 EC 58       sub  rsp, 0x58   ; allocate 88 bytes
; NO push rbp — rbp may be used as a general register!
; Locals at [rsp+N], args-shadow at [rsp+0x20]-[rsp+0x38]

Recognizing Loops

Every loop in high-level code becomes one of two patterns in assembly:

Top-test loop (while / for):

loop_start:
  cmp   ecx, 0
  je    loop_end         ; exit if done
  ; body
  dec   ecx
  jmp   loop_start
loop_end:

Bottom-test loop (do-while — optimized form):

loop_body:
  ; body  (always executes at least once)
  dec   ecx
  jnz   loop_body        ; jump back while ECX != 0

Tip: In Ghidra graph view, a loop appears as a node with a back-edge arrow pointing upward. In Binary Ninja’s graph, loops have blue arrows (conditional) creating a cycle. Any upward-pointing edge is a loop candidate.

Ghidra — Graph View: xor_loop (CFG)
entry
XOR ECX,ECX
JMP check
check (loop header)
CMP ECX,len
JGE loop_end
T (body)
body
MOVZX EAX,[ESI+ECX]
XOR EAX,key
MOV [ESI+ECX],AL
INC ECX
JMP check ↑ back-edge
F (exit)
loop_end
RET

ARM32 loop pattern:

; Classic counted loop: for (i=10; i>0; i--)
MOV   R2, #10          ; counter
loop_top:
  ; ... body using R0, R1 ...
  SUBS  R2, R2, #1     ; R2 -= 1; update flags (S suffix)
  BNE   loop_top       ; branch while R2 != 0

The S suffix on ARM instructions causes them to update NZCV flags — this is how ARM avoids a separate CMP before every branch.

Recognizing Conditionals

if / else:

; if (eax == 0) { A } else { B }
test  eax, eax
jnz   else_branch      ; if eax != 0, skip the 'if' body
; --- true branch (A) ---
; ...
jmp   end_if
else_branch:
; --- false branch (B) ---
; ...
end_if:

In Ghidra graph view: two outgoing edges from a diamond shape — one labelled T (true) and one F (false). The merge point is where both paths reconverge.

Ghidra — Graph View: if/else branch (CFG)
condition
TEST EAX,EAX
JNZ else_branch
T — true branch
true_branch
; if (eax == 0) body
MOV EAX,1
JMP end_if
F — else branch
else_branch
; else body
MOV EAX,0
↓   (both paths merge)
end_if (merge point)
RET

switch statement:

cmp   eax, 5
ja    default_case      ; value > 5: fall through to default
jmp   [eax*4 + jump_table]  ; indirect jump through table
jump_table:
dd    case_0, case_1, case_2, case_3, case_4, case_5

Ghidra recognizes jump tables and labels each case. Binary Ninja uses MLIL’s switch construct. If neither tool resolves a JMP [EAX*4 + addr], you are dealing with an obfuscated or dynamically computed jump table.

The Decompiler View

Ghidra and Binary Ninja both ship decompilers that convert disassembly to C-like pseudo-code. This is a reconstruction — the type information is guessed. Common pitfalls:

What you see in decompiler What it really means
*(int *)(param_1 + 0x3c) Structure field access — the decompiler doesn’t know the struct
uVar1 = uVar2 ^ uVar3 Likely XOR cipher — look at the key value
do { ... } while (iVar1 != 0) A bottom-test loop (do-while)
FUN_00401234(...) Unnamed function — rename it after analysis
DAT_00403000 A global variable — check cross-references
(code *)DAT_... Indirect function call — possible shellcode dispatch table

Common Malware Patterns

XOR Decryption Loop

The simplest and most common obfuscation. Spot it by a loop with an XOR instruction and a byte-size memory reference:

; x86: XOR decrypt: for (i=0; i<len; i++) buf[i] ^= key[i % keylen]
xor   ecx, ecx          ; i = 0
xor_loop:
  movzx eax, byte [esi + ecx]   ; load ciphertext byte
  movzx edx, byte [edi + ecx]   ; load key byte
  xor   eax, edx                ; decrypt
  mov   [esi + ecx], al         ; store plaintext
  inc   ecx
  cmp   ecx, dword [ebp - 4]    ; compare to length
  jl    xor_loop
; ARM32 equivalent
MOV   R2, #0            ; i = 0
xor_loop:
  LDRB  R0, [R4, R2]   ; load ciphertext byte
  LDRB  R1, [R5, R2]   ; load key byte
  EOR   R0, R0, R1     ; XOR
  STRB  R0, [R4, R2]   ; store
  ADD   R2, R2, #1
  CMP   R2, R6         ; compare to length
  BLT   xor_loop

In Ghidra: Look for the Decompile window showing bVar = *(byte *)(buf + i) ^ *(byte *)(key + i % keyLen) inside a for-loop. Cross-reference the buffer to see where the decrypted data is used next — that reveals the payload type.

Ghidra — Listing + Decompiler: xor_decrypt (split view)
Listing
xor_loop
004010a00f b6 04 0e   MOVZX EAX,[ESI+ECX*1]
004010a40f b6 14 0f   MOVZX EDX,[EDI+ECX*1]
004010a833 c2          XOR EAX,EDX; ← XOR decrypt
004010aa88 04 0e       MOV [ESI+ECX],AL
004010ad41              INC ECX
004010ae3b 4d fc       CMP ECX,[EBP-0x4]
004010b17c ed          JL xor_loop
Decompiler
void xor_decrypt(byte*buf,byte*key,int len)
{
  int i;
  i = 0;
  while (i < len) {
    buf[i] =
      buf[i] ^ key[i];
    i = i + 1;
  }
  return;
}

PEB Walking — API Resolution Without Imports

Shellcode and injected payloads cannot have an import table. They must resolve Win32 API addresses at runtime by walking the Process Environment Block (PEB):

; x86 PEB walk to find kernel32.dll base address
mov   eax, fs:[0x30]      ; EAX = &PEB  (FS segment always points to TEB)
mov   eax, [eax + 0x0C]   ; EAX = PEB->Ldr
mov   eax, [eax + 0x14]   ; EAX = Ldr->InMemoryOrderModuleList.Flink
mov   eax, [eax]           ; follow first entry (ntdll)
mov   eax, [eax]           ; follow second entry (kernel32)
mov   eax, [eax - 8 + 0x10] ; EAX = kernel32 base address

On x64, the PEB is at gs:[0x60] (not fs):

; x64 PEB walk
mov   rax, gs:[0x60]      ; RAX = &PEB64
mov   rax, [rax + 0x18]   ; PEB->Ldr
mov   rax, [rax + 0x20]   ; Ldr->InMemoryOrderModuleList.Flink
...

In Ghidra: You will see PTR fs:0x30 highlighted in blue (a special segment reference). Binary Ninja annotates the PEB structure fields automatically with the Windows types plugin loaded.

Ghidra — Listing: PEB walk (x86 shellcode)
resolve_kernel32
0000000064 a1 30 00 00 00MOV EAX,FS:[0x30]; EAX = &PEB ← TEB→PEB
000000068b 40 0c       MOV EAX,[EAX+0xc]; EAX = PEB.Ldr
000000098b 40 14       MOV EAX,[EAX+0x14]; EAX = InMemoryOrderList.Flink
0000000c8b 00          MOV EAX,[EAX]; skip ntdll entry
0000000e8b 00          MOV EAX,[EAX]; EAX = kernel32 list entry
000000108b 40 08       MOV EAX,[EAX+0x8]; EAX = kernel32.dll base
0000001389 45 f8       MOV [EBP-0x8],EAX; save kernel32 base to local

Anti-Debugging via RDTSC

RDTSC reads the CPU’s timestamp counter (nanosecond precision) into EDX:EAX. Malware calls it twice and checks if the delta is larger than expected — debugging inflates the delay.

rdtsc                  ; first read: EDX:EAX = TSC
mov   esi, eax         ; save low 32 bits
; ... some code ...
rdtsc                  ; second read
sub   eax, esi         ; delta in EAX
cmp   eax, 0x10000     ; threshold
jg    debugger_detected
Ghidra — Listing: RDTSC anti-debug check
timing_check
004010c00f 31          RDTSC                  ; read TSC → EDX:EAX
004010c289 45 f8       MOV [EBP-0x8],EAX; save t1 low word
004010c5e8 96 00 00 00CALL some_work
004010ca0f 31          RDTSC                  ; read TSC → EDX:EAX (t2)
004010cc2b 45 f8       SUB EAX,[EBP-0x8]; delta = t2 - t1
004010cf3d 00 00 01 00CMP EAX,0x10000; threshold ~65 k cycles
004010d47f 0a          JG debugger_detected; jump if too slow
004010d6e8 b4 02 00 00CALL real_payload
004010dbc3              RET
debugger_detected
004010dce8 2f 03 00 00CALL decoy_payload; mislead analyst
004010e1c3              RET

ARM equivalent: Uses MRC p15, 0, Rt, c9, c13, 0 (Performance Monitors Cycle Count Register, PMCCNTR) on privileged ARM cores, or CNTVCT_EL0 on AArch64.

Shellcode Stub Pattern

Position-independent shellcode must locate its own base address (since it doesn’t know where it will be injected). The classic x86 technique:

call  get_eip          ; push EIP of next instruction
get_eip:
pop   ebx              ; EBX = address of get_eip label
sub   ebx, 5           ; EBX = shellcode base (account for CALL encoding)
; Now EBX + offset = address of any data/code inside the shellcode

x64 version using RIP-relative LEA:

lea   rbx, [rip]       ; RBX = address of NEXT instruction
; Or more commonly just use [RIP + offset] directly for data references

ARM32: PC-relative loads are the native mechanism:

LDR   R0, [PC, #offset]  ; load value from (PC+8) + offset
; The assembler resolves `offset` so the literal pool is addressed correctly

Quick Reference Tables

x86 / x64 Jump Instructions

Signed Unsigned Condition Flags
JE / JZ JE / JZ Equal / Zero ZF=1
JNE / JNZ JNE / JNZ Not equal ZF=0
JL / JNGE JB / JNAE Less / Below SF!=OF / CF=1
JLE / JNG JBE / JNA Less-or-equal ZF=1 or SF!=OF
JG / JNLE JA / JNBE Greater / Above ZF=0 and SF=OF
JGE / JNL JAE / JNB Greater-or-equal SF=OF / CF=0
JS Sign SF=1
JO Overflow OF=1
JP / JPE Parity even PF=1

ARM Condition Codes (suffix on any instruction)

Suffix Meaning Flags
EQ Equal Z=1
NE Not equal Z=0
GT Signed greater than Z=0 and N=V
LT Signed less than N!=V
GE Signed greater-or-equal N=V
LE Signed less-or-equal Z=1 or N!=V
HI Unsigned higher C=1 and Z=0
LO Unsigned lower C=0
HS Unsigned higher-or-same C=1
LS Unsigned lower-or-same C=0 or Z=1
MI Minus / negative N=1
PL Plus / positive N=0
VS Overflow set V=1
AL Always (default)

Register Cheatsheet — What You See in the Disassembler

Color in this article x86 x64 ARM32 AArch64
Green — general purpose EAX EBX ECX EDX ESI EDI RAX RBX RCX RDX RSI RDI R8-R11 R0–R11 X0-X18
Orange — stack/frame/PC ESP EBP EIP RSP RBP RIP SP LR PC SP LR PC FP
Blue — ARM-specific R0-R15 CPSR X0-X30 NZCV
Amber — flags ZF CF SF OF PF AF DF ZF CF SF OF PF AF DF N Z C V N Z C V

Size Specifiers — Operand Widths

When a memory operand is ambiguous, the assembler requires an explicit size keyword. These appear constantly in disassembly and tell you the width of the data being read or written:

Keyword Bits Bytes x86 register aliases Typical use in disassembly
BYTE PTR 8 1 AL, BL, CL, DL (and R_B in x64) Character data, byte flags, single-byte XOR: mov byte ptr [eax], 0
WORD PTR 16 2 AX, BX, CX, DX (and R_W in x64) Unicode char pairs, port I/O, 16-bit struct fields: mov ax, word ptr [ebx+2]
DWORD PTR 32 4 EAX, EBX, ECX, EDX, ESI, EDI Local int, pointer on x86, HANDLE: cmp dword ptr [ebp-4], 0
QWORD PTR 64 8 RAX, RBX, RCX, … (x64 only) Pointer on x64, LONGLONG, timestamp: mov rax, qword ptr [rsi+8]
XMMWORD PTR 128 16 XMM0–XMM15 SIMD data, AES-NI cipher rounds, fast memcpy: movdqu xmm0, xmmword ptr [rdi]
YMMWORD PTR 256 32 YMM0–YMM15 AVX bulk operations; rare in malware but present in optimised crypto libraries

Width as a clue: A loop that accesses BYTE PTR [ESI + ECX] is processing raw bytes — likely a string, shellcode buffer, or cipher stream. Switch to DWORD PTR and you are almost certainly processing an array of integers, pointers, or 32-bit hash values.

C ↔ Assembly Equivalents

Use this table to mentally “lift” disassembly back to high-level intent before reaching for the decompiler:

C / C++ pattern x86 / x64 assembly Notes
int x = 0; xor eax, eaxmov [ebp-4], eax XOR zero is 2 bytes; mov eax, 0 is 5 bytes — compilers always pick XOR
if (x == 0) test eax, eaxje label TEST is shorter than cmp eax, 0; identical flag result
if (x != 0) test eax, eaxjnz label The “is this pointer null?” pattern
if (x < 0) test eax, eaxjs label Testing sign flag directly; no CMP needed
x++ inc eax INC does not set CF — subtle source of bugs in flag-dependent code
x-- dec eax Same: DEC does not set CF
x += n add eax, n  
x -= n sub eax, n  
x *= 2 shl eax, 1 or add eax, eax  
x *= 4 shl eax, 2 or lea eax, [eax*4] LEA form does not set flags
x *= 5 lea eax, [eax + eax*4] Classic LEA-abuse for non-power-of-two multiply — no MUL instruction
x *= 9 lea eax, [eax + eax*8] Same pattern; watch for these in hash functions
x /= 2 (unsigned) shr eax, 1 Signed equivalent: sar eax, 1
return x; mov eax, <value>ret Return value is always in EAX (x86) / RAX (x64)
return (struct large) Write to [EDI] / [RCX] then ret Large structs returned via hidden pointer passed as extra arg
*ptr [eax] — e.g. mov eax, [eax] Dereference — the pointer value is already in the register
ptr->field [eax + offset] offset is the struct field’s byte offset from the base
array[i] [base + index*scale] SIB addressing — scale matches element size (4 for int[])
array[i].field [base + index*scale + field_offset] Full SIB + displacement
memset(p, 0, n) xor eax, eax + rep stosd ECX = count in dwords; EDI = destination
memcpy(dst, src, n) rep movsd (or rep movsb) ECX = count; ESI = source; EDI = destination
strlen(s) repne scasb with AL=0 Scans EDI for null byte; ECX decrements; negate ECX–1 for length
x & mask and eax, mask Also tests a single bit when mask is a power of two
x \| flag or eax, flag Setting a bit without clearing others
x ^ key xor eax, key The single most common malware operation — encryption, decryption, hash mixing
~x not eax Bitwise complement — also seen as neg eax; dec eax
(int)(char)x movsx eax, byte ptr [ebx] Sign-extend byte to 32-bit int
(unsigned int)(unsigned char)x movzx eax, byte ptr [ebx] Zero-extend byte — the safe widening idiom
switch (x) jmp [eax*4 + table_addr] Indirect jump through a jump table
virtual->method() mov ecx, thismov eax, [ecx]call [eax + N] vtable dispatch: first dereference gets the vtable, second gets the slot
GetProcAddress(…) reimplemented hash loop over export names → call [eax] No import table entry — common in shellcode and packer stubs

Common Malware Indicator Patterns

Memorise these. When you see one in a binary, treat it as a high-confidence signal of a specific technique:

What you see in disassembly What it almost certainly means How to confirm
MOV EAX, FS:[0x30] or MOV RAX, GS:[0x60] PEB access — reading the process environment block for module list, image base, or heap Followed by chained dereferences ([EAX+0x0C], [EAX+0x14], …) into the Ldr module list
XOR on a byte-granularity loop over a buffer XOR decryption / encryption of embedded payload or config blob The decrypted buffer is subsequently called into or passed to a second function
CALL EAX / JMP EAX or CALL [EAX + N] after a hash-compare loop Dynamically resolved API call — import table is empty or missing Trace EAX backwards to a GetProcAddress re-implementation walking the export table
RDTSC … work … RDTSCCMP EAX, thresholdJG Timing-based anti-debug check Two RDTSC with same-register subtract; threshold is usually 0x100000x100000
CALL $+5; POP EBX; SUB EBX, 5 (x86) or LEA RBX, [RIP] (x64) PIC self-location — shellcode finding its own load address Followed by EBX + offset references to embedded data/code within the shellcode
PUSH 0x40; PUSH size; PUSH NULL; PUSH NULL; CALL NtAllocateVirtualMemory RWX memory allocation — staging area for injected shellcode 0x40 = PAGE_EXECUTE_READWRITE; look for a subsequent write then transfer of control
MOV EAX, [EAX + 0x3C]ADD EAX, [EAX + 0x78] PE export directory walk — hand-rolled GetProcAddress Classic shellcode technique; 0x3C = PE offset field in DOS header, 0x78 = export directory RVA
CPUID → check vendor string or bit 31 of ECX Hypervisor / VM detection Malware aborts or switches to benign path when it detects a sandbox
IN EAX, 0x40 / IN AL, 0x5658 VMware I/O port detection Often wrapped in an SEH try/except — exception means no VMware; success means VM
MOV EAX, LARGE FS:[0x0] (x86 SEH chain head) SEH chain manipulation — installing a custom exception handler Malware uses SEH to catch intentional exceptions and redirect control flow
INT 3 blocks or 0xCC byte padding inside function body Debugger trap or anti-attach bait Malware scans its own code pages for 0xCC bytes inserted by software breakpoints
REP STOSD zeroing a region → MOV of bytes → CALL into it Self-copying / decrypting shellcode followed by execution The classic stager pattern — payload written to zeroed RWX memory, then jumped into
MOV ECX, [ESI] immediately before CALL C++ method call (this in ECX) — thiscall convention Trace ESI back to a heap allocation or a global object to identify the class
MOV EAX, [EAX + 0x20] then name-hash loop Kernel32 export hash walk Compare the hash constant against known hash lists (e.g., 0x7C0DFCAA = LoadLibraryA)
Written on June 23, 2026


◀ Back to attack related posts