Introduction
In this publication, we delve into Hex-Rays microcode and explore techniques for manipulating the generated CTree to deobfuscate and annotate decompiled code. The final section includes a practical example demonstrating how to annotate a custom import table for malware analysis.
This guide is meant to help reverse engineers and malware analysts better understand the internal structures used during IDA's function decompilation. We advise keeping an eye on the Hex-Rays SDK that can be found under IDA PRO’s plugins directory, all the structures discussed below are sourced from it.
Architecture
Hex-Rays decompiles a function through a multistage process starting with the disassembled code of a function:
-
Assembly code to microcode:
It does a conversion of the assembly instructions that are stored in aninsn_t
structure to microcode instructions represented by aminsn_t
structure -
CTree generation:
From the optimized microcode, Hex-Rays generates the Abstract Syntax Tree(AST), its nodes are either statements (cinsn_t
) or expressions (cexpr_t
); note that bothcinsn_t
andcexpr_t
inherit from thecitem_t
structure
Microcode
Microcode is an intermediate language (IL) used by Hex-Rays, generated by lifting the assembly code of a binary. This has multiple advantages, one of which is that it is processor-independent.
The following screenshot displays the assembly and decompiled code, alongside its microcode extracted using Lucid, a tool that facilitates microcode visualization.
We can access the MBA (microcode block array) through the cfunc_t
structure of a decompiled function with the MBA field.
Tip: we get the cfunc_t
of a decompiled function with the ida_hexrays.decompile
.
mba_t
is an array of micro blocks mblock_t
, the first block represents the entry point of the function and the last one represents the end. Micro blocks (mblock_t
) are structured in a double linked list, we can access the next / previous block with nextb
/prevb
fields respectively. Each mblock_t
includes a double linked list of microcode instructions minsn_t
, accessed by the field head
for the first instruction of the block and tail
for the last instruction of the block. The mblock_t
structure is depicted in the following code snippet.
class mblock_t
{
//...
public:
mblock_t *nextb; ///< next block in the doubly linked list
mblock_t *prevb; ///< previous block in the doubly linked list
uint32 flags; ///< combination of \ref MBL_ bits
ea_t start; ///< start address
ea_t end; ///< end address
minsn_t *head; ///< pointer to the first instruction of the block
minsn_t *tail; ///< pointer to the last instruction of the block
mba_t *mba;
A microcode instruction minsn_t
is a double linked list, each microcode instruction contains 3 operands: left, right, and destination. We can access the next/previous microcode instruction of the same block with next
/prev
fields; the opcode field is an enumeration (mcode_t
) of all the microinstruction opcodes, for example, the m_mov
enum represents the mov
opcode.
class minsn_t
{
//...
public:
mcode_t opcode; ///< instruction opcode enumeration
int iprops; ///< combination of \ref IPROP_ bits
minsn_t *next; ///< next insn in doubly linked list. check also nexti()
minsn_t *prev; ///< prev insn in doubly linked list. check also previ()
ea_t ea; ///< instruction address
mop_t l; ///< left operand
mop_t r; ///< right operand
mop_t d; ///< destination operand
//...
enum mcode_t
{
m_nop = 0x00, // nop // no operation
m_stx = 0x01, // stx l, {r=sel, d=off} // store register to memory
m_ldx = 0x02, // ldx {l=sel,r=off}, d // load register from memory
m_ldc = 0x03, // ldc l=const, d // load constant
m_mov = 0x04, // mov l, d // move
m_neg = 0x05, // neg l, d // negate
m_lnot = 0x06, // lnot l, d // logical not
//...
};
Each operand is of type mop_t
, depending on the type (accessed with the t
field) it can hold registers, immediate values, and even nested microcode instructions. As an example, the following is the microcode of a function with multiple nested instructions:
class mop_t
{
public:
/// Operand type.
mopt_t t;
union
{
mreg_t r; // mop_r register number
mnumber_t *nnn; // mop_n immediate value
minsn_t *d; // mop_d result (destination) of another instruction
stkvar_ref_t *s; // mop_S stack variable
ea_t g; // mop_v global variable (its linear address)
int b; // mop_b block number (used in jmp,call instructions)
mcallinfo_t *f; // mop_f function call information
lvar_ref_t *l; // mop_l local variable
mop_addr_t *a; // mop_a variable whose address is taken
char *helper; // mop_h helper function name
char *cstr; // mop_str utf8 string constant, user representation
mcases_t *c; // mop_c cases
fnumber_t *fpc; // mop_fn floating point constant
mop_pair_t *pair; // mop_p operand pair
scif_t *scif; // mop_sc scattered operand info
};
#...
}
/// Instruction operand types
typedef uint8 mopt_t;
const mopt_t
mop_z = 0, ///< none
mop_r = 1, ///< register (they exist until MMAT_LVARS)
mop_n = 2, ///< immediate number constant
mop_str = 3, ///< immediate string constant (user representation)
#...
The microcode generation progresses through various maturity levels, also referred to as optimization levels. The initial level, MMAT_GENERATED
, involves the direct translation of assembly code into microcode. The final optimization level before generating the CTree is MMAT_LVARS
.
enum mba_maturity_t
{
MMAT_ZERO, ///< microcode does not exist
MMAT_GENERATED, ///< generated microcode
MMAT_PREOPTIMIZED, ///< preoptimized pass is complete
MMAT_LOCOPT, ///< local optimization of each basic block is complete.
///< control flow graph is ready too.
MMAT_CALLS, ///< detected call arguments
MMAT_GLBOPT1, ///< performed the first pass of global optimization
MMAT_GLBOPT2, ///< most global optimization passes are done
MMAT_GLBOPT3, ///< completed all global optimization. microcode is fixed now.
MMAT_LVARS, ///< allocated local variables
};
Microcode traversal example
The following Python code is used as an example of how to traverse and print the microcode instructions of a function, it traverses the microcode generated at the first maturity level (MMAT_GENERATED
).
import idaapi
import ida_hexrays
import ida_lines
MCODE = sorted([(getattr(ida_hexrays, x), x) for x in filter(lambda y: y.startswith('m_'), dir(ida_hexrays))])
def get_mcode_name(mcode):
"""
Return the name of the given mcode_t.
"""
for value, name in MCODE:
if mcode == value:
return name
return None
def parse_mop_t(mop):
if mop.t != ida_hexrays.mop_z:
return ida_lines.tag_remove(mop._print())
return ''
def parse_minsn_t(minsn):
opcode = get_mcode_name(minsn.opcode)
ea = minsn.ea
text = hex(ea) + " " + opcode
for mop in [minsn.l, minsn.r, minsn.d]:
text += ' ' + parse_mop_t(mop)
print(text)
def parse_mblock_t(mblock):
minsn = mblock.head
while minsn and minsn != mblock.tail:
parse_minsn_t(minsn)
minsn = minsn.next
def parse_mba_t(mba):
for i in range(0, mba.qty):
mblock_n = mba.get_mblock(i)
parse_mblock_t(mblock_n)
def main():
func = idaapi.get_func(here()) # Gets the function at the current cursor
maturity = ida_hexrays.MMAT_GENERATED
mbr = ida_hexrays.mba_ranges_t(func)
hf = ida_hexrays.hexrays_failure_t()
ida_hexrays.mark_cfunc_dirty(func.start_ea)
mba = ida_hexrays.gen_microcode(mbr, hf, None, ida_hexrays.DECOMP_NO_WAIT, maturity)
parse_mba_t(mba)
if __name__ == '__main__':
main()
The script's output is presented below: on the left, the printed microcode in the console, and on the right, the assembly code by IDA:
CTree
In this section, we'll dive into the core elements of Hex-Rays CTree structure, then proceed to a practical example demonstrating how to annotate a custom import table of malware that loads APIs dynamically.
For a better understanding, we will be leveraging the following plugin (hrdevhelper) that allows us to view the CTree nodes in IDA as a graph.
citem_t
is an abstract class that is the base for both cinsn_t
and cexpr_t
, it holds common info like the address, item type and label while also featuring constants like is_expr
, contains_expr
that can be used to know the type of the object:
struct citem_t
{
ea_t ea = BADADDR; ///< address that corresponds to the item. may be BADADDR
ctype_t op = cot_empty; ///< item type
int label_num = -1; ///< label number. -1 means no label. items of the expression
///< types (cot_...) should not have labels at the final maturity
///< level, but at the intermediate levels any ctree item
///< may have a label. Labels must be unique. Usually
///< they correspond to the basic block numbers.
mutable int index = -1; ///< an index in cfunc_t::treeitems.
///< meaningful only after print_func()
//...
The item type accessed with the op
field indicates the type of the node, expression nodes are prefixed with cot_
and the statements nodes are prefixed with cit_
, example cot_asg
indicates that the node is an assignment expression while cit_if
indicates that the node is a condition (if) statement.
Depending on the type of the statement node, a cinsn_t
can have a different attribute for example if the item type is cit_if
we can access the detail of the condition node through the cif
field, as seen in the below snippet, cinsn_t
is implemented using a union. Note that a cblock_t
is a block statement which is a list of cinsn_t
statements, we can find this type for example at the beginning of a function or after a conditional statement.
struct cinsn_t : public citem_t
{
union
{
cblock_t *cblock; ///< details of block-statement
cexpr_t *cexpr; ///< details of expression-statement
cif_t *cif; ///< details of if-statement
cfor_t *cfor; ///< details of for-statement
cwhile_t *cwhile; ///< details of while-statement
cdo_t *cdo; ///< details of do-statement
cswitch_t *cswitch; ///< details of switch-statement
creturn_t *creturn; ///< details of return-statement
cgoto_t *cgoto; ///< details of goto-statement
casm_t *casm; ///< details of asm-statement
};
//...
In the example below, the condition node of type cit_if
has two child nodes: the left one is of type cit_block
which represents the "True" branch and the right is the condition to evaluate, which is a call to a function, a third child is missing as the condition does not have a "False" branch.
The following is a graph showcasing the statement node cit_if
Find the associated decompilation for the above CTree:
The same logic applies to expressions nodes cexpr_t
, depending on the node type, different attributes are available, as an example, a node of type cot_asg
has children nodes accessible with the fields x
and y
.
struct cexpr_t : public citem_t
{
union
{
cnumber_t *n; ///< used for \ref cot_num
fnumber_t *fpc; ///< used for \ref cot_fnum
struct
{
union
{
var_ref_t v; ///< used for \ref cot_var
ea_t obj_ea; ///< used for \ref cot_obj
};
int refwidth; ///< how many bytes are accessed? (-1: none)
};
struct
{
cexpr_t *x; ///< the first operand of the expression
union
{
cexpr_t *y; ///< the second operand of the expression
carglist_t *a;///< argument list (used for \ref cot_call)
uint32 m; ///< member offset (used for \ref cot_memptr, \ref cot_memref)
///< for unions, the member number
};
union
{
cexpr_t *z; ///< the third operand of the expression
int ptrsize; ///< memory access size (used for \ref cot_ptr, \ref cot_memptr)
};
};
//...
Finally the cfunc_t
structure holds information related to the decompiled function, the function address, the microcode block array, and the CTree accessed with the entry_ea
, mba
and body
fields respectively.
struct cfunc_t
{
ea_t entry_ea; ///< function entry address
mba_t *mba; ///< underlying microcode
cinsn_t body; ///< function body, must be a block
//...
CTree traversal example
The provided Python code serves as a mini recursive visitor of a CTree, note that it does not handle all node types, the last section will describe how to use the Hex-Rays built-in visitor class ctree_visitor_t
. To begin, we obtain the cfunc
of the function using ida_hexrays.decompile
and access its CTree via the body
field.
Next, we check if the node(item) is an expression or a statement. Finally, we can parse the type through the op
field and explore its child nodes.
import idaapi
import ida_hexrays
OP_TYPE = sorted([(getattr(ida_hexrays, x), x) for x in filter(lambda y: y.startswith('cit_') or y.startswith('cot_'), dir(ida_hexrays))])
def get_op_name(op):
"""
Return the name of the given mcode_t.
"""
for value, name in OP_TYPE:
if op == value:
return name
return None
def explore_ctree(item):
print(f"item address: {hex(item.ea)}, item opname: {item.opname}, item op: {get_op_name(item.op)}")
if item.is_expr():
if item.op == ida_hexrays.cot_asg:
explore_ctree(item.x) # left side
explore_ctree(item.y) # right side
elif item.op == ida_hexrays.cot_call:
explore_ctree(item.x)
for a_item in item.a: # call parameters
explore_ctree(a_item)
elif item.op == ida_hexrays.cot_memptr:
explore_ctree(item.x)
else:
if item.op == ida_hexrays.cit_block:
for i_item in item.cblock: # list of statement nodes
explore_ctree(i_item)
elif item.op == ida_hexrays.cit_expr:
explore_ctree(item.cexpr)
elif item.op == ida_hexrays.cit_return:
explore_ctree(item.creturn.expr)
def main():
cfunc = ida_hexrays.decompile(here())
ctree = cfunc.body
explore_ctree(ctree)
if __name__ == '__main__':
main()
Displayed below is the output of the traversal script executed on the start
function of a BLISTER sample:
Practical example: annotating the custom import table of a malware sample
Now that we've gained insights into the architecture and structures of the generated CTree, let's delve into a practical application and explore how to automate the annotation of a custom import table of malware.
Hex-Rays provides a utility class ctree_visitor_t
that can be used to traverse and modify the CTree, two important virtual methods to know are:
visit_insn
: to visit a statementvisit_expr
: to visit an expression
For this example, the same BLISTER sample is used; after locating the function that gets Windows APIs addresses by hash at address 0x7FF8CC3B0926(in the .rsrc section), adding the enumeration to the IDB and applying the enum type to its parameter, we create a class that inherits from ctree_visitor_t
, as we are interested in expressions, we will be overriding visit_expr
only.
The idea is to locate a cot_call
node(1) of the function that resolves APIs by passing the obj_ea
address of node’s first child to the function idc.get_name
which will return the function name.
if expr.op == idaapi.cot_call:
if idc.get_name(expr.x.obj_ea) == self.func_name:
#...
Next retrieve the enum of the hash by accessing the right parameter of the call node(2), in our case parameter 3.
carg_1 = expr.a[HASH_ENUM_INDEX]
api_name = ida_lines.tag_remove(carg_1.cexpr.print1(None)) # Get API name
The next step is to locate the variable that has been assigned the address value of the WinAPI function. To do that we first need to locate the cot_asg
node(3), parent of the call node by using the find_parent_of
method under cfunc.body
of the decompiled function.
asg_expr = self.cfunc.body.find_parent_of(expr) # Get node parent
Finally, we can access the first child node(4) under the cot_asg
node, which is of type cot_var
and get the current variable name, the Hex-Rays API ida_hexrays.rename_lvar
is used to rename the new variable with the Windows API name taken from the enum parameter.
This process can ultimately save a significant amount of time for an analyst. Instead of spending time on relabeling variables, they can direct their attention to the core functionality. An understanding of how CTrees work can contribute to the development of more effective plugins, enabling the handling of more complex obfuscations.
For a complete understanding and context of the example, please find the entire code below:
import idaapi
import ida_hexrays
import idc
import ida_lines
import random
import string
HASH_ENUM_INDEX = 2
def generate_random_string(length):
letters = string.ascii_letters
return "".join(random.choice(letters) for _ in range(length))
class ctree_visitor(ida_hexrays.ctree_visitor_t):
def __init__(self, cfunc):
ida_hexrays.ctree_visitor_t.__init__(self, ida_hexrays.CV_FAST)
self.cfunc = cfunc
self.func_name = "sub_7FF8CC3B0926"# API resolution function name
def visit_expr(self, expr):
if expr.op == idaapi.cot_call:
if idc.get_name(expr.x.obj_ea) == self.func_name:
carg_1 = expr.a[HASH_ENUM_INDEX]
api_name = ida_lines.tag_remove(
carg_1.cexpr.print1(None)
) # Get API name
expr_parent = self.cfunc.body.find_parent_of(expr) # Get node parent
# find asg node
while expr_parent.op != idaapi.cot_asg:
expr_parent = self.cfunc.body.find_parent_of(expr_parent)
if expr_parent.cexpr.x.op == idaapi.cot_var:
lvariable_old_name = (
expr_parent.cexpr.x.v.getv().name
) # get name of variable
ida_hexrays.rename_lvar(
self.cfunc.entry_ea, lvariable_old_name, api_name
) # rename variable
return 0
def main():
cfunc = idaapi.decompile(idc.here())
v = ctree_visitor(cfunc)
v.apply_to(cfunc.body, None)
if __name__ == "__main__":
main()
Conclusion
Concluding our exploration into Hex-Rays microcode and CTree generation, we've gained practical techniques for navigating the complexities of malware obfuscation. The ability to modify Hex-Rays pseudo code allows us to cut through obfuscation like Control Flow Obfuscation, remove dead code, and many more. The Hex-Rays C++ SDK emerges as a valuable resource, offering well-documented guidance for future reference.
We hope that this guide will be helpful to fellow researchers and any avid learner, please find all the scripts in our research repository.