Salim Bitam

Introduction to Hex- Rays decompilation internals

In this publication, we delve into Hex-Rays microcode and explore techniques for manipulating the generated CTree to deobfuscate and annotate decompiled code.

25 min readMalware analysis
Introduction to Hex-Rays decompilation internals

Introduction

In this publication, we delve into Hex-Rays microcode and explore techniques for manipulating the generated CTree to deobfuscate and annotate decompiled code. The final section includes a practical example demonstrating how to annotate a custom import table for malware analysis.

This guide is meant to help reverse engineers and malware analysts better understand the internal structures used during IDA's function decompilation. We advise keeping an eye on the Hex-Rays SDK that can be found under IDA PRO’s plugins directory, all the structures discussed below are sourced from it.

Architecture

Hex-Rays decompiles a function through a multistage process starting with the disassembled code of a function:

  1. Assembly code to microcode:
    It does a conversion of the assembly instructions that are stored in an insn_t structure to microcode instructions represented by a minsn_t structure

  2. CTree generation:
    From the optimized microcode, Hex-Rays generates the Abstract Syntax Tree(AST), its nodes are either statements (cinsn_t) or expressions (cexpr_t); note that both cinsn_t and cexpr_t inherit from the citem_t structure

Microcode

Microcode is an intermediate language (IL) used by Hex-Rays, generated by lifting the assembly code of a binary. This has multiple advantages, one of which is that it is processor-independent.

The following screenshot displays the assembly and decompiled code, alongside its microcode extracted using Lucid, a tool that facilitates microcode visualization.

We can access the MBA (microcode block array) through the cfunc_t structure of a decompiled function with the MBA field.

Tip: we get the cfunc_t of a decompiled function with the ida_hexrays.decompile.

mba_t is an array of micro blocks mblock_t, the first block represents the entry point of the function and the last one represents the end. Micro blocks (mblock_t) are structured in a double linked list, we can access the next / previous block with nextb/prevb fields respectively. Each mblock_t includes a double linked list of microcode instructions minsn_t, accessed by the field head for the first instruction of the block and tail for the last instruction of the block. The mblock_t structure is depicted in the following code snippet.

class mblock_t
{
//...
public:
  mblock_t *nextb;              ///< next block in the doubly linked list
  mblock_t *prevb;              ///< previous block in the doubly linked list
  uint32 flags;                 ///< combination of \ref MBL_ bits
  ea_t start;                   ///< start address
  ea_t end;                     ///< end address
  minsn_t *head;                ///< pointer to the first instruction of the block
  minsn_t *tail;                ///< pointer to the last instruction of the block
  mba_t *mba;

A microcode instruction minsn_t is a double linked list, each microcode instruction contains 3 operands: left, right, and destination. We can access the next/previous microcode instruction of the same block with next/prev fields; the opcode field is an enumeration (mcode_t) of all the microinstruction opcodes, for example, the m_mov enum represents the mov opcode.

class minsn_t
{
//...
public:
  mcode_t opcode;       ///< instruction opcode enumeration
  int iprops;           ///< combination of \ref IPROP_ bits
  minsn_t *next;        ///< next insn in doubly linked list. check also nexti()
  minsn_t *prev;        ///< prev insn in doubly linked list. check also previ()
  ea_t ea;              ///< instruction address
  mop_t l;              ///< left operand
  mop_t r;              ///< right operand
  mop_t d;              ///< destination operand
 //...

enum mcode_t
{
  m_nop    = 0x00, // nop                       // no operation
  m_stx    = 0x01, // stx  l,    {r=sel, d=off} // store register to memory     
  m_ldx    = 0x02, // ldx  {l=sel,r=off}, d     // load register from memory    
  m_ldc    = 0x03, // ldc  l=const,     d       // load constant
  m_mov    = 0x04, // mov  l,           d       // move                        
  m_neg    = 0x05, // neg  l,           d       // negate
  m_lnot   = 0x06, // lnot l,           d       // logical not
//...
};

Each operand is of type mop_t, depending on the type (accessed with the t field) it can hold registers, immediate values, and even nested microcode instructions. As an example, the following is the microcode of a function with multiple nested instructions:

class mop_t
{
	public:
	  /// Operand type.
	  mopt_t t;
	union
	  {
	    mreg_t r;           // mop_r   register number
	    mnumber_t *nnn;     // mop_n   immediate value
	    minsn_t *d;         // mop_d   result (destination) of another instruction
	    stkvar_ref_t *s;    // mop_S   stack variable
	    ea_t g;             // mop_v   global variable (its linear address)
	    int b;              // mop_b   block number (used in jmp,call instructions)
	    mcallinfo_t *f;     // mop_f   function call information
	    lvar_ref_t *l;      // mop_l   local variable
	    mop_addr_t *a;      // mop_a   variable whose address is taken
	    char *helper;       // mop_h   helper function name
	    char *cstr;         // mop_str utf8 string constant, user representation
	    mcases_t *c;        // mop_c   cases
	    fnumber_t *fpc;     // mop_fn  floating point constant
	    mop_pair_t *pair;   // mop_p   operand pair
	    scif_t *scif;       // mop_sc  scattered operand info
	  };
	#...
}

/// Instruction operand types
typedef uint8 mopt_t;
const mopt_t
  mop_z   = 0,  ///< none
  mop_r   = 1,  ///< register (they exist until MMAT_LVARS)
  mop_n   = 2,  ///< immediate number constant
  mop_str = 3,  ///< immediate string constant (user representation)
  #...

The microcode generation progresses through various maturity levels, also referred to as optimization levels. The initial level, MMAT_GENERATED, involves the direct translation of assembly code into microcode. The final optimization level before generating the CTree is MMAT_LVARS.

enum mba_maturity_t
{
  MMAT_ZERO,         ///< microcode does not exist
  MMAT_GENERATED,    ///< generated microcode
  MMAT_PREOPTIMIZED, ///< preoptimized pass is complete
  MMAT_LOCOPT,       ///< local optimization of each basic block is complete.
                     ///< control flow graph is ready too.
  MMAT_CALLS,        ///< detected call arguments
  MMAT_GLBOPT1,      ///< performed the first pass of global optimization
  MMAT_GLBOPT2,      ///< most global optimization passes are done
  MMAT_GLBOPT3,      ///< completed all global optimization. microcode is fixed now.
  MMAT_LVARS,        ///< allocated local variables
};

Microcode traversal example

The following Python code is used as an example of how to traverse and print the microcode instructions of a function, it traverses the microcode generated at the first maturity level (MMAT_GENERATED).

import idaapi
import ida_hexrays
import ida_lines


MCODE = sorted([(getattr(ida_hexrays, x), x) for x in filter(lambda y: y.startswith('m_'), dir(ida_hexrays))])

def get_mcode_name(mcode):
    """
    Return the name of the given mcode_t.
    """
    for value, name in MCODE:
        if mcode == value:
            return name
    return None


def parse_mop_t(mop):
    if mop.t != ida_hexrays.mop_z:
        return ida_lines.tag_remove(mop._print())
    return ''


def parse_minsn_t(minsn):
    opcode = get_mcode_name(minsn.opcode)
    ea = minsn.ea
    
    text = hex(ea) + " " + opcode
    for mop in [minsn.l, minsn.r, minsn.d]:
        text += ' ' + parse_mop_t(mop)
    print(text)
    
    
def parse_mblock_t(mblock):
    minsn = mblock.head
    while minsn and minsn != mblock.tail:
        parse_minsn_t(minsn)
        minsn = minsn.next
    

def parse_mba_t(mba):
    for i in range(0, mba.qty):
        mblock_n = mba.get_mblock(i)
        parse_mblock_t(mblock_n)


def main():
    func = idaapi.get_func(here()) # Gets the function at the current cursor
    maturity = ida_hexrays.MMAT_GENERATED
    mbr = ida_hexrays.mba_ranges_t(func)
    hf = ida_hexrays.hexrays_failure_t()
    ida_hexrays.mark_cfunc_dirty(func.start_ea)
    mba = ida_hexrays.gen_microcode(mbr, hf, None, ida_hexrays.DECOMP_NO_WAIT, maturity)
    parse_mba_t(mba)


if __name__ == '__main__':
    main()

The script's output is presented below: on the left, the printed microcode in the console, and on the right, the assembly code by IDA:

CTree

In this section, we'll dive into the core elements of Hex-Rays CTree structure, then proceed to a practical example demonstrating how to annotate a custom import table of malware that loads APIs dynamically.

For a better understanding, we will be leveraging the following plugin (hrdevhelper) that allows us to view the CTree nodes in IDA as a graph.

citem_t is an abstract class that is the base for both cinsn_t and cexpr_t, it holds common info like the address, item type and label while also featuring constants like is_expr, contains_expr that can be used to know the type of the object:

struct citem_t
{
  ea_t ea = BADADDR;      ///< address that corresponds to the item. may be BADADDR
  ctype_t op = cot_empty; ///< item type
  int label_num = -1;     ///< label number. -1 means no label. items of the expression
                          ///< types (cot_...) should not have labels at the final maturity
                          ///< level, but at the intermediate levels any ctree item
                          ///< may have a label. Labels must be unique. Usually
                          ///< they correspond to the basic block numbers.
  mutable int index = -1; ///< an index in cfunc_t::treeitems.
                          ///< meaningful only after print_func()
//...

The item type accessed with the op field indicates the type of the node, expression nodes are prefixed with cot_ and the statements nodes are prefixed with cit_, example cot_asg indicates that the node is an assignment expression while cit_if indicates that the node is a condition (if) statement.

Depending on the type of the statement node, a cinsn_t can have a different attribute for example if the item type is cit_if we can access the detail of the condition node through the cif field, as seen in the below snippet, cinsn_t is implemented using a union. Note that a cblock_t is a block statement which is a list of cinsn_t statements, we can find this type for example at the beginning of a function or after a conditional statement.

struct cinsn_t : public citem_t
{
  union
  {
    cblock_t *cblock;   ///< details of block-statement
    cexpr_t *cexpr;     ///< details of expression-statement
    cif_t *cif;         ///< details of if-statement
    cfor_t *cfor;       ///< details of for-statement
    cwhile_t *cwhile;   ///< details of while-statement
    cdo_t *cdo;         ///< details of do-statement
    cswitch_t *cswitch; ///< details of switch-statement
    creturn_t *creturn; ///< details of return-statement
    cgoto_t *cgoto;     ///< details of goto-statement
    casm_t *casm;       ///< details of asm-statement
  };
//...

In the example below, the condition node of type cit_if has two child nodes: the left one is of type cit_block which represents the "True" branch and the right is the condition to evaluate, which is a call to a function, a third child is missing as the condition does not have a "False" branch.

The following is a graph showcasing the statement node cit_if

Find the associated decompilation for the above CTree:

The same logic applies to expressions nodes cexpr_t, depending on the node type, different attributes are available, as an example, a node of type cot_asg has children nodes accessible with the fields x and y.

struct cexpr_t : public citem_t
{
  union
  {
    cnumber_t *n;     ///< used for \ref cot_num
    fnumber_t *fpc;   ///< used for \ref cot_fnum
    struct
    {
      union
      {
        var_ref_t v;  ///< used for \ref cot_var
        ea_t obj_ea;  ///< used for \ref cot_obj
      };
      int refwidth;   ///< how many bytes are accessed? (-1: none)
    };
    struct
    {
      cexpr_t *x;     ///< the first operand of the expression
      union
      {
        cexpr_t *y;   ///< the second operand of the expression
        carglist_t *a;///< argument list (used for \ref cot_call)
        uint32 m;     ///< member offset (used for \ref cot_memptr, \ref cot_memref)
                      ///< for unions, the member number
      };
      union
      {
        cexpr_t *z;   ///< the third operand of the expression
        int ptrsize;  ///< memory access size (used for \ref cot_ptr, \ref cot_memptr)
      };
    };
//...

Finally the cfunc_t structure holds information related to the decompiled function, the function address, the microcode block array, and the CTree accessed with the entry_ea, mba and body fields respectively.

struct cfunc_t
{
  ea_t entry_ea;             ///< function entry address
  mba_t *mba;                   ///< underlying microcode
  cinsn_t body;              ///< function body, must be a block
//...

CTree traversal example

The provided Python code serves as a mini recursive visitor of a CTree, note that it does not handle all node types, the last section will describe how to use the Hex-Rays built-in visitor class ctree_visitor_t. To begin, we obtain the cfunc of the function using ida_hexrays.decompile and access its CTree via the body field.

Next, we check if the node(item) is an expression or a statement. Finally, we can parse the type through the op field and explore its child nodes.

import idaapi
import ida_hexrays

OP_TYPE = sorted([(getattr(ida_hexrays, x), x) for x in filter(lambda y: y.startswith('cit_') or y.startswith('cot_'), dir(ida_hexrays))])


def get_op_name(op):
    """
    Return the name of the given mcode_t.
    """
    for value, name in OP_TYPE:
        if op == value:
            return name
    return None


def explore_ctree(item):
        print(f"item address: {hex(item.ea)}, item opname: {item.opname}, item op: {get_op_name(item.op)}")
        if item.is_expr():
            if item.op == ida_hexrays.cot_asg:
                explore_ctree(item.x) # left side
                explore_ctree(item.y) # right side

            elif item.op == ida_hexrays.cot_call:
                explore_ctree(item.x)
                for a_item in item.a: # call parameters
                    explore_ctree(a_item)

            elif item.op == ida_hexrays.cot_memptr:
                explore_ctree(item.x)
        else:
            if item.op == ida_hexrays.cit_block:
                for i_item in item.cblock: # list of statement nodes
                    explore_ctree(i_item)

            elif item.op == ida_hexrays.cit_expr:
                explore_ctree(item.cexpr)
                
            elif item.op == ida_hexrays.cit_return:
                explore_ctree(item.creturn.expr)
            

def main():
    cfunc = ida_hexrays.decompile(here())
    ctree = cfunc.body
    explore_ctree(ctree)


if __name__ == '__main__':
    main()

Displayed below is the output of the traversal script executed on the start function of a BLISTER sample:

Practical example: annotating the custom import table of a malware sample

Now that we've gained insights into the architecture and structures of the generated CTree, let's delve into a practical application and explore how to automate the annotation of a custom import table of malware.

Hex-Rays provides a utility class ctree_visitor_t that can be used to traverse and modify the CTree, two important virtual methods to know are:

  • visit_insn: to visit a statement
  • visit_expr: to visit an expression

For this example, the same BLISTER sample is used; after locating the function that gets Windows APIs addresses by hash at address 0x7FF8CC3B0926(in the .rsrc section), adding the enumeration to the IDB and applying the enum type to its parameter, we create a class that inherits from ctree_visitor_t, as we are interested in expressions, we will be overriding visit_expr only.

The idea is to locate a cot_call node(1) of the function that resolves APIs by passing the obj_ea address of node’s first child to the function idc.get_name which will return the function name.

   if expr.op == idaapi.cot_call:
            if idc.get_name(expr.x.obj_ea) == self.func_name:
		#...

Next retrieve the enum of the hash by accessing the right parameter of the call node(2), in our case parameter 3.

    carg_1 = expr.a[HASH_ENUM_INDEX]
    api_name = ida_lines.tag_remove(carg_1.cexpr.print1(None))  # Get API name

The next step is to locate the variable that has been assigned the address value of the WinAPI function. To do that we first need to locate the cot_asg node(3), parent of the call node by using the find_parent_of method under cfunc.body of the decompiled function.

    asg_expr = self.cfunc.body.find_parent_of(expr)  # Get node parent

Finally, we can access the first child node(4) under the cot_asg node, which is of type cot_var and get the current variable name, the Hex-Rays API ida_hexrays.rename_lvar is used to rename the new variable with the Windows API name taken from the enum parameter.

This process can ultimately save a significant amount of time for an analyst. Instead of spending time on relabeling variables, they can direct their attention to the core functionality. An understanding of how CTrees work can contribute to the development of more effective plugins, enabling the handling of more complex obfuscations.

For a complete understanding and context of the example, please find the entire code below:

import idaapi
import ida_hexrays
import idc
import ida_lines
import random
import string

HASH_ENUM_INDEX = 2


def generate_random_string(length):
    letters = string.ascii_letters
    return "".join(random.choice(letters) for _ in range(length))


class ctree_visitor(ida_hexrays.ctree_visitor_t):
    def __init__(self, cfunc):
        ida_hexrays.ctree_visitor_t.__init__(self, ida_hexrays.CV_FAST)
        self.cfunc = cfunc
        self.func_name = "sub_7FF8CC3B0926"# API resolution function name

    def visit_expr(self, expr):
        if expr.op == idaapi.cot_call:
            if idc.get_name(expr.x.obj_ea) == self.func_name:
                carg_1 = expr.a[HASH_ENUM_INDEX]
                api_name = ida_lines.tag_remove(
                    carg_1.cexpr.print1(None)
                )  # Get API name
                expr_parent = self.cfunc.body.find_parent_of(expr)  # Get node parent

                # find asg node
                while expr_parent.op != idaapi.cot_asg:
                    expr_parent = self.cfunc.body.find_parent_of(expr_parent)

                if expr_parent.cexpr.x.op == idaapi.cot_var:
                    lvariable_old_name = (
                        expr_parent.cexpr.x.v.getv().name
                    )  # get name of variable
                    ida_hexrays.rename_lvar(
                        self.cfunc.entry_ea, lvariable_old_name, api_name
                    ) # rename variable
        return 0


def main():
    cfunc = idaapi.decompile(idc.here())
    v = ctree_visitor(cfunc)
    v.apply_to(cfunc.body, None)


if __name__ == "__main__":
    main()

Conclusion

Concluding our exploration into Hex-Rays microcode and CTree generation, we've gained practical techniques for navigating the complexities of malware obfuscation. The ability to modify Hex-Rays pseudo code allows us to cut through obfuscation like Control Flow Obfuscation, remove dead code, and many more. The Hex-Rays C++ SDK emerges as a valuable resource, offering well-documented guidance for future reference.

We hope that this guide will be helpful to fellow researchers and any avid learner, please find all the scripts in our research repository.

Resources