Fork me on GitHub

How to emulate executable with Medusa and python (part 0)

Motivations

Sometimes it could be useful to emulate an executable in order to analyze it, doing as such allows us to:

  • control what the target can access by managing memory, API, etc;
  • modify the execution on-the-fly;
  • monitoring;
  • and so on.

In this case, the target is ls and the host machine is a Windows. I'm perfectly aware it'd be easier to install cygwin (which I already did), but the aim here is to show off how we can emulate a program, not get rid of dir. ls is a good example to demonstrate how to emulate an executable with Medusa because this kind of executable requires a lot of interactions with the system. You can find the used version of ls here

This example required, at least, the git version 6f084016826604d387ad6106e3e64e4f7aff48aa of Medusa to work, <spoiler_alert>the final script it available at this address</spoiler_alert>.

How does it work?

Processor and memory

Medusa relies on YAML files to describe each instructions, most of them also contain a specific field name semantic for instance for the instruction add:

- opcode: 0x00
  mnemonic: add
  operand: [ Eb, Gb ]
  update_flags: [ cf, pf, af, zf, sf, of ]
  semantic: *add

We have a reference to the anchor... add which gives us:

add: &add |
  alloc_var('res', op0.bit);
  res.val = op0.val + op1.val;
  call('overflow_flag_add');
  call('carry_flag_add');
  call('sign_flag');
  call('zero_flag');
  call('parity_flag');
  call('adjust_flag');
  op0.val = res.val;
  free_var('res');

This code is actually written in python because it's easier to parse using ast module. But Medusa is written in C++, so we use a python script to convert this code into C++ which will call Medusa API to Expressions classes. A Expression class represents our IR (intermediate language), it allows to express behavior like: assignment, unary operation, binary operation, etc. Thus, from a given instruction, we can retrieve its behavior and then apply them to a processor and memory context (since both are required to actually run a program).

Alright, this is cool, but not enough.

Environment

If only processor and memory are emulated, what will happen when we try to execute our program? It's likely to get stuck on the first imported API (or syscall or whatever). Since there's neither loader or loaded module(s), the program will try to use an unmapped address and then stop. To address this issue, it's important to fill these imported addresses (like a loader would do) to give something to actually execute for the program. They're different solutions to handle this.

  • Load required modules (recursively); pro: the emulation is closer to the reality; con: it's slower and complicated, you'll end to a syscall execution which it's usually more challenging to handle.
  • Write a unique dummy value for each entries and setup a breakpoint to these values, once the breakpoint is fired up, you're freely to modify the processor and memory contexts to emulate the function behavior; plus: easier to implement, quicker, finer control on the execution; con: a bit hackish, some malwares could rely on loaded modules to retrieve information.

At the time of writing, only the second method is handled by Medusa.

Pydusa

Pydusa is the python binding for Medusa, even if not all API are accessible yet, it will be enough for this tutorial.

Executable loading

First step is to load the executable:

import sys
import pydusa

exec_path = sys.argv[1]
db_path = exec_path + '.mdb'

pydusa.log_to_stdout()

core = pydusa.Medusa()
core.open_exe(exec_path, db_path, False)
core.wait_for_tasks()
doc = core.document

The beginning is pretty straightforward, let's take a look at the parameter of method Medusa.open_exe: the first parameter is the executable path; the next parameter is the path for a database, this feature is still under development(, we have to use it anyways to avoid to break method signature later); the last parameter is a boolean which tells if you want Medusa to try to analyze the executable or not, since we want it to emulate, we don't actually need it, so False saves us time. :) Finally, we call the method Medusa.wait_for_tasks to be sure there's no task in the queue.

Now we can initialize the emulator:

exe = pydusa.Execution(doc)
exe.init([exec_path.replace('\\', '/'), '-l', 'C:\\'], [ ], '')
exe.set_emulator('interpreter')

This class requires a Document instance, then we have to call the method Execution.init to initialize: the program arguments, environment and current working directory. In this sample, we set argv[0] to the current program name but replacing \ to / because Windows vs. UNIX path. We also want to option -l to bring more details and finally tell ls to list files from C:\. For this time, we'll rely on the emulator interpreter, but you're free to try with the LLVM back-end .

As mentioned previously, we need to handle imported function carefully, to so we'll register a basic handler to display the symbol name. Therefore; we'll be able to add a specific handler for the called symbol:

def on_unk_api(cpu, mem, ad):
    global exe
    print 'unhandled function: %s' % exe.get_hook_name()

    return STOP

for lbl in doc.labels:
    if not lbl.is_imp:
        continue
    print '[!] default handler for imported symbol: "%s"' % lbl.name
    exe.hook_fn(lbl.name, on_unk_api)

Handler will be explained later, don't worry.

You may want to also display information for each instruction, so we add a hook for each instruction:

def on_insn(cpu, mem, ad):
    global core

    print '_' * 80
    print cpu
    insn = core.disasm_cur_insn(cpu, mem)
    print core.fmt_cell(ad, insn)

    return CONTINUE

exe.hook_insn(on_insn)

Execution.hook_insn slows down the execution, so use it with care.

Finally, we can start the execution:

start_addr = doc.get_label_addr('start')
exe.execute(start_addr)

Execution.execute takes a pydusa.Address as parameter instead of a simple int.

Imported functions

__libc_start_main

The program should not run far... As previously mentioned, an executable relies (mostly) on imported functions. For instance, almost every Linux executables have to call the function __libc_start_main in order to initialize: constructor / destructor and other stuff. It's also responsible to call the real main function(, the one you write in C/C++). To reproduce its behavior, we'll simply redirect the RIP register to the main function and simply ignore both constructor and destructor routines.

def on_libc_start_main(cpu, mem, ad):
    global exe
    # ref: https://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA.junk/baselib---libc-start-main-.html
    main      = exe.get_func_param('system_v', 0)
    argc      = exe.get_func_param('system_v', 1)
    ubp_av    = exe.get_func_param('system_v', 2)
    init      = exe.get_func_param('system_v', 3)
    fini      = exe.get_func_param('system_v', 4)
    rtld_fini = exe.get_func_param('system_v', 5)
    stack_end = exe.get_func_param('system_v', 6)

    print '__libc_start_main(%016x, %d, %016x, %016x, %016x, %016x, %016x)' % (main, argc, ubp_av, init, fini, rtld_fini, stack_end)

    # there's no set_func_param function yet
    cpu.rdi = argc
    cpu.rsi = ubp_av
    cpu.rdx = 0 # environ is nullptr

    cpu.rip   = main

    return BREAK

First we have to define our own __libc_start_main function named on_libc_start_main, it's simply a python function here. Now let's retrieve parameters, you can either retrieve them using the processor and memory contexts. But here, it's easier to use the method Execution.get_func_param since we know the calling convention of this function. The exe variable is defined as global to get the one which was previously instantiated. Here, all parameters are retrieved and printed, of course it's not mandatory. Unfortunately, there's not set_func_param method to set parameters for the main function, so these parameters are set manually. RIP now contains the address of main function to emulate a jump still instruction. Finally we return BREAK to tell the execution engine to stop the execution of this basic block. Actually, CONTINUE will also work since this function is called using a call or jmp instruction which would mark the end of the current basic block.

Once the fake function is defined, it's important to register it as a callback for the targeted function, a call to Execution.hook_fn is required:

assert(exe.hook_fn('__libc_start_main', on_libc_start_main))

Internally, this method will write a dummy address in the GOT location for __libc_start_main and then set a breakpoint on this dummy address.

xor              ebp, ebp
mov              r9, rdx
pop              rsi
mov              rdx, rsp
and              rsp, 0xFFFFFFFFFFFFFFF0
push             rax
push             rsp
mov              r8, 0x00411810
mov              rcx, 0x00411820
mov              rdi, 0x00408010
call             0x004022D0
________________________________________________________________________________
rax: 0000000000000000 rbx:    0000000000000000 rcx: 0000000000411820 rdx: 00000000bedd4fa8
rsi: 0000000000000003 rdi:    0000000000408010 rsp: 00000000bedd4f88 rbp: 0000000000000000
r8:  0000000000411810 r9:     0000000000000000 r10: 0000000000000000 r11: 0000000000000000
r12: 0000000000000000 r13:    0000000000000000 r14: 0000000000000000 r15: 0000000000000000
rip: 00000000004022d0 rflags: cPazstido
xmm0:  00000000000000000000000000000000 xmm1:  00000000000000000000000000000000
xmm2:  00000000000000000000000000000000 xmm3:  00000000000000000000000000000000
xmm4:  00000000000000000000000000000000 xmm5:  00000000000000000000000000000000
xmm6:  00000000000000000000000000000000 xmm7:  00000000000000000000000000000000
xmm8:  00000000000000000000000000000000 xmm9:  00000000000000000000000000000000
xmm10: 00000000000000000000000000000000 xmm11: 00000000000000000000000000000000
xmm12: 00000000000000000000000000000000 xmm13: 00000000000000000000000000000000
xmm14: 00000000000000000000000000000000 xmm15: 00000000000000000000000000000000
cs: 0000 ds: 0000 es: 0000 ss: 0000 fs: 0038 gs: 0000

jmp              qword [__libc_start_main]
__libc_start_main(0000000000408010, 3, 00000000bedd4fa8, 0000000000411820, 0000000000411810, 0000000000000000, 0000000000000000)

A simple ls -l C:\ requires 43 functions to be handled, it'll be boring to detail all of these, so let's focus on the most interesting ones.

fwrite_unlocked

fwrite_unlocked is used by ls to write on the stdout, so it's important to emulate it correctly.

def on_fwrite_unlocked(cpu, mem, ad):
    global exe, OUTPUT
    # ref: http://linux.die.net/man/3/fwrite_unlocked
    ptr    = exe.get_func_param('system_v', 0)
    size   = exe.get_func_param('system_v', 1)
    n      = exe.get_func_param('system_v', 2)
    stream = exe.get_func_param('system_v', 3)

    exe.ret_from_func('system_v', 4)

    ptr_s = mem.read_utf8(ptr)

    print 'fwrite_unlocked(%016x["%s"], %016x, %016x, %016x)' % (ptr, ptr_s, size, n, stream)

    OUTPUT += ptr_s[:size * n]

    cpu.rax = size * n
    return BREAK

As seen previously, we can rely on exe to get parameters. However, this time we need to read the actual string pointed by ptr. MemoryContext.read_utf8 will do that for us, it simply tries to read a null terminated string using a pointer and returns a str python object. If it fails to read the string, an None object is returned instead. This fake function also needs a special handling for the return address. It wasn't require for __libc_start_main because the RIP register was changed manually. In this case, we need to emulate a pop rip to return from the call; sometimes we also need to clean stack (e.g. stdcall calling convention) to emulate the return from function correctly. Once again, Execution.ret_from_func provides a way to do this handling, it requires the calling convention and the number of parameters. The last parameter is only required for stdcall calling convention since the callee function is responsible of cleaning the stack. This sample contains a valid number of parameters for the sake of correctness, but it's not mandatory.

To be able to print the final result properly, we store every write to stdout in a global variable named OUTPUT.

__overflow

def on___overflow(cpu, mem, ad):
    global exe, OUTPUT

    f = exe.get_func_param('system_v', 0)
    c = exe.get_func_param('system_v', 1)

    exe.ret_from_func('system_v', 2)

    print '__overflow(%016x, %08x)' % (f, c)

    cpu.rax = c

    OUTPUT += chr(c)
    return BREAK

Its definition is simple, however I waste so much time to figure out that it's a putchar-like function... Tricky name.

malloc

Probably one of the most cumbersome function to implement because it requires:

  • memory allocation for the memory context,
  • an allocator,
  • and be compatible with memory management functions (e.g. realloc, free, etc).
alloc_addr = 0xB0000000
alloc_pos  = 0

def on_malloc(cpu, mem, ad):
    global exe, alloc_addr, alloc_pos, alloced

    size = exe.get_func_param('system_v', 0)

    exe.ret_from_func('system_v', 1)

    ptr_alloc = alloc_addr + alloc_pos
    print("malloc(0x%08x) = 0x%016x" % (size, ptr_alloc))

    size = (size + 0x1000) & 0xFFFFF000 # Pages alignment
    mem.alloc(ptr_alloc, size, R_W)
    alloc_pos += size

    cpu.rax = ptr_alloc
    return BREAK

This implementation is unoptimized and really simple, alloc_addr is the base address of the heap, alloc_pos is the current heap offset. MemoryContext.alloc allows this function to allocate memory, it requires: an address which is the current heap address, the size is provided by the parameter and a flag to tell the memory protection, here read and write.

__xstat

This function is really important to emulate ls with the option -l, it allows to retrieve files information.

def on___xstat(cpu, mem, ad):
    global exe
    # ref: https://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-PDA/LSB-PDA/baselib-xstat-1.html
    ver      = exe.get_func_param('system_v', 0)
    path     = exe.get_func_param('system_v', 1)
    stat_buf = exe.get_func_param('system_v', 2)

    exe.ret_from_func('system_v', 3)

    path_str = mem.read_utf8(path)

    print '__xstat(%d, %016x["%s"], %016x)' % (ver, path, path_str, stat_buf)

    STAT_SIZE = 4 + 8 + 8 + 4 + 4 + 4 + 4 + 4 + 8 + 8 + 8 + 4 + 4 + 4

    mem.write(stat_buf, '\x00' * STAT_SIZE)

    for i in range(STAT_SIZE):
        mem.write_u32(stat_buf + i * 4, i * 4)

    st_mode = 0755;
    if os.path.isdir(path_str):
        st_mode |=  040000 # S_IFDIR
    else:
        st_mode |= 0100000 # S_IFREG

    st_size = os.path.getsize(path_str)

    mem.write_u32(stat_buf + 0x18, st_mode)
    mem.write_u64(stat_buf + 0x30, st_size)

The good news is the design of this API makes the caller to provide memory buffer stat_buf, so allocation is no needed. The bad news is: how to figure out fields offset of struct stat correctly? Structure is not yet handled by Medusa (work in progress), to overcome this issue, you may want to simply write dummy values bound to the offset, that's the purpose of the for loop. It appears that 0x18 is st_mode and 0x30 is st_size in this case, but these structure could changed a lot (especially on Linux), so be careful with your own version.

getpwuid

This API is used to convert a UID to a real name.

PW_ADDR = None

def on_getpwuid(cpu, mem, ad):
    global exe, PW_ADDR

    uid = exe.get_func_param('system_v', 0)

    exe.ret_from_func('system_v', 1)

    print 'getpwuid(%d)' % uid

    if not PW_ADDR:
        ##  struct passwd {
        ##    char *pw_name;
        ##    char *pw_passwd;
        ##    uid_t pw_uid;
        ##    gid_t pw_gid;
        ##    time_t pw_change;
        ##    char *pw_class;
        ##    char *pw_gecos;
        ##    char *pw_dir;
        ##    char *pw_shell;
        ##    time_t pw_expire;
        ##  };

        PW_ADDR = 0x90000000

        mem.alloc(PW_ADDR, 0x200, R_W)
        mem.write(PW_ADDR, '\x00' * 0x200)

        PW_NAME = PW_ADDR + 8 + 8 + 4 + 4 + 4 + 8 + 8 + 8 + 8 + 4
        mem.write(PW_NAME, 'wisk')
        mem.write_u64(PW_ADDR, PW_NAME)

    cpu.rax = PW_ADDR
    return BREAK

What makes this API difficult to implement is its old design, this function returns a pointer allocated from no where. To solve this issue, a memory allocation is needed. PW_ADDR is a dummy address which will hold the pointer to the structure passwd. To allocate this memory only once, a global variable is used. Not a clean way to code, but it does the work. It's actually easier to perform one allocation for both the structure passwd and the content of the field pw_name. The user name is hard-coded cause I'm lazy, so is the group name. :)

getopt_long

This function is important to implement if you want ls to understand arguments passed during initialization.

def on_getopt_long(cpu, mem, ad):
    global exe, doc, GETOPT_IDX
    # ref: http://linux.die.net/man/3/getopt_long
    argc      = exe.get_func_param('system_v', 0)
    argv      = exe.get_func_param('system_v', 1)
    optstring = exe.get_func_param('system_v', 2)
    longopts  = exe.get_func_param('system_v', 3)
    longindex = exe.get_func_param('system_v', 4)

    exe.ret_from_func('system_v', 5)

    argv_list = []
    for i in range(argc):
        arg_ptr = mem.read_u64(argv + i * 8)
        arg_str = mem.read_utf8(arg_ptr)
        argv_list.append(arg_str)

    optstring_str = mem.read_utf8(optstring)

    res = None

    try:
        # ref: https://docs.python.org/2/library/getopt.html
        opts, args = getopt.getopt(argv_list[GETOPT_IDX:], optstring_str)
        print opts, args

        if len(opts) == 0:
            res = -1
        else:
            for opt, val in opts:
                GETOPT_IDX += 1
                res = ord(opt[1])

    except getopt.GetoptError as err:
        res = ord('?')

    res_str = None
    if res == -1:
        res_str = '-1'
    else:
        res_str = '\'%c\'' % res

    print 'getopt_long(%d, %016x["%s"], %016x["%s"], %016x, %016x) = %s' %\
        (argc, argv, ';'.join(argv_list), optstring, optstring_str, longopts, longindex, res_str)

    cpu.rax = res & 0xffffffff


    optind_addr = doc.get_label_addr('optind')
    mem.write_u32(optind_addr.offset, GETOPT_IDX)

    return BREAK

This implementation uses the python version of getopt, however it's important to handle the update of the global variable optind. A global variable GETOPT_IDX holds the current index, at the end of this function, the address of optind is retrieve with the method Document.get_label_addr and updated with MemoryContext.write_u32.

__sprintf_chk

__sprintf__chk is a secured version of sprintf. Variadic functions are pain to handle, this implementation may not work under specific case, for instance if a floating point value is formatted.

def on___sprintf_chk(cpu, mem, ad):
    global exe
    # ref: http://refspecs.linux-foundation.org/LSB_4.0.0/LSB-Core-generic/LSB-Core-generic/libc---sprintf-chk-1.html
    buf    = exe.get_func_param('system_v', 0)
    flag   = exe.get_func_param('system_v', 1)
    buflen = exe.get_func_param('system_v', 2)
    fmt    = exe.get_func_param('system_v', 3)

    exe.ret_from_func('system_v', 0)

    fmt_str = mem.read_utf8(fmt)

    param_no = 4
    param = []

    for i in range(len(fmt_str)):
        c = fmt_str[i]
        if c == '%':
            p = fmt_str[i + 1]
            if p == '*':
                param.append(exe.get_func_param('system_v', param_no))
                param_no += 1
                p = fmt_str[i + 2]

            if p == 's':
                p_ptr = exe.get_func_param('system_v', param_no)
                param.append(mem.read_utf8(p_ptr))
                param_no += 1
                continue

            param.append(exe.get_func_param('system_v', param_no))
            param_no += 1

    res = fmt_str % tuple(param)

    print '__sprintf_chk(%016x, %d, %016x, %016x["%s"], ...) = "%s"' % (buf, flag, buflen, fmt, fmt_str, res)

    mem.write(buf, res + '\x00')
    cpu.rax = len(res) + 1

    return BREAK

Memory layout

It's worth noting that when Medusa allocates memory, it'll fill the memory with the value 0xfa. The bright side is it makes easier to find access to uninitialized memory by grepping for fafa..., however it may interfere with the execution.

stdout

Last but not least, stdout is actually a global structure of type __IO_FILE. Even if this structure is opaque, some functions may try to access some of its fields. Luckily, zero out this structure seems to be enough to make these functions happy.

exe.mem.alloc(0xd0002000, 0x100, R_W)
exe.mem.write(0xd0002000, '\x00' * 0x100)
exe.mem.write_u64(doc.get_label_addr('stdout').offset, 0xd0002000)

Result

Here is the final result, since it's far too long to include all debug print (ftrace ersatz), it's truncated:

total 4526895530976
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 $Recycle.Bin
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 $WINDOWS.~BT
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 $Windows.~WS
-rwxr-xr-x        None wisk iki          1 2015-12-07 20:38 BOOTNXT
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 Chocolatey
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 Documents and Settings
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 Flashtool
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 MSOCache
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 MinGW
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 NVIDIA
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 NvidiaLogging
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 OneDriveTemp
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 PerfLogs
drwxr-xr-x        None wisk iki      20480 2015-12-07 20:38 Program Files
drwxr-xr-x        None wisk iki      32768 2015-12-07 20:38 Program Files (x86)
drwxr-xr-x        None wisk iki      12288 2015-12-07 20:38 ProgramData
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 Python27
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 Python33
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 Qt
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 Recovery
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 System Volume Information
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 Temp
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 Users
drwxr-xr-x        None wisk iki      28672 2015-12-07 20:38 Windows
-rwxr-xr-x        None wisk iki     398156 2015-12-07 20:38 bootmgr
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 cygwin
drwxr-xr-x        None wisk iki       4096 2015-12-07 20:38 cygwin64
-rwxr-xr-x        None wisk iki 2569613312 2015-12-07 20:38 hiberfil.sys
-rwxr-xr-x        None wisk iki 4294967296 2015-12-07 20:38 pagefile.sys
-rwxr-xr-x        None wisk iki  268435456 2015-12-07 20:38 swapfile.sys
drwxr-xr-x        None wisk iki          0 2015-12-07 20:38 symbols

Conclusions

Software emulation is both complex and fun to do. Focusing only on the processor is usually not enough(, unless you only want to emulate shellcode or stuff like that), that's why Medusa tries to provide a large set of API to overcome most of situations. Of course, they're still stuff to do and the next development on emulation will focus on helper for both heap allocator and format-string. Next time we'll see how to use LLVM back-end to improve speed and target something more exciting. :)

Acknowledgements

  • gg: who help me a lot with x86 semantic. :)

Comments !

social