This is a tutorial on how to reverse engineer shellcode in malware with Radare2. Spoilers!
MalwareTech published a small challenge on his Twitter for reverse engineering embedded shellcode inside of the malware. I thought this was a great opportunity to write a small tutorial on how to do this with Radare2 on Mac.
You can download the sample here. After you unpack the archive, you will find a single .exe file with readme instructions saying that this is a static analysis challenge (so no debugging).
You can run the executable to get the MD5 hash of the correct flag:
Let’s launch Radare and see what’s inside of the executable.
radare2 shellcode1.exe
-- r2 talks to you. tries to make you feel well.
[0x00402270]>
Radare loads up and puts you into the entry point of the program. You can verify that by running ie
:
[0x00402270]> ie
[Entrypoints]
vaddr=0x00402270 paddr=0x00001670 baddr=0x00400000 laddr=0x00000000 haddr=0x00000108 type=program
1 entrypoints
Next step is to analyze the binary. This is where Radare is different from the Hopper and IDA – you have to manually start analysis process. The reason for that is that analysis may take a while, and radare developers decided not do it by default. Anyway, type aaa
.
[0x00402270]> aaa
[ WARNING : block size exceeding max block size at 0x00401000
[+] Try changing it with e anal.bb.maxsize
[x] Analyze all flags starting with sym. and entry0 (aa)
[x] Analyze function calls (aac)
[x] Analyze len bytes of instructions for references (aar)
[x] Use -AA or aaaa to perform additional experimental analysis.
[x] Constructing a function name for fcn.* and sym.func.* functions (aan)
After the analysis is complete, radare will group sections, functions, symbols, strings, etc under flags. You can see those by typing fs
:
[0x00402270]> fs
0 19 * strings
1 11 * symbols
2 10 * sections
3 9 * relocs
4 9 * imports
5 1 * resources
6 4 * functions
Let’s take a look at the functions radare discovered. afl
stands for analyzed functions list.
[0x00402270]> afl
0x00401000 1 1022 sym.shellcode1.exe__MD5Transform_MD5__CAXQAKQAE_Z
0x00401e50 5 160 sym.shellcode1.exe__Encode_MD5__CAXPAEPAKI_Z
0x00401ef0 5 117 sym.shellcode1.exe__Decode_MD5__CAXPAKPAEI_Z
0x00401f70 1 22 sym.shellcode1.exe___0MD5__QAE_XZ
0x00401f90 1 70 sym.shellcode1.exe__Init_MD5__QAEXXZ
0x00401fe0 10 263 sym.shellcode1.exe__Update_MD5__QAEXPAEI_Z
0x004020f0 4 161 sym.shellcode1.exe__Final_MD5__QAEXXZ
0x004021a0 5 74 sym.shellcode1.exe__writeToString_MD5__QAEXXZ
0x004021f0 1 51 sym.shellcode1.exe__digestMemory_MD5__QAEPADPAEH_Z
0x00402230 1 60 sym.shellcode1.exe__digestString_MD5__QAEPADPAD_Z
0x00402270 1 175 entry0
0x00402326 1 6 sub.ntdll.dll_memset_326
0x0040232c 1 6 sub.ntdll.dll_memcpy_32c
0x00402332 1 6 sub.ntdll.dll_sprintf_332
0x00402338 1 6 sub.ntdll.dll_strlen_338
Interesting, so here we have a number of MD5-related functions and the main function labeled entry0
. Given we are looking for the shellcode, it is likely to be kept in data. Let’s take a look at the strings with command iz
:
[0x00402270]> iz
000 0x00001830 0x00403030 23 24 (.rdata) ascii We've been compromised!
001 0x00001848 0x00403048 4 5 (.rdata) ascii %02x
002 0x000018d2 0x004030d2 11 12 (.rdata) ascii ExitProcess
003 0x000018e0 0x004030e0 12 13 (.rdata) ascii VirtualAlloc
004 0x000018f0 0x004030f0 9 10 (.rdata) ascii HeapAlloc
005 0x000018fc 0x004030fc 14 15 (.rdata) ascii GetProcessHeap
006 0x0000190c 0x0040310c 12 13 (.rdata) ascii KERNEL32.dll
007 0x0000191c 0x0040311c 6 7 (.rdata) ascii memset
008 0x00001926 0x00403126 6 7 (.rdata) ascii memcpy
009 0x00001930 0x00403130 7 8 (.rdata) ascii sprintf
010 0x0000193a 0x0040313a 6 7 (.rdata) ascii strlen
011 0x00001942 0x00403142 9 10 (.rdata) ascii ntdll.dll
012 0x0000194e 0x0040314e 11 12 (.rdata) ascii MessageBoxA
013 0x0000195a 0x0040315a 10 11 (.rdata) ascii USER32.dll
014 0x00001984 0x00403184 4 20 (.rdata) utf32le \n\n㆘㇀ blocks=Basic Latin,Kanbun,CJK Strokes
015 0x000019f6 0x004031f6 135 272 (.rdata) utf16le \a\b\t桳汥捬摯ㅥ攮數㼀〿䑍䀵兀䕁塀Z䐿捥摯䁥䑍䀵䍀塁䅐偋䕁䁉Z䔿据摯䁥䑍䀵䍀塁䅐偅䭁䁉Z䘿湩污䵀㕄䁀䅑塅婘㼀湉瑩䵀㕄䁀䅑塅婘㼀䑍吵慲獮潦浲䵀㕄䁀䅃兘䭁䅑䁅Z唿摰瑡䁥䑍䀵兀䕁偘䕁䁉Z搿杩獥䵴浥牯䁹䑍䀵兀䕁䅐偄䕁䁈Z搿杩獥却牴湩䁧䑍䀵兀䕁䅐偄䑁婀㼀牷瑩呥卯牴湩䁧䑍䀵兀䕁塘Z blocks=Basic Latin,CJK Unified Ideographs,Hangul Compatibility Jamo,CJK Unified Ideographs Extension A,CJK Symbols and Punctuation
000 0x00001c40 0x00404040 9 11 (.data) utf8 2b\n:ۚB*bb blocks=Basic Latin,Arabic
001 0x00001c4b 0x0040404b 5 6 (.data) ascii z"*iJ
000 0x00001e58 0x00405058 424 424 (.rsrc) ascii <assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">\r\n <trustInfo xmlns="urn:schemas-microsoft-com:asm.v3">\r\n <security>\r\n <requestedPrivileges>\r\n <requestedExecutionLevel level="asInvoker" uiAccess="false"></requestedExecutionLevel>\r\n </requestedPrivileges>\r\n </security>\r\n </trustInfo>\r\n</assembly>PAPADDINGXXPADDINGPADDINGXXPADDINGPADDINGXXPADDINGPADDINGXXPADDINGPADDINGXXPAD
We see the “We_ve_been_compromised” string and also some gibberish @ 0x00001984
. This is interesting because if the task is to find a shellcode, in the simpliest case we know to look for two things:
- Encoded string that is likely not to be in ASCII format, hence radare won’t be able to display it.
- The shellcode itself will likely to be in the form of the bytecode, which again, will show up as gibberish.
Let’s take a look at the main function. We are already supposed to be on its offset, but just for the practice sake, we can move to other functions by typing s
and function name or address:
[0x00402270]> s 0x00402270
To dissasemble the function, type pdf
:
[0x00402270]> pdf
/ (fcn) entry0 175
| entry0 ();
| ; var int local_a0h @ ebp-0xa0
| ; var int local_9ch @ ebp-0x9c
| ; var int local_98h @ ebp-0x98
| ; var int local_4h @ ebp-0x4
| 0x00402270 55 push ebp
| 0x00402271 8bec mov ebp, esp
| 0x00402273 81eca0000000 sub esp, 0xa0
| 0x00402279 56 push esi
| 0x0040227a 8d8d68ffffff lea ecx, [local_98h]
| 0x00402280 e8ebfcffff call sym.shellcode1.exe___0MD5__QAE_XZ
| 0x00402285 6a10 push 0x10 ; 16
| 0x00402287 6a00 push 0
| 0x00402289 ff1508304000 call dword [sym.imp.KERNEL32.dll_GetProcessHeap] ; 0x403008
| 0x0040228f 50 push eax
| 0x00402290 ff1504304000 call dword [sym.imp.KERNEL32.dll_HeapAlloc] ; 0x403004
| 0x00402296 8945fc mov dword [local_4h], eax
| 0x00402299 8b45fc mov eax, dword [local_4h]
| 0x0040229c c70040404000 mov dword [eax], str.2b__:__B_bb ; [0x404040:4]=0x3a0a6232 ; "2b\n:\u06daB*bb\x1az\"*iJ\x9ar\xa2iR\xaa\x9a\xa2i2z\x92i*\u0082bzJ\xa2\x9a\xeb"
| 0x004022a2 6840404000 push str.2b__:__B_bb ; 0x404040 ; "2b\n:\u06daB*bb\x1az\"*iJ\x9ar\xa2iR\xaa\x9a\xa2i2z\x92i*\u0082bzJ\xa2\x9a\xeb"
| 0x004022a7 e88c000000 call sub.ntdll.dll_strlen_338 ; size_t strlen(const char *s)
| 0x004022ac 83c404 add esp, 4
| 0x004022af 8b4dfc mov ecx, dword [local_4h]
| 0x004022b2 894104 mov dword [ecx + 4], eax
| 0x004022b5 6a40 push 0x40 ; '@' ; 64
| 0x004022b7 6800100000 push 0x1000
| 0x004022bc 6a0d push 0xd ; 13
| 0x004022be 6a00 push 0
| 0x004022c0 ff1500304000 call dword [sym.imp.KERNEL32.dll_VirtualAlloc] ; 0x403000
| 0x004022c6 898560ffffff mov dword [local_a0h], eax
| 0x004022cc 6a0d push 0xd ; 13
| 0x004022ce 6868404000 push 0x404068
| 0x004022d3 8b9560ffffff mov edx, dword [local_a0h]
| 0x004022d9 52 push edx
| 0x004022da e84d000000 call sub.ntdll.dll_memcpy_32c ; void *memcpy(void *s1, const void *s2, size_t n)
| 0x004022df 83c40c add esp, 0xc
| 0x004022e2 8b75fc mov esi, dword [local_4h]
| 0x004022e5 ff9560ffffff call dword [local_a0h]
| 0x004022eb 6840404000 push str.2b__:__B_bb ; 0x404040 ; "2b\n:\u06daB*bb\x1az\"*iJ\x9ar\xa2iR\xaa\x9a\xa2i2z\x92i*\u0082bzJ\xa2\x9a\xeb"
| 0x004022f0 8d8d68ffffff lea ecx, [local_98h]
| 0x004022f6 e835ffffff call sym.shellcode1.exe__digestString_MD5__QAEPADPAD_Z
| 0x004022fb 898564ffffff mov dword [local_9ch], eax
| 0x00402301 6a30 push 0x30 ; '0' ; 48
| 0x00402303 6830304000 push str.We_ve_been_compromised ; 0x403030 ; "We've been compromised!"
| 0x00402308 8b8564ffffff mov eax, dword [local_9ch]
| 0x0040230e 50 push eax
| 0x0040230f 6a00 push 0
| 0x00402311 ff1514304000 call dword [sym.imp.USER32.dll_MessageBoxA] ; 0x403014 ; "L1"
| 0x00402317 6a00 push 0
\ 0x00402319 ff150c304000 call dword [sym.imp.KERNEL32.dll_ExitProcess] ; 0x40300c
[0x00402270]>
Seems like the program starts with a call to MD5 init function:
0x00402280 e8ebfcffff call sym.shellcode1.exe___0MD5__QAE_XZ
This is not very interesting for now. In the next set of functions there are two calls to Windows’ memory management functions GetProcessHeap
& HeapAlloc
:
0x00402285 6a10 push 0x10 ; 16
0x00402287 6a00 push 0
0x00402289 ff1508304000 call dword [sym.imp.KERNEL32.dll_GetProcessHeap] ; 0x403008
0x0040228f 50 push eax
0x00402290 ff1504304000 call dword [sym.imp.KERNEL32.dll_HeapAlloc] ; 0x403004
MSDN reference for GetProcessHeap
says it doesn’t take any arguments and returns a handle to the calling process’s heap.
HANDLE WINAPI GetProcessHeap(void);
Then there is a call to HeapAlloc
which has the following definition:
LPVOID WINAPI HeapAlloc(
_In_ HANDLE hHeap,
_In_ DWORD dwFlags,
_In_ SIZE_T dwBytes
);
Looking at the parameters pushed onto the stack, it looks like the HeapAlloc
was called with the following params:
LPVOID WINAPI HeapAlloc(
_In_ HANDLE hHeap, # the handle returned by GetProcessHeap
_In_ DWORD dwFlags, # 0
_In_ SIZE_T dwBytes # 0x10 = 16 bytes
);
Let’s see what this heap is for –
0x00402296 8945fc mov dword [local_4h], eax
0x00402299 8b45fc mov eax, dword [local_4h]
0x0040229c c70040404000 mov dword [eax], str.2b__:__B_bb
We save the pointer to the allocated memory block from HeapAlloc
to the address at local_4h
and then load the pointer to str.2b__:__B_bb
into the address at eax
. Generally, with DEP, HeapAlloc will return non-executable memory. Let’s verify that the DEP in fact enabled:
[0x00402270]> iI~nx
nx true
This means, for now, we can assume that this is the encoded string. Let’s pull it out. I printed all the characters until the null termination 0x00
as a C array so it will be easier to write a decoder later.
[0x00402270]> pc 39 @ 0x404040
#define _BUFFER_SIZE 39
const uint8_t buffer[39] = {
0x32, 0x62, 0x0a, 0x3a, 0xdb, 0x9a, 0x42, 0x2a, 0x62, 0x62,
0x1a, 0x7a, 0x22, 0x2a, 0x69, 0x4a, 0x9a, 0x72, 0xa2, 0x69,
0x52, 0xaa, 0x9a, 0xa2, 0x69, 0x32, 0x7a, 0x92, 0x69, 0x2a,
0xc2, 0x82, 0x62, 0x7a, 0x4a, 0xa2, 0x9a, 0xeb, 0x00
};
[0x00402270]>
We still need to find the shellcode responsible for decoding. A little lower we see the VirtualAlloc call
0x004022b5 6a40 push 0x40 ; '@' ; 64
0x004022b7 6800100000 push 0x1000
0x004022bc 6a0d push 0xd ; 13
0x004022be 6a00 push 0
0x004022c0 ff1500304000 call dword [sym.imp.KERNEL32.dll_VirtualAlloc] ; 0x403000
With the following signature:
LPVOID WINAPI VirtualAlloc(
_In_opt_ LPVOID lpAddress, 0
_In_ SIZE_T dwSize, = 0xd0 13
_In_ DWORD flAllocationType, 0x1000 = MEM_COMMIT
_In_ DWORD flProtect 0x40 = PAGE_EXECUTE_READWRITE
);
The PAGE_EXECUTE_READWRITE
parameter allows us to execute code from the heap, which means this allocation will likely hold the decoder. Let’s see:
0x004022cc 6a0d push 0xd ; 13
0x004022ce 6868404000 push 0x404068
0x004022d3 8b9560ffffff mov edx, dword [local_a0h]
0x004022d9 52 push edx
0x004022da e84d000000 call sub.ntdll.dll_memcpy_32c ; void *memcpy(void *s1, const void *s2, size_t n)
0x004022df 83c40c add esp, 0xc
0x004022e2 8b75fc mov esi, dword [local_4h]
0x004022e5 ff9560ffffff call dword [local_a0h]
Here we see memcpy being called to copy 13 bytes from memory location 0x404068
into the freshly allocated space on executable heap and then called @ 0x004022e5
.
Let’s see what those 13 bytes look like:
[0x00402270]> px 13 @ 0x404068
- offset - 0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF
0x00404068 8b3e 8b4e 04c0 440f ff05 e2f9 c3 .>.N..D......
It definitely looks like a bytecode. Radare has a very cool feature to decompile the bytecode pD 13 @ 0x404068
:
And here is our decoder. Looks like it loads the pointer to the string into edi
, the length of the string in ecx
and then does rotates left by 5 bits on each byte. This is a little annoying as C doesn’t have rotate left function by default, so let’s just use the decoder as inline assembly.
GCC takes AT&T syntax, so we have to change the disassembly flavor. Btw, you can chain commands in radare with ;
.
[0x00402270]> e asm.syntax=att; pD 13 @ 0x404068
; DATA XREF from 0x004022ce (entry0)
0x00404068 8b3e movl 0(%esi), %edi
0x0040406a 8b4e04 movl 4(%esi), %ecx ; [0x4:4]=-1 ; 4
.-> 0x0040406d c0440fff05 rolb $5, -1(%edi, %ecx)
`=< 0x00404072 e2f9 loop 0x40406d
0x00404074 c3 retl
Now onto writing the decoder. The idea is to pass the pointer to the encoded string to edi
, the buffer size to ecx
and decode the string in place, so we don’t have to rewrite the assembly.
__asm__ __volatile__ (
"movl %0, %%edi\n"
"movl %1, %%ecx\n"
"loop:"
"rolb $5, -1(%%edi, %%ecx)\n"
"loop loop\n"
: /* We are rewritting in place, so no output */
: "r"(buffer), "r"(_BUFFER_SIZE)
);
If you are not familiar with inline assembly, let me clarify the example a little bit. The %0, %1 are operands in the order received. Because the placeholders are written with %
we have to use %%
for registers. After the assembly, we have two :
. The first block is for outputs, which in our case are none. The second is for inputs, where we are supplying our buffer and its size.
Since I am using Mac, I need to specify that the code will be compiled for 32-bit assembly, otherwise, GCC will try to use the 64bit registers and the code won’t work: gcc -o decode decode.c -m32
.
Now all that’s left is to run it and get the flag.
[Spoiler] Here is the code for the complete solution: click me