aboutsummaryrefslogtreecommitdiffgithub log msg author committer range
blob: 97aeaf6d3cda563ec48dce0563a7669991d6213b (plain)
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332  +++ date = "2018-12-08T00:00:00-05:00" draft = false title = "Autograding Gameboy Advance DMA Transfers" description = "A tale of using (abusing?) Linux virtual memory features to autograde GBA DMA for CS 2110" hasmath = true +++ In CS 2110 at Georgia Tech, we use the Gameboy Advance to demonstrate how C code can interact directly with hardware. On a high level, students create games by reading button states directly from memory-mapped registers and write directly to a special region of memory called the "video buffer," which the video controller continuously reads and draws on the screen. But my favorite example of students poking with the GBA hardware introduces them to an important real-life example: [Direct Memory Access (DMA)][2]! The GBA has a DMA controller which amounts to a fast memcpy() in hardware which some additional switches to flip. ([See tonc for more details on GBA DMA][1].) Students use DMA as follows: c // provided boilerplate typedef struct { const volatile void *src; volatile void *dst; volatile uint32_t cnt; } DMA_CONTROLLER; #define DMA ((volatile DMA_CONTROLLER *) 0x040000B0) #define DMA_DST_INC (0 << 21) #define DMA_DST_DEC (1 << 21) #define DMA_DST_FIX (2 << 21) #define DMA_SRC_INC (0 << 23) #define DMA_SRC_DEC (1 << 23) #define DMA_SRC_FIX (2 << 23) #define DMA_ON (1 << 31) // STUDENT CODE: // Turns the whole screen red void redscreen(void) { volatile u16 color = RED; DMA[3].src = &color; DMA[3].dst = videoBuffer; // dimensions of gba screen DMA[3].cnt = DMA_ON | DMA_SRC_FIX | DMA_DST_INC | (240 * 160); }  Notice that the student code above which invokes DMA does not call a function we can mock out in an autograder, like how a student's malloc() could call a fake sbrk() --- it's just hitting memory at some weird address. You might think: well this is easy you dingus, just make the DMA macro point to some chunk of memory which you can check after their code returns! However, GBA DMA normally stops the CPU while it makes the copy, and that would not. Let's look at a less trivial example of DMA:  void redsquare(int width, int height) { volatile u16 color = RED; for (int row = 0; row < height; row++) { DMA[3].src = &color; DMA[3].dst = videoBuffer + row * 240; // dimensions of gba screen DMA[3].cnt = DMA_ON | DMA_SRC_FIX | DMA_DST_INC | width; } }  If we used the "magic DMA buffer" idea mentioned above, later DMA transfers would clobber src/dst/cnt values from earlier transfers. That is, we would only see the DMA results for the last row of the red square. So that ain't gonna work. Consequently, before this semester, we never autograded DMA. Instead, we would grep students' code for for loops to see if they were manipulating the videoBuffer pixel-by-pixel[^^1] without DMA and then would make sure the emulator showed the right image. But my goal in 2110 is to autograde as much as possible (especially because TAs would be grading the timed lab during or right before their final exams), so I decided to take this opportunity to flex so hard I caused permanent muscle damage. Abusing Virtual Memory to Solve Your Life Problems ================================================== What's a good way to make a memory access halt execution of a program, much like how setting the DMA_ON bit in the control register makes the DMA controller halt the CPU while the copy runs? A page fault! We can make DMA accesses pagefault by making the DMA macro point to some page marked as unwritable in the page table. Then our page fault handler runs and can read what they wrote! Except it's not that simple: when an instruction pagefaults, it doesn't actually _run_, so we can't see what students wrote to our fake DMA page. In other words, we can see what address the student tried to hit, but not what they wrote (yet). So instead, our page fault handler needs to mark the fake DMA page as writable in the page table, so that when the hardware re-tries the faulting instruction, it actually has an effect. But this re-introduces the original problem: how do we know what they wrote? Indeed, if our page fault handler makes the fake DMA page subsequently writable, then later DMA accesses won't pagefault, meaning we'll miss them and they'll clobber the written data in the fake DMA page. We're back to square one! My solution was to use two fake DMA pages and alternate between them. The DMA macro always points to an unwritable page, but when an instruction faults, we change the page it's trying to hit to writable. (Changing the page to which the DMA macro points will not change the page that the faulting instruction hits when the hardware retries the instruction because the old page will still be in some register, unchanged.) This solves our problem because every DMA access will pagefault, and we can see the data that the user wrote on DMA access $t$ when DMA access $t+1$ pagefaults. So the pseudocode for the tester would look like: page fault handler: make sure this is a fake DMA-related page fault read what the student wrote to the fake DMA page on the last faulting access. log it mark the page to which the DMA macro points as writable so the faulting instruction can run set the DMA macro to point to the alternate page and set it as unwritable tester code: set up page fault handler run student code check fake DMA page for data written by student on the final access exit A visualization for all this follows: {{< figure src="/img/blog/autograding-gba-dma/dma2.png" alt="DMA virtual memory hack diagram" width="100%" >}} On Linux, you can map pages with mmap(), change their permissions with mprotect(), and catch page faults by registering a SIGSEGV handler. Simulating DMA ============== Simulating DMA seems simple enough. When the the segfault handler or cleanup code reads what the student wrote to the fake DMA page, we see if they set the DMA_ON bit in the control register, and if they did, simulate it! But there's a complication here too. Returning to the earlier example, c void redscreen(void) { volatile u16 color = RED; DMA[3].src = &color; DMA[3].dst = videoBuffer; DMA[3].cnt = DMA_ON | DMA_SRC_FIX | 38400; }  we see the student (correctly) sets the source register to the address of a variable on the stack. But we can only check to see that they set DMA_ON in the control register _after_ their function returns because there are no later fake DMA segfaults --- this is true for the last DMA access in general. But accessing memory in the stack frame of a function which has returned is undefined behavior in C, because it's gonna get clobbered by subsequent function calls! Yikes! Speculative Copying, aka Copy Like a Madman ------------------------------------------- The instant the user sets the DMA_ON bit in the control register, we need to take a snapshot of memory around where the source register points. However, we have a few problems with when this copy needs to happen: 1. We only know they're writing to the control register, not that they're setting the DMA_ON bit 2. Similarly, we don't know how much they're copying 3. We don't how if they're using source decrement or source increment. (Source decrement means subtracting 2, or 4 if 32-bit DMAing is enabled, from the source address instead of adding after each one of the N transfers.) We can fix all three of these issues by simply copying super aggressively when the user sets the control register, even if we don't know what flags they set or size they requested. In particular, if they copy more than 38,400 pixels (shorts), they're doing the problem wrong anyway (since that's bigger than the actual videoBuffer!), so that provides an upper bound. And to handle source decrement, we can copy the 76,800 bytes on the left-hand and right-hand side of the source pointer, just in case. But now we have a new problem: what if, for example, they're using source decrement and they set the source pointer near the end of the heap? Then we're gonna segfault when trying to "snapshot" memory. The hack I found is to fork() and then copy until failure. Pseudocode: zero out source buffer fork() copy left-hand side of src pointer byte-by-byte until I segfault, stopping at the max copy size fork() copy right-hand side of src pointer byte-by-byte until I segfault, stopping at the max copy size wait() wait() So if _they_ copy too far, instead of a segfault in the mystery tester, they'll get black pixels (note the zeroing out of the source buffer at the top). This approach seems like it would be slow because of the sheer amount of copying, but it actually seems to perform pretty well. 🤷 The Autograding Part ==================== Now that we've got GBA DMA to _work_ on x86, we still have to autograde it. How? Let's make up an assignment in which the student simply draws an image in the top left of the GBA screen. That is, they have to complete:  void drawImage3(const u16 *image, int width, int height) { // Student writes this code for (int r = 0; r < height; r++) { DMA[3].src = image + width * r; DMA[3].dst = videoBuffer + 240 * r; DMA[3].cnt = DMA_ON | width; } }  What exactly is "right" here? Is this  void drawImage3(const u16 *image, int width, int height) { for (int r = 0; r < height; r++) for (int c = 0; c < width; c++) videoBuffer[240 * r + c] = image[width * r + c]; }  "right"? Probably not if the assignment is about DMA, but it deserves some partial credit if it works. How about this  void drawImage3(const u16 *image, int width, int height) { for (int r = 0; r < height; r++) { for (int c = 0; c < width; c++) { DMA[3].src = image + width * r + c; DMA[3].dst = videoBuffer + 240 * r + c; DMA[3].cnt = DMA_ON | 1; } } }  or this  void drawImage3(const u16 *image, int width, int height) { for (int r = height - 1; r >= 0; r++) { DMA[3].src = image + width * r + width - 1; DMA[3].dst = videoBuffer + 240 * r + width - 1; DMA[3].cnt = DMA_ON | DMA_SRC_DEC | DMA_DST_DEC | width; } }  strange example? I settled on the following: 1. Require they use DMA to draw pixels on the screen 2. Require they use the minimum number of DMA calls possible without violating #1 3. For each test input, have two separate tests: 1. One simulates DMA and tests that the videoBuffer is correct. This awards credit for not using DMA or misusing it (e.g. the single-pixel DMA example above) but still getting it working 2. The other compares the log of DMA calls with the expected, but does its best to allow perfectly valid cases like the double-decrementing one right above even though it does everything backwards from the simpler approach Regions of the Screen --------------------- For the second test above, I ended up taking the log of their DMA calls and converting each call to a generalized form of some ordered quadruplet $(i\_\text{start}, i\_\text{end}, v\_\text{start}, v\_\text{end}) \in \mathbb{Z}^4$, where the first two coordinates are the start/end indices in the image of the copy, and the next two are the same except for the videoBuffer. Then I normalized each copy in both the actual and expected lists of copies using the following two steps: First, if $i\_\text{start} > i\_\text{end}$ for any transfer $(i\_\text{start}, i\_\text{end}, v\_\text{start}, v\_\text{end})$, then I converted the quadruplet to $(i\_\text{end}, i\_\text{start}, v\_\text{end}, v\_\text{start})$ since that represents the same copy. Then I sorted the list of copies by start index. After normalizing both the actual and expected copies, I can compare them copy-by-copy in order and display a (hopefully) helpful error message to the student on the first one that does not match. Example Tester ============== If you're interested, [the GitHub repository for all this DMA stuff][4] has a complete example of a tester which uses the strategies I rambled about above. Beware, the student half, which runs a GBA emulator, requires the CS 2110 GBA toolchain, but that's not really the fun part --- the grader is! [^^1]: One student infamously evaded our grep checking by [unrolling their loop][3] by hand. That's hundreds of lines and we loved it [1]: https://www.coranac.com/tonc/text/dma.htm [2]: https://en.wikipedia.org/wiki/Direct_memory_access [3]: https://en.wikipedia.org/wiki/Loop_unrolling [4]: https://github.com/ausbin/dma-autograder-template/