Now that we’ve learned what a translator looks like and how to build one, it’s time to run one and actually watch it work. The best way to do this is good old-fashioned gdb, as follows (using some of the examples from last time).

1
2
3
4
5
6
7
[root@gfs-i8c-01 xlator_example]# gdb glusterfs
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
...
(gdb) r --debug -f my.vol /play/import
Starting program: /usr/sbin/glusterfs --debug -f my.vol /play/import
...
[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13

If you get to this point, your glusterfs client process is already running. You can go to another window to see the mountpoint, do file operations, etc.

[root@gfs-i8c-01 ~]# df /play/import
Filesystem           1K-blocks      Used Available Use% Mounted on
/root/xlator_example/my.vol
                     114506240   2643968 106045568   3% /play/import
[root@gfs-i8c-01 ~]# ls /play/import
a_file
[root@gfs-i8c-01 ~]# cat /play/import/a_file
hello

Now let’s interrupt the process and see where we are.

1
2
3
4
5
6
7
8
9
10
11
^C
Program received signal SIGINT, Interrupt.
0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) info threads
  5 Thread 0x7fffeffff700 (LWP 27206)  0x0000003a002dd8c7 in readv ()
   from /lib64/libc.so.6
  4 Thread 0x7ffff50e3700 (LWP 27205)  0x0000003a0060b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 Thread 0x7ffff5f02700 (LWP 27204)  0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 0x7ffff6903700 (LWP 27203)  0x0000003a0060f245 in sigwait ()
   from /lib64/libpthread.so.0
* 1 Thread 0x7ffff7957700 (LWP 27196)  0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

Like any non-toy server, this one has multiple threads. What are they all doing? Honestly, even I don’t know. Thread 1 turns out to be in event_dispatch_epoll, which means it’s the one handling all of our network I/O. Note that with socket multi-threading patch this will change, with one thread in socket_poller per connection. Thread 2 is in glusterfs_sigwaiter which means signals will be isolated to that thread. Thread 3 is in syncenv_task, so it’s a worker process for synchronous requests such as those used by the rebalance and repair code. Thread 4 is in janitor_get_next_fd, so it’s waiting for a chance to close no-longer-needed file descriptors on the local filesystem. (I admit I had to look that one up, BTW.) Lastly, thread 5 is in fuse_thread_proc, so it’s the one fetching requests from our FUSE interface. You’ll often see many more threads than this, but it’s a pretty good basic set. Now, let’s set a breakpoint so we can actually watch a request.

1
2
3
4
(gdb) b rot13_writev
Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119.
(gdb) c
Continuing.

At this point we go into our other window and do something that will involve a write.

1
2
3
4
5
6
7
[root@gfs-i8c-01 ~]# echo goodbye > /play/import/another_file
(back to the first window)
[Switching to Thread 0x7fffeffff700 (LWP 27206)]
 
Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440, fd=0x7ffff409802c,
    vector=0x7fffe8000cd8, count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:119
119             rot_13_private_t *priv = (rot_13_private_t *)this->private;

Remember how we built with debugging symbols enabled and no optimization? That will be pretty important for the next few steps. As you can see, we’re in rot13_writev, with several parameters.

  • frame is our always-present frame pointer for this request. Also, frame->local will point to any local data we created and attached to the request ourselves.
  • this is a pointer to our instance of the rot-13 translator. You can examine it if you like to see the name, type, options, parent/children, inode table, and other stuff associated with it.
  • fd is a pointer to a file-descriptor object (fd_t, not just a file-descriptor index which is what most people use “fd” for). This in turn points to an inode object (inode_t) and we can associate our own rot-13-specific data with either of these.
  • vector and count together describe the data buffers for this write, which we’ll get to in a moment.
  • offset is the offset into the file at which we’re writing.
  • iobref is a buffer-reference object, which is used to track the life cycle of buffers containing read/write data. If you look closely, you’ll notice that vector[0].iov_base points to the same address as iobref->iobrefs[0].ptr, which should give you some idea of the inter-relationships between vector and iobref.

OK, now what about that vector? We can use it to examine the data being written, like this.

(gdb) p vector[0]
$2 = {iov_base = 0x7ffff7936000, iov_len = 8}
(gdb) x/s 0x7ffff7936000
0x7ffff7936000:  "goodbye\n"

It’s not always safe to view this data as a string, because it might just as well be binary data, but since we’re generating the write this time it’s safe and convenient. With that knowledge, let’s step through things a bit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(gdb) s
120             if (priv->encrypt_write)
(gdb)
121                     rot13_iovec (vector, count);
(gdb)
rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57
57              for (i = 0; i < count; i++) {
(gdb)
58                      rot13 (vector[i].iov_base, vector[i].iov_len);
(gdb)
rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45
45              for (i = 0; i < len; i++) {
(gdb)
46                      if (buf[i] >= 'a' && buf[i] <= 'z')
(gdb)
47                              buf[i] = 'a' + ((buf[i] - 'a' + 13) % 26);

Here we’ve stepped into rot13_iovec, which iterates through our vector calling rot13, which in turn iterates through the characters in that chunk doing the rot-13 operation if/as appropriate. This is pretty straightforward stuff, so let’s skip to the next interesting bit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
(gdb) fin
Run till exit from #0  rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:47
rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57
57              for (i = 0; i < count; i++) {
(gdb) fin
Run till exit from #0  rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57
rot13_writev (frame=0x7ffff6e4402c, this=0x638440, fd=0x7ffff409802c,
    vector=0x7fffe8000cd8, count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:123
123             STACK_WIND (frame,
(gdb) b 129
Breakpoint 2 at 0x7ffff50e4f35: file rot-13.c, line 129.
(gdb) b rot13_writev_cbk
Breakpoint 3 at 0x7ffff50e4db3: file rot-13.c, line 106.
(gdb) c

So we’ve set breakpoints on both the callback and the statement following the STACK_WIND. Which one will we hit first?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Breakpoint 3, rot13_writev_cbk (frame=0x7ffff6e4402c, cookie=0x7ffff6e440d8,
    this=0x638440, op_ret=8, op_errno=0, prebuf=0x7fffefffeca0,
    postbuf=0x7fffefffec30) at rot-13.c:106
106             STACK_UNWIND_STRICT (writev, frame, op_ret, op_errno, prebuf, postbuf);
(gdb) bt
#0  rot13_writev_cbk (frame=0x7ffff6e4402c, cookie=0x7ffff6e440d8, this=0x638440,
    op_ret=8, op_errno=0, prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30)
    at rot-13.c:106
#1  0x00007ffff52f1b37 in posix_writev (frame=0x7ffff6e440d8,
    this=<value optimized out>, fd=<value optimized out>,
    vector=<value optimized out>, count=1, offset=<value optimized out>,
    iobref=0x7fffe8001070) at posix.c:2217
#2  0x00007ffff50e513e in rot13_writev (frame=0x7ffff6e4402c, this=0x638440,
    fd=0x7ffff409802c, vector=0x7fffe8000cd8, count=1, offset=0,
    iobref=0x7fffe8001070) at rot-13.c:123

Surprise! We’re in rot13_writev_cbk now, called (indirectly) while we’re still in rot13_writev before STACK_WIND returns (still at rot-13.c:123). If you did any request cleanup here, then you need to be careful about what you do in the remainder of rot13_writev because data may have been freed etc. It’s tempting to say you should just do the cleanup in rot13_writev after the STACK_WIND, but that’s not valid because it’s also possible that some other translator returned without calling STACK_UNWIND – i.e. before rot13_writev is called, so then it would be the one getting null-pointer errors instead. To put it another way, the callback and the return from STACK_WIND can occur in either order or even simultaneously on different threads. Even if you were to use reference counts, you’d have to make sure to use locking or atomic operations to avoid races, and it’s not worth it. Unless you really understand the possible flows of control and know what you’re doing, it’s better to do cleanup in the callback and nothing after STACK_WIND.

At this point all that’s left is a STACK_UNWIND and a return. The STACK_UNWIND invokes our parent’s completion callback, and in this case our parent is FUSE so at that point the VFS layer is notified of the write being complete. Finally, we return through several levels of normal function calls until we come back to fuse_thread_proc, which waits for the next request.

So that’s it. For extra fun, you might want to repeat this exercise by stepping through some other call – stat or setxattr might be good choices – but you’ll have to use a translator that actually implements those calls to see much that’s interesting. Then you’ll pretty much know everything I knew when I started writing my first for-real translators, and probably even a bit more. I hope you’ve enjoyed this series, or at least found it useful, and if you have any suggestions for other topics I should cover please let me know (via comments or email, IRC or Twitter).

Of interest

News to note—just for you