Using DTrace to find block sizes of ZFS, NFS, and iSCSI

A member of our Storage Team, Cloud Engineer Goran Mekic, shared  that documentation and books on DTrace were not terribly helpful to him in finding out what function he should trace. Goran said that he would have been extremely happy if he could refer to a blog post like the one below before starting with his task, so he offered to share the process because it is more important than the measurement itself.

So in that spirit of sharing – you can download these steps, and walk through Goran’s instructions on finding block sizes of ZFS, NFS, and iSCSI using DTrace below…

The first step is finding the right probe. Sometimes you can just filter existing probes hoping the output will contain the right one.

dtrace -l | grep zfs | grep write

76274        fbt               zfs                         zfs_write entry
76275        fbt               zfs                         zfs_write return

Let’s find that function in the source code. To reduce time of search it is recommended to use ripgrep instead grep.

cd /usr/src
rg ‘^zfs_write’
sys/contrib/openzfs/module/zfs/zfs_vnops.c
406:zfs_write(znode_t *zp, zfs_uio_t *uio, int ioflag, cred_t *cr)

sys/contrib/openzfs/module/os/linux/zfs/zfs_vnops_os.c
344:zfs_write_simple(znode_t *zp, const void *data, size_t len,

sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
648:zfs_write_simple(znode_t *zp, const void *data, size_t len,

Note that I searched for files where “zfs_write” starts at the beginning of the line. That is because FreeBSD coding style(9) is the following:

/*
* Write the bytes to a file.
*
*    IN:    zp    – znode of file to be written to.
*        uio    – structure supplying write location, range info,
*              and data buffer.
*        ioflag    – O_APPEND flag set if in append mode.
*              O_DIRECT flag; used to bypass page cache.
*        cr    – credentials of caller.
*
*    OUT:    uio    – updated offset and range.
*
*    RETURN:    0 if success
*        error code if failure
*
* Timestamps:
*    ip – ctime|mtime updated if byte count > 0
*/
int
zfs_write(znode_t *zp, zfs_uio_t *uio, int ioflag, cred_t *cr)
{

}

/*
* Write the bytes to a file.
*
*    IN:    zp    – znode of file to be written to.
*        uio    – structure supplying write location, range info,
*              and data buffer.
*        ioflag    – O_APPEND flag set if in append mode.
*              O_DIRECT flag; used to bypass page cache.
*        cr    – credentials of caller.
*
*    OUT:    uio    – updated offset and range.
*
*    RETURN:    0 if success
*        error code if failure
*
* Timestamps:
*    ip – ctime|mtime updated if byte count > 0
*/
int
zfs_write(znode_t *zp, zfs_uio_t *uio, int ioflag, cred_t *cr)
{


}

Searching only for lines that start with zfs_write reduces the number of entries returned by ripgrep. We see that the second argument, or args[1] in DTrace, is zfs_uio_t. In most cases when you see _t type you are actually dealing with “typedef struct <name> { … } <name>_t” kind of construct, hence the following command to find it.

rg ‘\} zfs_uio_t’
sys/contrib/openzfs/lib/libspl/include/sys/uio.h
76:} zfs_uio_t;

sys/contrib/openzfs/include/os/linux/spl/sys/uio.h
83:} zfs_uio_t;

sys/contrib/openzfs/include/os/freebsd/spl/sys/uio.h
44:} zfs_uio_t;

When we peak inside FreeBSD’s uio.h we find the following:

typedef struct zfs_uio {
struct uio    *uio;
} zfs_uio_t;

Then we find “struct uio”:

rg ‘struct uio \{‘
share/man/man9/uio.9
41:struct uio {

share/doc/papers/fsinterface/fsinterface.ms
724:struct uio {

share/doc/papers/fsinterface/slides.t
116:struct uio {

sys/sys/uio.h
55:struct uio {

In the last file, this is the definition of the structure:

struct uio {
struct    iovec *uio_iov;        /* scatter/gather list */
int    uio_iovcnt;        /* length of scatter/gather list */
off_t    uio_offset;        /* offset in target object */
ssize_t    uio_resid;        /* remaining bytes to process */
enum    uio_seg uio_segflg;    /* address space */
enum    uio_rw uio_rw;        /* operation */
struct    thread *uio_td;        /* owner */
};

What we want is the uio_resid, as per comment that’s the number of bytes. So our probe looks like this:

fbt:zfs:zfs_write:entry
{
@zfsw = quantize(args[1]->uio->uio_resid);
}

This will give us distribution of byte sizes, and for a test machine which should otherwise be idle, the only process writing stuff should be

dd if=/dev/random of=/tmp/output.dd bs=16k

By using write and random data we ensure that no cache is going to interfere with our measurement. The output of probe while running dd is the following:

As there are other processes writing on the same machine, we get more than just 16k writes. Let’s try to do the same for NFS. First, we need to set up NFS.

cat /etc/rc.conf
mountd_enable=”YES”
mountd_flags=”-r”
nfs_server_enable=”YES”
nfsv4_server_enable=”YES”
rpcbind_enable=”YES”

You will need to (re)start mountd, nfsd and rpcbind services, create ZFS dataset and share it over NFS:

service nfsd restart
service rpcbind restart
service mountd restart

zfs create zroot/var/nfs
zfs set sharenfs=on zroot/var/nfs
mount -t nfs -o rsize=8192,wsize=8192 127.0.0.1:/var/nfs /mnt

As default ZFS block size is 4kB, I chose 8kB in NFS client to see what will happen. For finding the probe let’s try the same technique we did for ZFS:

dtrace -l | grep nfs | grep write

51671        fbt            kernel                      nfsvno_write entry
51672        fbt            kernel                      nfsvno_write return

I must point out that we were really lucky searching for the probes. If luck is on your side, embrace it with both hands and forget about more elaborate techniques! Now, let’s see what the actual function looks like.

rg ‘^nfsvno_write’
sys/fs/nfsserver/nfs_nfsdport.c
1124:nfsvno_write(struct vnode *vp, off_t off, int retlen, int *stable,

Third argument, or args[2] in DTrace, looks like what we need. Let’s try this:

fbt:kernel:nfsvno_write:entry
{
@nfsw = quantize(args[2]);
}

And we get the following distribution:

Great! Because no matter the ZFS preferred block size of 4kB and dd bs=16k parameter, NFS respects the mount option wsize=8192. If we combine the probes into a DTrace script like this:

fbt:zfs:zfs_write:entry
{
@zfsw = quantize(args[1]->uio->uio_resid);
}

fbt:kernel:nfsvno_write:entry
{
@nfsw = quantize(args[2]);
}

With the same dd command we get the following output:

We see that NFS and ZFS align perfectly up to a last digit. This is all great in laboratory but in production you probably have multiple datasets and NFS clients. Let’s find out which dataset it was actually written to. To do that we will use first argument to zfs_write, as comment indicates that’s the destination. I already showed how you can ripgrep and use editor to figure out arguments, but now let’s use a shortcut. DTrace comes with function print() which can tell you what kind of struct is the first argument:

dtrace -i ‘fbt:zfs:zfs_write:entry { print(args[0]); exit(0); }’
CPU     ID                    FUNCTION:NAME
53  76274                  zfs_write:entry znode_t * 0xfffff80be9e0b1d0

Nice, we know it’s a pointer, but what’s inside? Add * infront “args” and find out:

dtrace -i ‘fbt:zfs:zfs_write:entry { print(*args[0]); exit(0); }’
CPU     ID                    FUNCTION:NAME
43  76274                  zfs_write:entry znode_t {
uint64_t z_id = 0x290200
kmutex_t z_lock = {
struct lock_object lock_object = {
const char *lo_name = 0xffffffff8267282a
u_int lo_flags = 0x22710000
u_int lo_data = 0
struct witness *lo_witness = 0
}
volatile uintptr_t sx_lock = 0x1
}
krwlock_t z_parent_lock = {
struct lock_object lock_object = {
const char *lo_name = 0xdeadc0dedeadc0de
u_int lo_flags = 0xdeadc0de
u_int lo_data = 0xdeadc0de
struct witness *lo_witness = 0xdeadc0dedeadc0de
}
volatile uintptr_t sx_lock = 0xdeadc0dedeadc0de
}
krwlock_t z_name_lock = {
struct lock_object lock_object = {
const char *lo_name = 0xdeadc0dedeadc0de
u_int lo_flags = 0xdeadc0de
u_int lo_data = 0xdeadc0de
struct witness *lo_witness = 0xdeadc0dedeadc0de
}
volatile uintptr_t sx_lock = 0xdeadc0dedeadc0de
}
zfs_dirlock_t *z_dirlocks = 0xdeadc0dedeadc0de
zfs_rangelock_t z_rangelock = {
avl_tree_t rl_tree = {
struct avl_node *avl_root = 0
int (*)() avl_compar = zfs.ko`zfs_rangelock_compare
size_t avl_offset = 0x8
ulong_t avl_numnodes = 0
}
kmutex_t rl_lock = {
struct lock_object lock_object = {
const char *lo_name = 0xffffffff826824bd
u_int lo_flags = 0x22710000
u_int lo_data = 0
struct witness *lo_witness = 0
}
volatile uintptr_t sx_lock = 0x1
}
zfs_rangelock_cb_t *rl_cb = 0xffffffff823f1810
void *rl_arg = 0xfffff801637dfae0
}
boolean_t z_unlinked = 0
boolean_t z_atime_dirty = 0
boolean_t z_zn_prefetch = 0xdeadc0de
boolean_t z_is_sa = 0x1
boolean_t z_is_ctldir = 0xdeadc0de
boolean_t z_suspended = 0xdeadc0de
uint_t z_blksz = 0x10c00
uint_t z_seq = 0x7a473d
uint64_t z_mapcnt = 0
uint64_t z_dnodesize = 0x200
uint64_t z_size = 0x10ae1
uint64_t z_pflags = 0x40800000004
uint32_t z_sync_cnt = 0
uint32_t z_sync_writes_cnt = 0
uint32_t z_async_writes_cnt = 0
mode_t z_mode = 0x81a4
kmutex_t z_acl_lock = {
struct lock_object lock_object = {
const char *lo_name = 0xffffffff8264b421
u_int lo_flags = 0x22710000
u_int lo_data = 0
struct witness *lo_witness = 0
}
volatile uintptr_t sx_lock = 0x1
}
zfs_acl_t *z_acl_cached = 0
krwlock_t z_xattr_lock = {
struct lock_object lock_object = {
const char *lo_name = 0xffffffff8264b431
u_int lo_flags = 0x22710000
u_int lo_data = 0
struct witness *lo_witness = 0
}
volatile uintptr_t sx_lock = 0x1
}
nvlist_t *z_xattr_cached = 0
uint64_t z_xattr_parent = 0
uint64_t z_projid = 0
list_node_t z_link_node = {
struct list_node *list_next = 0xfffff802f3600c50
struct list_node *list_prev = 0xfffff802af923a80
}
sa_handle_t *z_sa_hdl = 0xfffff8029f5dca68
struct zfsvfs *z_zfsvfs = 0xfffff80146e92000
vnode_t *z_vnode = 0xfffff802c89bac40
char *z_cached_symlink = 0
uint64_t z_uid = 0x3e9
uint64_t z_gid = 0x3e9
uint64_t z_gen = 0x3b71a
uint64_t [2] z_atime = [ 0x64031489, 0x37bba788 ]
uint64_t z_links = 0x1
}

Now that’s a lot, but with some knowledge, you can figure out that vnode_t is what you need next:

dtrace -i ‘fbt:zfs:zfs_write:entry { print(*args[0]->z_vnode); exit(0); }’

struct mount *v_mount = 0xfffffe025cb56100

dtrace -i ‘fbt:zfs:zfs_write:entry { print(*args[0]->z_vnode->v_mount); exit(0); }’

struct statfs mnt_stat = {

char [1024] f_mntonname = [ “/usr/home” ]

}

What we need in the end is:

fbt:zfs:zfs_write:entry
{
mount = stringof(args[0]->z_vnode->v_mount->mnt_stat.f_mntonname);
@[“zfs”, mount] = quantize(args[1]->uio->uio_resid);
}

Following same principle of finding vnode_t and getting to f_mntonname, the whole dtrace script is:

fbt:kernel:nfsvno_write:entry
{
mount = stringof(args[0]->v_mount->mnt_stat.f_mntonname);
@[“nfs”, mount] = quantize(args[2]);
}

fbt:zfs:zfs_write:entry
{
mount = stringof(args[0]->z_vnode->v_mount->mnt_stat.f_mntonname);
@[“zfs”, mount] = quantize(args[1]->uio->uio_resid);
}

If you look at zfs_read and nfsvno_read, you’ll notice they have the same arguments as their write counterparts, which means just replace “write” with “read” and you have your ZFS and NFS read probes. For iSCSI we need to do a lot more detective work. First, let’s prepare ZFS
dataset.

Let’s try to use zvol_log_write.

cd /usr/src
rg ‘^zvol_log_write’
sys/contrib/openzfs/module/zfs/zvol.c
530:zvol_log_write(zvol_state_t *zv, dmu_tx_t *tx, uint64_t offset, uint64_t size, int sync)

So we need args[3] for size and something in zv to identify zvol.

dtrace -i ‘fbt:zfs:zvol_log_write:entry { print(*args[0]); exit(0); }’
CPU     ID                    FUNCTION:NAME
62  76290             zvol_log_write:entry zvol_state_t {
char [256] zv_name = [ “zroot/var/vm1” ]

So the final probe is:

fbt:zfs:zvol_log_write:entry
{
@[args[0]->zv_name] = quantize(args[3]);
}

The command used to write to zvol is:

dd if=/dev/random of=/dev/zvol/zroot/var/vm1 bs=16k

Let’s configure iSCSI first.

cat /etc/ctl.conf
portal-group pg0 {
discovery-auth-group no-authentication
listen 0.0.0.0:3260
}

target iqn.2012-06.com.example:target0 {
alias “Example target”
auth-group no-authentication
portal-group pg0
lun 0 {
path /dev/zvol/zroot/var/vm1
}
}

service ctld enable
service ctld start

It is advisable to use another machine as client, because that way your client/server results will not interfere with each other.

service iscsid enable
service iscsid start
iscsictl -A -p <ip> -t iqn.2012-06.com.example:target0
gpart create -s gpt da0
gpart add -t freebsd-ufs /dev/da0
newfs -U /dev/da0p1
mkdir -p /media/da0p1
mount /dev/da0p1 /media/da0p1

You will get /dev/daN device and you should replace da0 with it. For iSCSI, first error I made is that I assumed that the probe will come from fbt:iscsi module. That was wrong! I needed fbt:cfiscsi. When unsure which probe is the right one, you can do a little test. You can tell DTrace to measure which probe fires the most.

cd /usr/src
rg ‘^cfiscsi_receive_callback’
sys/cam/ctl/ctl_frontend_iscsi.c
327:cfiscsi_receive_callback(struct icl_pdu *request)

Using print() like before we get our probe. It is a bit more special than the ones we used before because of how iSCSI works. If there is no data carried in the request, ip_data_len will be zero, so we filter out those calls.

fbt:cfiscsi:cfiscsi_receive_callback:entry
/ args[0]->ip_data_len > 0 /
{
@ = quantize(args[0]->ip_data_len);
}

Combining the probes gives us:

fbt:cfiscsi:cfiscsi_receive_callback:entry
/ args[0]->ip_data_len > 0 /
{
@iscsiw = quantize(args[0]->ip_data_len);
}

fbt:zfs:zvol_log_write:entry
{
@zvolw[args[0]->zv_name] = quantize(args[3]);
}

When we do the math, we can conclude that results are pretty close and that the difference was probably caused by rounding of quantize() function:

262144 * 1965 = 515112960
1048576 * 492 = 515899392
difference    =    786432

Similar combination of counting probes, grepping for “zvol”, reading a bit of source code and printing arguments gives the following probe for zvol read

fbt:zfs:zvol_geom_bio_start:entry
{
zv = (zvol_state_t *)args[0]->bio_to->private;
@[zv->zv_name] = quantize(args[0]->bio_length);
}

To make it easier you can “pkg install automount” and “service devd restart” so that it mounts iSCSI target as soon as /dev/daN is available.

Sometimes you just can’t find the proper probe, and that’s the case with iSCSI read. All probes that fire for read, also get triggered on write, so I couldn’t find the way to distinguish read from write.

And there you have it! I hope this is helpful.

How well could you sleep with reliable cloud-based backups and recovery?

Take a deep dive into Axcient’s proprietary, automated security features to see how we’re ensuring uninterrupted business continuity — no matter what: