How do you know if a separate ZIL log device will help your ZFS performance?
A while back, I had a workload which I suspected had a fair amount of synchronous writes on a ZFS file system. The general recommendation for this case is to use a separate ZIL log, preferably on a device which has a nonvolatile write cache or a write-optimized SSD. However, once you add a separate log to a pool, you cannot easily remove it. So before you make that move, it is a better idea to look at your workload and answer the question, "how much ZIL write activity is generated by my workload?"
I posed that question to Roch, and quickly received a short dtrace script which basically sums the size of the data being sent to the ZIL over an interval. I took that one-liner and transformed it into a more useful tool I call zilstat.
|Usage: zilstat.ksh [gMt][-l linecount] [-p poolname] [interval [count]]|
-M # print numbers as megabytes (base 10)
-t # print timestamp
-p poolname # only look at poolname
-l linecount # print header every linecount lines (default=only once)
interval in seconds or "txg" for transaction group commit intervals
note: "txg" only appropriate when -p poolname is used
count will limit the number of intervals reported
zilstat.ksh # default output, 1 second samples
zilstat.ksh 10 # 10 second samples
zilstat.ksh 10 6 # print 6 x 10 second samples
zilstat.ksh -p rpool # show ZIL stats for rpool only
N-Bytes - data bytes written to ZIL over the interval
N-Bytes/s - data bytes/s written to ZIL over ther interval
N-Max-Rate - maximum data rate during any 1-second sample
B-Bytes - buffer bytes written to ZIL over the interval
B-Bytes/s - buffer bytes/s written to ZIL over ther interval
B-Max-Rate - maximum buffer rate during any 1-second sample
ops - number of synchronous iops per interval
<=4kB - number of synchronous iops <= 4kBytes per interval
4-32kB - number of synchronous iops 4-32kBytes per interval
>=32kB - number of synchronous iops >= 32kBytes per interval
note: data bytes are actual data, total bytes counts buffer size
The main question to be answered by zilstat is, “how much write activity is going to the ZIL?” If you run your workload and the results remain all balls (zeros), then it is a good indication that your workload would not benefit from a separate log because the workload does not generate synchronous writes. However, if you are seeing a lot of activity, then perhaps your workload would benefit from a separate log. In general, NFS and database servers are the beneficiaries of a separate log.
Here is a sample output from my desktop. I apologize if this wraps on your browser.
# ./zilstat.ksh -t 60
TIME N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB
2009 Feb 6 14:25:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:26:22 287368 4789 283064 618496 10308 524288 7 0 2 5
2009 Feb 6 14:27:22 10304 171 10304 94208 1570 94208 2 0 0 2
2009 Feb 6 14:28:22 223120 3718 213152 618496 10308 544768 7 0 2 5
2009 Feb 6 14:29:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:30:22 3336 55 1264 28672 477 12288 7 7 0 0
2009 Feb 6 14:31:22 0 0 0 0 0 0 0 0 0 0
2009 Feb 6 14:32:22 149768 2496 145480 294912 4915 282624 8 1 2 5
2009 Feb 6 14:33:22 248 4 248 4096 68 4096 1 1 0 0
From this sample you can see that the amount of data being synchronously written by the applications is rather modest. This data is fit into buffers which is what is actually sent to storage. For my activity this afternoon, the maximum rate of buffer bytes written peaked at 544,768 in a one-second sample. I really didn't notice this as a performance impact, mostly because it is my desktop machine, and it already performs pretty well. Your workload might be very different and zilstat might be able to give you an idea of how much data would be written to a separate log. Clearly, if the numbers are mostly zero, most of the time, then a separate log will likely not help the performance of your workload.
In some cases, you may want to see the ZIL activity per transaction group (txg) commit. This will provide a better idea of the separate log device size. To see per-txg measurements, use the string "txg" instead of a number for the interval. You will also need to specify a pool with the txg option because there may be multiple active pools, and I'm not sure that there is a reasonable way to present that data concisely. If you want to look at multiple pools with the txg option concurrently, just run two instances of zilstat. Here is an example of txg output.
|# ./zilstat.ksh -p rpool txg|
waiting for txg commit...
txg N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB
3873100 0 0 0 0 0 0 0 0 0 0
3873101 0 0 0 0 0 0 0 0 0 0
3873102 24160 805 24160 45056 1501 45056 6 1 5 0
3873103 0 0 0 0 0 0 0 0 0 0
I find that using the txg view with timestamps is very useful. However, the output is too long to fit nicely on a web page example.
Note that the time interval for txg commits is variable. By design, this should be no more than approximately 30 seconds. However, you may see them occur more frequently. For example, if you are using the Time Slider feature of OpenSolaris, then when ZFS takes snapshots it forces a txg commit. This is another reason why recording timestamps with the -t option is useful.
Note: a packaging bug in OpenSolaris may affect your ability to run DTrace scripts. If you run zilstat and DTrace complains about "Invalid library dependency in /usr/lib/dtrace/iscsit.d: /usr/lib/dtrace/iscsi.d" then you will need to install the SUNWiscsitgt package. See CR 6855460 for details.