oracle-pwrite-zfs-latency.d: Detect slow sync writes on ZFS

The first DTrace integration into the Oracle 12c database was finally reverse engineered in great detail by Andrey Nikolaev. Finally we have a good documentation how the V$KERNEL_IO_OUTLIER view really works.

The standalone kernel_io_oulier.d DTrace script works great even on Solaris 10 and pre-12c Oracle databases, but only on RAW-devices and ASM. Oracle pushes ASM because of good arguments, but sometimes one likes to use a filesystem.

We have some databases on ZFS and so I was jealous that the script does not work with ZFS. Therefore I tried to make a similar script for ZFS.

ZFS-Queuing

I assume RAW-devices were also chosen as first target, because it’s simpler to DTrace a pwrite() syscall end to end, than on ZFS. ZFS is an awesome file system, which simplifies management and scaling a lot, but analyzing performance issues, can become easily complex.

ZFS adds some additional queues. Every I/O is transferred into the ZIO pipeline, where it could get compressed, checksummed, encrypted, deduplicated, etc. A lot is going on in the pipeline, but some DTrace analysis and source code study show that the sum of all zio_wait() execution time basically represents the time which is needed to bring the data of the pwrite() to stable disk storage. So if the lower I/O stack (e.g. sd driver, hard disks) causes a higher latency, it should be visible in the zio_wait() times.

Brendan Gregg wrote an excellent blog post about file system latency which also covers the ZIO pipeline.

The DTrace script: oracle-pwrite-zfs-latency.d

The script oracle-pwrite-zfs-latency.d provides to following information:

root@host# ./oracle-pwrite-zfs-latency.d 
2013 Oct  4 12:04:17
 TIMESTAMP         ZPOOL       PROCESS_NAME      PID              FILE_NAME TOTAL_LATENCY ZIO_IN ZIO_PIPELINE ZIO_WAIT ZIO_OUT
 ---------         -----       ------------      ---              --------- ------------- ------ ------------ -------- -------
1380888131  oracle_pool1    ora_ckpt_ABC1QA    21342    o1_mf_875rj35k_.ctl        108471     89    108363      107841    18
1380888244  oracle_pool2    ora_ckpt_ABC2PR    17576    o1_mf_71626spv_.ctl        125376     87    125267      124683    21
1380888257  oracle_pool2    ora_lgwr_ABC4QA    26177  o1_mf_1_7qx7zbsm_.log        162499     60    162380      161621    59
  • TOTAL_LATENCY: Total time in microseconds the I/O spent in the kernel.
  • TOTAL_LATENCY = ZIO_IN + ZIO_PIPELINE + ZIO_OUT
  • ZIO_IN: Time in microseconds spent until I/O reaches ZIO pipeline
  • ZIO_PIPELINE: Time in microseconds I/O spent in ZIO pipeline
  • ZIO_WAIT: Time in microseconds spent for zio_wait() in ZIO pipeline, most likely this time is used for waiting for physical I/O is completed
  • ZIO_OUT: Time in microseconds spent after I/O has left ZIO pipeline

By default the script detects pwrites() which need longer then 100ms. All “oracle” procs are traced, usually you are most concerned of the database log writer (ora_lgwr_*). If ZIO_WAIT is far lower compared to ZIO_PIPELINE, then maybe ZFS does something stupid. Further analysis can be done e.g. with ziostacks.d or spasync.d.

The script was tested on Solaris 10 and 11.1.