The first DTrace integration into the Oracle 12c database was finally reverse engineered in great detail by Andrey Nikolaev. Finally we have a good documentation how the V$KERNEL_IO_OUTLIER view really works.
The standalone kernel_io_oulier.d DTrace script works great even on Solaris 10 and pre-12c Oracle databases, but only on RAW-devices and ASM. Oracle pushes ASM because of good arguments, but sometimes one likes to use a filesystem.
We have some databases on ZFS and so I was jealous that the script does not work with ZFS. Therefore I tried to make a similar script for ZFS.
ZFS-Queuing
I assume RAW-devices were also chosen as first target, because it’s simpler to DTrace a pwrite() syscall end to end, than on ZFS. ZFS is an awesome file system, which simplifies management and scaling a lot, but analyzing performance issues, can become easily complex.
ZFS adds some additional queues. Every I/O is transferred into the ZIO pipeline, where it could get compressed, checksummed, encrypted, deduplicated, etc. A lot is going on in the pipeline, but some DTrace analysis and source code study show that the sum of all zio_wait() execution time basically represents the time which is needed to bring the data of the pwrite() to stable disk storage. So if the lower I/O stack (e.g. sd driver, hard disks) causes a higher latency, it should be visible in the zio_wait() times.
Brendan Gregg wrote an excellent blog post about file system latency which also covers the ZIO pipeline.
The DTrace script: oracle-pwrite-zfs-latency.d
The script oracle-pwrite-zfs-latency.d provides to following information:
root@host# ./oracle-pwrite-zfs-latency.d 2013 Oct 4 12:04:17 TIMESTAMP ZPOOL PROCESS_NAME PID FILE_NAME TOTAL_LATENCY ZIO_IN ZIO_PIPELINE ZIO_WAIT ZIO_OUT --------- ----- ------------ --- --------- ------------- ------ ------------ -------- ------- 1380888131 oracle_pool1 ora_ckpt_ABC1QA 21342 o1_mf_875rj35k_.ctl 108471 89 108363 107841 18 1380888244 oracle_pool2 ora_ckpt_ABC2PR 17576 o1_mf_71626spv_.ctl 125376 87 125267 124683 21 1380888257 oracle_pool2 ora_lgwr_ABC4QA 26177 o1_mf_1_7qx7zbsm_.log 162499 60 162380 161621 59
- TOTAL_LATENCY: Total time in microseconds the I/O spent in the kernel.
- TOTAL_LATENCY = ZIO_IN + ZIO_PIPELINE + ZIO_OUT
- ZIO_IN: Time in microseconds spent until I/O reaches ZIO pipeline
- ZIO_PIPELINE: Time in microseconds I/O spent in ZIO pipeline
- ZIO_WAIT: Time in microseconds spent for zio_wait() in ZIO pipeline, most likely this time is used for waiting for physical I/O is completed
- ZIO_OUT: Time in microseconds spent after I/O has left ZIO pipeline
By default the script detects pwrites() which need longer then 100ms. All “oracle” procs are traced, usually you are most concerned of the database log writer (ora_lgwr_*). If ZIO_WAIT is far lower compared to ZIO_PIPELINE, then maybe ZFS does something stupid. Further analysis can be done e.g. with ziostacks.d or spasync.d.
The script was tested on Solaris 10 and 11.1.