How to troubleshoot slow Puppet Facter runs

Facter is Puppet’s component for discovering information about the hardware and the operating system where the Puppet agents runs. The many auto-discovered variables (“facts”) are convenient, but facter is also a potential source for slow puppet agent executions. This post shows some tricks for troubleshooting the performance of facter.

With the simple unix utility time, you can measure the full execution time of facter. For example to know if an agent runs on a virtual hardware, the is_virtual “fact” is populated:

# time facter is_virtual
false

real    0m4.008s
user    0m0.853s
sys     0m2.594s

In this example we see facter needs four seconds to discover is_virtual=false. But what happens in this four long seconds? The option --timing shows how long each fact needs to get discovered. It also shows that facter uses several other facts to resolve is_virtual, because depending on these facts other CLI commands are used for the discovery.

# facter --timing is_virtual
kernel: 7.74ms
operatingsystem: 18.32ms
osfamily: 20.16ms
macaddress: 77.64ms
hardwareisa: 19.04ms
hardwaremodel: 29.89ms
architecture: 30.20ms
virtual: 0.69ms
prtdiag: 1062.60ms
virtual: 1094.83ms
virtual: 4.07ms
is_virtual: 1130.92ms
false

From this output we see that more than ten facts are resolved but only the prtdiag and virtual fact need longer than one second for discovery. This information is already helpful for a Puppet facter developer. But for a regular user, who only likes to understand why facter runs slower on some hardware types, it is likely not simple enough.

Facter basically just parses the output of CLI commands, so if such a command runs slow, whole facter runs slow. The performance of the ruby code which calls these commands, usually does not have a big impact on the overall facter performance.

How to measure the execution time of the CLI commands which are called by facter? One option would be to improve the --timing feature. But as I am on a platform which ships with Dtrace (Solaris), it is easier to write a simple Dtrace script. The following script measures the execution time of all binaries which are called by the facter ruby process:

dtrace scriptfacter-timing.d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/usr/sbin/dtrace -s

#pragma D option bufpolicy=fill
#pragma D option bufsize=1m
#pragma D option quiet

BEGIN
{
printf("\n\n= Dtrace statistics =\n");
printf("\nRuby facter PID: %d\n\n", $target);
printf("PID\truntime (ms)\tcommand\n");
printf("---\t------------\t-------\n");
}

proc:::exec-success
/ppid == $target/
{
starttime[curpsinfo->pr_pid] = timestamp;
}

proc:::exit
/starttime[curpsinfo->pr_pid]/
{
starttime[curpsinfo->pr_pid] = timestamp - starttime[curpsinfo->pr_pid];
printf("%d\t%d\t\t%s\n", curpsinfo->pr_pid,starttime[curpsinfo->pr_pid] / 1000000,curpsinfo->pr_psargs);
}

You need to call the Dtrace script as followed:

# dtrace -s facter-timing.d -c '/usr/local/bin/facter --timing is_virtual'
...

= Dtrace statistics =

Ruby facter PID: 9195

PID     runtime (ms)    command
---     ------------    -------
9200    405             netstat -p -f inet
9233    58              /sbin/ifconfig -a
9196    1               /usr/bin/uname -s
9234    128             /usr/perl5/bin/perl /usr/bin/kstat cpu_info
9198    57              /sbin/ifconfig -a
9219    1               /usr/bin/uname -p
9238    47              /usr/sbin/zoneadm list -cp
9243    1               /usr/bin/uname -m
9247    12              /sbin/zonename
9248    1059            /usr/platform/SUNW,SPARC-Enterprise/sbin/prtdiag
9197    1               /usr/bin/uname -v
9231    16              sh -c /usr/sbin/prtconf 2>/dev/null
9221    1088            sh -c /usr/sbin/prtdiag 2>/dev/null
9201    58              /sbin/ifconfig -a
9207    2               /usr/sbin/virtinfo -ap

In this output a sysadmin can easily see which commands are slow to run. For example prtdiag needs more than one second and also netstat -p -f inet needs almost 0.5 seconds. Actually I wrote the script when we faced the bug that virtinfo -ap needed more than twenty seconds for a simple execution. This script identified the issue quickly.

Share Comments