Operf and Opreport outputs

Discussion:

Michael Petlan

2017-06-16 11:22:02 UTC

Hello,
Many days ago you had helped me a lot in using Oprofile in Virtual Machine. That time, I also had a query regarding
getting information from 'opreport --symbols --debuginfo' command.

Hello,

yeah, I remember...

Actually, I am profiling a file1.c file with the following commands...
1. gcc -g file1.c -o file1
2. operf ./file1 --events=CPU_CLK_UNHALTED
3. opreport --symbols --debuginfo
samplesÂ %Â Â Â Â Â Â Â linenr infoÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â image nameÂ Â Â Â Â Â Â Â Â Â Â Â Â Â symbol name
2 Â Â Â Â Â 12.5000Â rsaimp.c:16Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â rsaimpÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â MM
2Â Â Â Â Â Â Â 12.5000Â rsaimp.c:34Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â rsaimpÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â main
When I execute commands 2 and 3 multiple times, each time I gwt the output for symbole name MM (which is UDF in the
program) but I don't get output for main every time.

This is kind of random, see below.

Additionally,'' the sample % also vary every time I file operf command. I do not understand the working of operf that
how it gives different outputs for the same program.

How _counting_ works:

Each time an event happens in CPU (in our case -- CPU_CLK_UNHALTED -- one
CPU tick), its counter is incremented by one. This way, you can get the
whole number of how many times the event happened:

$ ocount ls
Events were actively counted for 1863267 nanoseconds.
Event counts (actual) for /bin/ls:
Event Count % time counted
CPU_CLK_UNHALTED 2,672,309 100.00

This means that there were 2.672 million CPU ticks necessary to execute
`ls`.

The tool for counting is *ocount*.

How _profiling_ works:

For profiling, events are counted the same way as when counting, but only
once per N event occurrences a "sample" is taken. A sample is a piece of
information about where in the code it happened, what was the context and
potentially call-stack, etc. This sample is saved into the oprofile_data
dir structure. When you finish profiling, opreport generates some overview
from the samples recorded.

Of course, taking sample causes some overhead. Thus, you cannot take it
each time the event counter increments. For this purpose, there is "sample
rate" value which divides the event counter and takes a sample only once
per "sample rate" occurrences. Thus, it is pretty random, which event
occurrence is picked for a sample.

Of course, your main() consumes CPU ticks and the counter is incremented,
but sometimes, one of these incrementations produces sample, sometimes not,
thus the function is missing.

You have 2 samples on each of the line in the report. That is really a low
number, so it is very probable that sometimes the samples hit main() and
sometimes not.

$ operf ls
Profiling done.

$ opreport
Using /root/oprofile_data/samples/ for samples directory.
CPU: Core 2, speed 2267 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
CPU_CLK_UNHALT...|
samples| %|
------------------
--> 37 100.000 ls
CPU_CLK_UNHALT...|
samples| %|
------------------
27 72.9730 kallsyms
5 13.5135 libc-2.19.so
2 5.4054 ld-2.19.so
1 2.7027 ls
1 2.7027 ext4
1 2.7027 libattr.so.1.1.0

You see that out of (estimated) 2.5-3.5 mil of event occurrences (which is
the scale `ls` usually needs on my PC), I got only 37 samples.

The sample rate for each event are in `ophelp` as "min count".

Why did I get 37 samples? My default setting of default event is:

$ ophelp -d
CPU_CLK_UNHALTED:100000:0:1:1
-------------------^^

This means that there is 1 sample per 100000 event occurrences.

You can tweak this value a bit, but note that oprofile won't let you set
the value to be too low, due to the overhead reasons.

Let's try 10000 (sampling 10 times more often):

$ operf -e CPU_CLK_UNHALTED:10000 ls
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
Lowering the sampling rate may reduce or eliminate lost samples.
See the '--events' option description in the operf man page for help.
Profiling done.

$ opreport
Using /root/oprofile_data/samples/ for samples directory.
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
CPU: Core 2, speed 2267 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10000
CPU_CLK_UNHALT...|
samples| %|
------------------
429 100.000 ls
CPU_CLK_UNHALT...|
samples| %|
------------------
313 72.9604 kallsyms
55 12.8205 ld-2.19.so
34 7.9254 libc-2.19.so
17 3.9627 ls
8 1.8648 ext4
2 0.4662 binfmt_misc

You see that both operf and opreport warned about lost samples which means
that the overhead in kernel was too high and some samples were lost. But
probably not that many, since you see that I got 429 samples instead of
37 which is very roughly 10 times more.

Profiling is not for exact measurement of how many CPU ticks happened in
a function, it is rather designed for relative comparison across your code.

If you are getting too few samples, try to decrease the sample rate as I
did in the example. Have a look at `ophelp` what is the lowest value for
your event.

Also please note that the min sample rates are rough and depend on the load,
so under low load, lower rates can be used without issues. When you profile
systemwide on all CPUs, it's very high chance that you'll run into an overhead
and lose many samples...

I hope I explained it a bit.

Cheers,
Michael

Thanks,
Dhara Buch

dhara buch

2017-06-19 11:22:02 UTC

Permalink

Hello,

Thanks a lot for the detailed explanation. I could understand the working
of operf and opreport.

Unlike operf, ocount counts each occurrence of the monitored event. In such
a case, why ocount also gives varying values of CPU_CLK_UNHALTED even when
the monitored code doesn't get changed. Is it because of background
processes? If so, then can we get event count for only monitored command?

Thank you,

Hello,
Many days ago you had helped me a lot in using Oprofile in Virtual

Machine. That time, I also had a query regarding

getting information from 'opreport --symbols --debuginfo' command.

Hello,
yeah, I remember...

symbol name

2 12.5000 rsaimp.c:16 rsaimp MM
2 12.5000 rsaimp.c:34 rsaimp

main

When I execute commands 2 and 3 multiple times, each time I gwt the

output for symbole name MM (which is UDF in the

program) but I don't get output for main every time.

This is kind of random, see below.

Additionally,'' the sample % also vary every time I file operf command.

I do not understand the working of operf that

how it gives different outputs for the same program.

Each time an event happens in CPU (in our case -- CPU_CLK_UNHALTED -- one
CPU tick), its counter is incremented by one. This way, you can get the
$ ocount ls
Events were actively counted for 1863267 nanoseconds.
Event Count % time counted
CPU_CLK_UNHALTED 2,672,309 100.00
This means that there were 2.672 million CPU ticks necessary to execute
`ls`.
The tool for counting is *ocount*.
For profiling, events are counted the same way as when counting, but only
once per N event occurrences a "sample" is taken. A sample is a piece of
information about where in the code it happened, what was the context and
potentially call-stack, etc. This sample is saved into the oprofile_data
dir structure. When you finish profiling, opreport generates some overview
from the samples recorded.
Of course, taking sample causes some overhead. Thus, you cannot take it
each time the event counter increments. For this purpose, there is "sample
rate" value which divides the event counter and takes a sample only once
per "sample rate" occurrences. Thus, it is pretty random, which event
occurrence is picked for a sample.
Of course, your main() consumes CPU ticks and the counter is incremented,
but sometimes, one of these incrementations produces sample, sometimes not,
thus the function is missing.
You have 2 samples on each of the line in the report. That is really a low
number, so it is very probable that sometimes the samples hit main() and
sometimes not.
$ operf ls
Profiling done.
$ opreport
Using /root/oprofile_data/samples/ for samples directory.
CPU: Core 2, speed 2267 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
CPU_CLK_UNHALT...|
samples| %|
------------------
--> 37 100.000 ls
CPU_CLK_UNHALT...|
samples| %|
------------------
27 72.9730 kallsyms
5 13.5135 libc-2.19.so
2 5.4054 ld-2.19.so
1 2.7027 ls
1 2.7027 ext4
1 2.7027 libattr.so.1.1.0
You see that out of (estimated) 2.5-3.5 mil of event occurrences (which is
the scale `ls` usually needs on my PC), I got only 37 samples.
The sample rate for each event are in `ophelp` as "min count".
$ ophelp -d
CPU_CLK_UNHALTED:100000:0:1:1
-------------------^^
This means that there is 1 sample per 100000 event occurrences.
You can tweak this value a bit, but note that oprofile won't let you set
the value to be too low, due to the overhead reasons.
$ operf -e CPU_CLK_UNHALTED:10000 ls
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
Lowering the sampling rate may reduce or eliminate lost samples.
See the '--events' option description in the operf man page for help.
Profiling done.
$ opreport
Using /root/oprofile_data/samples/ for samples directory.
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
CPU: Core 2, speed 2267 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 10000
CPU_CLK_UNHALT...|
samples| %|
------------------
429 100.000 ls
CPU_CLK_UNHALT...|
samples| %|
------------------
313 72.9604 kallsyms
55 12.8205 ld-2.19.so
34 7.9254 libc-2.19.so
17 3.9627 ls
8 1.8648 ext4
2 0.4662 binfmt_misc
You see that both operf and opreport warned about lost samples which means
that the overhead in kernel was too high and some samples were lost. But
probably not that many, since you see that I got 429 samples instead of
37 which is very roughly 10 times more.
Profiling is not for exact measurement of how many CPU ticks happened in
a function, it is rather designed for relative comparison across your code.
If you are getting too few samples, try to decrease the sample rate as I
did in the example. Have a look at `ophelp` what is the lowest value for
your event.
Also please note that the min sample rates are rough and depend on the load,
so under low load, lower rates can be used without issues. When you profile
systemwide on all CPUs, it's very high chance that you'll run into an overhead
and lose many samples...
I hope I explained it a bit.
Cheers,
Michael

Thanks,
Dhara Buch

Michael Petlan

2017-06-19 12:21:10 UTC

Permalink

Hello,
Thanks a lot for the detailed explanation. I could understand the working of operf and opreport.
Unlike operf, ocount counts each occurrence of the monitored event. In such a case, why ocount also gives varying values of CPU_CLK_UNHALTED even when the monitored code doesn't get
changed. Is it because of background processes? If so, then can we get event count for only monitored command?

Hi Dhara,

it is kind of because of the background processes, but not in the way that ocount
would count also their CPU consumption. If you don't use '-s' option, ocount will
count only what your command does.

However, it counts either kernelspace, userspace or both (default if you are root).
This can be configured by the event's flags:

Both:
-e CPU_CLK_UNHALTED:0:1:1

Only kernelspace (used by the command, i.e. syscalls, etc):
-e CPU_CLK_UNHALTED:0:1:0

Only userspace (i.e. syscalls are excluded):
-e CPU_CLK_UNHALTED:0:0:1

(the format is (ocount): event:umask:kernelspace:userspace,
for operf it is: event:umask:sample_rate:kernelspace:userspace)

The numbers are slightly different each run because the process does not live in
vacuum. It is usually dynamically linked, it calls libraries, kernel switches
contexts, etc. All these things cause the differences. Also, the CPU optimizes
the execution with floating success, depending on what else runs there with it.
If the process is interrupted many times, the CPU consumption will be higher.

I have a sample code which I linked dynamically and statically. While the dyn.
linked executable always takes around 670,260 cycles, the statically linked one
takes around 380,202 and seems to be more steady. Since it is very simple code,
you can see that the linker-related stuff took more than 40% of the CPU cycles).

Executing a binary is so complex problem that there is no surprise that those
measurements give different values every time.

........

TIP:

You can use either papi library or the built-in kernel support for counting
just a particular part of code, e.g. a function. I have attached one example
which uses perfevents kernel background. I have two functions for factorial
counting, one is slightly less effective than the other. I count cycles and
instructions for both. When executing repeatedly, you can see, that the inst.
count is the same all the time (at least on my laptop), but the cycles count
differs even for just the stuff in the functions which are very simple.

cc -o perf-hello perf-hello.c myperf.c
./perf-hello

Ineffective method:
PERF_COUNT_HW_CPU_CYCLES = 32192
PERF_COUNT_HW_INSTRUCTIONS = 20681

More effective method:
PERF_COUNT_HW_CPU_CYCLES = 15931
PERF_COUNT_HW_INSTRUCTIONS = 19103

........

Cheers,
Michael

Thank you,

Hello,
Many days ago you had helped me a lot in using Oprofile in Virtual Machine. That time, I also had a query regarding
getting information from 'opreport --symbols --debuginfo' command.

Hello,
yeah, I remember...

Actually, I am profiling a file1.c file with the following commands...
1. gcc -g file1.c -o file1
2. operf ./file1 --events=CPU_CLK_UNHALTED
3. opreport --symbols --debuginfo
samples %        linenr info                 image name               symbol name
2       12.5000 rsaimp.c:16                 rsaimp                   MM
2        12.5000 rsaimp.c:34                 rsaimp                   main
When I execute commands 2 and 3 multiple times, each time I gwt the output for symbole name MM (which is UDF in the
program) but I don't get output for main every time.

This is kind of random, see below.

Additionally,'' the sample % also vary every time I file operf command. I do not understand the working of operf that
how it gives different outputs for the same program.

Each time an event happens in CPU (in our case -- CPU_CLK_UNHALTED -- one
CPU tick), its counter is incremented by one. This way, you can get the
$ ocount ls
Events were actively counted for 1863267 nanoseconds.
Event Count % time counted
CPU_CLK_UNHALTED 2,672,309 100.00
This means that there were 2.672 million CPU ticks necessary to execute
`ls`.
The tool for counting is *ocount*.
For profiling, events are counted the same way as when counting, but only
once per N event occurrences a "sample" is taken. A sample is a piece of
information about where in the code it happened, what was the context and
potentially call-stack, etc. This sample is saved into the oprofile_data
dir structure. When you finish profiling, opreport generates some overview
from the samples recorded.
Of course, taking sample causes some overhead. Thus, you cannot take it
each time the event counter increments. For this purpose, there is "sample
rate" value which divides the event counter and takes a sample only once
per "sample rate" occurrences. Thus, it is pretty random, which event
occurrence is picked for a sample.
Of course, your main() consumes CPU ticks and the counter is incremented,
but sometimes, one of these incrementations produces sample, sometimes not,
thus the function is missing.
You have 2 samples on each of the line in the report. That is really a low
number, so it is very probable that sometimes the samples hit main() and
sometimes not.
$ operf ls
Profiling done.
$ opreport
Using /root/oprofile_data/samples/ for samples directory.
CPU: Core 2, speed 2267 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
CPU_CLK_UNHALT...|
samples| %|
------------------
--> 37 100.000 ls
CPU_CLK_UNHALT...|
samples| %|
------------------
27 72.9730 kallsyms
5 13.5135 libc-2.19.so
2 5.4054 ld-2.19.so
1 2.7027 ls
1 2.7027 ext4
1 2.7027 libattr.so.1.1.0
You see that out of (estimated) 2.5-3.5 mil of event occurrences (which is
the scale `ls` usually needs on my PC), I got only 37 samples.
The sample rate for each event are in `ophelp` as "min count".
$ ophelp -d
CPU_CLK_UNHALTED:100000:0:1:1
-------------------^^
This means that there is 1 sample per 100000 event occurrences.
You can tweak this value a bit, but note that oprofile won't let you set
the value to be too low, due to the overhead reasons.
$ operf -e CPU_CLK_UNHALTED:10000 ls
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
Lowering the sampling rate may reduce or eliminate lost samples.
See the '--events' option description in the operf man page for help.
Profiling done.
$ opreport
Using /root/oprofile_data/samples/ for samples directory.
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
CPU: Core 2, speed 2267 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 10000
CPU_CLK_UNHALT...|
samples| %|
------------------
429 100.000 ls
CPU_CLK_UNHALT...|
samples| %|
------------------
313 72.9604 kallsyms
55 12.8205 ld-2.19.so
34 7.9254 libc-2.19.so
17 3.9627 ls
8 1.8648 ext4
2 0.4662 binfmt_misc
You see that both operf and opreport warned about lost samples which means
that the overhead in kernel was too high and some samples were lost. But
probably not that many, since you see that I got 429 samples instead of
37 which is very roughly 10 times more.
Profiling is not for exact measurement of how many CPU ticks happened in
a function, it is rather designed for relative comparison across your code.
If you are getting too few samples, try to decrease the sample rate as I
did in the example. Have a look at `ophelp` what is the lowest value for
your event.
Also please note that the min sample rates are rough and depend on the load,
so under low load, lower rates can be used without issues. When you profile
systemwide on all CPUs, it's very high chance that you'll run into an overhead
and lose many samples...
I hope I explained it a bit.
Cheers,
Michael

Thanks,
Dhara Buch

dhara buch

2017-07-06 09:04:43 UTC

Permalink

Hello,

I tried the perf-hello.c file and found the results as you explained. It is
very helpful to me. Now, for PERF_COUNT_HW_BRANCH_MISSES, I am getting
varying values every time unlike PERF_COUNT_HW_INSTRUCTIONS.

To get count for only defined process, I am trying to set perf_event_attr
structure variables sample_type as PERF_SAMPLE_BRANCH_STACK and
branch_sample_type as PERF_SAMPLE_BRANCH_USER. By doing so, I am getting
error no. 5 by perf_event_open. What may be the problem?

Dhara Buch

Hello,
Thanks a lot for the detailed explanation. I could understand the

working of operf and opreport.

Unlike operf, ocount counts each occurrence of the monitored event. In

such a case, why ocount also gives varying values of CPU_CLK_UNHALTED even
when the monitored code doesn't get

changed. Is it because of background processes? If so, then can we get

event count for only monitored command?
Hi Dhara,
it is kind of because of the background processes, but not in the way that ocount
would count also their CPU consumption. If you don't use '-s' option, ocount will
count only what your command does.
However, it counts either kernelspace, userspace or both (default if you are root).
-e CPU_CLK_UNHALTED:0:1:1
-e CPU_CLK_UNHALTED:0:1:0
-e CPU_CLK_UNHALTED:0:0:1
(the format is (ocount): event:umask:kernelspace:userspace,
for operf it is: event:umask:sample_rate:kernelspace:userspace)
The numbers are slightly different each run because the process does not live in
vacuum. It is usually dynamically linked, it calls libraries, kernel switches
contexts, etc. All these things cause the differences. Also, the CPU optimizes
the execution with floating success, depending on what else runs there with it.
If the process is interrupted many times, the CPU consumption will be higher.
I have a sample code which I linked dynamically and statically. While the dyn.
linked executable always takes around 670,260 cycles, the statically linked one
takes around 380,202 and seems to be more steady. Since it is very simple code,
you can see that the linker-related stuff took more than 40% of the CPU cycles).
Executing a binary is so complex problem that there is no surprise that those
measurements give different values every time.
........
You can use either papi library or the built-in kernel support for counting
just a particular part of code, e.g. a function. I have attached one example
which uses perfevents kernel background. I have two functions for factorial
counting, one is slightly less effective than the other. I count cycles and
instructions for both. When executing repeatedly, you can see, that the inst.
count is the same all the time (at least on my laptop), but the cycles count
differs even for just the stuff in the functions which are very simple.
cc -o perf-hello perf-hello.c myperf.c
./perf-hello
PERF_COUNT_HW_CPU_CYCLES = 32192
PERF_COUNT_HW_INSTRUCTIONS = 20681
PERF_COUNT_HW_CPU_CYCLES = 15931
PERF_COUNT_HW_INSTRUCTIONS = 19103
........
Cheers,
Michael

Thank you,

Hello,
Many days ago you had helped me a lot in using Oprofile in

Virtual Machine. That time, I also had a query regarding

getting information from 'opreport --symbols --debuginfo'

command.

Hello,
yeah, I remember...

Actually, I am profiling a file1.c file with the following

commands...

1. gcc -g file1.c -o file1
2. operf ./file1 --events=CPU_CLK_UNHALTED
3. opreport --symbols --debuginfo
samples % linenr info image

name symbol name

2 12.5000 rsaimp.c:16

rsaimp MM

2 12.5000 rsaimp.c:34

rsaimp main

When I execute commands 2 and 3 multiple times, each time I gwt

the output for symbole name MM (which is UDF in the

program) but I don't get output for main every time.

This is kind of random, see below.

Additionally,'' the sample % also vary every time I file operf

command. I do not understand the working of operf that

how it gives different outputs for the same program.

Each time an event happens in CPU (in our case -- CPU_CLK_UNHALTED

-- one

CPU tick), its counter is incremented by one. This way, you can

get the

$ ocount ls
Events were actively counted for 1863267 nanoseconds.
Event Count % time

counted

CPU_CLK_UNHALTED 2,672,309 100.00
This means that there were 2.672 million CPU ticks necessary to

execute

`ls`.
The tool for counting is *ocount*.
For profiling, events are counted the same way as when counting,

but only

once per N event occurrences a "sample" is taken. A sample is a

piece of

information about where in the code it happened, what was the

context and

potentially call-stack, etc. This sample is saved into the

oprofile_data

dir structure. When you finish profiling, opreport generates some

overview

from the samples recorded.
Of course, taking sample causes some overhead. Thus, you cannot

take it

each time the event counter increments. For this purpose, there is

"sample

rate" value which divides the event counter and takes a sample

only once

per "sample rate" occurrences. Thus, it is pretty random, which

event

occurrence is picked for a sample.
Of course, your main() consumes CPU ticks and the counter is

incremented,

but sometimes, one of these incrementations produces sample,

sometimes not,

thus the function is missing.
You have 2 samples on each of the line in the report. That is

really a low

number, so it is very probable that sometimes the samples hit

main() and

sometimes not.
$ operf ls
Profiling done.
$ opreport
Using /root/oprofile_data/samples/ for samples directory.
CPU: Core 2, speed 2267 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted)

with a unit mask of 0x00 (Unhalted core cycles) count 100000

(which is

the scale `ls` usually needs on my PC), I got only 37 samples.
The sample rate for each event are in `ophelp` as "min count".
$ ophelp -d
CPU_CLK_UNHALTED:100000:0:1:1
-------------------^^
This means that there is 1 sample per 100000 event occurrences.
You can tweak this value a bit, but note that oprofile won't let

you set

the value to be too low, due to the overhead reasons.
$ operf -e CPU_CLK_UNHALTED:10000 ls
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log

for details.

Lowering the sampling rate may reduce or eliminate lost samples.
See the '--events' option description in the operf man page for

help.

Profiling done.
$ opreport
Using /root/oprofile_data/samples/ for samples directory.
WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log

for details.

CPU: Core 2, speed 2267 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted)

with a unit mask of 0x00 (Unhalted core cycles) count 10000

which means

that the overhead in kernel was too high and some samples were

lost. But

probably not that many, since you see that I got 429 samples

instead of

37 which is very roughly 10 times more.
Profiling is not for exact measurement of how many CPU ticks

happened in

a function, it is rather designed for relative comparison across

your code.

If you are getting too few samples, try to decrease the sample

rate as I

did in the example. Have a look at `ophelp` what is the lowest

value for

your event.
Also please note that the min sample rates are rough and depend on

the load,

so under low load, lower rates can be used without issues. When

you profile

systemwide on all CPUs, it's very high chance that you'll run into

an overhead

and lose many samples...
I hope I explained it a bit.
Cheers,
Michael

Thanks,
Dhara Buch

Michael Petlan

2017-07-07 21:08:32 UTC

Permalink

Hi Dhara,

Try to compare:

PERF_COUNT_HW_BRANCH_INSTRUCTIONS against PERF_COUNT_HW_BRANCH_MISSES.

The first one is how many branching instructions have been used, the second is
how many branch predictions were wrong. I am always getting almost the same
values of _BRANCH_INSTRUCTIONS, the values of _BRANCH_MISSES differ. That
makes sense, since from some constant number of branch instructions (these
are conditional jumps, loops, etc), some were mispredicted. The branch
prediction works pretty well, but it is not 100% successful. I am getting
around 6% of missed branches:

PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4541
PERF_COUNT_HW_BRANCH_MISSES = 272

but another run slightly differs in _BRANCH_MISSES while still having the
same number of branch instructions:

PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4541
PERF_COUNT_HW_BRANCH_MISSES = 248

It seems that the "more effective" factorial computation function not only
consumes slightly less instructions and cycles, but it also consumes
slightly less branch instructions and the branch prediction works slightly
better there:

My results look like this:

Ineffective method:
PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4541
PERF_COUNT_HW_BRANCH_MISSES = 258

More effective method:
PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4334
PERF_COUNT_HW_BRANCH_MISSES = 97

This is expectable and expected.

To get count for only defined process,

You are counting only the process and, also, in my example, you are counting
just the code between myperf_event_restart() and myperf_event_stop() functions.

The only dust you see in the measurements is that kernel might suspend your
process and schedule some other process in meanwhile, which may invalidate
the caches or spoil the branch guessing (I don't know how exactly this works)
--> so you cannot expect the metrics such as branch misses, cycles, etc. to
be 100% deterministic from your point of view.

I amÂ trying to set perf_event_attr structure variables sample_type
asÂ PERF_SAMPLE_BRANCH_STACK and branch_sample_type asÂ PERF_SAMPLE_BRANCH_USER. By doing so, I am getting error no. 5 by
perf_event_open. What may be the problem?

These variables are for something completely different. See the manpage for
"perf_event_open". The syscall returns error, because you misconfigured the
perf_event_attr structure to invalid values.

As I've already said: You don't need to tweak anything else, you already are
counting just the process you want.

Dhara Buch

Cheers,
Michael

varying values of CPU_CLK_UNHALTED even when the monitored code doesn't get

changed. Is it because of background processes? If so, then can we get event count for only monitored

command?
Hi Dhara,
it is kind of because of the background processes, but not in the way that ocount
would count also their CPU consumption. If you don't use '-s' option, ocount will
count only what your command does.
However, it counts either kernelspace, userspace or both (default if you are root).
-e CPU_CLK_UNHALTED:0:1:1
-e CPU_CLK_UNHALTED:0:1:0
-e CPU_CLK_UNHALTED:0:0:1
(the format is (ocount): event:umask:kernelspace:userspace,
for operf it is: event:umask:sample_rate:kernelspace:userspace)
The numbers are slightly different each run because the process does not live in
vacuum. It is usually dynamically linked, it calls libraries, kernel switches
contexts, etc. All these things cause the differences. Also, the CPU optimizes
the execution with floating success, depending on what else runs there with it.
If the process is interrupted many times, the CPU consumption will be higher.
I have a sample code which I linked dynamically and statically. While the dyn.
linked executable always takes around 670,260 cycles, the statically linked one
takes around 380,202 and seems to be more steady. Since it is very simple code,
you can see that the linker-related stuff took more than 40% of the CPU cycles).
Executing a binary is so complex problem that there is no surprise that those
measurements give different values every time.
........
You can use either papi library or the built-in kernel support for counting
just a particular part of code, e.g. a function. I have attached one example
which uses perfevents kernel background. I have two functions for factorial
counting, one is slightly less effective than the other. I count cycles and
instructions for both. When executing repeatedly, you can see, that the inst.
count is the same all the time (at least on my laptop), but the cycles count
differs even for just the stuff in the functions which are very simple.
cc -o perf-hello perf-hello.c myperf.c
./perf-hello
Â Â Â Â PERF_COUNT_HW_CPU_CYCLES = 32192
Â Â Â Â PERF_COUNT_HW_INSTRUCTIONS = 20681
Â Â Â Â PERF_COUNT_HW_CPU_CYCLES = 15931
Â Â Â Â PERF_COUNT_HW_INSTRUCTIONS = 19103
........
Cheers,
Michael

Thank you,
Â Â Â Â > Hello,
Â Â Â Â >
Â Â Â Â > Many days ago you had helped me a lot in using Oprofile in Virtual Machine. That time, I also had a

query regarding

Â Â Â Â > getting information from 'opreport --symbols --debuginfo' command.
Â Â Â Â Hello,
Â Â Â Â yeah, I remember...
Â Â Â Â >
Â Â Â Â > Actually, I am profiling a file1.c file with the following commands...
Â Â Â Â >
Â Â Â Â > 1. gcc -g file1.c -o file1
Â Â Â Â >
Â Â Â Â > 2. operf ./file1 --events=CPU_CLK_UNHALTED
Â Â Â Â >
Â Â Â Â > 3. opreport --symbols --debuginfo
Â Â Â Â >
Â Â Â Â >
Â Â Â Â > samplesÂ %Â Â Â Â Â Â Â linenr infoÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â image nameÂ Â Â Â Â Â Â Â Â Â Â Â Â Â symbol name
Â Â Â Â >
Â Â Â Â > 2 Â Â Â Â Â 12.5000Â rsaimp.c:16Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â rsaimpÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â MM
Â Â Â Â > 2Â Â Â Â Â Â Â 12.5000Â rsaimp.c:34Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â rsaimpÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â main
Â Â Â Â >
Â Â Â Â > When I execute commands 2 and 3 multiple times, each time I gwt the output for symbole name MM

(which is UDF in the

Â Â Â Â > program) but I don't get output for main every time.
Â Â Â Â >
Â Â Â Â This is kind of random, see below.
Â Â Â Â > Additionally,'' the sample % also vary every time I file operf command. I do not understand the

working of operf that

Â Â Â Â > how it gives different outputs for the same program.
Â Â Â Â >
Â Â Â Â Each time an event happens in CPU (in our case -- CPU_CLK_UNHALTED -- one
Â Â Â Â CPU tick), its counter is incremented by one. This way, you can get the
Â Â Â Â $ ocount ls
Â Â Â Â Events were actively counted for 1863267 nanoseconds.
Â Â Â Â Â Â Â Â Â EventÂ Â Â Â Â Â Â Â Â Â CountÂ Â Â Â Â Â Â Â Â Â % time counted
Â Â Â Â Â Â Â Â Â CPU_CLK_UNHALTEDÂ Â Â Â Â 2,672,309Â Â Â Â Â Â Â Â 100.00
Â Â Â Â This means that there were 2.672 million CPU ticks necessary to execute
Â Â Â Â `ls`.
Â Â Â Â The tool for counting is *ocount*.
Â Â Â Â For profiling, events are counted the same way as when counting, but only
Â Â Â Â once per N event occurrences a "sample" is taken. A sample is a piece of
Â Â Â Â information about where in the code it happened, what was the context and
Â Â Â Â potentially call-stack, etc. This sample is saved into the oprofile_data
Â Â Â Â dir structure. When you finish profiling, opreport generates some overview
Â Â Â Â from the samples recorded.
Â Â Â Â Of course, taking sample causes some overhead. Thus, you cannot take it
Â Â Â Â each time the event counter increments. For this purpose, there is "sample
Â Â Â Â rate" value which divides the event counter and takes a sample only once
Â Â Â Â per "sample rate" occurrences. Thus, it is pretty random, which event
Â Â Â Â occurrence is picked for a sample.
Â Â Â Â Of course, your main() consumes CPU ticks and the counter is incremented,
Â Â Â Â but sometimes, one of these incrementations produces sample, sometimes not,
Â Â Â Â thus the function is missing.
Â Â Â Â You have 2 samples on each of the line in the report. That is really a low
Â Â Â Â number, so it is very probable that sometimes the samples hit main() and
Â Â Â Â sometimes not.
Â Â Â Â $ operf ls
Â Â Â Â Profiling done.
Â Â Â Â $ opreport
Â Â Â Â Using /root/oprofile_data/samples/ for samples directory.
Â Â Â Â CPU: Core 2, speed 2267 MHz (estimated)
Â Â Â Â Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted

core cycles) count 100000

Â Â Â Â CPU_CLK_UNHALT...|
Â Â Â Â Â samples|Â Â Â %|
Â Â Â Â ------------------
Â Â Â Â Â -->Â Â 37 100.000 ls
Â Â Â Â Â Â Â Â CPU_CLK_UNHALT...|
Â Â Â Â Â Â Â Â Â samples|Â Â Â %|
Â Â Â Â Â Â Â Â Â Â Â Â Â ------------------
Â Â Â Â Â Â Â Â Â Â Â 27 72.9730 kallsyms
Â Â Â Â Â Â Â Â Â Â Â Â 5 13.5135 libc-2.19.so
Â Â Â Â Â Â Â Â Â Â Â Â 2Â 5.4054 ld-2.19.so
Â Â Â Â Â Â Â Â Â Â Â Â 1Â 2.7027 ls
Â Â Â Â Â Â Â Â Â Â Â Â 1Â 2.7027 ext4
Â Â Â Â Â Â Â Â Â Â Â Â 1Â 2.7027 libattr.so.1.1.0
Â Â Â Â You see that out of (estimated) 2.5-3.5 mil of event occurrences (which is
Â Â Â Â the scale `ls` usually needs on my PC), I got only 37 samples.
Â Â Â Â The sample rate for each event are in `ophelp` as "min count".
Â Â Â Â $ ophelp -d
Â Â Â Â CPU_CLK_UNHALTED:100000:0:1:1
Â Â Â Â -------------------^^
Â Â Â Â This means that there is 1 sample per 100000 event occurrences.
Â Â Â Â You can tweak this value a bit, but note that oprofile won't let you set
Â Â Â Â the value to be too low, due to the overhead reasons.
Â Â Â Â $ operf -e CPU_CLK_UNHALTED:10000 ls
Â Â Â Â WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
Â Â Â Â Lowering the sampling rate may reduce or eliminate lost samples.
Â Â Â Â See the '--events' option description in the operf man page for help.
Â Â Â Â Profiling done.
Â Â Â Â $ opreport
Â Â Â Â Using /root/oprofile_data/samples/ for samples directory.
Â Â Â Â WARNING: Lost samples detected! See /root/oprofile_data/samples/operf.log for details.
Â Â Â Â CPU: Core 2, speed 2267 MHz (estimated)
Â Â Â Â Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted

core cycles) count 10000

Â Â Â Â CPU_CLK_UNHALT...|
Â Â Â Â Â samples|Â Â Â %|
Â Â Â Â ------------------
Â Â Â Â Â Â Â 429 100.000 ls
Â Â Â Â Â Â Â CPU_CLK_UNHALT...|
Â Â Â Â Â Â Â Â samples|Â Â Â %|
Â Â Â Â Â Â Â Â ------------------
Â Â Â Â Â Â Â Â Â Â Â 313 72.9604 kallsyms
Â Â Â Â Â Â Â Â Â Â Â Â 55 12.8205 ld-2.19.so
Â Â Â Â Â Â Â Â Â Â Â Â 34Â 7.9254 libc-2.19.so
Â Â Â Â Â Â Â Â Â Â Â Â 17Â 3.9627 ls
Â Â Â Â Â Â Â Â Â Â Â Â 8Â 1.8648 ext4
Â Â Â Â Â Â Â Â Â Â Â Â 2Â 0.4662 binfmt_misc
Â Â Â Â You see that both operf and opreport warned about lost samples which means
Â Â Â Â that the overhead in kernel was too high and some samples were lost. But
Â Â Â Â probably not that many, since you see that I got 429 samples instead of
Â Â Â Â 37 which is very roughly 10 times more.
Â Â Â Â Profiling is not for exact measurement of how many CPU ticks happened in
Â Â Â Â a function, it is rather designed for relative comparison across your code.
Â Â Â Â If you are getting too few samples, try to decrease the sample rate as I
Â Â Â Â did in the example. Have a look at `ophelp` what is the lowest value for
Â Â Â Â your event.
Â Â Â Â Also please note that the min sample rates are rough and depend on the load,
Â Â Â Â so under low load, lower rates can be used without issues. When you profile
Â Â Â Â systemwide on all CPUs, it's very high chance that you'll run into an overhead
Â Â Â Â and lose many samples...
Â Â Â Â I hope I explained it a bit.
Â Â Â Â Cheers,
Â Â Â Â Michael
Â Â Â Â > Thanks,
Â Â Â Â >
Â Â Â Â > Dhara Buch
Â Â Â Â >
Â Â Â Â >