Discussion:
Potential deadlock in operf when using --pid
(too old to reply)
大平怜
2015-09-30 22:07:19 UTC
Permalink
Raw Message
Hello again,

When using --pid, I have occasionally seen operf does not end by hitting ^C
once. By hitting ^C multiple times, operf ends with error messages:

-----
operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep -f
CassandraDaemon`
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling
operf: Profiler started
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler
-----

I am using the master branch in the git repository.

Here is what I found:
The operf-read process was waiting for a read of a sample ID from the
operf-record process to return:
(gdb) bt
#0 0x00007fd90e0fa480 in __read_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000412999 in read (__nbytes=8, __buf=0x7ffddc82b620,
__fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2 __handle_fork_event (event=0x98e860) at operf_utils.cpp:125
#3 OP_perf_utils::op_write_event (event=***@entry=0x98e860,
sample_type=<optimized out>) at operf_utils.cpp:834
#4 0x0000000000417250 in operf_read::convertPerfData (
this=***@entry=0x648000 <operfRead>) at operf_counter.cpp:1147
#5 0x000000000040a4cb in convert_sample_data () at operf.cpp:947
#6 0x0000000000407482 in _run () at operf.cpp:625
#7 main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539

The operf-record process was waiting for a write of sample data to the
operf-read process to complete. Why did the write of the sample data
block? My guess is that the sample_data_pipe was full:
(gdb) bt
#0 0x00007fbe0dc9f4e0 in __write_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000040cd0e in OP_perf_utils::op_write_output (output=6,
buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989
#2 0x000000000040d605 in OP_perf_utils::op_get_kernel_event_data (
md=0xd3c7f0, pr=***@entry=0xd07900) at operf_utils.cpp:1443
#3 0x000000000041bc12 in operf_record::recordPerfData (this=0xd07900)
at operf_counter.cpp:846
#4 0x00000000004098b8 in start_profiling () at operf.cpp:402
#5 0x0000000000406305 in _run () at operf.cpp:596
#6 main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539

As a result, when I hit ^C, the operf main process sent SIGUSR1 to the
operf-record process, in which the write returned with EINTR and simply got
retried. Since the operf-record process did not end, the operf main process
waited at waitpid(2) forever.

Do you think my guess makes sense? What would be a fundamental solution?
Simply extending the pipe size would not be appropriate....


Regards,
Rei Odaira
William Cohen
2015-10-01 20:42:14 UTC
Permalink
Raw Message
Post by 大平怜
Hello again,
Hi,

I tried to replicate this on my local machine, but haven't seen it occur yet. How often does it happen? Also does it make a difference when the event sampling rate is changed? There is just one monitored process and it isn't spawning new processes?
Post by 大平怜
-----
operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep -f CassandraDaemon`
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling
operf: Profiler started
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler
-----
I am using the master branch in the git repository.
(gdb) bt
#0 0x00007fd90e0fa480 in __read_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000412999 in read (__nbytes=8, __buf=0x7ffddc82b620,
__fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2 __handle_fork_event (event=0x98e860) at operf_utils.cpp:125
sample_type=<optimized out>) at operf_utils.cpp:834
#4 0x0000000000417250 in operf_read::convertPerfData (
#5 0x000000000040a4cb in convert_sample_data () at operf.cpp:947
#6 0x0000000000407482 in _run () at operf.cpp:625
#7 main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539
(gdb) bt
#0 0x00007fbe0dc9f4e0 in __write_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000040cd0e in OP_perf_utils::op_write_output (output=6,
buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989
#2 0x000000000040d605 in OP_perf_utils::op_get_kernel_event_data (
#3 0x000000000041bc12 in operf_record::recordPerfData (this=0xd07900)
at operf_counter.cpp:846
#4 0x00000000004098b8 in start_profiling () at operf.cpp:402
#5 0x0000000000406305 in _run () at operf.cpp:596
#6 main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539
As a result, when I hit ^C, the operf main process sent SIGUSR1 to the operf-record process, in which the write returned with EINTR and simply got retried. Since the operf-record process did not end, the operf main process waited at waitpid(2) forever.
Do you think my guess makes sense? What would be a fundamental solution? Simply extending the pipe size would not be appropriate....
I am wondering if there are any other nuggets of information that can be gathered by using "--verbose debug,misc" and other "--verbose" variations on the operf command line. It would be worthwhile to take a close look at the code in operf.cpp and see how ctl-c is being handled. There could be a problem with the order that things are shutdown, causing the problem. I noticed around line 406 and of operf.cpp there is the following code:

catch (const runtime_error & re) {
/* If the user does ctl-c, the operf-record process may get interrupted
* in a system call, causing problems with writes to the sample data pipe.
* So we'll ignore such errors unless the user requests debug info.
*/
if (!ctl_c || (cverb << vmisc)) {
cerr << "Caught runtime_error: " << re.what() << endl;
exit_code = EXIT_FAILURE;
}
goto fail_out;
}

-Will




------------------------------------------------------------------------------
大平怜
2015-10-01 23:32:09 UTC
Permalink
Raw Message
Hi Will,

How about the attached test program? This almost always causes the problem
in my environment.
gcc -o oprofile_multithread_test oprofile_multithread_test.c -lpthread
./oprofile_multithread_test
Usage: oprofile_multithread_test <number of spawns> <number of threads>
<number of operations per thread>
./oprofile_multithread_test -1 16 100000
In this example, oprofile_multithread_test spawns threads infinitely but
runs maximum 16 threads simultaneously. Each thread performs addition
100000 times and then completes. Please use ^C to end this program if you
specify -1 to the number of spawns.

If you profile this program with operf --pid, I expect you will not be able
to finish operf by a single ^C.


Regards,
Rei Odaira
Post by 大平怜
Hello again,
When using --pid, I have occasionally seen operf does not end by hitting
Hi,
I tried to replicate this on my local machine, but haven't seen it occur
yet. How often does it happen? Also does it make a difference when the
event sampling rate is changed? There is just one monitored process and it
isn't spawning new processes?
Post by 大平怜
-----
operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep -f
CassandraDaemon`
Post by 大平怜
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling
operf: Profiler started
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler
-----
I am using the master branch in the git repository.
The operf-read process was waiting for a read of a sample ID from the
(gdb) bt
#0 0x00007fd90e0fa480 in __read_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000412999 in read (__nbytes=8, __buf=0x7ffddc82b620,
__fd=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/unistd.h:44
Post by 大平怜
#2 __handle_fork_event (event=0x98e860) at operf_utils.cpp:125
sample_type=<optimized out>) at operf_utils.cpp:834
#4 0x0000000000417250 in operf_read::convertPerfData (
#5 0x000000000040a4cb in convert_sample_data () at operf.cpp:947
#6 0x0000000000407482 in _run () at operf.cpp:625
#7 main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539
The operf-record process was waiting for a write of sample data to the
operf-read process to complete. Why did the write of the sample data
Post by 大平怜
(gdb) bt
#0 0x00007fbe0dc9f4e0 in __write_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000040cd0e in OP_perf_utils::op_write_output (output=6,
buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989
#2 0x000000000040d605 in OP_perf_utils::op_get_kernel_event_data (
#3 0x000000000041bc12 in operf_record::recordPerfData (this=0xd07900)
at operf_counter.cpp:846
#4 0x00000000004098b8 in start_profiling () at operf.cpp:402
#5 0x0000000000406305 in _run () at operf.cpp:596
#6 main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539
As a result, when I hit ^C, the operf main process sent SIGUSR1 to the
operf-record process, in which the write returned with EINTR and simply got
retried. Since the operf-record process did not end, the operf main process
waited at waitpid(2) forever.
Post by 大平怜
Do you think my guess makes sense? What would be a fundamental
solution? Simply extending the pipe size would not be appropriate....
I am wondering if there are any other nuggets of information that can be
gathered by using "--verbose debug,misc" and other "--verbose" variations
on the operf command line. It would be worthwhile to take a close look at
the code in operf.cpp and see how ctl-c is being handled. There could be a
problem with the order that things are shutdown, causing the problem. I
catch (const runtime_error & re) {
/* If the user does ctl-c, the operf-record
process may get interrupted
* in a system call, causing problems with writes
to the sample data pipe.
* So we'll ignore such errors unless the user requests debug info.
*/
if (!ctl_c || (cverb << vmisc)) {
cerr << "Caught runtime_error: " << re.what() << endl;
exit_code = EXIT_FAILURE;
}
goto fail_out;
}
-Will
大平怜
2015-10-02 15:10:21 UTC
Permalink
Raw Message
Sorry, I sent wrong source code. I am attaching the right one.


Regards,
Rei Odaira
Post by 大平怜
Hi Will,
How about the attached test program? This almost always causes the
problem in my environment.
gcc -o oprofile_multithread_test oprofile_multithread_test.c -lpthread
./oprofile_multithread_test
Usage: oprofile_multithread_test <number of spawns> <number of threads>
<number of operations per thread>
./oprofile_multithread_test -1 16 100000
In this example, oprofile_multithread_test spawns threads infinitely but
runs maximum 16 threads simultaneously. Each thread performs addition
100000 times and then completes. Please use ^C to end this program if you
specify -1 to the number of spawns.
If you profile this program with operf --pid, I expect you will not be
able to finish operf by a single ^C.
Regards,
Rei Odaira
Post by 大平怜
Hello again,
When using --pid, I have occasionally seen operf does not end by
hitting ^C once. By hitting ^C multiple times, operf ends with error
Hi,
I tried to replicate this on my local machine, but haven't seen it occur
yet. How often does it happen? Also does it make a difference when the
event sampling rate is changed? There is just one monitored process and it
isn't spawning new processes?
Post by 大平怜
-----
operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep -f
CassandraDaemon`
Post by 大平怜
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling
operf: Profiler started
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler
-----
I am using the master branch in the git repository.
The operf-read process was waiting for a read of a sample ID from the
(gdb) bt
#0 0x00007fd90e0fa480 in __read_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000412999 in read (__nbytes=8, __buf=0x7ffddc82b620,
__fd=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/unistd.h:44
Post by 大平怜
#2 __handle_fork_event (event=0x98e860) at operf_utils.cpp:125
sample_type=<optimized out>) at operf_utils.cpp:834
#4 0x0000000000417250 in operf_read::convertPerfData (
#5 0x000000000040a4cb in convert_sample_data () at operf.cpp:947
#6 0x0000000000407482 in _run () at operf.cpp:625
#7 main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539
The operf-record process was waiting for a write of sample data to the
operf-read process to complete. Why did the write of the sample data
Post by 大平怜
(gdb) bt
#0 0x00007fbe0dc9f4e0 in __write_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000040cd0e in OP_perf_utils::op_write_output (output=6,
buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989
#2 0x000000000040d605 in OP_perf_utils::op_get_kernel_event_data (
#3 0x000000000041bc12 in operf_record::recordPerfData (this=0xd07900)
at operf_counter.cpp:846
#4 0x00000000004098b8 in start_profiling () at operf.cpp:402
#5 0x0000000000406305 in _run () at operf.cpp:596
#6 main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539
As a result, when I hit ^C, the operf main process sent SIGUSR1 to the
operf-record process, in which the write returned with EINTR and simply got
retried. Since the operf-record process did not end, the operf main process
waited at waitpid(2) forever.
Post by 大平怜
Do you think my guess makes sense? What would be a fundamental
solution? Simply extending the pipe size would not be appropriate....
I am wondering if there are any other nuggets of information that can be
gathered by using "--verbose debug,misc" and other "--verbose" variations
on the operf command line. It would be worthwhile to take a close look at
the code in operf.cpp and see how ctl-c is being handled. There could be a
problem with the order that things are shutdown, causing the problem. I
catch (const runtime_error & re) {
/* If the user does ctl-c, the operf-record
process may get interrupted
* in a system call, causing problems with writes
to the sample data pipe.
* So we'll ignore such errors unless the user
requests debug info.
*/
if (!ctl_c || (cverb << vmisc)) {
cerr << "Caught runtime_error: " << re.what() << endl;
exit_code = EXIT_FAILURE;
}
goto fail_out;
}
-Will
William Cohen
2015-10-02 19:19:33 UTC
Permalink
Raw Message
Post by 大平怜
Sorry, I sent wrong source code. I am attaching the right one.
Regards,
Rei Odaira
Hi Will,
How about the attached test program? This almost always causes the problem in my environment.
gcc -o oprofile_multithread_test oprofile_multithread_test.c -lpthread
./oprofile_multithread_test
Usage: oprofile_multithread_test <number of spawns> <number of threads> <number of operations per thread>
./oprofile_multithread_test -1 16 100000
In this example, oprofile_multithread_test spawns threads infinitely but runs maximum 16 threads simultaneously. Each thread performs addition 100000 times and then completes. Please use ^C to end this program if you specify -1 to the number of spawns.
If you profile this program with operf --pid, I expect you will not be able to finish operf by a single ^C.
Regards,
Rei Odaira
Hi,

I was able to get the failure mentioned with the example code on a rhel7 machine.

$ /usr/local/bin/operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep oprofile_multi`
operf: Press Ctl-c or 'kill -SIGINT 8976' to stop profiling
operf: Profiler started
Unable to collect samples for forked process 9569. Process may have ended before recording could be started.
Unable to collect samples for forked process 9570. Process may have ended before recording could be started.
Unable to collect samples for forked process 9571. Process may have ended before recording could be started.
Unable to collect samples for forked process 9572. Process may have ended before recording could be started.
Unable to collect samples for forked process 9573. Process may have ended before recording could be started.
Unable to collect samples for forked process 9574. Process may have ended before recording could be started.
Unable to collect samples for forked process 9576. Process may have ended before recording could be started.
Unable to collect samples for forked process 9577. Process may have ended before recording could be started.
Unable to collect samples for forked process 9578. Process may have ended before recording could be started.
Unable to collect samples for forked process 9579. Process may have ended before recording could be started.
Unable to collect samples for forked process 9580. Process may have ended before recording could be started.
Unable to collect samples for forked process 9581. Process may have ended before recording could be started.
Unable to collect samples for forked process 9582. Process may have ended before recording could be started.
Unable to collect samples for forked process 9583. Process may have ended before recording could be started.
Unable to collect samples for forked process 9584. Process may have ended before recording could be started.
Unable to collect samples for forked process 9585. Process may have ended before recording could be started.
Unable to collect samples for forked process 9586. Process may have ended before recording could be started.
Unable to collect samples for forked process 9587. Process may have ended before recording could be started.
Unable to collect samples for forked process 9588. Process may have ended before recording could be started.
Unable to collect samples for forked process 9589. Process may have ended before recording could be started.
Unable to collect samples for forked process 9590. Process may have ended before recording could be started.
Unable to collect samples for forked process 9591. Process may have ended before recording could be started.
Unable to collect samples for forked process 9592. Process may have ended before recording could be started.
Unable to collect samples for forked process 9593. Process may have ended before recording could be started.
Unable to collect samples for forked process 9594. Process may have ended before recording could be started.
Unable to collect samples for forked process 9595. Process may have ended before recording could be started.
Unable to collect samples for forked process 9596. Process may have ended before recording could be started.
Unable to collect samples for forked process 9597. Process may have ended before recording could be started.
Unable to collect samples for forked process 9598. Process may have ended before recording could be started.
Unable to collect samples for forked process 9599. Process may have ended before recording could be started.
Unable to collect samples for forked process 9600. Process may have ended before recording could be started.
Unable to collect samples for forked process 9601. Process may have ended before recording could be started.
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler

Threads are being created and destroyed very often in the example code. It took multiple times to get operf to connect up to all the threads. Many times I get messages like the following:

$ /usr/local/bin/operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep oprofile_multi`
!!!! No samples collected !!!
The target program/command ended before profiling was started.
operf record init failed
usage: operf [ options ] [ --system-wide | --pid <pid> | [ command [ args ] ] ]
Error running profiler

However, it operf gets started, it seems to reliably fail with the ctl-c.

-Will
Post by 大平怜
Hello again,
Hi,
I tried to replicate this on my local machine, but haven't seen it occur yet. How often does it happen? Also does it make a difference when the event sampling rate is changed? There is just one monitored process and it isn't spawning new processes?
-----
operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep -f CassandraDaemon`
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling
operf: Profiler started
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler
-----
I am using the master branch in the git repository.
(gdb) bt
#0 0x00007fd90e0fa480 in __read_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000412999 in read (__nbytes=8, __buf=0x7ffddc82b620,
__fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2 __handle_fork_event (event=0x98e860) at operf_utils.cpp:125
sample_type=<optimized out>) at operf_utils.cpp:834
#4 0x0000000000417250 in operf_read::convertPerfData (
#5 0x000000000040a4cb in convert_sample_data () at operf.cpp:947
#6 0x0000000000407482 in _run () at operf.cpp:625
#7 main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539
(gdb) bt
#0 0x00007fbe0dc9f4e0 in __write_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000040cd0e in OP_perf_utils::op_write_output (output=6,
buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989
#2 0x000000000040d605 in OP_perf_utils::op_get_kernel_event_data (
#3 0x000000000041bc12 in operf_record::recordPerfData (this=0xd07900)
at operf_counter.cpp:846
#4 0x00000000004098b8 in start_profiling () at operf.cpp:402
#5 0x0000000000406305 in _run () at operf.cpp:596
#6 main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539
As a result, when I hit ^C, the operf main process sent SIGUSR1 to the operf-record process, in which the write returned with EINTR and simply got retried. Since the operf-record process did not end, the operf main process waited at waitpid(2) forever.
Do you think my guess makes sense? What would be a fundamental solution? Simply extending the pipe size would not be appropriate....
catch (const runtime_error & re) {
/* If the user does ctl-c, the operf-record process may get interrupted
* in a system call, causing problems with writes to the sample data pipe.
* So we'll ignore such errors unless the user requests debug info.
*/
if (!ctl_c || (cverb << vmisc)) {
cerr << "Caught runtime_error: " << re.what() << endl;
exit_code = EXIT_FAILURE;
}
goto fail_out;
}
-Will
------------------------------------------------------------------------------
_______________________________________________
oprofile-list mailing list
https://lists.sourceforge.net/lists/listinfo/oprofile-list
------------------------------------------------------------------------------
大平怜
2016-06-25 00:06:24 UTC
Permalink
Raw Message
I forgot to include oprofile-list.


Regards,
Rei Odaira

---------- Forwarded message ----------
From: 倧平怜 <***@gmail.com>
Date: 2016-06-24 19:04 GMT-05:00
Subject: Re: Potential deadlock in operf when using --pid
To: William Cohen <***@redhat.com>


Hi,

Now I come back to this problem.

I am attaching two patches to solve it. They apply to the latest master in
the git repository. Please apply oprofile_avoid_deadlock_1.patch first and
then oprofile_avoid_deadlock_2.patch. The first patch is just for
refactoring the code to make the second patch possible, and the second
patch is for the actual problem solving.

So, the problem is that the operf-record process is blocked on a write to
the sample output pipe when the pipe is full, while the operf-read process,
which is the consumer of the pipe, cannot read from the pipe because it is
blocked on a read from the comm pipe, to which the operf-record process is
supposed to write. Please refer to the previous emails for the detailed
problem description.

|<--sample output pipe--|
operf-read |-------comm pipe------>| operf-record
|<------comm pipe-------|

My solution is to make the sample output pipe non-blocking on the
operf-record side. When the sample output pipe is full, instead of waiting
on it, the operf-record process polls both the sample output pipe and the
(read-end of) comm pipe. If there is a message in the comm pipe from the
operf-read process, the operf-record process handles it and writes a
response to the comm pipe, so that the operf-read process can finish being
blocked on the read from the comm pipe and eventually can consume the
sample output pipe.

There are some other ways to solve this problem, so any comments are
welcome. With my patches, you can finish profiling
oprofile_multithread_test by a single ctrl-c. They passed all the tests in
oprofile-tests.


Regards,
Rei Odaira
Post by 大平怜
Post by 大平怜
Sorry, I sent wrong source code. I am attaching the right one.
Regards,
Rei Odaira
Hi Will,
How about the attached test program? This almost always causes the
problem in my environment.
Post by 大平怜
gcc -o oprofile_multithread_test oprofile_multithread_test.c
-lpthread
Post by 大平怜
./oprofile_multithread_test
Usage: oprofile_multithread_test <number of spawns> <number of
threads> <number of operations per thread>
Post by 大平怜
./oprofile_multithread_test -1 16 100000
In this example, oprofile_multithread_test spawns threads infinitely
but runs maximum 16 threads simultaneously. Each thread performs addition
100000 times and then completes. Please use ^C to end this program if you
specify -1 to the number of spawns.
Post by 大平怜
If you profile this program with operf --pid, I expect you will not
be able to finish operf by a single ^C.
Post by 大平怜
Regards,
Rei Odaira
Hi,
I was able to get the failure mentioned with the example code on a rhel7 machine.
$ /usr/local/bin/operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid
`pgrep oprofile_multi`
operf: Press Ctl-c or 'kill -SIGINT 8976' to stop profiling
operf: Profiler started
Unable to collect samples for forked process 9569. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9570. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9571. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9572. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9573. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9574. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9576. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9577. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9578. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9579. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9580. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9581. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9582. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9583. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9584. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9585. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9586. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9587. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9588. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9589. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9590. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9591. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9592. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9593. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9594. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9595. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9596. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9597. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9598. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9599. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9600. Process may have ended
before recording could be started.
Unable to collect samples for forked process 9601. Process may have ended
before recording could be started.
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler
Threads are being created and destroyed very often in the example code. It
took multiple times to get operf to connect up to all the threads. Many
$ /usr/local/bin/operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid
`pgrep oprofile_multi`
!!!! No samples collected !!!
The target program/command ended before profiling was started.
operf record init failed
usage: operf [ options ] [ --system-wide | --pid <pid> | [ command [ args ] ] ]
Error running profiler
However, it operf gets started, it seems to reliably fail with the ctl-c.
-Will
Post by 大平怜
Hello again,
When using --pid, I have occasionally seen operf does not end
by hitting ^C once. By hitting ^C multiple times, operf ends with error
Post by 大平怜
Hi,
I tried to replicate this on my local machine, but haven't seen
it occur yet. How often does it happen? Also does it make a difference
when the event sampling rate is changed? There is just one monitored
process and it isn't spawning new processes?
Post by 大平怜
-----
operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep
-f CassandraDaemon`
Post by 大平怜
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel
samples.
Post by 大平怜
operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling
operf: Profiler started
^C^Cwaitpid for operf-record process failed: Interrupted
system call
Post by 大平怜
^Cwaitpid for operf-read process failed: Interrupted system
call
Post by 大平怜
Error running profiler
-----
I am using the master branch in the git repository.
The operf-read process was waiting for a read of a sample ID
(gdb) bt
#0 0x00007fd90e0fa480 in __read_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000412999 in read (__nbytes=8,
__buf=0x7ffddc82b620,
Post by 大平怜
__fd=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/unistd.h:44
Post by 大平怜
#2 __handle_fork_event (event=0x98e860) at operf_utils.cpp:125
sample_type=<optimized out>) at operf_utils.cpp:834
#4 0x0000000000417250 in operf_read::convertPerfData (
operf_counter.cpp:1147
Post by 大平怜
#5 0x000000000040a4cb in convert_sample_data () at
operf.cpp:947
Post by 大平怜
#6 0x0000000000407482 in _run () at operf.cpp:625
#7 main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539
The operf-record process was waiting for a write of sample
data to the operf-read process to complete. Why did the write of the
Post by 大平怜
(gdb) bt
#0 0x00007fbe0dc9f4e0 in __write_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000040cd0e in OP_perf_utils::op_write_output
(output=6,
Post by 大平怜
buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989
#2 0x000000000040d605 in
OP_perf_utils::op_get_kernel_event_data (
Post by 大平怜
#3 0x000000000041bc12 in operf_record::recordPerfData
(this=0xd07900)
Post by 大平怜
at operf_counter.cpp:846
#4 0x00000000004098b8 in start_profiling () at operf.cpp:402
#5 0x0000000000406305 in _run () at operf.cpp:596
#6 main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539
As a result, when I hit ^C, the operf main process sent
SIGUSR1 to the operf-record process, in which the write returned with EINTR
and simply got retried. Since the operf-record process did not end, the
operf main process waited at waitpid(2) forever.
Post by 大平怜
Do you think my guess makes sense? What would be a
fundamental solution? Simply extending the pipe size would not be
appropriate....
Post by 大平怜
I am wondering if there are any other nuggets of information
that can be gathered by using "--verbose debug,misc" and other "--verbose"
variations on the operf command line. It would be worthwhile to take a
close look at the code in operf.cpp and see how ctl-c is being handled.
There could be a problem with the order that things are shutdown, causing
the problem. I noticed around line 406 and of operf.cpp there is the
Post by 大平怜
catch (const runtime_error & re) {
/* If the user does ctl-c, the
operf-record process may get interrupted
Post by 大平怜
* in a system call, causing problems
with writes to the sample data pipe.
Post by 大平怜
* So we'll ignore such errors unless
the user requests debug info.
Post by 大平怜
*/
if (!ctl_c || (cverb << vmisc)) {
cerr << "Caught runtime_error: "
<< re.what() << endl;
Post by 大平怜
exit_code = EXIT_FAILURE;
}
goto fail_out;
}
-Will
------------------------------------------------------------------------------
Post by 大平怜
_______________________________________________
oprofile-list mailing list
https://lists.sourceforge.net/lists/listinfo/oprofile-list
William Cohen
2016-06-29 15:22:20 UTC
Permalink
Raw Message
Post by 大平怜
I forgot to include oprofile-list.
Regards,
Rei Odaira
Thanks for the patches. Right now I am at Red Hat Summit and away from the computer much of the time this week. However, I will look at the patches this week.

-Will
Post by 大平怜
---------- Forwarded message ----------
Date: 2016-06-24 19:04 GMT-05:00
Subject: Re: Potential deadlock in operf when using --pid
Hi,
Now I come back to this problem.
I am attaching two patches to solve it. They apply to the latest master in the git repository. Please apply oprofile_avoid_deadlock_1.patch first and then oprofile_avoid_deadlock_2.patch. The first patch is just for refactoring the code to make the second patch possible, and the second patch is for the actual problem solving.
So, the problem is that the operf-record process is blocked on a write to the sample output pipe when the pipe is full, while the operf-read process, which is the consumer of the pipe, cannot read from the pipe because it is blocked on a read from the comm pipe, to which the operf-record process is supposed to write. Please refer to the previous emails for the detailed problem description.
|<--sample output pipe--|
operf-read |-------comm pipe------>| operf-record
|<------comm pipe-------|
My solution is to make the sample output pipe non-blocking on the operf-record side. When the sample output pipe is full, instead of waiting on it, the operf-record process polls both the sample output pipe and the (read-end of) comm pipe. If there is a message in the comm pipe from the operf-read process, the operf-record process handles it and writes a response to the comm pipe, so that the operf-read process can finish being blocked on the read from the comm pipe and eventually can consume the sample output pipe.
There are some other ways to solve this problem, so any comments are welcome. With my patches, you can finish profiling oprofile_multithread_test by a single ctrl-c. They passed all the tests in oprofile-tests.
Regards,
Rei Odaira
Post by 大平怜
Sorry, I sent wrong source code. I am attaching the right one.
Regards,
Rei Odaira
Hi Will,
How about the attached test program? This almost always causes the problem in my environment.
gcc -o oprofile_multithread_test oprofile_multithread_test.c -lpthread
./oprofile_multithread_test
Usage: oprofile_multithread_test <number of spawns> <number of threads> <number of operations per thread>
./oprofile_multithread_test -1 16 100000
In this example, oprofile_multithread_test spawns threads infinitely but runs maximum 16 threads simultaneously. Each thread performs addition 100000 times and then completes. Please use ^C to end this program if you specify -1 to the number of spawns.
If you profile this program with operf --pid, I expect you will not be able to finish operf by a single ^C.
Regards,
Rei Odaira
Hi,
I was able to get the failure mentioned with the example code on a rhel7 machine.
$ /usr/local/bin/operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep oprofile_multi`
operf: Press Ctl-c or 'kill -SIGINT 8976' to stop profiling
operf: Profiler started
Unable to collect samples for forked process 9569. Process may have ended before recording could be started.
Unable to collect samples for forked process 9570. Process may have ended before recording could be started.
Unable to collect samples for forked process 9571. Process may have ended before recording could be started.
Unable to collect samples for forked process 9572. Process may have ended before recording could be started.
Unable to collect samples for forked process 9573. Process may have ended before recording could be started.
Unable to collect samples for forked process 9574. Process may have ended before recording could be started.
Unable to collect samples for forked process 9576. Process may have ended before recording could be started.
Unable to collect samples for forked process 9577. Process may have ended before recording could be started.
Unable to collect samples for forked process 9578. Process may have ended before recording could be started.
Unable to collect samples for forked process 9579. Process may have ended before recording could be started.
Unable to collect samples for forked process 9580. Process may have ended before recording could be started.
Unable to collect samples for forked process 9581. Process may have ended before recording could be started.
Unable to collect samples for forked process 9582. Process may have ended before recording could be started.
Unable to collect samples for forked process 9583. Process may have ended before recording could be started.
Unable to collect samples for forked process 9584. Process may have ended before recording could be started.
Unable to collect samples for forked process 9585. Process may have ended before recording could be started.
Unable to collect samples for forked process 9586. Process may have ended before recording could be started.
Unable to collect samples for forked process 9587. Process may have ended before recording could be started.
Unable to collect samples for forked process 9588. Process may have ended before recording could be started.
Unable to collect samples for forked process 9589. Process may have ended before recording could be started.
Unable to collect samples for forked process 9590. Process may have ended before recording could be started.
Unable to collect samples for forked process 9591. Process may have ended before recording could be started.
Unable to collect samples for forked process 9592. Process may have ended before recording could be started.
Unable to collect samples for forked process 9593. Process may have ended before recording could be started.
Unable to collect samples for forked process 9594. Process may have ended before recording could be started.
Unable to collect samples for forked process 9595. Process may have ended before recording could be started.
Unable to collect samples for forked process 9596. Process may have ended before recording could be started.
Unable to collect samples for forked process 9597. Process may have ended before recording could be started.
Unable to collect samples for forked process 9598. Process may have ended before recording could be started.
Unable to collect samples for forked process 9599. Process may have ended before recording could be started.
Unable to collect samples for forked process 9600. Process may have ended before recording could be started.
Unable to collect samples for forked process 9601. Process may have ended before recording could be started.
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler
$ /usr/local/bin/operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep oprofile_multi`
!!!! No samples collected !!!
The target program/command ended before profiling was started.
operf record init failed
usage: operf [ options ] [ --system-wide | --pid <pid> | [ command [ args ] ] ]
Error running profiler
However, it operf gets started, it seems to reliably fail with the ctl-c.
-Will
Post by 大平怜
Hello again,
Hi,
I tried to replicate this on my local machine, but haven't seen it occur yet. How often does it happen? Also does it make a difference when the event sampling rate is changed? There is just one monitored process and it isn't spawning new processes?
-----
operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep -f CassandraDaemon`
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling
operf: Profiler started
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler
-----
I am using the master branch in the git repository.
(gdb) bt
#0 0x00007fd90e0fa480 in __read_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000412999 in read (__nbytes=8, __buf=0x7ffddc82b620,
__fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2 __handle_fork_event (event=0x98e860) at operf_utils.cpp:125
sample_type=<optimized out>) at operf_utils.cpp:834
#4 0x0000000000417250 in operf_read::convertPerfData (
#5 0x000000000040a4cb in convert_sample_data () at operf.cpp:947
#6 0x0000000000407482 in _run () at operf.cpp:625
#7 main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539
(gdb) bt
#0 0x00007fbe0dc9f4e0 in __write_nocancel ()
at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000040cd0e in OP_perf_utils::op_write_output (output=6,
buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989
#2 0x000000000040d605 in OP_perf_utils::op_get_kernel_event_data (
#3 0x000000000041bc12 in operf_record::recordPerfData (this=0xd07900)
at operf_counter.cpp:846
#4 0x00000000004098b8 in start_profiling () at operf.cpp:402
#5 0x0000000000406305 in _run () at operf.cpp:596
#6 main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539
As a result, when I hit ^C, the operf main process sent SIGUSR1 to the operf-record process, in which the write returned with EINTR and simply got retried. Since the operf-record process did not end, the operf main process waited at waitpid(2) forever.
Do you think my guess makes sense? What would be a fundamental solution? Simply extending the pipe size would not be appropriate....
catch (const runtime_error & re) {
/* If the user does ctl-c, the operf-record process may get interrupted
* in a system call, causing problems with writes to the sample data pipe.
* So we'll ignore such errors unless the user requests debug info.
*/
if (!ctl_c || (cverb << vmisc)) {
cerr << "Caught runtime_error: " << re.what() << endl;
exit_code = EXIT_FAILURE;
}
goto fail_out;
}
-Will
------------------------------------------------------------------------------
_______________________________________________
oprofile-list mailing list
https://lists.sourceforge.net/lists/listinfo/oprofile-list
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
oprofile-list mailing list
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Loading...