[minicoredumper] [External] Infinite loop while executing read_remote
Jos Hulzink (Ellips B.V.)
jos.hulzink at ellips.com
Wed Aug 4 10:00:03 CEST 2021
> From: John Ogness <john.ogness at linutronix.de>
> Sent: dinsdag 3 augustus 2021 22:01
> To: Jos Hulzink (Ellips B.V.) <jos.hulzink at ellips.com>;
>
> Some questions from me:
>
> Is there anything special about this binary?
Well... It is a computer vision application that easily consumes 60 GB+ of memory,
burns all cores on the latest intel i7 processors, uses GPU accellerated machine learning
on 3080Ti cards, has some real time threads, uses zero copy XDP sockets on 10 GBit
ethernet adapters and runs on the PREEMPT_RT kernel...
> Is it dynamically linked?
yes:
linux-vdso.so.1 (0x00007ffd68dbb000)
libbpf.so.0 => /usr/lib/libbpf.so.0 (0x00007f4186f5c000)
libc10.so => /Ellips/Lib/libtorch/lib/libc10.so (0x00007f4186cde000)
libtorch_cpu.so => /Ellips/Lib/libtorch/lib/libtorch_cpu.so (0x00007f4175fa6000)
libtorch_cuda.so => /Ellips/Lib/libtorch/lib/libtorch_cuda.so (0x00007f411d0f9000)
libnvidia-ml.so.1 => /usr/lib/libnvidia-ml.so.1 (0x00007f411ca70000)
libgomp.so.1 => /Ellips/Lib/libtorch/lib/libgomp.so.1 (0x00007f411c84b000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f411c674000)
libc.so.6 => /lib/libc.so.6 (0x00007f411c4b2000)
libm.so.6 => /lib/libm.so.6 (0x00007f411c371000)
libncurses.so.6 => /lib/libncurses.so.6 (0x00007f411c300000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f411c2e6000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007f411c2c6000)
libelf.so.1 => /lib/libelf.so.1 (0x00007f411c2a9000)
libz.so.1 => /lib/libz.so.1 (0x00007f411c28c000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4186f9b000)
librt.so.1 => /lib/librt.so.1 (0x00007f411c282000)
libdl.so.2 => /lib/libdl.so.2 (0x00007f411c27d000)
libcudart-3f3c6934.so.11.0 => /Ellips/Lib/libtorch/lib/libcudart-3f3c6934.so.11.0 (0x00007f411bffb000)
libc10_cuda.so => /Ellips/Lib/libtorch/lib/libc10_cuda.so (0x00007f411bdbe000)
libnvToolsExt-24de1d56.so.1 => /Ellips/Lib/libtorch/lib/libnvToolsExt-24de1d56.so.1 (0x00007f411bbb4000)
> Do you use special link options?
-no-pie -Wl,-z,now, and only for libtorch : -Wl,--no-as-needed -ltorch_cuda -Wl,--as-needed
> Do you use a linker other than ld, the GNU linker?
We use gold to reduce link times: Our application contains abourt 1100 source files, leading to an executable of 760 MB with debug info.
> Are you stripping your binary? If yes, with what options?
yes, --strip-debug
> Are you using any special minicoredumper features, such as
> linking with libminicoredumper?
No, we actually disabled those features during the build of minicoredumper.
> Are you specifying any files in the "dump_by_name" option of the recept
> file?
no, only the 'required' [vdso] entry is there
> Is this application running in any special environments? Such as
> memory-bound containers or memory-limited process resources? If yes, is
> it possible that these limits are being hit?
The application is run as root user, relevant ulimits are set to unlimited.
System runs without swap, mlockall enabled so it is definitely memory limited.
However, I am quite confident that this limit was not an issue in at least a few occasions where we saw this behaviour.
I did see an out of memory exception in the application once, but then minicoredumper also crashed.
Kind regards,
Jos
More information about the minicoredumper
mailing list