How to troubleshoot hung/stuck jboss container process??

How to troubleshoot hung/stuck java process??

There are several tools available on linux to troubleshoot.

1) Enable the thread dump in jboss configuration file (generally run.conf). Make sure that it is enabled earlier so that you can take the thread dump as and when required. Here to take dump, one simply need to issue the following command: # kill -3 <pid>
You should be able to see the dump on your console. How to capture the same in a file, try it on your own. Have some fun with that.;-)

2) jmap    

  • Using this tool you can monitor the proces memory footprint. Using the below command keep monitoring the heap related parameters which might give some insight about the stuck pids

# jmap -F -heap <PID>
Attaching to process ID 3423, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.65-b04

using thread-local object allocation.
Parallel GC with 8 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 0
   MaxHeapFreeRatio = 100
   MaxHeapSize      = 8589934592 (8192.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 2147483648 (2048.0MB)
   MaxPermSize      = 2147483648 (2048.0MB)
   G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 2738356224 (2611.5MB)
   used     = 1995271560 (1902.839241027832MB)
   free     = 743084664 (708.660758972168MB)
   72.8638422756206% used
From Space:
   capacity = 62390272 (59.5MB)
   used     = 54670224 (52.13758850097656MB)
   free     = 7720048 (7.3624114990234375MB)
   87.62619916130515% used
To Space:
   capacity = 62390272 (59.5MB)
   used     = 0 (0.0MB)
   free     = 62390272 (59.5MB)
   0.0% used
PS Old Generation
   capacity = 5726797824 (5461.5MB)
   used     = 498643080 (475.54309844970703MB)
   free     = 5228154744 (4985.956901550293MB)
   8.707188472941628% used
PS Perm Generation
   capacity = 2147483648 (2048.0MB)
   used     = 81746800 (77.95982360839844MB)
   free     = 2065736848 (1970.0401763916016MB)
   3.80663201212883% used

42057 interned Strings occupying 4723296 bytes.



3)  jstack

  • It captures what the thread is currently doing, waiting on something, polling for resource etc..

jstack -l <pid>
For example, it may give something like this..
xxxjmsContainer-1" prio=10 tid=0x00007fb3a84b3800 nid=0x207b waiting for monitor entry [0x00007fb3bc6ac000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.net.InetAddress.getLocalHost(InetAddress.java:1455)

        - waiting to lock <0x000000060133b6c8> (a java.lang.Object)

or
Agent DNS Service 3590" daemon prio^C^C^C=10 tid=0x00007f3228234000 nid=0x8639 in Object.wait() [0x00007f30f9558000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at java.net.InetAddress.checkLookupTable^C(InetAddress.java:1363)
        - locked <0x0000000600d49400> (a java.util.HashMap)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1280)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1246)
        at java.net.InetAddress.getAllByName0(In^CetAddress.java:1223)
        at java.net.InetAddress.getHostFromNameService(InetAddress.java:607)
        at java.net.InetAddress.getHostName(^C^CInetAddress.java:532)
        at java.net.InetAddress.getHostName(InetAddress.java:504)
        at com.wily.introscope.agent.dns.DnsQueryProvi^CderDefault.getDnsHostNameByIPAddr(DnsQueryProviderDefault.java:62)
        at com.wily.introscope.agent.dns.DnsServiceExecutor$3.call(D^CnsServiceExecutor.java:265)
        at com.wily.introscope.agent.dns.DnsServiceExecutor$3.call(DnsServiceExecutor.java:262)
        at java.ut^Cil.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:^C^C1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:7^C45)

   Locked ownable synchronizers:
        - <0x000000062dd2b428> (a java.util.concurrent.ThreadPoolExecutor$Worker)


4. Use linux command "strace"

  • Find the hanging/stuck process pid
  • Attach the strace to this pid
# strace -p <pid>
If its a parent PID, it will show something like this
Process 3423 attached - interrupt to quit

futex(0x7f7c9d5659d0, FUTEX_WAIT, 3426, NULL

Futex_wait means, the parent pid is waiting for its child to notify. Lets find out its child processes.

# ps -efL | grep <Parent-pid> | grep -v grep | awk '{print$4}'
This command will show all the child pids, if there are plenty of such child pids, it means there is some problem and threads are getting stuck on some resource and thats the reason parent is not getting any response back from child pid and parent keep spawning new child process until it reaches its threshold.

# strace -p <child-pid>

Process 29770 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
futex(0x7efb9c561854, FUTEX_WAIT_PRIVATE, 1636407, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7efb9c561828, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7efb9c2ecb54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7efb9c2ecb50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7efb9c2ecb28, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 1 ([{fd=1028, revents=POLLIN}])
recvfrom(1028, "\0\0\0$\0$7c3c757b-3f2e-11e4-8f0b-27"..., 65535, 0, {sa_family=AF_INET6, sin6_port=htons(45179), inet_pton(AF_INET6, "::ffff:192.168.0.76", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 4096
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)

In this case, the child process is not able to get the host name and keep polling...

Hope the above tricks helps to narrow down the situation.

Keep troubleshooting...

Comments

Popular posts from this blog

Speech recognition - Centos 7

How to install Telnet.pm for Perl ((revision 5 version 16 subversion 3)