How to troubleshoot hung/stuck jboss container process??
How to troubleshoot hung/stuck java process??
There are several tools available on linux to troubleshoot.
1) Enable the thread dump in jboss configuration file (generally run.conf). Make sure that it is enabled earlier so that you can take the thread dump as and when required. Here to take dump, one simply need to issue the following command: # kill -3 <pid>
You should be able to see the dump on your console. How to capture the same in a file, try it on your own. Have some fun with that.;-)
2) jmap
# jmap -F -heap <PID>
3) jstack
# jstack -l <pid>
For example, it may give something like this..
4. Use linux command "strace"
If its a parent PID, it will show something like this
There are several tools available on linux to troubleshoot.
1) Enable the thread dump in jboss configuration file (generally run.conf). Make sure that it is enabled earlier so that you can take the thread dump as and when required. Here to take dump, one simply need to issue the following command: # kill -3 <pid>
You should be able to see the dump on your console. How to capture the same in a file, try it on your own. Have some fun with that.;-)
2) jmap
- Using this tool you can monitor the proces memory footprint. Using the below command keep monitoring the heap related parameters which might give some insight about the stuck pids
# jmap -F -heap <PID>
Attaching to process ID 3423, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.65-b04
using thread-local object allocation.
Parallel GC with 8 thread(s)
Heap Configuration:
MinHeapFreeRatio = 0
MaxHeapFreeRatio = 100
MaxHeapSize =
8589934592 (8192.0MB)
NewSize = 1310720
(1.25MB)
MaxNewSize =
17592186044415 MB
OldSize = 5439488
(5.1875MB)
NewRatio
= 2
SurvivorRatio = 8
PermSize = 2147483648
(2048.0MB)
MaxPermSize =
2147483648 (2048.0MB)
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 2738356224 (2611.5MB)
used = 1995271560
(1902.839241027832MB)
free = 743084664
(708.660758972168MB)
72.8638422756206% used
From Space:
capacity = 62390272 (59.5MB)
used = 54670224
(52.13758850097656MB)
free = 7720048
(7.3624114990234375MB)
87.62619916130515% used
To Space:
capacity = 62390272 (59.5MB)
used = 0 (0.0MB)
free = 62390272 (59.5MB)
0.0% used
PS Old Generation
capacity = 5726797824 (5461.5MB)
used = 498643080
(475.54309844970703MB)
free = 5228154744
(4985.956901550293MB)
8.707188472941628% used
PS Perm Generation
capacity = 2147483648 (2048.0MB)
used = 81746800
(77.95982360839844MB)
free = 2065736848
(1970.0401763916016MB)
3.80663201212883% used
42057 interned Strings occupying 4723296 bytes.
3) jstack
- It captures what the thread is currently doing, waiting on something, polling for resource etc..
# jstack -l <pid>
For example, it may give something like this..
xxxjmsContainer-1" prio=10 tid=0x00007fb3a84b3800 nid=0x207b
waiting for monitor entry [0x00007fb3bc6ac000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
java.net.InetAddress.getLocalHost(InetAddress.java:1455)
- waiting to lock
<0x000000060133b6c8> (a java.lang.Object)
or
Agent DNS Service 3590" daemon prio^C^C^C=10
tid=0x00007f3228234000 nid=0x8639 in Object.wait() [0x00007f30f9558000]
java.lang.Thread.State: WAITING (on object monitor)
at
java.lang.Object.wait(Native Method)
at
java.lang.Object.wait(Object.java:503)
at
java.net.InetAddress.checkLookupTable^C(InetAddress.java:1363)
- locked
<0x0000000600d49400> (a java.util.HashMap)
at
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1280)
at
java.net.InetAddress.getAllByName0(InetAddress.java:1246)
at
java.net.InetAddress.getAllByName0(In^CetAddress.java:1223)
at
java.net.InetAddress.getHostFromNameService(InetAddress.java:607)
at
java.net.InetAddress.getHostName(^C^CInetAddress.java:532)
at
java.net.InetAddress.getHostName(InetAddress.java:504)
at
com.wily.introscope.agent.dns.DnsQueryProvi^CderDefault.getDnsHostNameByIPAddr(DnsQueryProviderDefault.java:62)
at
com.wily.introscope.agent.dns.DnsServiceExecutor$3.call(D^CnsServiceExecutor.java:265)
at
com.wily.introscope.agent.dns.DnsServiceExecutor$3.call(DnsServiceExecutor.java:262)
at java.ut^Cil.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:^C^C1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:7^C45)
Locked ownable synchronizers:
-
<0x000000062dd2b428> (a java.util.concurrent.ThreadPoolExecutor$Worker)
4. Use linux command "strace"
- Find the hanging/stuck process pid
- Attach the strace to this pid
If its a parent PID, it will show something like this
Process 3423 attached - interrupt to quit
futex(0x7f7c9d5659d0, FUTEX_WAIT, 3426, NULL
Futex_wait means, the parent pid is waiting for its child to notify. Lets find out its child processes.
# ps -efL | grep <Parent-pid> | grep -v grep | awk '{print$4}'
This command will show all the child pids, if there are plenty of such child pids, it means there is some problem and threads are getting stuck on some resource and thats the reason parent is not getting any response back from child pid and parent keep spawning new child process until it reaches its threshold.
# strace -p <child-pid>
Process 29770 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
futex(0x7efb9c561854, FUTEX_WAIT_PRIVATE, 1636407, NULL) = -1
EAGAIN (Resource temporarily unavailable)
futex(0x7efb9c561828, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7efb9c2ecb54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7efb9c2ecb50,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7efb9c2ecb28, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 1 ([{fd=1028,
revents=POLLIN}])
recvfrom(1028,
"\0\0\0$\0$7c3c757b-3f2e-11e4-8f0b-27"..., 65535, 0,
{sa_family=AF_INET6, sin6_port=htons(45179), inet_pton(AF_INET6,
"::ffff:192.168.0.76", &sin6_addr), sin6_flowinfo=0,
sin6_scope_id=0}, [28]) = 4096
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
poll([{fd=1028, events=POLLIN|POLLERR}], 1, 500) = 0 (Timeout)
In this case, the child process is not able to get the host name and keep polling...
Hope the above tricks helps to narrow down the situation.
Keep troubleshooting...
Comments
Post a Comment