I was wondering what tools people use when debugging problems with programs or interactions between multiple programs/systems?
Also, if you could give a brief description of what kind of problems the tools are best at detecting, that would be great. I'm trying to get better at finding the cause of problems, especially those that involve multiple programs or systems (mostly UNIX, but Windows tools are welcome too).
Some tools that people have already suggested to me are: pfiles, ptree, pstack, netstat, strace, dtrace, tcpdump.
I've used netstat and ptree before, but I'm not too familiar with the others.
ptree shows you the currently running processes & PIDs in a tree of process -> sub-processes...
netstat shows you all the currently open ports...
Obviously looking at the log files for the programs you're debugging it a good place to start, and there's also debuggers like gdb (assuming you have access to the program's source code).
"electric fence" is a malloc replacement library for detecting many problems with allocated memory, such as overrun and use after free.
"valgrind" is very good at spotting memory leaks, many kinds of use before initialisation and many other useful things. A companion tool called helgrind attempts to spot where you're using variables across multiple threads without appropriate locking. There are several other 'grinder' listed on the home page.
"gprof" is the profiler, for finding where your code is spending the most time, and thus potentially be a candidate for optimisation (or at least further analysis).
"gcov" provides test coverage information to help you locate functionality which hasn't been tested.
"wireshark" is the ultimate $0 tool for diagnosing network prototcol issues. If your TCP/UDP program is messing up, this is what you need.
"lsof" provides you with a list of all open files.
Here's the list of troubleshooting commands I found yesterday and started playing around with:
- netstat - Print network connections, routing tables, interface statistics, masquerade connections, and multicast memberships.
- pfiles - Report fstat(2) and fcntl(2) information for all open files in a process.
- pmap - Displays the address space map.
- dmesg - Collect system diagnostic messages to form error log.
- ps -ejH - Display ps output in a process/sub-process tree structure.
- strace - Trace system calls and signals for a specified program.
- tcpdump - Dump traffic on a network.
- ptree - Display ps output in a process/sub-process tree structure.
- dtrace - DTrace dynamic tracing compiler and tracing utility.
- pstack - Displays the process stack.
- psig - List the signal actions and handlers of a process.
- pldd - Lists the linked libraries associated with the process.
- pflags - Displays the /proc-related flags for each lwp in the process.
- pcred - Shows the effective and real UIDs and GIDs associated with the process.
- truss - Traces library and system calls and signal activity for a given process. This can be very useful in seeing where a program is choking.
I also found some system log file locations:
- /var/log/message - General message and system related stuff.
- /var/log/kern.log - Kernel logs.
- /var/log/cron.log - Crond logs (cron job).
- /var/svc/log - Services controlled through SMF.
- /var/adm/messages - General system logs.
- /dev/msglog - rc script output (default location).
- /dev/sysmsg - Console messages (default location).
Dtrace is also available on os X and more recently on freeBSD. Tcpdump is also available on a variety of different plattforms, I believe it's even mentioned in the BSD tcp/ip illustrated book.
Here's some Java tools I found, but I'm not sure how useful some of them are?
- jps - Java Virtual Machine Process Status Tool. A 'ps' command for Java processes.
- jstack - Prints a stack trace of each thread of a specified Java process.
- jstat - Displays performance statistics for a JVM.
- jdb - Java Debugger.
- jmap - Prints shared object memory maps or heap memory details of a given process or core file or a remote debug server.
- jinfo - Prints Java configuration information for a given Java process or core file or a remote debug server.
On Linux, also take a look at ltrace, which is like strace but traces all dynamic library calls. You get a much higher-level view than with strace (instead of seeing write() you see printf() for instance)
I could give better input if you provided some example problems you'd want to solve, then I could offer advice what I'd do for that.
Well we have an automated build system that builds and deploys components and runs some tests on them. I'm one of the guys that looks after that system.
Originally Posted by brewbuck
When the compiling fails, it's no problem finding the cause.
When the deployment & tests fail, it's a nightmare for me to find the cause since we have dozens of different components ranging from C++ to Java to Perl to Web stuff, and I no next to nothing about how each team's code works or how their tests work; so finding the cause of failures is pretty hard.
Ideally the teams responsible for the particular build that's failing should look at it and fix it themselves; but since they might be busy with other priorities, that doesn't always happen.
So basically I need to become better at diagnosing the reason why programs fail to start up, run properly, install properly, or communicate with other servers properly. i.e. By looking at logs, and checking processes that are running. Debugging the code is up to the developers that wrote it.
Most of our software runs on Solaris and a bit on Linux. So those are the 2 I'm most interested in.