Coredump-level backtraces and duplicate detection
Unattended servers and improperly maintained machines are
vulnerable to disk space filling caused by ABRT repeated
crashes. Moreover, repeated crashes occassionally slip into Red Hat
Bugzilla, wasting developer time.
We are creating a system which detects that two coredumps are
caused by the same software bug without using debugging
symbols. Being run immmediately after detecting a crash, this system
saves disk space, user's and maintainer's effort by recognizing
duplicates that would otherwise be detected after a long
process.
Coredump-level duplicate detection can be used to build a scalable
crash collection server. By recognizing duplicates between various
users' crashes, many coredumps can be deleted and server disk usage
will be much more predictable.
Reason for existence
Issues this project is addressing:
- When user is hit by the same bug in an application or library
multiple times (consider daemons automatically respanwed via d-bus
and systemd after every crash) without processing and deleting the
associated ABRT crash report, duplicate coredumps are filling
system's harddrive space. ABRT stops filling the space only when
less then 4 GB remain free. Typical victims:
- unattended machines (esp. servers where a monitoring service is
not setup to monitor ABRT crashes)
- crashes of programs running under less exposed users
- some daemons are running under their own user, crash
reports are piling up for that user accounts
- root crashes when administrator is not checking ABRT
reports for root
- improperly maintained machine (unexperienced Fedora users
ignoring ABRT)
- Single user occasionally report multiple almost-identical
crashes to a single component in Red Hat Bugzilla. No duplicate
detection can be 100% correct, so this will happen. However,
introducing another level of duplicate-detection lowers the number
of such cases.
Opportunities this project is pursuing:
- Coredump-level duplicate detection is one of the fundamental
pieces of any scalable crash collection server. Instead of storing
and processing thousands of almost-identical coredumps (with usual
size between 500 kB and 500 MB) from thousands of users, we can
instantly merge them. This enables crash server to survive
tsunamis caused by popular crashes.
Objectives
- Create a system which detects that two coredumps are caused by
the same software bug without using debugging symbols. Do it
quickly enough so we can run the check on crash-time. Do it
generally enough so coredumps are marked as identical even when
the application and related libraries are updated (patched,
rebuilt) in a newer coredump.
- Integrate the system into ABRT so it runs at the post-create
event of C/C++ crashes.
Outcomes
- btparser extensions
- btparser will be extended to handle normalization of
coredump-level backtraces
- btparser's thread metrics need to allow computing a distance
between coredump-level backtraces
- abrt-action-generate-canonical-backtrace (git)
- a commad-line tool to generate a backtrace from a
coredump
- obsoletes abrt-action-analyze-c plugin? should we get rid of
UUID?
- gets a backtrace with offsets of functions in binaries (via
gdb)
- gets paths to binaries referenced in the backtrace
(eu-unstrip)
- gets assembly output of functions referenced in the
backtrace from the binaries
- generates fingerprints for functions from the assembly
output
- stores a backtrace with build ids, offsets,
fingerprints to a crash report directory
- duplicate detection routine in abrtd daemon
- Replace function is_crash_a_dup in abrt-handle-event.c
- Load a coredump-level backtrace, run the btparser
normalization on it, trim it to reasonable number of
frames
- Use btparser metrics to compute distance between
coredumps (trimmed crash threads).
- Find the right balance between thread trimming and distance
threshold to consider coredumps as duplicates.
- a system to find the best function fingerprint algorithm by
testing various approaches on multiple versions of all Fedora
components
- faf-funfin-analyze-builds
- faf-funfin-analyze-build
- faf-funfin-analyze-binary
People
- Team: Martin Milata, Karel Klíč
- External help people: Jan Kratochvíl
- Send information to: Jirka Moskovčák, Radek Vokál
Timeline
- Project start date: 2011-07-22
- Planned finish date: 2012-04-30
Dependencies
Btparser receives bugfixing and
improvements.
Faf receives extensions and
bugfixing.
ABRT receives extensions and
bugfixing.
Contingency plan
None.
Documentation
Crashing thread
Jan Kratochvil: bfd_core_file_failing_signal is used to determine
the signal for a core file, not for a thread. The signal is stored in
for every thread in elf_prstatus.pr_info. Kernel fills the signal for
every thread, even those which didn't crash.
In theory, multiple threads can crash, as other threads might crash
after the first one and before kernel stops them.
Homepage