Automated Debugging for Data-Intensive Scalable Computing
- 技術優勢
- BIGSIFT improves the accuracy of fault localizability by several order-of-magnitude (103-107) compared to Titian data provenanceImproves performance by up to 66x compared to Delta DebuggingAble to localize fault-inducing data within 62% of the original job running time for each faulty output
- 技術應用
- Debugging for Data-Intensive Scalable Computing (DISC) systems
- 詳細技術說明
- Researchers at UCLA have developed a new faulty data localization approach called BIGSIFT, which combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. BIGSIFT redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads.
- *Abstract
-
UCLA researchers in the Department of Computer Science have developed BIGSIFT, a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-introducing inputs.
- *Principal Investigation
-
Name: Muhammad Ali Gulzar
Department:
Name: Miryung Kim
Department:
- 其他
-
State Of Development
The BIGSIFT is ready to be used for DISC systems.
Background
Data-Intensive Scalable Computing (DISC) systems draw valuable insights from massive data sets to help make business decisions and scientific discoveries. Similar to other software development platforms, developers often deal with program errors and incorrect inputs that require error debugging. When errors (e.g. program crash, outlier results) arise, developers often have to go through a lengthy and expensive process of manual trial and error debugging by identifying a subset of the input data that is able to reproduce the problem.
Current approaches such as Data Provenance (DP) and Delta Debugging (DD) are not suitable for debugging DISC workloads because 1) DD does not consider the semantics of data-flow operators and thus cannot prune input records known to be irrelevant; 2) DD’s search strategy is iterative, which is prohibitively expensive for large datasets such as DISC; 3) DP over-approximates the scope of failure-inducing inputs by considering that all intermediate inputs mapping to the same key contribute to the erroneous output.
For complex DISC systems, it is therefore crucial to equip developers with toolkits that can better pinpoint the root cause of an error.
Related Materials
Tech ID/UC Case
29154/2018-151-0
Related Cases
2018-151-0
- 國家/地區
- 美國
