1045
kc1-top5
1
**Author**:
**Source**: Unknown - Date unknown
**Please cite**:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This is a PROMISE Software Engineering Repository data set made publicly
available in order to encourage repeatable, verifiable, refutable, and/or
improvable predictive models of software engineering.
If you publish material based on PROMISE data sets then, please
follow the acknowledgment guidelines posted on the PROMISE repository
web page http://promise.site.uottawa.ca/SERepository .
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1. Title: Class-level data for KC1
This one includes a {DEF,NODEF} attribute (DL) to indicate
defectiveness. DL is equal to DEF if the module is in the Top5% in
defect count ranking, NODEF otherwise.
2. Sources
(a) Creator: A. Gunes Koru
(b) Date: February 21, 2005
(c) Contact: gkoru AT umbc DOT edu Phone: +1 (410) 455 8843
3. Donor: A. Gunes Koru
4. Past Usage: This data set was used for:
A. Gunes Koru and Hongfang Liu, "An Investigation of the Effect
of Module Size on Defect Prediction Using Static Measures", PROMISE -
Predictive Models in Software Engineering Workshop, ICSE 2005,
May 15th 2005, Saint Louis, Missouri, US.
We used several machine learning algorithms to predict the defective
modules in five NASA products, namely, CM1, JM1, KC1, KC2, and PC1.
A set of static measures were used as predictor variables. While doing
so, we observed that a large portion of the modules were small, as
measured by lines of code (LOC). When we experimented on the data
subsets created by partitioning according to module size, we obtained
higher prediction performance for the subsets that include larger
modules. We also performed defect prediction using class-level data
for KC1 rather than method-level data. In this case, the use of class-level
data resulted in improved prediction performance compared to using
method-level data. These findings suggest that quality assurance activities
can be guided even better if defect predictions are made by using
data that belong to larger modules.
5. Features:
The descriptions of the features are taken from
http://mdp.ivv.nasa.gov/mdp_glossary.html
Feature Used as the Response Variable:
======================================
DL: Defect level. DEF if the class is in the Top 5% in defect ranking, NODEF
otherwise.
Features at Class Level Originally
==================================
PERCENT_PUB_DATA: The percentage of data that is public and protected data
in a class. In general, lower values indicate greater encapsulation. It is
measure of encapsulation.
ACCESS_TO_PUB_DATA: The amount of times that a class's public and protected
data is accessed. In general, lower values indicate greater encapsulation.
It is a measure of encapsulation.
COUPLING_BETWEEN_OBJECTS: The number of distinct non-inheritance-related
classes on which a class depends. If a class that is heavily dependent on
many classes outside of its hierarchy is introduced into a library, all the
classes upon which it depends need to be introduced as well. This may be
acceptable, especially if the classes which it references are already part
of a class library and are even more fundamental than the specified class.
DEPTH: The level for a class. For instance, if a parent has one child the
depth for the child is two. Depth indicates at what level a class is located
within its class hierarchy. In general, inheritance increases when depth
increases.
LACK_OF_COHESION_OF_METHODS: For each data field in a class, the percentage
of the methods in the class using that data field; the percentages are
averaged then subtracted from 100%. The locm metric indicates low or
high percentage of cohesion. If the percentage is low, the class is cohesive.
If it is high, it may indicate that the class could be split into separate
classes that will individually have greater cohesion.
NUM_OF_CHILDREN: The number of classes derived from a specified class.
DEP_ON_CHILD: Whether a class is dependent on a descendant.
FAN_IN: This is a count of calls by higher modules.
RESPONSE_FOR_CLASS: A count of methods implemented within a class plus the
number of methods accessible to an object class due to inheritance. In
general, lower values indicate greater polymorphism.
WEIGHTED_METHODS_PER_CLASS: A count of methods implemented within a class
(rather than all methods accessible within the class hierarchy). In general,
lower values indicate greater polymorphism.
Features Transformed to Class Level (Originally at Method Level)
================================================================
Transformation was achieved by obtaining min, max, sum, and avg values
over all the methods in a class. There this data set includes four
features for all of the following features that were originally at the
method level but transformed to the class level. For example, LOC_BLANK
has minLOC_BLANK, maxLOC_BLANK, avgLOC_BLANK, and maxLOC_BLANK.
LOC_BLANK: Lines with only white space or no text content.
BRANCH_COUNT: This metric is the number of branches for each module.
Branches are defined as those edges that exit from a decision node.
The greater the number of branches in a program's modules, the more
testing resource's required.
LOC_CODE_AND_COMMENT: Lines that contain both code and comment.
LOC_COMMENTS: The number of lines in a module. This particular metric
includes all blank lines, comment lines, and source lines.
CYCLOMATIC_COMPLEXITY: It is a measure of the complexity of a modules
decision structure. It is the number of linearly independent paths.
DESIGN_COMPLEXITY: Design complexity is a measure of a module's decision
structure as it relates to calls to other modules. This quantifies the
testing effort related to integration.
ESSENTIAL_COMPLEXITY: Essential complexity is a measure of the degree to
which a module contains unstructured constructs.
LOC_EXECUTABLE: Source lines of code that contain only code and white space.
HALSTEAD_CONTENT: Complexity of a given algorithm independent of the
language used to express the algorithm.
HALSTEAD_DIFFICULTY: Level of difficulty in the program.
HALSTEAD_EFFORT: Estimated mental effort required to develop the program.
HALSTEAD_ERROR_EST: Estimated number of errors in the program.
HALSTEAD_LENGTH: This is a Halstead metric that includes the total number
of operator occurrences and total number of operand occurrences.
HALSTEAD_LEVEL: Level at which the program can be understood.
HALSTEAD_PROG_TIME: Estimated amount of time to implement the algorithm.
HALSTEAD_VOLUME: This is a Halstead metric that contains the minimum
number of bits required for coding the program.
NUM_OPERANDS: Variables and identifiers Constants (numeric literal/string)
Function names when used during calls.
NUM_UNIQUE_OPERANDS: Variables and identifiers Constants
(numeric literal/string) Function names when used during calls
NUM_UNIQUE_OPERATORS: Number of unique operators.
LOC_TOTAL: Total Lines of Code.
1
ARFF
NASA Metrics Data Program 2005-02-21 2014-10-06T23:57:05
English Public https://api.openml.org/data/v1/download/53928/kc1-top5.arff
https://openml1.win.tue.nl/datasets/0000/1045/dataset_1045.pq 53928 DL Follow the acknowledgment guidelines posted on the PROMISE repository web page http://promise.site.uottawa.ca/SERepository ChemistryLife Sciencemythbusting_1PROMISEstudy_1study_15study_20study_41study_7study_88 public http://promise.site.uottawa.ca/SERepository/datasets-page.html https://openml1.win.tue.nl/datasets/0000/1045/dataset_1045.pq active
2020-11-20 20:26:04 d5ca7c0eb9d3149ebd9c8f5b9be6bbf2