Data

nasa_numeric

active
ARFF
Publicly available Visibility: public Uploaded 06-10-2014 by Joaquin Vanschoren

0 likes downloaded by 2 people , 3 total downloads 0 issues 0 downvotes

0 likes downloaded by 2 people , 3 total downloads 0 issues 0 downvotes

Issue | #Downvotes for this reason | By |
---|

Loading wiki

Help us complete this description
Edit

1. Title/Topic: COCOMO NASA 2 / Software cost estimation
2. Sources:
-- 93 NASA projects from different centers
for projects from the following years:
n year
--- ----
1 1971
1 1974
2 1975
2 1976
10 1977
4 1978
19 1979
11 1980
13 1982
7 1983
7 1984
6 1985
8 1986
2 1987
Collected by
Jairus Hihn, JPL, NASA, Manager SQIP Measurement &
Benchmarking Element
Phone (818) 354-1248 (Jairus.M.Hihn@jpl.nasa.gov)
-- Donor: Tim Menzies (tim@menzies.us)
-- Date: Feb 8 2006
3. Past Usage
None with this specific data set. But for older work on similar data, see:
1. "Validation Methods for Calibrating Software Effort
Models", T. Menzies and D. Port and Z. Chen and
J. Hihn and S. Stukes, Proceedings ICSE 2005,
http://menzies.us/pdf/04coconut.pdf
-- Results
-- Given background knowledge on 60 prior projects,
a new cost model can be tuned to local data using
as little as 20 new projects.
-- A very simple calibration method (COCONUT) can
achieve PRED(30)=7% or PRED(20)=50% (after 20 projects).
These are results seen in 30 repeats of an incremental
cross-validation study.
-- Two cost models are compared; one based on just
lines of code and one using over a dozen "effort
multipliers". Just using lines of code loses 10 to 20
PRED(N) points.
3.1 Additional Usage:
2. "Feature Subset Selection Can Improve Software Cost Estimation Accuracy"
Zhihao Chen, Tim Menzies, Dan Port and Barry Boehm
Proceedings PROMISE Workshop 2005,
http://www.etechstyle.com/chen/papers/05fsscocomo.pdf
P02, P03, P04 are used in this paper.
-- Results
-- To the best of our knowledge, this is the first report
of applying feature subset selection (FSS)
to software effort data.
-- FSS can dramatically improve cost estimation.
---T-tests are applied to the results to demonstrate
that always in our data sets, removing
attributes improves performance without increasing the
variance in model behavior.
4. Relevant Information
The COCOMO software cost model measures effort in calendar months
of 152 hours (and includes development and management hours).
COCOMO assumes that the effort grows more than linearly on
software size; i.e. months=a* KSLOC^b*c. Here, "a" and "b" are
domain-specific parameters; "KSLOC" is estimated directly or
computed from a function point analysis; and "c" is the product
of over a dozen "effort multipliers". I.e.
months=a*(KSLOC^b)*(EM1* EM2 * EM3 * ...)
The effort multipliers are as follows:
increase | acap | analysts capability
these to | pcap | programmers capability
decrease | aexp | application experience
effort | modp | modern programing practices
| tool | use of software tools
| vexp | virtual machine experience
| lexp | language experience
----------+------+---------------------------
| sced | schedule constraint
----------+------+---------------------------
decrease | stor | main memory constraint
these to | data | data base size
decrease | time | time constraint for cpu
effort | turn | turnaround time
| virt | machine volatility
| cplx | process complexity
| rely | required software reliability
In COCOMO I, the exponent on KSLOC was a single value ranging from
1.05 to 1.2. In COCOMO II, the exponent "b" was divided into a
constant, plus the sum of five "scale factors" which modeled
issues such as ``have we built this kind of system before?''. The
COCOMO~II effort multipliers are similar but COCOMO~II dropped one
of the effort multiplier parameters; renamed some others; and
added a few more (for "required level of reuse", "multiple-site
development", and "schedule pressure").
The effort multipliers fall into three groups: those that are
positively correlated to more effort; those that are
negatively correlated to more effort; and a third group
containing just schedule information. In COCOMO~I, "sced" has a
U-shaped correlation to effort; i.e. giving programmers either
too much or too little time to develop a system can be
detrimental.
The numeric values of the effort multipliers are:
very very extra productivity
low low nominal high high high range
---------------------------------------------------------------------
acap 1.46 1.19 1.00 0.86 0.71 2.06
pcap 1.42. 1.17 1.00 0.86 0.70 1.67
aexp 1.29 1.13 1.00 0.91 0.82 1.57
modp 1.24. 1.10 1.00 0.91 0.82 1.34
tool 1.24 1.10 1.00 0.91 0.83 1.49
vexp 1.21 1.10 1.00 0.90 1.34
lexp 1.14 1.07 1.00 0.95 1.20
sced 1.23 1.08 1.00 1.04 1.10 e
stor 1.00 1.06 1.21 1.56 -1.21
data 0.94 1.00 1.08 1.16 -1.23
time 1.00 1.11 1.30 1.66 -1.30
turn 0.87 1.00 1.07 1.15 -1.32
virt 0.87 1.00 1.15 1.30 -1.49
rely 0.75 0.88 1.00 1.15 1.40 -1.87
cplx 0.70 0.85 1.00 1.15 1.30 1.65 -2.36
These were learnt by Barry Boehm after a regression analysis of the
projects in the COCOMO I data set.
@Book{boehm81,
Author = "B. Boehm",
Title = "Software Engineering Economics",
Publisher = "Prentice Hall",
Year = 1981}
The last column of the above table shows max(E)/min(EM) and shows
the overall effect of a single effort multiplier. For example,
increasing "acap" (analyst experience) from very low to very
high will most decrease effort while increasing "rely"
(required reliability) from very low to very high will most
increase effort.
There is much more to COCOMO that the above description. The
COCOMO~II text is over 500 pages long and offers
all the details needed to implement data capture and analysis of
COCOMO in an industrial context.
@Book{boehm00b,
Author = "Barry Boehm and Ellis Horowitz and Ray Madachy and
Donald Reifer and Bradford K. Clark and Bert Steece
and A. Winsor Brown and Sunita Chulani and Chris Abts",
Title = "Software Cost Estimation with Cocomo II",
Publisher = "Prentice Hall",
Year = 2000,
ibsn = "0130266922"}
Included in that book is not just an effort model but other
models for schedule, risk, use of COTS, etc. However, most
(?all) of the validation work on COCOMO has focused on the effort
model.
@article{chulani99,
author = "S. Chulani and B. Boehm and B. Steece",
title = "Bayesian Analysis of Empirical Software Engineering
Cost Models",
journal = "IEEE Transaction on Software Engineering",
volume = 25,
number = 4,
month = "July/August",
year = "1999"}
The value of an effort predictor can be reported many ways
including MMRE and PRED(N).MMRE and PRED are computed from the
relative error, or RE, which is the relative size of the
difference between the actual and estimated value:
RE.i = (estimate.i - actual.i) / (actual.i)
Given a data set of of size "D", a "Train"ing set of size
"(X=|Train|) <= D", and a "test" set of size "T=D-|Train|", then
the mean magnitude of the relative error, or MMRE, is the
percentage of the absolute values of the relative errors,
averaged over the "T" items in the "Test" set; i.e.
MRE.i = abs(RE.i)
MMRE.i = 100/T*( MRE.1 + MRE.2 + ... + MRE.T)
PRED(N) reports the average percentage of estimates that were
within N% of the actual values:
count=0
for(i=1;i<=T;i++) do if (MRE.i <= N/100) then count++ fi done
PRED(N) = 100/T * sum
For example, e.g. PRED(30)=50% means that half the estimates are
within 30% of the actual. Shepperd and Schofield comment that
"MMRE is fairly conservative with a bias against overestimates
while Pred(25) will identify those prediction systems that are
generally accurate but occasionally wildly inaccurate".
@article{shepperd97,
author="M. Shepperd and C. Schofield",
title="Estimating Software Project Effort Using Analogies",
journal="IEEE Transactions on Software Engineering",
volume=23,
number=12,
month="November",
year=1997,
note="Available from
\url{http://www.utdallas.edu/~rbanker/SE_XII.pdf}"}
5. Number of instances: 93
6. Number of attributes: 24
- 15 standard COCOMO-I discrete attributes in the range Very_Low to
Extra_High
- 7 others describing the project;
- one lines of code measure,
- one goal field being the actual effort in person months.
7. Attribute information:
Unique id
project name
cagetory of application
flight or ground system?
which nasa center?
year of development
development mode
cocomo attributes: described above in section 4
equivalent physical 1000 lines of source code
development effort in months (one month =152 hours and includes development and management hours)
Section 8. Missing attributes: none
Section 9: Distribution of class values
# development months
== ==================
46 0 - 499
28 500 - 999
7 1000 - 1499
3 1500 - 1999
3 2000 - 2499
3 2500 - 2999
0 3000 - 3999
1 4000 - 4499
1 4500 - 4999
0 5000 - 7999
1 8000

act_effort (target) | numeric | 74 unique values 0 missing | |

recordnumber | numeric | 93 unique values 0 missing | |

projectname | nominal | 8 unique values 0 missing | |

cat2 | nominal | 14 unique values 0 missing | |

forg | nominal | 2 unique values 0 missing | |

center | nominal | 5 unique values 0 missing | |

year | numeric | 14 unique values 0 missing | |

mode | nominal | 3 unique values 0 missing | |

rely | nominal | 4 unique values 0 missing | |

data | nominal | 4 unique values 0 missing | |

cplx | nominal | 5 unique values 0 missing | |

time | nominal | 4 unique values 0 missing | |

stor | nominal | 4 unique values 0 missing | |

virt | nominal | 3 unique values 0 missing | |

turn | nominal | 4 unique values 0 missing | |

acap | nominal | 3 unique values 0 missing | |

aexp | nominal | 4 unique values 0 missing | |

pcap | nominal | 3 unique values 0 missing | |

vexp | nominal | 4 unique values 0 missing | |

lexp | nominal | 4 unique values 0 missing | |

modp | nominal | 5 unique values 0 missing | |

tool | nominal | 5 unique values 0 missing | |

sced | nominal | 3 unique values 0 missing | |

equivphyskloc | numeric | 79 unique values 0 missing |

Third quartile of mutual information between the nominal attributes and the target attribute.

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

4.12

Third quartile of skewness among attributes of the numeric type.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

885.35

Third quartile of standard deviation of attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

-0.88

First quartile of kurtosis among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

Average mutual information between the nominal attributes and the target attribute.

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

First quartile of mutual information between the nominal attributes and the target attribute.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

2.54

Standard deviation of the number of distinct values among attributes of the nominal type.

4.55

Average number of distinct values among the attributes of the nominal type.

-0.04

First quartile of skewness among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

9.56

First quartile of standard deviation of attributes of the numeric type.

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

10.07

Second quartile (Median) of kurtosis among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

359.22

Second quartile (Median) of means among attributes of the numeric type.

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

Minimal mutual information between the nominal attributes and the target attribute.

1.93

Second quartile (Median) of skewness among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Maximum mutual information between the nominal attributes and the target attribute.

2

The minimal number of distinct values among attributes of the nominal type.

80.91

Second quartile (Median) of standard deviation of attributes of the numeric type.

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

14

The maximum number of distinct values among attributes of the nominal type.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

22.46

Third quartile of kurtosis among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2