Data

pbcseq

active
ARFF
Publicly available Visibility: public Uploaded 29-09-2014 by Joaquin Vanschoren

0 likes downloaded by 5 people , 7 total downloads 0 issues 0 downvotes

0 likes downloaded by 5 people , 7 total downloads 0 issues 0 downvotes

Issue | #Downvotes for this reason | By |
---|

Loading wiki

Help us complete this description
Edit

Primary Biliary Cirrhosis
This data set is a follow-up to the original PBC data set, as discussed
in appendix D of Fleming and Harrington, Counting Processes and Survival
Analysis, Wiley, 1991. An analysis based on the enclised data is found in
Murtaugh PA. Dickson ER. Van Dam GM. Malinchoc M. Grambsch PM.
Langworthy AL. Gips CH. "Primary biliary cirrhosis: prediction of short-term
survival based on repeated patient visits." Hepatology. 20(1.1):126-34, 1994.
Quoting from F&H. "The following pages contain the data from the Mayo Clinic
trial in primary biliary cirrhosis (PBC) of the liver conducted between 1974
and 1984. A description of the clinical background for the trial and the
covariates recorded here is in Chapter 0, especially Section 0.2. A more
extended discussion can be found in Dickson, et al., Hepatology 10:1-7 (1989)
and in Markus, et al., N Eng J of Med 320:1709-13 (1989).
"A total of 424 PBC patients, referred to Mayo Clinic during that ten-year
interval, met eligibility criteria for the randomized placebo controlled
trial of the drug D-penicillamine. The first 312 cases in the data set
participated in the randomized trial and contain largely complete data. The
additional 112 cases did not participate in the clinical trial, but consented
to have basic measurements recorded and to be followed for survival. Six of
those cases were lost to follow-up shortly after diagnosis, so the data here
are on an additional 106 cases as well as the 312 randomized participants.
Missing data items are denoted by `.'. "
The F&H data set contains only baseline measurements of the laboratory
paramters. This data set contains multiple laboratory results, but
only on the first 312 patients. Some baseline data values in this file
differ from the original PBC file, for instance, the data errors in
prothrombin time and age which were discovered after the orignal analysis,
during research work on dfbeta residuals. (These two data points are
discussed in F&H, figure 4.6.7). Another major difference is that
there was significantly more follow-up for many of the patients at the
time this data set was assembled.
One "feature" of the data deserves special comment. The last
observation before death or liver transplant often has many more
missing covariates than other data rows. The original clinical
protocol for these patients specified visits at 6 months, 1 year, and
annually thereafter. At these protocol visits lab values were
obtained for a large pre-specified battery of tests. "Extra" visits,
often undertaken because of worsening medical condition, did not
necessarily have all this lab work. The missing values are thus
potentially informative, and violate the usual "missing at random"
(MCAR or MAC) assumptions that are assumed in analyses. Because of
the earlier published results on the Mayo PBC risk score, however, the
5 variables involved in that computation were usually obtained, i.e.,
age, bilirubin, albumin, prothrombin time, and edema score.
```
Variables:
case number
number of days between registration and the earlier of death,
transplantion, or study analysis time
status: 0=alive, 1=transplanted, 2=dead
drug: 1= D-penicillamine, 0=placebo
age in days, at registration
sex: 0=male, 1=female
day: number of days between enrollment and this visit date, remaining
values on the line of data refer to this visit.
presence of asictes: 0=no 1=yes
presence of hepatomegaly 0=no 1=yes
presence of spiders 0=no 1=yes
presence of edema 0=no edema and no diuretic therapy for edema;
.5 = edema present without diuretics, or edema resolved by diuretics;
1 = edema despite diuretic therapy
serum bilirubin in mg/dl
serum cholesterol in mg/dl
albumin in gm/dl
alkaline phosphatase in U/liter
SGOT in U/ml (serum glutamic-oxaloacetic transaminase, the enzyme name
has subsequently changed to "ALT" in the medical literature)
platelets per cubic ml / 1000
prothrombin time in seconds
histologic stage of disease
```
Information about the dataset\
CLASSTYPE: numeric\
CLASSINDEX: 3

status (target) | numeric | 3 unique values 0 missing | |

case_number | numeric | 312 unique values 0 missing | |

number_of_days | numeric | 305 unique values 0 missing | |

drug | nominal | 2 unique values 0 missing | |

age | numeric | 308 unique values 0 missing | |

sex | nominal | 2 unique values 0 missing | |

day | nominal | 1024 unique values 0 missing | |

presence_of_asictes | nominal | 2 unique values 60 missing | |

presence_of_hepatomegaly | nominal | 2 unique values 61 missing | |

presence_of_spiders | nominal | 2 unique values 58 missing | |

presence_of_edema | numeric | 3 unique values 0 missing | |

serum_bilirubin | numeric | 193 unique values 0 missing | |

serum_cholesterol | numeric | 375 unique values 821 missing | |

albumin | numeric | 254 unique values 0 missing | |

alkaline_phosphatase | numeric | 1263 unique values 60 missing | |

SGOT | numeric | 418 unique values 0 missing | |

platelets | numeric | 414 unique values 73 missing | |

prothrombin_time | numeric | 78 unique values 0 missing | |

histologic_stage_of_disease | numeric | 4 unique values 0 missing |

122.67

Second quartile (Median) of means among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

Minimal mutual information between the nominal attributes and the target attribute.

0.87

Second quartile (Median) of skewness among attributes of the numeric type.

Maximum mutual information between the nominal attributes and the target attribute.

2

The minimal number of distinct values among attributes of the nominal type.

78.44

Second quartile (Median) of standard deviation of attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

1024

The maximum number of distinct values among attributes of the nominal type.

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

23.65

Third quartile of kurtosis among attributes of the numeric type.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

Third quartile of mutual information between the nominal attributes and the target attribute.

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

3.93

Third quartile of skewness among attributes of the numeric type.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

-0.62

First quartile of kurtosis among attributes of the numeric type.

681.17

Third quartile of standard deviation of attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

Average mutual information between the nominal attributes and the target attribute.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

First quartile of mutual information between the nominal attributes and the target attribute.

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

417.23

Standard deviation of the number of distinct values among attributes of the nominal type.

172.33

Average number of distinct values among the attributes of the nominal type.

0.04

First quartile of skewness among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0.91

First quartile of standard deviation of attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

2.69

Second quartile (Median) of kurtosis among attributes of the numeric type.