Data

Titanic

active
ARFF
Publicly available Visibility: public Uploaded 16-10-2017 by Joaquin Vanschoren

0 likes downloaded by 9 people , 16 total downloads 0 issues 0 downvotes

0 likes downloaded by 9 people , 16 total downloads 0 issues 0 downvotes

Issue | #Downvotes for this reason | By |
---|

Loading wiki

Help us complete this description
Edit

Author: Frank E. Harrell Jr., Thomas Cason
Source: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)
Please cite:
The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.
Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variables created.
For more information about how this dataset was constructed:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt
### Attribute information
The variables on our extracted dataset are pclass, survived, name, age, embarked, home.dest, room, ticket, boat, and sex. pclass refers to passenger class (1st, 2nd, 3rd), and is a proxy for socio-economic class. Age is in years, and some infants had fractional values. The titanic2 data frame has no missing data and includes records for the crew, but age is dichotomized at adult vs. child. These data were obtained from Robert Dawson, Saint Mary's University, E-mail. The variables are pclass, age, sex, survived. These data frames are useful for demonstrating many of the functions in Hmisc as well as demonstrating binary logistic regression analysis using the Design library. For more details and references see Simonoff, Jeffrey S (1997): The "unusual episode" and a second statistics course. J Statistics Education, Vol. 5 No. 1.

survived (target) | nominal | 2 unique values 0 missing | |

pclass | numeric | 3 unique values 0 missing | |

name | string | 1307 unique values 0 missing | |

sex | nominal | 2 unique values 0 missing | |

age | numeric | 98 unique values 263 missing | |

sibsp | numeric | 7 unique values 0 missing | |

parch | numeric | 8 unique values 0 missing | |

ticket | string | 929 unique values 0 missing | |

fare | numeric | 281 unique values 1 missing | |

cabin | string | 186 unique values 1014 missing | |

embarked | nominal | 3 unique values 2 missing | |

boat | string | 27 unique values 823 missing | |

body | numeric | 121 unique values 1188 missing | |

home.dest | string | 369 unique values 564 missing |

2.04

Second quartile (Median) of skewness among attributes of the numeric type.

7.73

Second quartile (Median) of standard deviation of attributes of the numeric type.

0.02

Minimal mutual information between the nominal attributes and the target attribute.

22.91

Third quartile of kurtosis among attributes of the numeric type.

0.21

Maximum mutual information between the nominal attributes and the target attribute.

2

The minimal number of distinct values among attributes of the nominal type.

3

The maximum number of distinct values among attributes of the nominal type.

0.21

Third quartile of mutual information between the nominal attributes and the target attribute.

3.98

Third quartile of skewness among attributes of the numeric type.

-1.27

First quartile of kurtosis among attributes of the numeric type.

63.24

Third quartile of standard deviation of attributes of the numeric type.

0.58

Standard deviation of the number of distinct values among attributes of the nominal type.

0.02

First quartile of mutual information between the nominal attributes and the target attribute.

-0.08

First quartile of skewness among attributes of the numeric type.

0.12

Average mutual information between the nominal attributes and the target attribute.

0.86

First quartile of standard deviation of attributes of the numeric type.

8.09

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

2.33

Average number of distinct values among the attributes of the nominal type.

10.1

Second quartile (Median) of kurtosis among attributes of the numeric type.

8.34

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

16.09

Second quartile (Median) of means among attributes of the numeric type.

0.12

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.