active ARFF Publicly available Visibility: public Uploaded 05-12-2019 by Florian Pargent
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By

Loading wiki
Help us complete this description Edit
Trip Record Data provided by the New York City Taxi and Limousine Commission (TLC) []. The dataset includes TLC trips of the green line in December 2016. Data was downloaded on 03.11.2018. For a description of all variables in the dataset checkout the TLC homepage []. The variable 'tip_amount' was chosen as target variable. The variable 'total_amount' is ignored by default, otherwise the target could be predicted deterministically. The date variables 'lpep_pickup_datetime' and 'lpep_dropoff_datetime' (ignored by default) could be used to compute additional time features. In this version, we chose only trips with 'payment_type' == 1 (credit card), as tips are not included for most other payment types. We also removed the variables 'trip_distance' and 'fare_amount' to increase the importance of the categorical features 'PULocationID' and 'DOLocationID'.

15 features

tip_amount (target)numeric1811 unique values
0 missing
VendorIDnominal2 unique values
0 missing
lpep_pickup_datetimestring505885 unique values
0 missing
lpep_dropoff_datetimestring505577 unique values
0 missing
store_and_fwd_flagnominal2 unique values
0 missing
RatecodeIDnominal5 unique values
0 missing
PULocationIDnominal233 unique values
0 missing
DOLocationIDnominal259 unique values
0 missing
passenger_countnumeric10 unique values
0 missing
extranominal5 unique values
0 missing
mta_taxnominal3 unique values
0 missing
tolls_amountnumeric105 unique values
0 missing
improvement_surchargenominal3 unique values
0 missing
total_amountnumeric5377 unique values
0 missing
trip_typenominal2 unique values
0 missing

19 properties

Number of instances (rows) of the dataset.
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
Number of missing values in the dataset.
Number of instances with at least one value missing.
Number of numeric attributes.
Number of nominal attributes.
Percentage of nominal attributes.
Percentage of instances belonging to the most frequent class.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
Number of binary attributes.
Percentage of binary attributes.
Percentage of instances having missing values.
Percentage of missing values.
Average class difference between consecutive instances.
Percentage of numeric attributes.
Number of attributes divided by the number of instances.

1 tasks

0 runs - estimation_procedure: 50 times Clustering
Define a new task