Data
KDDCup99_full

KDDCup99_full

active ARFF Publicly available Visibility: public Uploaded 07-10-2014 by Joaquin Vanschoren
0 likes downloaded by 19 people , 23 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Author: Source: Unknown - Date unknown Please cite: Datasets from ACM KDD Cup (http://www.sigkdd.org/kddcup/index.php) Data set for KDD Cup 1999 Modified by TunedIT (converted to ARFF format) http://www.sigkdd.org/kddcup/index.php?section=1999&method=info This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ``bad'' connections, called intrusions or attacks, and "good" normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. The training and test datasets are also available in the UC Irvine KDD archive. KDD Cup 1999: Tasks This document is adapted from the paper Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project by Salvatore J. Stolfo, Wei Fan, Wenke Lee, Andreas Prodromidis, and Philip K. Chan. Intrusion Detector Learning Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between ``bad'' connections, called intrusions or attacks, and ``good'' normal connections. The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection. A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided. The 1999 KDD intrusion detection contest uses a version of this dataset. Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks. The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records. Similarly, the two weeks of test data yielded around two million connection records. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes. Attacks fall into four main categories: * DOS: denial-of-service, e.g. syn flood; * R2L: unauthorized access from a remote machine, e.g. guessing password; * U2R: unauthorized access to local superuser (root) privileges, e.g., various ``buffer overflow'' attacks; * probing: surveillance and other probing, e.g., port scanning. It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only. Derived Features Stolfo et al. defined higher-level features that help in distinguishing normal connections from attacks. There are several categories of derived features. The ``same host'' features examine only the connections in the past two seconds that have the same destination host as the current connection, and calculate statistics related to protocol behavior, service, etc. The similar ``same service'' features examine only the connections in the past two seconds that have the same service as the current connection. "Same host" and "same service" features are together called time-based traffic features of the connection records. Some probing attacks scan the hosts (or ports) using a much larger time interval than two seconds, for example once per minute. Therefore, connection records were also sorted by destination host, and features were constructed using a window of 100 connections to the same host instead of a time window. This yields a set of so-called host-based traffic features. Unlike most of the DOS and probing attacks, there appear to be no sequential patterns that are frequent in records of R2L and U2R attacks. This is because the DOS and probing attacks involve many connections to some host(s) in a very short period of time, but the R2L and U2R attacks are embedded in the data portions of packets, and normally involve only a single connection. Useful algorithms for mining the unstructured data portions of packets automatically are an open research question. Stolfo et al. used domain knowledge to add features that look for suspicious behavior in the data portions, such as the number of failed login attempts. These features are called ``content'' features. A complete listing of the set of features defined for the connection records is given in the three tables below. The data schema of the contest dataset is available in machine-readable form. feature name description type duration length (number of seconds) of the connection continuous protocol_type type of the protocol, e.g. tcp, udp, etc. discrete service network service on the destination, e.g., http, telnet, etc. discrete src_bytes number of data bytes from source to destination continuous dst_bytes number of data bytes from destination to source continuous flag normal or error status of the connection discrete land 1 if connection is from/to the same host/port; 0 otherwise discrete wrong_fragment number of ``wrong'' fragments continuous urgent number of urgent packets continuous Table 1: Basic features of individual TCP connections. feature name description type hot number of ``hot'' indicators continuous num_failed_logins number of failed login attempts continuous logged_in 1 if successfully logged in; 0 otherwise discrete num_compromised number of ``compromised'' conditions continuous root_shell 1 if root shell is obtained; 0 otherwise discrete su_attempted 1 if ``su root'' command attempted; 0 otherwise discrete num_root number of ``root'' accesses continuous num_file_creations number of file creation operations continuous num_shells number of shell prompts continuous num_access_files number of operations on access control files continuous num_outbound_cmds number of outbound commands in an ftp session continuous is_hot_login 1 if the login belongs to the ``hot'' list; 0 otherwise discrete is_guest_login 1 if the login is a ``guest''login; 0 otherwise discrete Table 2: Content features within a connection suggested by domain knowledge. feature name description type count number of connections to the same host as the current connection in the past two seconds continuous Note: The following features refer to these same-host connections. serror_rate % of connections that have ``SYN'' errors continuous rerror_rate % of connections that have ``REJ'' errors continuous same_srv_rate % of connections to the same service continuous diff_srv_rate % of connections to different services continuous srv_count number of connections to the same service as the current connection in the past two seconds continuous Note: The following features refer to these same-service connections. srv_serror_rate % of connections that have ``SYN'' errors continuous srv_rerror_rate % of connections that have ``REJ'' errors continuous srv_diff_host_rate % of connections to different hosts continuous Table 3: Traffic features computed using a two-second time window. http://www.sigkdd.org/kddcup

42 features

label (target)nominal23 unique values
0 missing
durationnumeric9883 unique values
0 missing
protocol_typenominal3 unique values
0 missing
servicenominal70 unique values
0 missing
flagnominal11 unique values
0 missing
src_bytesnumeric7195 unique values
0 missing
dst_bytesnumeric21493 unique values
0 missing
landnominal2 unique values
0 missing
wrong_fragmentnumeric3 unique values
0 missing
urgentnumeric6 unique values
0 missing
hotnumeric30 unique values
0 missing
num_failed_loginsnumeric6 unique values
0 missing
logged_innominal2 unique values
0 missing
lnum_compromisednumeric98 unique values
0 missing
lroot_shellnumeric2 unique values
0 missing
lsu_attemptednumeric3 unique values
0 missing
lnum_rootnumeric93 unique values
0 missing
lnum_file_creationsnumeric42 unique values
0 missing
lnum_shellsnumeric3 unique values
0 missing
lnum_access_filesnumeric10 unique values
0 missing
lnum_outbound_cmdsnumeric1 unique values
0 missing
is_host_loginnominal2 unique values
0 missing
is_guest_loginnominal2 unique values
0 missing
countnumeric512 unique values
0 missing
srv_countnumeric512 unique values
0 missing
serror_ratenumeric96 unique values
0 missing
srv_serror_ratenumeric87 unique values
0 missing
rerror_ratenumeric89 unique values
0 missing
srv_rerror_ratenumeric76 unique values
0 missing
same_srv_ratenumeric101 unique values
0 missing
diff_srv_ratenumeric95 unique values
0 missing
srv_diff_host_ratenumeric72 unique values
0 missing
dst_host_countnumeric256 unique values
0 missing
dst_host_srv_countnumeric256 unique values
0 missing
dst_host_same_srv_ratenumeric101 unique values
0 missing
dst_host_diff_srv_ratenumeric101 unique values
0 missing
dst_host_same_src_port_ratenumeric101 unique values
0 missing
dst_host_srv_diff_host_ratenumeric76 unique values
0 missing
dst_host_serror_ratenumeric101 unique values
0 missing
dst_host_srv_serror_ratenumeric100 unique values
0 missing
dst_host_rerror_ratenumeric101 unique values
0 missing
dst_host_srv_rerror_ratenumeric101 unique values
0 missing

62 properties

4898431
Number of instances (rows) of the dataset.
42
Number of attributes (columns) of the dataset.
23
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
34
Number of numeric attributes.
8
Number of nominal attributes.
1.69
First quartile of skewness among attributes of the numeric type.
118.6
Mean of means among attributes of the numeric type.
0.04
First quartile of standard deviation of attributes of the numeric type.
1
Average class difference between consecutive instances.
0.5
Average mutual information between the nominal attributes and the target attribute.
0.59
Second quartile (Median) of entropy among attributes.
1.49
Entropy of the target attribute values.
0.31
An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.
51.26
Second quartile (Median) of kurtosis among attributes of the numeric type.
0
Number of attributes divided by the number of instances.
14.38
Average number of distinct values among the attributes of the nominal type.
0.06
Second quartile (Median) of means among attributes of the numeric type.
2.95
Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.
267.9
Mean skewness among attributes of the numeric type.
0.42
Second quartile (Median) of mutual information between the nominal attributes and the target attribute.
57.32
Percentage of instances belonging to the most frequent class.
46700.21
Mean standard deviation of attributes of the numeric type.
6.92
Second quartile (Median) of skewness among attributes of the numeric type.
2807886
Number of instances belonging to the most frequent class.
0
Minimal entropy among attributes.
9.52
Percentage of binary attributes.
0.31
Second quartile (Median) of standard deviation of attributes of the numeric type.
1.87
Maximum entropy among attributes.
-1.9
Minimum kurtosis among attributes of the numeric type.
0
Percentage of instances having missing values.
1.17
Third quartile of entropy among attributes.
3533299.05
Maximum kurtosis among attributes of the numeric type.
0
Minimum of means among attributes of the numeric type.
0
Percentage of missing values.
31538.88
Third quartile of kurtosis among attributes of the numeric type.
1834.62
Maximum of means among attributes of the numeric type.
0
Minimal mutual information between the nominal attributes and the target attribute.
80.95
Percentage of numeric attributes.
0.76
Third quartile of means among attributes of the numeric type.
1.33
Maximum mutual information between the nominal attributes and the target attribute.
2
The minimal number of distinct values among attributes of the nominal type.
19.05
Percentage of nominal attributes.
1.01
Third quartile of mutual information between the nominal attributes and the target attribute.
70
The maximum number of distinct values among attributes of the nominal type.
-2.77
Minimum skewness among attributes of the numeric type.
0
First quartile of entropy among attributes.
161.54
Third quartile of skewness among attributes of the numeric type.
1807.57
Maximum skewness among attributes of the numeric type.
0
Minimum standard deviation of attributes of the numeric type.
0.84
First quartile of kurtosis among attributes of the numeric type.
3.88
Third quartile of standard deviation of attributes of the numeric type.
941431.07
Maximum standard deviation of attributes of the numeric type.
0
Percentage of instances belonging to the least frequent class.
2
Number of instances belonging to the least frequent class.
0
First quartile of means among attributes of the numeric type.
23.67
Standard deviation of the number of distinct values among attributes of the nominal type.
0.66
Average entropy of the attributes.
4
Number of binary attributes.
0
First quartile of mutual information between the nominal attributes and the target attribute.
420300.45
Mean kurtosis among attributes of the numeric type.

6 tasks

1 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: predictive_accuracy - target_feature: label
0 runs - estimation_procedure: 10 times 10-fold Crossvalidation - evaluation_measure: predictive_accuracy - target_feature: label
3 runs - estimation_procedure: Interleaved Test then Train - target_feature: label
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
Define a new task