Data
BitcoinHeist_Ransomware

BitcoinHeist_Ransomware

active ARFF Publicly available Visibility: public Uploaded 24-06-2020 by Joaquin Vanschoren
1 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
BitcoinHeist Ransomware Dataset Akcora, C.G., Li, Y., Gel, Y.R. and Kantarcioglu, M., 2019. BitcoinHeist. Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain. IJCAI-PRICAI 2020. We have downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, we extracted daily transactions on the network and formed the Bitcoin graph. We filtered out the network edges that transfer less than B0.3, since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. Please see the BitcoinHeist article for references. On the heterogeneous Bitcoin network, in each 24-hour snapshot we extract the following six features for an address: income, neighbors, weight, length, count, loop. In 24 ransomware families, at least one address appears in more than one 24-hour time window. CryptoLocker has 13 addresses that appear more than 100 times each. The CryptoLocker address 1LXrSb67EaH1LGc6d6kWHq8rgv4ZBQAcpU appears for a maximum of 420 times. Four addresses have conflicting ransomware labels between Montreal and Padua datasets. APT (Montreal) and Jigsaw (Padua) ransomware families have two and one P2SH addresses (that start with 3), respectively. All other addresses are ordinary addresses that start with 1. Features: address: String. Bitcoin address. year: Integer. Year. day: Integer. Day of the year. 1 is the first day, 365 is the last day. length: Integer. weight: Float. count: Integer. looped: Integer. neighbors: Integer. income: Integer. Satoshi amount (1 bitcoin is 100 million satoshis). label: Category String. Name of the ransomware family (e.g., Cryptxxx, cryptolocker etc) or white (i.e., not known to be ransomware). Our graph features are designed to quantify specific transaction patterns. Loop is intended to count how many transaction i) split their coins; ii) move these coins in the network by using different paths and finally, and iii) merge them in a single address. Coins at this final address can then be sold and converted to fiat currency. Weight quantifies the merge behavior (i.e., the transaction has more input addresses than output addresses), where coins in multiple addresses are each passed through a succession of merging transactions and accumulated in a final address. Similar to weight, the count feature is designed to quantify the merging pattern. However, the count feature represents information on the number of transactions, whereas the weight feature represents information on the amount (what percent of these transactions output?) of transactions. Length is designed to quantify mixing rounds on Bitcoin, where transactions receive and distribute similar amounts of coins in multiple rounds with newly created addresses to hide the coin origin. White Bitcoin addresses are capped at 1K per day (Bitcoin has 800K addresses daily). Note that although we are certain about ransomware labels, we do not know if all white addresses are in fact not related to ransomware. When compared to non-ransomware addresses, ransomware addresses exhibit more profound right skewness in distributions of feature values.

10 features

label (target)nominal29 unique values
0 missing
addressstring2631095 unique values
0 missing
yearnumeric8 unique values
0 missing
daynumeric365 unique values
0 missing
lengthnumeric73 unique values
0 missing
weightnumeric785669 unique values
0 missing
countnumeric11572 unique values
0 missing
loopednumeric10168 unique values
0 missing
neighborsnumeric814 unique values
0 missing
incomenumeric1866365 unique values
0 missing

19 properties

2916697
Number of instances (rows) of the dataset.
10
Number of attributes (columns) of the dataset.
29
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
8
Number of numeric attributes.
1
Number of nominal attributes.
0
Percentage of instances belonging to the least frequent class.
1
Number of instances belonging to the least frequent class.
0
Number of binary attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
0
Percentage of missing values.
1
Average class difference between consecutive instances.
80
Percentage of numeric attributes.
0
Number of attributes divided by the number of instances.
10
Percentage of nominal attributes.
98.58
Percentage of instances belonging to the most frequent class.
2875284
Number of instances belonging to the most frequent class.

6 tasks

0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
Define a new task