{ "data_id": "578", "name": "kdd_coil_7", "exact_name": "kdd_coil_7", "version": 1, "version_label": null, "description": "**Author**: \n**Source**: Unknown - Date unknown \n**Please cite**: \n\n%%%%%%%%%%%%%%%%%%%\nData-Description %\n%%%%%%%%%%%%%%%%%%%\n\nCOIL 1999 Competition Data\n\nData Type\n\nmultivariate\n\nAbstract\n\nThis data set is from the 1999 Computational Intelligence and Learning\n(COIL) competition. The data contains measurements of river chemical\nconcentrations and algae densities.\n\nSources\n\nOriginal Owner\n\n[1]ERUDIT\nEuropean Network for Fuzzy Logic and Uncertainty Modelling\nin Information Technology\n\nDonor\n\nJens Strackeljan\nTechnical University Clausthal\nInstitute of Applied Mechanics\nGraupenstr. 3, 38678 Clausthal-Zellerfeld, Germany\n[2]tmjs@itm.tu-clausthal.de\n\nDate Donated: September 9, 1999\n\nData Characteristics\n\nThis data comes from a water quality study where samples were taken\nfrom sites on different European rivers of a period of approximately\none year. These samples were analyzed for various chemical substances\nincluding: nitrogen in the form of nitrates, nitrites and ammonia,\nphosphate, pH, oxygen, chloride. In parallel, algae samples were\ncollected to determine the algae population distributions.\n\nOther Relevant Information\n\nThe competition involved the prediction of algal frequency\ndistributions on the basis of the measured concentrations of the\nchemical substances and the global information concerning the season\nwhen the sample was taken, the river size and its flow velocity. The\ncompetition [3]instructions contain additional information on the\nprediction task.\n\nData Format\n\nThere are a total of 340 examples each containing 17 values. The first\n11 values of each data set are the season, the river size, the fluid\nvelocity and 8 chemical concentrations which should be relevant for\nthe algae population distribution. The last 8 values of each example\nare the distribution of different kinds of algae. These 8 kinds are\nonly a very small part of the whole community, but for the competition\nwe limited the number to 7. The value 0.0 means that the frequency is\nvery low. The data set also contains some empty fields which are\nlabeled with the string XXXXX.\n\nThe training data are saved in the file: analysis.data (ASCII format).\n\nTable 1: Structure of the file analysis.data\n\nA\n\n\nK\n\na\n\n\ng\n\nCC[1,1]\n\n\nCC[1,11]\n\nAG[1,1]\n\n\nAG[1,7]\n\nCC[200,1]\n\n\nCC[200,11]\n\nAG[200,1]\n\n\nAG[200,7]\n\nExplanation:\nCC[i,j]: Chemical concentration or river characteristic\nAG[i,j]: Algal frequency\n\nThe chemical parameters are labeled as A, ..., K. The columns of the\nalgaes are labeled as a, ..,g.\n\nPast Usage\n\n[4]The Third (1999) International COIL Competition Home Page\n_________________________________________________________________\n\n\n[5]The UCI KDD Archive\n[6]Information and Computer Science\n[7]University of California, Irvine\nIrvine, CA 92697-3425\n\nLast modified: October 13, 1999\n\nReferences\n\n1. http:\/\/www.erudit.de\/\n2. mailto:tmjs@itm.tu-clausthal.de\n3. file:\/\/localhost\/research\/ml\/datasets\/uci\/raw\/data\/ucikdd\/coil\/instructions.txt\n4. http:\/\/www.erudit.de\/erudit\/activities\/ic-99\/index.htm\n5. http:\/\/kdd.ics.uci.edu\/\n6. http:\/\/www.ics.uci.edu\/\n7. http:\/\/www.uci.edu\/\n\n%%%%%%%%%%%%%%%%%%%\nTask-Description %\n%%%%%%%%%%%%%%%%%%%\n\n\nThird International Competition\n\nProtecting rivers and streams by monitoring chemical concentrations and\nalgae communities.\n\n\nIntelligent Techniques for Monitoring Water Quality using chemical\nindicators and algae population\n\nRecent years have been characterised by increasing concern at the\nimpact man is having on the environment.\nThe impact on the environment of toxic waste, from a wide variety\nof manufacturing processes, is well known. More recently, however,\nit has become clear that the more subtle effects of nutrient level\nand chemical balance changes arising from farming land run-off and\nsewage water treatment also have a serious, but indirect, effect on\nthe states of rivers, lakes and even the sea. In temperate climates\nacross the world summers are characterized by numerous reports excessive\nsummer algae growth resulting in poor water clarity, mass deaths of\nriver fish from reduced oxygen levels and the closure of recreational\nwater facilities on account of the toxic effects of this annual algal bloom.\nReducing the impact of these man-made changes in river nutrient levels\nhas stimulated much biological research with the aim of identifying\nthe crucial chemical control variables for the biological\nprocesses.\n\nThe data used in this problem comes from one such study.\nDuring the research study water quality samples were\ntaken from sites on different European rivers of a period of\napproximately one year. These samples were analyzed for various\nchemical substances including: nitrogen in the form of nitrates,\nnitrites and ammonia, phosphate, pH, oxygen, chloride.\nIn parallel, algae samples were collected to determine the algae population\ndistributions. It is well known that the dynamics of the\nalgae community is determined by external chemical\nenvironment with one or more factors being predominant.\nWhile the chemical analysis is cheap and easily\nautomated, the biological part involves microscopic examination,\nrequires trained manpower and is therefore both\nexpensive and slow.\n\nDiatoms like Cymbella are major contributors to primary production\nthroughout the world. The diatom reacts with\nlarge sensitivity to even small changes in acidity .\n\nOver a three and half billion year history algae have evolved and\nadapted as primary plant colonizers of almost\nevery known habitant in terrestrial and aquatic environments.\nThey respond very rapidly to man-made environment changes.\n\n\n\nThe relationship between the chemical and biological features is\ncomplex and can be expected to need the application of advanced\ntechniques. Typical of such real-life problems, the particular\ndata set for the problem contains a mixture of (fuzzy) qualiative\nvariables and numerical measurement values, with much of the data\nbeing incomplete.\n\nThe competition task is the prediction of algal frequency distributions\non the basis of the measured concentrations of the chemical\nsubstances and the global information concerning the season when the sample\nwas taken, the river size and its flow velocity. The two last variables\nare given as linguistic variables.\n\n340 data sets were taken and each contain 17 values. The\nfirst 11 values of each data set are the season, the river\nsize, the fluid velocity and 8 chemical concentrations which\nshould be relevant for the algae population distribution.\nThe last 8 values of each data set are the distribution of\ndifferent kinds of algae. These 8 kinds are only a very small\npart of the whole community, but for the competition we limited\nthe number to 7. The value 0.0 means that the frequency is very low.\nThe data set also contains some empty fields which are labeled\nwith the string XXXXX.\n\nEach participant in the competition receives 200 complete data sets\n(training data) and 140 data sets (evaluation data) containing only\nthe 11 values of the river descriptions and the chemical concentrations.\n\nThis training data is to be used in obtainin\na 'model' providing a prediction of the algal distributions associated\nwith the evaluation data.\n\n\n\nThe training data are saved in the file:\n\nanalysis.txt (ASCII format).\n\nStructure of the file analysis.txt\n\nA K a g\nCC1,1 ... CC1,11 AG1,1 ... AG1,7\n.... ... ... ...\n\n\nCC200,1 ... CC200,11 AG240,1 ... AG240,7\n\n\nExplanation:\nCCi,j: Chemical concentration j=1,..11\nAGi,k: Algal frequency k=1...7\n\n\nThe chemical parameters are labeled as A, ..., K.\nThe columns of the algaes are labeled as a, ..,g.\n\n\nEvaluation data are saved in file eval.txt (ASCII format).\n\n\nTable 2: Structure of the file eval.*\nA K\nCC1,1 ... CC1,11\n\n..... ...\n\nCC140,1 ... CC140,11\n\n_____________________________________________________________\n\nObjective\n\nThe objective of the competition is to provide a prediction\nmodel on basis of the training data. Having obtained this\nprediction model, each participant must provide the solution\nin the form of the results of applying this model to the\nevaluation data. The results obtained in this way should\ncorrespond to the results of the evaluation data\n(which are known to the organizer). The criteria used to evaluate\nthe results is given below.\nAll 7 Algae frequency distributions must be determined.\nFor this purpose any number of partial models may be developed.\n\n_____________________________________________________________\n\nJudgment of the results\n\nTo judge the results, the sum of squared errors will be calculated.\nThe following Table describes the results of a particular participant.\n\nMatrix of results\na g\n\nRes1,1 ... Res1,7\n\n.... ...\n\nRes140,1 Res140,7\n\n\nAll solutions that lead to a smallest total error will\nbe regarded as winner of the contest.\n\n\n\nInformation about the dataset\nCLASSTYPE: numeric\nCLASSINDEX: last\n\nALGAE #: 7\/7", "format": "ARFF", "uploader": "Joaquin Vanschoren", "uploader_id": 2, "visibility": "public", "creator": "ERUDIT", "contributor": null, "date": "2014-10-03 21:53:26", "update_comment": null, "last_update": "2014-10-03 21:53:26", "licence": "Public", "status": "active", "error_message": null, "url": "https:\/\/www.openml.org\/data\/download\/52756\/kdd_coil_7.arff", "default_target_attribute": "algae_7", "row_id_attribute": null, "ignore_attribute": null, "runs": 12, "suggest": { "input": [ "kdd_coil_7", "%%%%%%%%%%%%%%%%%%% Data-Description % %%%%%%%%%%%%%%%%%%% COIL 1999 Competition Data Data Type multivariate Abstract This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities. Sources Original Owner [1]ERUDIT European Network for Fuzzy Logic and Uncertainty Modelling in Information Technology Donor Jens Strackeljan Technical University Clausthal Institute of Applied Mechanics Gra " ], "weight": 5 }, "qualities": { "NumberOfInstances": 316, "NumberOfFeatures": 12, "NumberOfClasses": 0, "NumberOfMissingValues": 56, "NumberOfInstancesWithMissingValues": 34, "NumberOfNumericFeatures": 9, "NumberOfSymbolicFeatures": 3, "Quartile3AttributeEntropy": null, "RandomTreeDepth1ErrRate": null, "EquivalentNumberOfAtts": null, "MaxNominalAttDistinctValues": 4, "MinSkewnessOfNumericAtts": -0.8925428557617799, "PercentageOfInstancesWithMissingValues": 10.759493670886076, "Quartile3KurtosisOfNumericAtts": 11.871663857364009, "AutoCorrelation": -1.6171428571428574, "RandomTreeDepth1Kappa": null, "J48.00001.AUC": null, "MaxSkewnessOfNumericAtts": 4.168799512604892, "MinStdDevOfNumericAtts": 0.5936665297676558, "PercentageOfMissingValues": 1.4767932489451476, "Quartile3MeansOfNumericAtts": 88.64669098164408, "CfsSubsetEval_DecisionStumpAUC": null, "RandomTreeDepth2AUC": null, "J48.00001.ErrRate": null, "MaxStdDevOfNumericAtts": 187.1034971461759, "MinorityClassPercentage": null, "PercentageOfNumericFeatures": 75, "Quartile3MutualInformation": null, "CfsSubsetEval_DecisionStumpErrRate": null, "RandomTreeDepth2ErrRate": null, "J48.00001.Kappa": null, "MeanAttributeEntropy": null, "MinorityClassSize": null, "PercentageOfSymbolicFeatures": 25, "Quartile3SkewnessOfNumericAtts": 2.837873958330288, "CfsSubsetEval_DecisionStumpKappa": null, "RandomTreeDepth2Kappa": null, "J48.0001.AUC": null, "MeanKurtosisOfNumericAtts": 6.324065585162889, "NaiveBayesAUC": null, "Quartile1AttributeEntropy": null, "Quartile3StdDevOfNumericAtts": 85.41093476439406, "CfsSubsetEval_NaiveBayesAUC": null, "RandomTreeDepth3AUC": null, "J48.0001.ErrRate": null, "MeanMeansOfNumericAtts": 45.99860347296734, "NaiveBayesErrRate": null, "Quartile1KurtosisOfNumericAtts": 1.270256425908787, "REPTreeDepth1AUC": null, "CfsSubsetEval_NaiveBayesErrRate": null, "RandomTreeDepth3ErrRate": null, "J48.0001.Kappa": null, "MeanMutualInformation": null, "NaiveBayesKappa": null, "Quartile1MeansOfNumericAtts": 5.485071656050955, "REPTreeDepth1ErrRate": null, "CfsSubsetEval_NaiveBayesKappa": null, "RandomTreeDepth3Kappa": null, "J48.001.AUC": null, "MeanNoiseToSignalRatio": null, "NumberOfBinaryFeatures": 0, "Quartile1MutualInformation": null, "REPTreeDepth1Kappa": null, "CfsSubsetEval_kNN1NAUC": null, "StdvNominalAttDistinctValues": 0.5773502691896258, "J48.001.ErrRate": null, "MeanNominalAttDistinctValues": 3.3333333333333335, "Quartile1SkewnessOfNumericAtts": 0.07954703685202424, "REPTreeDepth2AUC": null, "CfsSubsetEval_kNN1NErrRate": null, "kNN1NAUC": null, "J48.001.Kappa": null, "MeanSkewnessOfNumericAtts": 1.6128138976941189, "Quartile1StdDevOfNumericAtts": 2.2455579597037474, "REPTreeDepth2ErrRate": null, "CfsSubsetEval_kNN1NKappa": null, "kNN1NErrRate": null, "MajorityClassPercentage": null, "MeanStdDevOfNumericAtts": 47.87667752395588, "Quartile2AttributeEntropy": null, "REPTreeDepth2Kappa": null, "ClassEntropy": null, "kNN1NKappa": null, "MajorityClassSize": null, "MinAttributeEntropy": null, "Quartile2KurtosisOfNumericAtts": 4.066602607862747, "Quartile2MeansOfNumericAtts": 12.857054607508534, "REPTreeDepth3AUC": null, "DecisionStumpAUC": null, "MaxAttributeEntropy": null, "MinKurtosisOfNumericAtts": 0.5091930284655253, "Quartile2MutualInformation": null, "REPTreeDepth3ErrRate": null, "DecisionStumpErrRate": null, "MaxKurtosisOfNumericAtts": 19.012237079945105, "MinMeansOfNumericAtts": 2.1775316455696205, "Quartile2SkewnessOfNumericAtts": 2.036018797279895, "REPTreeDepth3Kappa": null, "DecisionStumpKappa": null, "MaxMeansOfNumericAtts": 160.45271044585988, "MinMutualInformation": null, "Quartile2StdDevOfNumericAtts": 18.433566919446708, "RandomTreeDepth1AUC": null, "Dimensionality": 0.0379746835443038, "MaxMutualInformation": null, "MinNominalAttDistinctValues": 3, "PercentageOfBinaryFeatures": 0 }, "tags": [ { "tag": "uci", "uploader": "24659" }, { "tag": "study_239", "uploader": "0" } ], "features": [ { "name": "algae_7", "index": "11", "type": "numeric", "distinct": "61", "missing": "0", "target": "1", "min": "0", "max": "32", "mean": "2", "stdev": "5" }, { "name": "season", "index": "0", "type": "nominal", "distinct": "4", "missing": "0", "distr": [] }, { "name": "river_size", "index": "1", "type": "nominal", "distinct": "3", "missing": "0", "distr": [] }, { "name": "fluid_velocity", "index": "2", "type": "nominal", "distinct": "3", "missing": "0", "distr": [] }, { "name": "concentration_1", "index": "3", "type": "numeric", "distinct": "96", "missing": "2", "min": "6", "max": "10", "mean": "8", "stdev": "1" }, { "name": "concentration_2", "index": "4", "type": "numeric", "distinct": "103", "missing": "2", "min": "2", "max": "13", "mean": "9", "stdev": "2" }, { "name": "concentration_3", "index": "5", "type": "numeric", "distinct": "272", "missing": "16", "min": "0", "max": "392", "mean": "41", "stdev": "44" }, { "name": "concentration_4", "index": "6", "type": "numeric", "distinct": "283", "missing": "2", "min": "0", "max": "12", "mean": "3", "stdev": "2" }, { "name": "concentration_5", "index": "7", "type": "numeric", "distinct": "270", "missing": "2", "min": "5", "max": "932", "mean": "160", "stdev": "187" }, { "name": "concentration_6", "index": "8", "type": "numeric", "distinct": "252", "missing": "2", "min": "1", "max": "429", "mean": "59", "stdev": "70" }, { "name": "concentration_7", "index": "9", "type": "numeric", "distinct": "286", "missing": "7", "min": "1", "max": "559", "mean": "118", "stdev": "101" }, { "name": "concentration_8", "index": "10", "type": "numeric", "distinct": "194", "missing": "23", "min": "0", "max": "110", "mean": "13", "stdev": "18" } ], "nr_of_issues": 0, "nr_of_downvotes": 0, "nr_of_likes": 0, "nr_of_downloads": 0, "total_downloads": 0, "reach": 0, "reuse": 13, "impact_of_reuse": 0, "reach_of_reuse": 1, "impact": 13 }