578
kdd_coil_7
1
**Author**:
**Source**: Unknown - Date unknown
**Please cite**:
%%%%%%%%%%%%%%%%%%%
Data-Description %
%%%%%%%%%%%%%%%%%%%
COIL 1999 Competition Data
Data Type
multivariate
Abstract
This data set is from the 1999 Computational Intelligence and Learning
(COIL) competition. The data contains measurements of river chemical
concentrations and algae densities.
Sources
Original Owner
[1]ERUDIT
European Network for Fuzzy Logic and Uncertainty Modelling
in Information Technology
Donor
Jens Strackeljan
Technical University Clausthal
Institute of Applied Mechanics
Graupenstr. 3, 38678 Clausthal-Zellerfeld, Germany
[2]tmjs@itm.tu-clausthal.de
Date Donated: September 9, 1999
Data Characteristics
This data comes from a water quality study where samples were taken
from sites on different European rivers of a period of approximately
one year. These samples were analyzed for various chemical substances
including: nitrogen in the form of nitrates, nitrites and ammonia,
phosphate, pH, oxygen, chloride. In parallel, algae samples were
collected to determine the algae population distributions.
Other Relevant Information
The competition involved the prediction of algal frequency
distributions on the basis of the measured concentrations of the
chemical substances and the global information concerning the season
when the sample was taken, the river size and its flow velocity. The
competition [3]instructions contain additional information on the
prediction task.
Data Format
There are a total of 340 examples each containing 17 values. The first
11 values of each data set are the season, the river size, the fluid
velocity and 8 chemical concentrations which should be relevant for
the algae population distribution. The last 8 values of each example
are the distribution of different kinds of algae. These 8 kinds are
only a very small part of the whole community, but for the competition
we limited the number to 7. The value 0.0 means that the frequency is
very low. The data set also contains some empty fields which are
labeled with the string XXXXX.
The training data are saved in the file: analysis.data (ASCII format).
Table 1: Structure of the file analysis.data
A
K
a
g
CC[1,1]
CC[1,11]
AG[1,1]
AG[1,7]
CC[200,1]
CC[200,11]
AG[200,1]
AG[200,7]
Explanation:
CC[i,j]: Chemical concentration or river characteristic
AG[i,j]: Algal frequency
The chemical parameters are labeled as A, ..., K. The columns of the
algaes are labeled as a, ..,g.
Past Usage
[4]The Third (1999) International COIL Competition Home Page
_________________________________________________________________
[5]The UCI KDD Archive
[6]Information and Computer Science
[7]University of California, Irvine
Irvine, CA 92697-3425
Last modified: October 13, 1999
References
1. http://www.erudit.de/
2. mailto:tmjs@itm.tu-clausthal.de
3. file://localhost/research/ml/datasets/uci/raw/data/ucikdd/coil/instructions.txt
4. http://www.erudit.de/erudit/activities/ic-99/index.htm
5. http://kdd.ics.uci.edu/
6. http://www.ics.uci.edu/
7. http://www.uci.edu/
%%%%%%%%%%%%%%%%%%%
Task-Description %
%%%%%%%%%%%%%%%%%%%
Third International Competition
Protecting rivers and streams by monitoring chemical concentrations and
algae communities.
Intelligent Techniques for Monitoring Water Quality using chemical
indicators and algae population
Recent years have been characterised by increasing concern at the
impact man is having on the environment.
The impact on the environment of toxic waste, from a wide variety
of manufacturing processes, is well known. More recently, however,
it has become clear that the more subtle effects of nutrient level
and chemical balance changes arising from farming land run-off and
sewage water treatment also have a serious, but indirect, effect on
the states of rivers, lakes and even the sea. In temperate climates
across the world summers are characterized by numerous reports excessive
summer algae growth resulting in poor water clarity, mass deaths of
river fish from reduced oxygen levels and the closure of recreational
water facilities on account of the toxic effects of this annual algal bloom.
Reducing the impact of these man-made changes in river nutrient levels
has stimulated much biological research with the aim of identifying
the crucial chemical control variables for the biological
processes.
The data used in this problem comes from one such study.
During the research study water quality samples were
taken from sites on different European rivers of a period of
approximately one year. These samples were analyzed for various
chemical substances including: nitrogen in the form of nitrates,
nitrites and ammonia, phosphate, pH, oxygen, chloride.
In parallel, algae samples were collected to determine the algae population
distributions. It is well known that the dynamics of the
algae community is determined by external chemical
environment with one or more factors being predominant.
While the chemical analysis is cheap and easily
automated, the biological part involves microscopic examination,
requires trained manpower and is therefore both
expensive and slow.
Diatoms like Cymbella are major contributors to primary production
throughout the world. The diatom reacts with
large sensitivity to even small changes in acidity .
Over a three and half billion year history algae have evolved and
adapted as primary plant colonizers of almost
every known habitant in terrestrial and aquatic environments.
They respond very rapidly to man-made environment changes.
The relationship between the chemical and biological features is
complex and can be expected to need the application of advanced
techniques. Typical of such real-life problems, the particular
data set for the problem contains a mixture of (fuzzy) qualiative
variables and numerical measurement values, with much of the data
being incomplete.
The competition task is the prediction of algal frequency distributions
on the basis of the measured concentrations of the chemical
substances and the global information concerning the season when the sample
was taken, the river size and its flow velocity. The two last variables
are given as linguistic variables.
340 data sets were taken and each contain 17 values. The
first 11 values of each data set are the season, the river
size, the fluid velocity and 8 chemical concentrations which
should be relevant for the algae population distribution.
The last 8 values of each data set are the distribution of
different kinds of algae. These 8 kinds are only a very small
part of the whole community, but for the competition we limited
the number to 7. The value 0.0 means that the frequency is very low.
The data set also contains some empty fields which are labeled
with the string XXXXX.
Each participant in the competition receives 200 complete data sets
(training data) and 140 data sets (evaluation data) containing only
the 11 values of the river descriptions and the chemical concentrations.
This training data is to be used in obtainin
a 'model' providing a prediction of the algal distributions associated
with the evaluation data.
The training data are saved in the file:
analysis.txt (ASCII format).
Structure of the file analysis.txt
A K a g
CC1,1 ... CC1,11 AG1,1 ... AG1,7
.... ... ... ...
CC200,1 ... CC200,11 AG240,1 ... AG240,7
Explanation:
CCi,j: Chemical concentration j=1,..11
AGi,k: Algal frequency k=1...7
The chemical parameters are labeled as A, ..., K.
The columns of the algaes are labeled as a, ..,g.
Evaluation data are saved in file eval.txt (ASCII format).
Table 2: Structure of the file eval.*
A K
CC1,1 ... CC1,11
..... ...
CC140,1 ... CC140,11
_____________________________________________________________
Objective
The objective of the competition is to provide a prediction
model on basis of the training data. Having obtained this
prediction model, each participant must provide the solution
in the form of the results of applying this model to the
evaluation data. The results obtained in this way should
correspond to the results of the evaluation data
(which are known to the organizer). The criteria used to evaluate
the results is given below.
All 7 Algae frequency distributions must be determined.
For this purpose any number of partial models may be developed.
_____________________________________________________________
Judgment of the results
To judge the results, the sum of squared errors will be calculated.
The following Table describes the results of a particular participant.
Matrix of results
a g
Res1,1 ... Res1,7
.... ...
Res140,1 Res140,7
All solutions that lead to a smallest total error will
be regarded as winner of the contest.
Information about the dataset
CLASSTYPE: numeric
CLASSINDEX: last
ALGAE #: 7/7
1
ARFF
ERUDIT 1999-09-09 2014-10-03T21:53:26
English Public https://api.openml.org/data/v1/download/52756/kdd_coil_7.arff
https://openml1.win.tue.nl/datasets/0000/0578/dataset_578.pq 52756 algae_7 https://archive.ics.uci.edu/ml/citation_policy.html ChemistryLife Scienceuci public https://archive.ics.uci.edu/ml/datasets/Coil+1999+Competition+Data https://openml1.win.tue.nl/datasets/0000/0578/dataset_578.pq active
2020-11-20 20:13:10 145dc6adf2b647cc6ba9128c6374dc32