This traning data is a subset of the Wisconsin Diagnostic Breast
Cancer database.  Each training example in this data set consists of
30 continuous valued attributes based on features computed from a
digitized image of a fine needle aspirate of a breast mass.  These
features describe characteristics of the cell nuclei present in the
image.  The classification of each example is whether the sample was
malignant (m) or benign (b).

There are 10 features computed for each cell nucleus:

  a) radius (mean of distances from center to points on the perimeter)
  b) texture (standard deviation of gray-scale values)
  c) perimeter
  d) area
  e) smoothness (local variation in radius lengths)
  f) compactness (perimeter^2 / area - 1.0)
  g) concavity (severity of concave portions of the contour)
  h) concave points (number of concave portions of the contour)
  i) symmetry 
  j) fractal dimension ("coastline approximation" - 1)

The attributes in each training example are as follows:

  1-10:  the average value of the 10 features above (over all cell
         nuclei in the image)
  11-20: the standard error for the 10 features over all cell nuclei
  21-30: the "worst" value over all cell nuclei

I've provided several different subsets:

  bc-data-s     small set (20 examples)
  bc-data-m1    medium set (100 examples)
  bc-data-m1    medium set (100 examples)
  bc-data-l1    large set (200 examples)
  bc-data-l1    large set (200 examples)
  bc-data-l1    large set (200 examples)

and of course, there is a list of the attribute names, named:

  bc-names

FYI, Here's how I generated these data sets:

  - the original data set consists of 569 examples

  - I split the data set into two parts, with 2/3 of the examples in a
    "training data set" and 1/3 in a "testing data set" (which I'll
    use for testing your solution)

  - I generated random subsets of the "traning data set"