CSCI4390-6390 Assign5

Assign5: Bayes Classifier

Due Date: Oct 31, before midnight (11:59:59PM)

Dataset

Download the Steel Industry Energy Consumption Dataset from the UCI Machine Learning repository. Extract the Steel_industry_data.csv datafile. You should parse and store the data as a data matrix, focusing only on the 6 continuous attributes (see datafile or link above for names/descriptions). Thus, your data matrix will have 35040 points in 6 dimensions. However, in addition you should record the last attribute (load type) for each point. We will use this as the class label.

You should first randomly shuffle all points, and then take the first 80% of the data as training and the remaining 20% as testing. You must do this via sklearn.train_test_split, using 42 as the random_state.

Part I: Bayes Classifier

Implement both the full Bayes classifier in Algo 18.1, and the naive Bayes classifier in Algo 18.2.

Students in CSCI6390, should in addition, implement the K-nearest-neighbor classifier, and present the best results trying different K values.

Estimate parameters using the training data, and report the accuracy of the testing set. You must report total accuracy, and the class-specific accuracy and recall values -- see Eq 22.3 and 22.4 for the later two. Report these values for all methods.

You may use scipy multivariate_normal.pdf to compute the normal probability density function.

Part II: Questions

Submit your solutions to the following questions:

Chapter 18: Q3

What to submit

Submit your notebook named as assign5.ipynb.

Policy on Academic Honesty

You are free to discuss how to tackle the assignment, but all coding must be your own. Any AI tool use must be declared. Any students caught violating the academic honesty principle (e.g., code similarity, or failure to disclose AI tools) will get an automatic F grade on the course and will be referred to the dean of students for disciplinary action.