Data Mining Project 1: Data Preprocessing Released on Sept 17 Due on Oct 3 at 11:55pmSpeci?cationIn.

Data Mining Project 1: Data Preprocessing Released on Sept 17 Due on Oct 3 at 11:55pmSpeci?cationIn this project, the students are to write programs to apply data pre-processing and feature selection techniques to gene expression datasets for cancers/diseases. A small example of gene expression datasets is the colon cancer dataset, which contains 62 samples collected from colon-cancer patients; among those samples, 40 are tumor biopsies labeled as “positive” and 22 are normal tissue biopsies labeled as ”negative”. For simplicity, in this project we assume that each such dataset has just two classes. Each tuple (row) in the data consists of the readings for the genes, and the class (which is the last column). Each gene is an attribute. The columns are separated by ”,”, which is a commonly used format in data mining. The dataset can be found on pilot under ”Projects” called p1colon.txt. Other datasets may be provided in the same folder later. Note: Your programs will need to be able to handle other datasets. That means that your program will need to go through the data once to determine the number of samples/rows and the number of attributes. The instructor plans to test your program using other datasets. In your discussions and reports, refer to the genes as g1, …, gN, in the left-to-right order. Your project should address the following tasks:Task 1. Discretize the genes using equi-density binning with NumIntervals=k intervals for each of the genes, and select the top-K genes using info gain (see Task 3).Task 2. Similarly, discretize the genes using equi-width binning with NumIntervals=k intervals for each of the genes, and select the top-K genes using info gain (see Task 3).Task 3. Compute the information gain produced by each of the two binnings produced in the above two tasks.If K is larger than the number of available attributes, then all attributes in the dataset are selected. Your program should read three command line arguments: nameData?le, k, K. The data ?le should be located in the same folder of the executable. The executable should be called DWBinning. It should produce the following output ?les: edensitybins.txt, edensitydata.txt, ewidthbins.txt, and ewidthdata.txt. In the edensitybins.txt ?le, you should have the following information for each of the K selected genes: gene number; infogain=ig4thisgene; (bin 1 lb, bin 1 ub] (bin1C1count, bin1C2count); …; (bin k lb, bin k ub] (binkC1count, binkC2count). Use one line for one gene’s info. The genes should be listed in decreasing info gain order. Similarly for the ewidthbins.txt ?le. In the edensitydata.txt ?le, you should have the result of the discretized data for the ?rst K genes: Use 0, 1, 2, …, K as the value representing the bins, with 0 for the leftmost bin, and map the original data into discretized data. You should keep the class for each tuple but ignore genes after gene K. The genes should be listed in the same order as given in the xxxbin.txt ?les. Similarly for the ewidthbins.txt ?le. The ?rst line below is a made-up example line for the edensitybins.txt ?le, and the next line is a made-up example line for the edensitydata.txt ?le.g1670; Info Gain=0.435072; Bins: (-, 35.959] (2,3); ( 35.959,+] (18, 29)1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, positiveHere k=2, and K=11.

 

We are the Best!

course-preview

275 words per page

You essay will be 275 words per page. Tell your writer how many words you need, or the pages.


12 pt Times New Roman

Unless otherwise stated, we use 12pt Arial/Times New Roman as the font for your paper.


Double line spacing

Your essay will have double spaced text. View our sample essays.


Any citation style

APA, MLA, Chicago/Turabian, Harvard, our writers are experts at formatting.


We Accept

Secure Payment
Image 3