KNN Validation

I decided that about 2% of the data needed to be imputed, or replaced with re-stimated values. In order to make sure my imputation model was correct, I needed to validate. Validation is when we make predictions on a dataset who’s target value we already know. That way we can compare the algorithm’s estimations to the real data to see where it needs improvement.

I began by making train and test datasets. The training set has 7041 rows, which is the same amount of the dataset which contains rows of temperature data who had both missing max and min values, and therefore simulates using the algorithm on that size of dataset.

I then applied the functions that finds stations of the same date, as well as the month, and saves their values.

Each rows of the missing dataset returned a certain amount of nearby stations. I made histograms showing how many nearby stations per row were returned. The algorithm that looked for same month rather than same date intuitively returned more rows per station.

For every row of missing data, I took both the mean and median of the nearby stations’ temperature, just to see if one provided any useful information over the other.

Since the validation is done using a random sample of the data every time, the calculated SSE will be different. Through running through the process times, however, the trends I find are:

Date SSE is much smaller than month SSE, for both mean and median
Mean SSE is just barely smaller than median SSE

In the real KNN, I will use mean. I will also provide residual histograms of both predictions, using the mean. You can see how the same date stations have a much tighter hold around 0, meaning the predicitons were more accurate.