机器学习 | k近邻法实践
今天,我们来手动时间k近邻法。
还是老规矩,使用两种方法来实现,我们先根据k近邻法的原理来实现最简单的k近邻法。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
| from sklearn.datasets import make_classification import numpy as np from collections import Counter
''' 最简单粗暴算欧氏距离 ''' def euclideanDistance_two_loops(train_X, test_X): num_test = test_X.shape[0] num_train = train_X.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): for j in range(num_train): test_line = test_X[i] train_line = train_X[j] temp = np.subtract(test_line, train_line) temp = np.power(temp, 2) dists[i][j] = np.sqrt(temp.sum()) return dists
''' 使用矩阵计算欧氏距离,无需循环,原理见下文 ''' def euclideanDistance_no_loops(train_X, test_X):
num_test = test_X.shape[0] num_train = train_X.shape[0]
sum_train = np.power(train_X, 2) sum_train = sum_train.sum(axis=1) sum_train = sum_train * np.ones((num_test, num_train))
sum_test = np.power(test_X, 2) sum_test = sum_test.sum(axis=1) sum_test = sum_test * np.ones((1, sum_train.shape[0])) sum_test = sum_test.T
sum = sum_train + sum_test - 2 * np.dot(test_X, train_X.T) dists = np.sqrt(sum) return dists
''' 预测 ''' def predict_labels(dists, labels, k=1): num_test = dists.shape[0] y_pred = [] for i in range(num_test): index = np.argsort(dists[i]) index = index[:k] closest_y = labels[index] name, _ = Counter(closest_y).most_common(1)[0] y_pred.append(name) return y_pred
''' 准确率 ''' def getAccuracy(y_pred, y): num_correct = np.sum(y_pred == y) accuracy = float(num_correct) / len(y) return accuracy
if __name__ == "__main__": x, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, n_classes=3) trainData = x[:800] trainLabel = y[:800] testData = x[800:] testLabel = y[800:] dists = euclideanDistance_no_loops(trainData, testData) y_pred = predict_labels(dists, trainLabel, k=3) accuracy = getAccuracy(y_pred, testLabel) print(accuracy)
|
无需循环计算欧氏距离的原理:
记测试集矩阵P的大小为\(M*D\),训练集矩阵C的大小为\(N*D\),其中D表示D维向量,M和N表示样本数。
记\(P_i\)是P的第i行\(P_i = [P_{i1},P_{i2},...,P_{iD}]\),记\(C_j\)是C的第j行\(C_j = [C_{j1},C_{j2},...,C_{jD}]\)
首先计算Pi和Cj之间的距离dist(i, j):
由此可以推广到距离矩阵第i行的计算公式:
继续将公式推广为整个距离矩阵:
由此即可无需循环得到样本点之间的距离。
由此方式得到的分类准确率:
然后我们使用sklearn库中的k近邻法实现来做分类:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import make_classification
x, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, n_classes=3)
x_train = x[:800] y_train = y[:800] x_test = x[800:] y_test = y[800:]
clf = KNeighborsClassifier(n_neighbors=4) clf.fit(x_train, y_train)
clf.fit(x_train, y_train)
print(clf.score(x_test, y_test))
|
由此方法得到的分类准确率为:
以上就是今天的内容,谢谢大家的观看。