机器学习--k近邻法实践

机器学习 | k近邻法实践

今天,我们来手动时间k近邻法。

还是老规矩,使用两种方法来实现,我们先根据k近邻法的原理来实现最简单的k近邻法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from sklearn.datasets import make_classification
import numpy as np
from collections import Counter


'''
最简单粗暴算欧氏距离
'''
def euclideanDistance_two_loops(train_X, test_X):
num_test = test_X.shape[0]
num_train = train_X.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
test_line = test_X[i]
train_line = train_X[j]
temp = np.subtract(test_line, train_line)
temp = np.power(temp, 2)
dists[i][j] = np.sqrt(temp.sum())
return dists

'''
使用矩阵计算欧氏距离,无需循环,原理见下文
'''
def euclideanDistance_no_loops(train_X, test_X):

num_test = test_X.shape[0]
num_train = train_X.shape[0]

sum_train = np.power(train_X, 2)
sum_train = sum_train.sum(axis=1)
sum_train = sum_train * np.ones((num_test, num_train))

sum_test = np.power(test_X, 2)
sum_test = sum_test.sum(axis=1)
sum_test = sum_test * np.ones((1, sum_train.shape[0]))
sum_test = sum_test.T

sum = sum_train + sum_test - 2 * np.dot(test_X, train_X.T)
dists = np.sqrt(sum)
return dists

'''
预测
'''
def predict_labels(dists, labels, k=1):
num_test = dists.shape[0]
y_pred = []
for i in range(num_test):
index = np.argsort(dists[i])
index = index[:k]
closest_y = labels[index]
name, _ = Counter(closest_y).most_common(1)[0]
y_pred.append(name)
return y_pred

'''
准确率
'''
def getAccuracy(y_pred, y):
num_correct = np.sum(y_pred == y)
accuracy = float(num_correct) / len(y)
return accuracy


if __name__ == "__main__":
x, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1,
n_classes=3)
trainData = x[:800]
trainLabel = y[:800]
testData = x[800:]
testLabel = y[800:]
dists = euclideanDistance_no_loops(trainData, testData)
y_pred = predict_labels(dists, trainLabel, k=3)
accuracy = getAccuracy(y_pred, testLabel)
print(accuracy)

无需循环计算欧氏距离的原理:

记测试集矩阵P的大小为\(M*D\),训练集矩阵C的大小为\(N*D\),其中D表示D维向量,M和N表示样本数。

\(P_i\)是P的第i行\(P_i = [P_{i1},P_{i2},...,P_{iD}]\),记\(C_j\)是C的第j行\(C_j = [C_{j1},C_{j2},...,C_{jD}]\)

首先计算Pi和Cj之间的距离dist(i, j):

捕获2

由此可以推广到距离矩阵第i行的计算公式:

捕获3

继续将公式推广为整个距离矩阵:

捕获4

由此即可无需循环得到样本点之间的距离。

由此方式得到的分类准确率:

捕获1

然后我们使用sklearn库中的k近邻法实现来做分类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification

# 1000个样本,分为三类
x, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1,
n_classes=3)
# 将样本分为训练数据和测试数据
x_train = x[:800]
y_train = y[:800]
x_test = x[800:]
y_test = y[800:]

clf = KNeighborsClassifier(n_neighbors=4)
clf.fit(x_train, y_train)

# 进行训练
clf.fit(x_train, y_train)
# 测试训练成绩
print(clf.score(x_test, y_test))

由此方法得到的分类准确率为:

捕获

以上就是今天的内容,谢谢大家的观看。