畅游人工智能之海--Keras教程之图像数据预处理(二)

畅游人工智能之海--Keras教程之图像数据预处理(二)

前言

tf.keras.preprocessing这个库用来将原始数据生成固定可用于训练的格式tf.data.Dataset

flow method

1
2
3
4
5
6
7
8
9
10
11
12
13
from tensorflow.keras.preprocessing import ImageDataGenerator
ImageDataGenerator.flow(
x,
y=None,
batch_size=32,
shuffle=True,
sample_weight=None,
seed=None,
save_to_dir=None,
save_prefix="",
save_format="png",
subset=None,
)

这个函数用于生成batch数据,同时增加了一些辅助功能,其中shuffle、sample_weight、seed、subset在前面已经说过了,save_to_dir、save_prefix是配套使用的,用于将生成的数据保存在硬盘上。

flow_from_dataframe method

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ImageDataGenerator.flow_from_dataframe(
dataframe,
directory=None,
x_col="filename",
y_col="class",
weight_col=None,
target_size=(256, 256),
color_mode="rgb",
classes=None,
class_mode="categorical",
batch_size=32,
shuffle=True,
seed=None,
save_to_dir=None,
save_prefix="",
save_format="png",
subset=None,
interpolation="nearest",
validate_filenames=True,
**kwargs
)

用于处理pandas格式下的dataframe,并返回标准化的batch数据。directory实际上是目标程序运行时要读取的路径,在这个函数中相当于是输出路径。在 https://medium.com/@vijayabhaskar96/tutorial-on-keras-imagedatagenerator-with-flow-from-dataframe-8bd5776e45c1 这篇博客中提到了另一种使用该函数的方法:先在命令行中执行 pip uninstall keras-preprocessingpip install git+https://github.com/keras-team/keras-preprocessing.git ,然后使用from keras_preprocessing.image import ImageDataGenerator调用这个类即可。

关于一个使用该函数的例子如下所示,其中数据集可以到这里获取: https://www.kaggle.com/c/cifar-10/data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import pandas as pd
df=pd.read_csv(r".\train.csv")
datagen=ImageDataGenerator(rescale=1./255)
train_generator=datagen.flow_from_dataframe(dataframe=df, directory=".\train_imgs", x_col="id", y_col="label", class_mode="categorical", target_size=(32,32), batch_size=32)
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
input_shape=(32,32,3)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

model.compile(optimizers.rmsprop(lr=0.0001),
loss="categorical_crossentropy", metrics=["accuracy"])
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size
model.fit_generator(generator=train_generator,
steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=valid_generator,
validation_steps=STEP_SIZE_VALID,
epochs=10)

flow_from_directory method

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
ImageDataGenerator.flow_from_directory(
directory,
target_size=(256, 256),
color_mode="rgb",
classes=None,
class_mode="categorical",
batch_size=32,
shuffle=True,
seed=None,
save_to_dir=None,
save_prefix="",
save_format="png",
follow_links=False,
subset=None,
interpolation="nearest",
)

从某个目录直接获取数据并生成batch数据,但是数据目录需要满足特定的格式,举个例子,更加详细的样例代码见 https://gist.github.com/fchollet/0830affa1f7f19fd47b06d4cf89ed44d :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
data/
train/
dogs/
dog001.jpg
dog002.jpg
...
cats/
cat001.jpg
cat002.jpg
...
validation/
dogs/
dog001.jpg
dog002.jpg
...
cats/
cat001.jpg
cat002.jpg
...