memo: Scene Recognition by Caffe

Introduction

　In this page, I perform a scene recognition by means of the library Caffe. It is shown that with the pre-training model that Caffe provides and its fine-tuning by scene images, the recognition accuracy achieves about 95% .

Computation Environment

　I used that instance g2.2xlarge in the Amazon EC2 which mounts the GPU device.

Dataset

　I trained the CNN using the dataset LSP15 in this page. The dataset consists of the 15 directories as follows:

MITcoast
MITforest
MIThighway
MITinsidecity
MITmountain
MITopencountry
MITstreet
MITtallbuilding
bedroom
CALsuburb
industrial
kitchen
livingroom
PARoffice
store

The name of the directory represents the category of the scene. Each directory contains about 200 to 300 images which belong to their category.

Data Augmentation

　In order to augment dataset, I added the mirror images to it. Moreover, the images are split into two groups "train" and "test." The size of the image is 256 $\times$ 256, and the number of the channel is 3. The number of the images in each category is as follows:
　　　　　　　　　　　　　　　　　　

label	name	number of train	number of test
0	MITcoast	610	100
1	MIThighway	440	70
2	MITmountain	630	100
3	MITstreet	490	80
4	MITforest	550	90
5	MITinsidecity	520	80
6	MITopencountry	690	110
7	MITtallbuilding	600	100
8	bedroom	360	60
9	CALsuburb	400	60
10	industrial	520	80
11	kitchen	360	60
12	livingroom	490	80
13	PARoffice	360	60
14	store	540	90
		7560	1220

Dataset for Caffe

　Caffe requires the following directories and files:

a directory which contains training images
a directory which contains test images
a text file in which names and labels of training images are described
a text file in which names and labels of test images are described

In my environment, they are put in the following paths:

/home/ubuntu/data/caffe_256_15/train/
/home/ubuntu/data/caffe_256_15/test/
/home/ubuntu/data/caffe_256_15/train.txt
/home/ubuntu/data/caffe_256_15/test.txt

The contents of the file "test.txt" are as follows:

MITstreet_image_0179_flipped.jpg 3
MITtallbuilding_image_0173_flipped.jpg 7
MITcoast_image_0126.jpg 0
store_image_0158_flipped.jpg 14
MITinsidecity_image_0102_flipped.jpg 5
MITforest_image_0200_flipped.jpg 4
industrial_image_0189_flipped.jpg 10
MITcoast_image_0142.jpg 0
kitchen_image_0019_flipped.jpg 11
bedroom_image_0210_flipped.jpg 8
bedroom_image_0116_flipped.jpg 8
livingroom_image_0008_flipped.jpg 12
kitchen_image_0051_flipped.jpg 11
MITstreet_image_0167_flipped.jpg 3
MITcoast_image_0315.jpg 0
....

The contents of the file "train.txt" are as follows:

industrial_image_0190.jpg 10
CALsuburb_image_0103_flipped.jpg 9
bedroom_image_0022_flipped.jpg 8
MITopencountry_image_0222.jpg 6
MITstreet_image_0040.jpg 3
MIThighway_image_0053_flipped.jpg 1
livingroom_image_0063_flipped.jpg 12
store_image_0106_flipped.jpg 14
industrial_image_0144.jpg 10
kitchen_image_0085_flipped.jpg 11
bedroom_image_0040.jpg 8
MIThighway_image_0088_flipped.jpg 1
industrial_image_0264.jpg 10
bedroom_image_0117_flipped.jpg 8
MITcoast_image_0021_flipped.jpg 0
...

After storing the images specified in the files "test.txt" and "train.txt" in the directories "test" and "train" respectively, this script is run to create the dataset for Caffe. "test_leveldb" and "train_leveldb" which are the inputs for Caffe are output as shown below.

Definition of CNN

　I defined the structure of the CNN in the file named "model/scene_recognition/train_val.prototxt" as: The file is based on the file "/home/ubuntu/buildspace/caffe-master/models/bvlc_reference_caffenet/train_val.prototxt." In the layers "data" and "fc8," there are differences between the original and my own files. I replaced the layer "fc8" with the new layer "scene_fc8." Moreover, in accordance with the explanation in this page, parameters in the layer "scene_fc8" were modified as shown above.

Definition of Solver

　Based on the file "models/bvlc_reference_caffenet/solver.prototxt," the text file used for training the CNN is defined as follows: The path of that file is "model/scene_recognition/solver.prototxt."

Training

　This script is run to train the CNN. The pre-training model which Caffe provides is "models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel" which is passed as the argument of the command option "-weights." The script fine-tunes the pre-training model by using the current dataset.

Result

The x-axis indicates the iteration number and the y-axis the recognition accuracy. Because in the current case the total iteration number is 80,000 and the solver is designed to output the accuracy once per 500 iterations, the maximum value of the x-axis is 160(=80,000/500). The recognition accuracy reaches about 95%.

Construction of Classifier

　After the training, the file "scene_train_iter_80000.caffemodel" is created. The file contains the information of the fine-tuned CNN. In order to construct the classifier from the model file, the following file is needed. That file is named "deploy.ptototxt." It is made from the file "model/scene_recognition/train_val.prototxt" according to the following procedures.

Remove the layer "data," and add the four lines as shown below.
Remove the layers "loss" and "accuracy", and add this layer.

The four lines with which the layer "data" is replaced means:

input_dim: 20 --- batch size
input_dim: 3 --- channel number
input_dim: 227 --- width of an image
input_dim: 227 --- height of an image

The code to classify the image is implemented as follows: It is named "classifier.py." Now I can classify the images.

memo

2015年9月6日日曜日

Scene Recognition by Caffe