2015年9月6日日曜日

Scene Recognition by Caffe


Introduction


 In this page, I perform a scene recognition by means of the library Caffe. It is shown that with the pre-training model that Caffe provides and its fine-tuning by scene images, the recognition accuracy achieves about 95% .

Computation Environment


 I used that instance g2.2xlarge in the Amazon EC2 which mounts the GPU device.

Dataset


 I trained the CNN using the dataset LSP15 in this page. The dataset consists of the 15 directories as follows:
  1. MITcoast
  2. MITforest
  3. MIThighway
  4. MITinsidecity
  5. MITmountain
  6. MITopencountry
  7. MITstreet
  8. MITtallbuilding
  9. bedroom
  10. CALsuburb
  11. industrial
  12. kitchen
  13. livingroom
  14. PARoffice
  15. store
The name of the directory represents the category of the scene. Each directory contains about 200 to 300 images which belong to their category.


Data Augmentation


 In order to augment dataset, I added the mirror images to it. Moreover, the images are split into two groups "train" and "test." The size of the image is 256 $\times$ 256, and the number of the channel is 3. The number of the images in each category is as follows:
                                   
label name number of train number of test
0 MITcoast 610 100
1 MIThighway 440 70
2 MITmountain 630 100
3 MITstreet 490 80
4 MITforest 550 90
5 MITinsidecity 520 80
6 MITopencountry 690 110
7 MITtallbuilding 600 100
8 bedroom 360 60
9 CALsuburb 400 60
10 industrial 520 80
11 kitchen 360 60
12 livingroom 490 80
13 PARoffice 360 60
14 store 540 90
7560 1220

Dataset for Caffe


 Caffe requires the following directories and files:
  1. a directory which contains training images
  2. a directory which contains test images
  3. a text file in which names and labels of training images are described
  4. a text file in which names and labels of test images are described
In my environment, they are put in the following paths:
  1. /home/ubuntu/data/caffe_256_15/train/
  2. /home/ubuntu/data/caffe_256_15/test/
  3. /home/ubuntu/data/caffe_256_15/train.txt
  4. /home/ubuntu/data/caffe_256_15/test.txt
The contents of the file "test.txt" are as follows:
MITstreet_image_0179_flipped.jpg 3
MITtallbuilding_image_0173_flipped.jpg 7
MITcoast_image_0126.jpg 0
store_image_0158_flipped.jpg 14
MITinsidecity_image_0102_flipped.jpg 5
MITforest_image_0200_flipped.jpg 4
industrial_image_0189_flipped.jpg 10
MITcoast_image_0142.jpg 0
kitchen_image_0019_flipped.jpg 11
bedroom_image_0210_flipped.jpg 8
bedroom_image_0116_flipped.jpg 8
livingroom_image_0008_flipped.jpg 12
kitchen_image_0051_flipped.jpg 11
MITstreet_image_0167_flipped.jpg 3
MITcoast_image_0315.jpg 0
....
The contents of the file "train.txt" are as follows:
industrial_image_0190.jpg 10
CALsuburb_image_0103_flipped.jpg 9
bedroom_image_0022_flipped.jpg 8
MITopencountry_image_0222.jpg 6
MITstreet_image_0040.jpg 3
MIThighway_image_0053_flipped.jpg 1
livingroom_image_0063_flipped.jpg 12
store_image_0106_flipped.jpg 14
industrial_image_0144.jpg 10
kitchen_image_0085_flipped.jpg 11
bedroom_image_0040.jpg 8
MIThighway_image_0088_flipped.jpg 1
industrial_image_0264.jpg 10
bedroom_image_0117_flipped.jpg 8
MITcoast_image_0021_flipped.jpg 0
...
After storing the images specified in the files "test.txt" and "train.txt" in the directories "test" and "train" respectively, this script is run to create the dataset for Caffe. "test_leveldb" and "train_leveldb" which are the inputs for Caffe are output as shown below.

Definition of CNN


 I defined the structure of the CNN in the file named "model/scene_recognition/train_val.prototxt" as: The file is based on the file "/home/ubuntu/buildspace/caffe-master/models/bvlc_reference_caffenet/train_val.prototxt." In the layers "data" and "fc8," there are differences between the original and my own files. I replaced the layer "fc8" with the new layer "scene_fc8." Moreover, in accordance with the explanation in this page, parameters in the layer "scene_fc8" were modified as shown above.

Definition of Solver


 Based on the file "models/bvlc_reference_caffenet/solver.prototxt," the text file used for training the CNN is defined as follows: The path of that file is "model/scene_recognition/solver.prototxt."

Training


 This script is run to train the CNN. The pre-training model which Caffe provides is "models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel" which is passed as the argument of the command option "-weights." The script fine-tunes the pre-training model by using the current dataset.

Result


The x-axis indicates the iteration number and the y-axis the recognition accuracy. Because in the current case the total iteration number is 80,000 and the solver is designed to output the accuracy once per 500 iterations, the maximum value of the x-axis is 160(=80,000/500). The recognition accuracy reaches about 95%.

Construction of Classifier


 After the training, the file "scene_train_iter_80000.caffemodel" is created. The file contains the information of the fine-tuned CNN. In order to construct the classifier from the model file, the following file is needed. That file is named "deploy.ptototxt." It is made from the file "model/scene_recognition/train_val.prototxt" according to the following procedures.
  1. Remove the layer "data," and add the four lines as shown below.
  2. Remove the layers "loss" and "accuracy", and add this layer.
The four lines with which the layer "data" is replaced means:
  1. input_dim: 20 --- batch size
  2. input_dim: 3 --- channel number
  3. input_dim: 227 --- width of an image
  4. input_dim: 227 --- height of an image
The code to classify the image is implemented as follows: It is named "classifier.py." Now I can classify the images.

0 件のコメント:

コメントを投稿