Google Colaboratory及び学習済みモデルを用いたpythonによる画像説明文書の自動生成

手軽に画像認識を試してみたいです。

このような要望にお応えします。

今回は、画像キャプション生成をしてみます。

画像キャプション生成では、入力画像に対してその内容を説明する文書を出力します。

ざっくりと、画像キャプション生成をするための技術について整理すると、CNNとLSTMを合わせたモデルで実現します。
この仕組みにより、CNN(Convolutional neural network)に基づき画像から抽出した特徴ベクトルとLSTM(Long short-term memory)による文書生成モデルを組み合わせ、画像の特徴から文書生成を実現することができます。すなわち、画像に対する説明文書を生成することができます。

今回は、以下のページで公開されている学習済みモデルを使用させていただき
画像キャプションを行います。また、実行環境として、Google Colaboratoryを使用します。また、実行環境の構築手順については、下記のページでも紹介されています。

pytorch-tutorial/tutorials/03-advanced/image_captioning at master · yunjey/pytorch-tutorial

PyTorch Tutorial for Deep Learning Researchers. Contribute to yunjey/pytorch-tutorial development by creating an account on GitHub.

Google Colaboratoryは、Googleの仮想マシン上で動くPython環境です。
Googleアカウントがあれば無料で利用することができる機械学習等を行う上での最高の環境だと思います。

Google Colaboratoryの準備
Gitからツールを導入
データの準備
出力結果

Google Colaboratoryの準備

・Googleのアカウントを作成します。

・Googleドライブにアクセスし、「新規」→「その他」から「Google Colaboratory」の順でクリックします。そうすると、Colaboratoryが起動します。

・Colaboratoryが起動したら、以下のコマンドをCoalboratoryのセルに入力し実行します。
そうすることで、Googleドライブをマウントします。

from google.colab import drive
drive.mount('/content/drive')

1 2	from google.colab import drive drive.mount('/content/drive')

・実行後、認証コードの入力が促されます。このとき、「Go to this URL in a browser」が指しているURLにアクセスしgoogleアカウントを選択すると、認証コードが表示されますので、それをコピーしenterを押します。これでGoogleドライブのマウントが完了します。

Gitからツールを導入

・cdコマンドでMy Driveまで移動します。

cd /content/drive/My Drive

1	cd /content/drive/My Drive

ここで、ツールのインストールをします。

!git clone https://github.com/pdollar/coco.git

1	!git clone https://github.com/pdollar/coco.git

次に、coco/PythonAPIまで移動します。

cd coco/PythonAPI/

1	cd coco/PythonAPI/

makeコマンドを実行します。

!make

!make

pythonのセットアップをします。

!python setup.py build
!python setup.py install

1 2	!python setup.py build !python setup.py install

cdコマンドでMy Driveまで戻ります。

cd ../../

cd ../../

・gitから画像キャプチャに必要なツールをインストールします。

!git clone https://github.com/yunjey/pytorch-tutorial.git

1	!git clone https://github.com/yunjey/pytorch-tutorial.git

インストールしたツールのimage_captioningフォルダまで移動します。

cd pytorch-tutorial/tutorials/03-advanced/image_captioning/

1	cd pytorch-tutorial/tutorials/03-advanced/image_captioning/

ツール実行に必要なpythonライブラリをインストールします。

!pip install -r requirements.txt

1	!pip install -r requirements.txt

データの準備

Google ChromeのGoogleドライブにて、modelsフォルダ、pngフォルダ、dataフォルダを新規作成します。すでに存在する場合は、フォルダ新規作成する必要はありません。

学習済みモデルデータなどをダウンロードします。
以下のページにある2つのモデルデータをダウンロードします。

pretrained_model.zip

Shared with Dropbox

・encoder-5-3000.pkl
・decoder-5-3000.pkl

これらを先程作成したmodelsフォルダに保存します。

次に、テキストデータをダウンロードします。
以下のページにあるテキストデータをダウンロードします。

vocap.zip

Shared with Dropbox

・vocab.pkl

このファイルを先程作成したdataフォルダーに保存します。これで準備完了です。

sample.pyを実行することで、動作確認をすることができます。

!python sample.py --image=png/example.png

1	!python sample.py --image=png/example.png

sample.pyでは、単一の画像ファイルのみの指定ができるようになっています。そのため、複数画像を扱えるように、pngフォルダ内のすべての画像データを対象に、そのファイルを読み込んで画像キャプチャを行い、その結果をgoogletransを使用して日本語に変換する処理を追加したプログラムを作成しました。

以下がソースコードになります。

import torch
import matplotlib.pyplot as plt
import numpy as np 
import argparse
import pickle 
import os
from torchvision import transforms 
from build_vocab import Vocabulary
from model import EncoderCNN, DecoderRNN
from PIL import Image
from googletrans import Translator
import glob


# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def load_image(image_path, transform=None):
  image = Image.open(image_path).convert('RGB')
  image = image.resize([224, 224], Image.LANCZOS)
    
  if transform is not None:
      image = transform(image).unsqueeze(0)
    
  return image

def main(args, vocab):
  # Image preprocessing
  transform = transforms.Compose([
      transforms.ToTensor(), 
      transforms.Normalize((0.485, 0.456, 0.406), 
                            (0.229, 0.224, 0.225))])

  # Build models
  encoder = EncoderCNN(args['embed_size']).eval()  # eval mode (batchnorm uses moving mean/variance)
  decoder = DecoderRNN(args['embed_size'], args['hidden_size'], len(vocab), args['num_layers'])
  encoder = encoder.to(device)
  decoder = decoder.to(device)

  # Load the trained model parameters
  encoder.load_state_dict(torch.load(args['encoder_path']))
  decoder.load_state_dict(torch.load(args['decoder_path']))

  # Prepare an image
  image = load_image(args['image'], transform)
  image_tensor = image.to(device)
    
  # Generate an caption from the image
  feature = encoder(image_tensor)
  sampled_ids = decoder.sample(feature)
  sampled_ids = sampled_ids[0].cpu().numpy()          # (1, max_seq_length) -> (max_seq_length)
    
  # Convert word_ids to words
  sampled_caption = []
  for word_id in sampled_ids:
      word = vocab.idx2word[word_id]
      sampled_caption.append(word)
      if word == '<end>':
          break
  sentence = ' '.join(sampled_caption)
    
  # Print out the image and the generated caption

  image = Image.open(args['image'])
  plt.imshow(np.asarray(image))
  plt.show()
    
  translator = Translator()
  sentence = sentence.strip('<start>').strip('<end>')
  res = translator.translate(sentence, dest='ja')
  print (res.text)
    
if __name__ == '__main__':

  args = dict({'image' : "",
              'encoder_path' : 'models/encoder-5-3000.pkl',
              'decoder_path' : 'models/decoder-5-3000.pkl',
              'vocab_path' : 'data/vocab.pkl',
              'embed_size' : int(256),
              'hidden_size' : int(512),
              'num_layers' : int(1)
  })

  #print(args)

  # Load vocabulary wrapper
  with open(args['vocab_path'], 'rb') as f:
      vocab = pickle.load(f)
    
  files = glob.glob('png/*')
  for index, image_path in enumerate(files):
      args['image'] = image_path
      #print(args)
      main(args, vocab)

import torch

import matplotlib.pyplot as plt

import numpy as np

import argparse

import pickle

import os

from torchvision import transforms

from build_vocab import Vocabulary

from model import EncoderCNN, DecoderRNN

from PIL import Image

from googletrans import Translator

import glob

# Device configuration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def load_image(image_path, transform=None):

image = Image.open(image_path).convert('RGB')

image = image.resize([224, 224], Image.LANCZOS)

if transform is not None:

image = transform(image).unsqueeze(0)

return image

def main(args, vocab):

# Image preprocessing

transform = transforms.Compose([

transforms.ToTensor(),

transforms.Normalize((0.485, 0.456, 0.406),

(0.229, 0.224, 0.225))])

# Build models

encoder = EncoderCNN(args['embed_size']).eval() # eval mode (batchnorm uses moving mean/variance)

decoder = DecoderRNN(args['embed_size'], args['hidden_size'], len(vocab), args['num_layers'])

encoder = encoder.to(device)

decoder = decoder.to(device)

# Load the trained model parameters

encoder.load_state_dict(torch.load(args['encoder_path']))

decoder.load_state_dict(torch.load(args['decoder_path']))

# Prepare an image

image = load_image(args['image'], transform)

image_tensor = image.to(device)

# Generate an caption from the image

feature = encoder(image_tensor)

sampled_ids = decoder.sample(feature)

sampled_ids = sampled_ids[0].cpu().numpy() # (1, max_seq_length) -> (max_seq_length)

# Convert word_ids to words

sampled_caption = []

for word_id in sampled_ids:

word = vocab.idx2word[word_id]

sampled_caption.append(word)

if word == '<end>':

break

sentence = ' '.join(sampled_caption)

# Print out the image and the generated caption

image = Image.open(args['image'])

plt.imshow(np.asarray(image))

plt.show()

translator = Translator()

sentence = sentence.strip('<start>').strip('<end>')

res = translator.translate(sentence, dest='ja')

print (res.text)

if __name__ == '__main__':

args = dict({'image' : "",

'encoder_path' : 'models/encoder-5-3000.pkl',

'decoder_path' : 'models/decoder-5-3000.pkl',

'vocab_path' : 'data/vocab.pkl',

'embed_size' : int(256),

'hidden_size' : int(512),

'num_layers' : int(1)

})

#print(args)

# Load vocabulary wrapper

with open(args['vocab_path'], 'rb') as f:

vocab = pickle.load(f)

files = glob.glob('png/*')

for index, image_path in enumerate(files):

args['image'] = image_path

#print(args)

main(args, vocab)

出力結果

出力結果としては、以下のようになりました。
画像データは、以下のwebページの素材を使用しました。
フリー素材ぱくたそ（www.pakutaso.com）

出力結果は、入力画像を完全に説明できているかというと疑問が残りますが、男性、女性、猫、鳥等の判断や人物の姿勢等の細かな情報まで読み取れているような気がします。

このように、Google Colaboratory等のリソースを活用することで手軽に画像認識を試すことができますので是非活用してみてはいかがでしょうか。