Android SpeechRecognizerで常時オフライン音声認識をする

2019-07-04 2021-07-20

cryptocat

Androidでの音声認識

Androidには"OK Google"で会話できるGoogleアシスタントがありますが、音声をテキストに変換する"Speech-To-Text"(STT)機能があります。

キーボードの音声入力で使えますが、SpeechRecognizerというAPIで自分のアプリに組み込む事ができます。

GoogleアシスタントもSpeechRecognizerも通常はオンラインで音声認識させますが、SpeechRecognizerは端末にオフライン用の音声認識モデルをダウンロードしておく事でオフラインでも音声認識を行う事ができます。

こちらの方の記事を参考にしました。

Android Speech Recognizerを使いこなす

連続音声認識っぽくなったAndroid SpeechRecognizer速報

サンプル仕様

今回、音声認識精度のテスト用にSpeechRecognizerで常時音声認識するサンプルを作成しました。

IDE:Android Studio 3.4.1
言語:Kotlin
動作確認:Android 6.0/9.0

動作仕様

アプリ起動後音声認識開始(マイクのパーミッションが必要)
起動時にマイクのパーミッションが無ければリクエストする
音声認識結果をTextViewに10個まで表示
音声認識後、再度音声認識開始(ループ)

プロジェクトはGitHubに公開してあります。

SpeechRecognizerSample

事前準備

オフラインで音声認識するには事前にオフライン用の音声認識モデルをダウンロードしておく必要があります。

オフラインの音声認識メニューのすべてのタブから日本語を選択してインストールしてください。

これが無いとオフライン音声認識が失敗します。

Android6.0

設定 -> 言語と入力 -> Google音声入力 -> オフラインの音声認識

Android9.0

画面と階層が異なります。

設定 -> システム -> 言語と入力 -> 仮想キーボード -> Google音声入力 -> オフラインの音声認識

コード

RECORD_AUDIOのパーミッションが必要です。

オンラインで音声認識を行う場合はINTERNETのパーミッションも必要です。

<?xml version="1.0" encoding="utf-8"?>
<manifest xmlns:android="http://schemas.android.com/apk/res/android"
    package="com.cryptocat.speechrecognizersample" >
    <uses-permission android:name="android.permission.RECORD_AUDIO" />
    <!--<uses-permission android:name="android.permission.INTERNET" />-->

    <application
        android:allowBackup="true"
        android:icon="@mipmap/ic_launcher"
        android:label="@string/app_name"
        android:roundIcon="@mipmap/ic_launcher_round"
        android:supportsRtl="true"
        android:theme="@style/AppTheme" >
        <activity android:name=".MainActivity" >
            <intent-filter>
                <action android:name="android.intent.action.MAIN" />

                <category android:name="android.intent.category.LAUNCHER" />
            </intent-filter>
        </activity>
    </application>

</manifest>

レイアウトはTextViewを置いてるだけです。

<?xml version="1.0" encoding="utf-8"?>
<androidx.constraintlayout.widget.ConstraintLayout
    xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:tools="http://schemas.android.com/tools"
    xmlns:app="http://schemas.android.com/apk/res-auto"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    tools:context=".MainActivity">

    <TextView
            android:id="@+id/textViewSpeech"
            android:layout_width="0dp"
            android:layout_height="0dp"
            android:textFontWeight="20"
            app:layout_constraintStart_toStartOf="parent" android:layout_marginStart="8dp"
            app:layout_constraintTop_toTopOf="parent" android:layout_marginTop="8dp"
            app:layout_constraintBottom_toBottomOf="parent" android:layout_marginBottom="8dp"
            app:layout_constraintEnd_toEndOf="parent" android:layout_marginEnd="8dp" android:textSize="36sp"/>
</androidx.constraintlayout.widget.ConstraintLayout>

apply plugin: 'com.android.application'

apply plugin: 'kotlin-android'

apply plugin: 'kotlin-android-extensions'

android {
    compileSdkVersion 29
    buildToolsVersion "29.0.0"
    defaultConfig {
        applicationId "com.cryptocat.speechrecognizersample"
        minSdkVersion 23
        targetSdkVersion 29
        versionCode 1
        versionName "1.0"
        testInstrumentationRunner "androidx.test.runner.AndroidJUnitRunner"
    }
    buildTypes {
        release {
            minifyEnabled false
            proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro'
        }
    }
}

dependencies {
    implementation fileTree(dir: 'libs', include: ['*.jar'])
    implementation "org.jetbrains.kotlin:kotlin-stdlib-jdk7:$kotlin_version"
    implementation 'androidx.appcompat:appcompat:1.0.2'
    implementation 'androidx.core:core-ktx:1.0.2'
    implementation 'androidx.constraintlayout:constraintlayout:1.1.3'
    testImplementation 'junit:junit:4.12'
    androidTestImplementation 'androidx.test:runner:1.2.0'
    androidTestImplementation 'androidx.test.espresso:espresso-core:3.2.0'
    implementation 'org.jetbrains.kotlinx:kotlinx-coroutines-core:1.0.1'
}

オンラインで音声認識を行う場合はspeechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_PREFER_OFFLINE, true)を削除します。

package com.cryptocat.speechrecognizersample

import android.Manifest
import android.content.Intent
import android.content.pm.PackageManager
import android.os.Bundle
import android.speech.RecognitionListener
import android.speech.RecognizerIntent
import android.speech.SpeechRecognizer
import android.util.Log
import androidx.appcompat.app.AppCompatActivity
import androidx.core.app.ActivityCompat
import androidx.core.content.ContextCompat
import kotlinx.android.synthetic.main.activity_main.*
import kotlinx.coroutines.GlobalScope
import kotlinx.coroutines.delay
import kotlinx.coroutines.launch
import java.util.*

class MainActivity : AppCompatActivity() {
    private val TAG = "RecognitionListener"
    private val RECORD_REQUEST_CODE = 101

    private var mSpeechRecognizer : SpeechRecognizer? = null
    private var speechRecognizerIntent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH)

    private var listItems = mutableListOf("")

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)

        listItems.removeAt(0)

        speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.getDefault().getLanguage())
        speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
        speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_CALLING_PACKAGE, this.packageName)
        speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_PREFER_OFFLINE, true)
        speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_PARTIAL_RESULTS, true)
        speechRecognizerIntent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1)
        mSpeechRecognizer = SpeechRecognizer.createSpeechRecognizer(applicationContext)
        mSpeechRecognizer?.setRecognitionListener(object : RecognitionListener {
            override fun onReadyForSpeech(params: Bundle?) {
                Log.d(TAG, "onReadyForSpeech")
            }

            override fun onRmsChanged(rmsdB: Float) {
                Log.d(TAG, "onRmsChanged")
            }

            override fun onBufferReceived(buffer: ByteArray?) {
                Log.d(TAG, "onBufferReceived")
            }

            override fun onBeginningOfSpeech() {
                Log.d(TAG, "onBeginningOfSpeech")
            }

            override fun onEndOfSpeech() {
                Log.d(TAG, "onEndOfSpeech")
            }

            override fun onError(error: Int) {
                var errorCode = ""
                when (error) {
                    SpeechRecognizer.ERROR_AUDIO -> errorCode = "Audio recording error"
                    SpeechRecognizer.ERROR_CLIENT -> errorCode = "Other client side errors"
                    SpeechRecognizer.ERROR_INSUFFICIENT_PERMISSIONS -> errorCode = "Insufficient permissions"
                    SpeechRecognizer.ERROR_NETWORK -> errorCode = "Network related errors"
                    SpeechRecognizer.ERROR_NETWORK_TIMEOUT -> errorCode = "Network operation timed out"
                    SpeechRecognizer.ERROR_NO_MATCH -> errorCode = "No recognition result matched"
                    SpeechRecognizer.ERROR_RECOGNIZER_BUSY -> errorCode = "RecognitionService busy"
                    SpeechRecognizer.ERROR_SERVER -> errorCode = "Server sends error status"
                    SpeechRecognizer.ERROR_SPEECH_TIMEOUT -> errorCode = "No speech input"
                }
                Log.d("RecognitionListener", "onError:" + errorCode)
                try {
                    GlobalScope.launch {
                        runOnUiThread {
                            mSpeechRecognizer?.cancel()
                        }
                        delay(1000)
                        runOnUiThread {
                            mSpeechRecognizer?.startListening(speechRecognizerIntent)
                        }
                    }
                } catch (ex: Exception) {

                }
            }

            override fun onEvent(eventType: Int, params: Bundle?) {
                Log.d(TAG, "onEvent")
            }

            override fun onPartialResults(partialResults: Bundle) {
                Log.d(TAG, "onPartialResults")
                val result = partialResults.getStringArrayList(android.speech.SpeechRecognizer.RESULTS_RECOGNITION)
                Log.i(TAG, "onPartialResults" + result.toString())
            }

            override fun onResults(results: Bundle) {
                Log.d(TAG, "onResults")

                val result = results.getStringArrayList(android.speech.SpeechRecognizer.RESULTS_RECOGNITION)

                Log.i(TAG, "onResults" + result.toString())
                listItems.add(result.toString())
                if (listItems.count() > 10){
                    listItems.removeAt(0)
                }
                var text = ""
                listItems.forEach {
                    text += it + "\n"
                }
                try {
                    GlobalScope.launch {
                        runOnUiThread {
                            textViewSpeech.text = text
                        }
                        runOnUiThread {
                            mSpeechRecognizer?.startListening(speechRecognizerIntent)
                        }
                    }
                } catch (ex: Exception) {

                }
            }
        })

        val permission = ContextCompat.checkSelfPermission(this, Manifest.permission.RECORD_AUDIO)
        if (permission != PackageManager.PERMISSION_GRANTED) {
            ActivityCompat.requestPermissions(this, arrayOf(Manifest.permission.RECORD_AUDIO), RECORD_REQUEST_CODE)
        }
        else {
            Log.i(TAG, "Permission is granted")
            mSpeechRecognizer?.startListening(speechRecognizerIntent)
            Log.i(TAG, "Start listening")
        }
    }

    override fun onRequestPermissionsResult(requestCode: Int, permissions: Array<out String>, grantResults: IntArray) {
        if (requestCode == RECORD_REQUEST_CODE) {
            if (grantResults.isEmpty() || grantResults[0] != PackageManager.PERMISSION_GRANTED){
                Log.i(TAG, "Permission has been denied by user")
            }
            else{
                Log.i(TAG, "Permission has been granted by user")
                mSpeechRecognizer?.startListening(speechRecognizerIntent)
                Log.i(TAG, "Start listening")
            }
        }
    }
}

ライフサイクルでは特に処理していませんが、onPauseでmSpeechRecognizer?.cancel()やonResumeでmSpeechRecognizer?.startListening(speechRecognizerIntent)してやるとよいと思います。

音声認識結果やエラー時はピロンと音がしますので、うるさい時は音量をミュートしてください。

この記事を書いた人

cryptocat

組み込み系がメインのソフトウェアエンジニア扱える言語・環境： C VBA C# (.Net Framework、Xamarin) Java (Android) kotlin (Android) Swift (iOS) JavaScript Python 勉強中: HTML CSS Ruby on Rails PHP 所持している資格：エンベデッドシステムスペシャリスト(ES) 応用情報技術者(AP) 基本情報技術者(FE) 情報セキュリティマネジメント AWS Certified Cloud Practitioner(CLF) AWS Certified Solution Architect Associate(SAA)