SpeechKit Recognition gRPC API: Recognizer

A set of methods for voice recognition.

CallDescription
RecognizeStreamingExpects audio in real-time

Calls Recognizer

RecognizeStreaming

Expects audio in real-time

rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)

StreamingRequest

FieldDescription
Eventoneof: session_options, chunk, silence_chunk or eou
  session_optionsStreamingOptions
Session options. Should be the first message from user.
  chunkAudioChunk
Chunk with audio data.
  silence_chunkSilenceChunk
Chunk with silence.
  eouEou
Request to end current utterance. Works only with external EOU detector.

StreamingOptions

FieldDescription
recognition_modelRecognitionModelOptions
Configuration for speech recognition model.
eou_classifierEouClassifierOptions
Configuration for end of utterance detection model.
recognition_classifierRecognitionClassifierOptions
Configuration for classifiers over speech recognition.

RecognitionModelOptions

FieldDescription
modelstring
Reserved for future, do not use.
audio_formatAudioFormatOptions
Specified input audio.
text_normalizationTextNormalizationOptions
Text normalization options.
language_restrictionLanguageRestrictionOptions
Possible languages in audio.
audio_processing_typeenum AudioProcessingType
How to deal with audio data (in real time, after all data is received, etc). Default is REAL_TIME.

AudioFormatOptions

FieldDescription
AudioFormatoneof: raw_audio or container_audio
  raw_audioRawAudio
Audio without container.
  container_audioContainerAudio
Audio is wrapped in container.

RawAudio

FieldDescription
audio_encodingenum AudioEncoding
Type of audio encoding
  • LINEAR16_PCM: Audio bit depth 16-bit signed little-endian (Linear PCM).
sample_rate_hertzint64
PCM sample rate
audio_channel_countint64
PCM channel count. Currently only single channel audio is supported in real-time recognition.

ContainerAudio

FieldDescription
container_audio_typeenum ContainerAudioType
Type of audio container.
  • WAV: Audio bit depth 16-bit signed little-endian (Linear PCM).
  • OGG_OPUS: Data is encoded using the OPUS audio codec and compressed using the OGG container format.
  • MP3: Data is encoded using MPEG-1/2 Layer III and compressed using the MP3 container format.

TextNormalizationOptions

FieldDescription
text_normalizationenum TextNormalization
Normalization
  • TEXT_NORMALIZATION_ENABLED: Enable normalization
  • TEXT_NORMALIZATION_DISABLED: Disable normalization
profanity_filterbool
Profanity filter (default: false).
literature_textbool
Rewrite text in literature style (default: false).
phone_formatting_modeenum PhoneFormattingMode
Define phone formatting mode
  • PHONE_FORMATTING_MODE_DISABLED: Disable phone formatting

LanguageRestrictionOptions

FieldDescription
restriction_typeenum LanguageRestrictionType
  • WHITELIST: The allowing list. The incoming audio can contain only the listed languages.
  • BLACKLIST: The forbidding list. The incoming audio cannot contain the listed languages.
language_code[]string

EouClassifierOptions

FieldDescription
Classifieroneof: default_classifier or external_classifier
Type of EOU classifier.
  default_classifierDefaultEouClassifier
EOU classifier provided by SpeechKit. Default.
  external_classifierExternalEouClassifier
EOU is enforced by external messages from user.

DefaultEouClassifier

FieldDescription
typeenum EouSensitivity
EOU sensitivity. Currently two levels, faster with more error and more conservative (our default).
max_pause_between_words_hint_msint64
Hint for max pause between words. Our EOU detector could use this information to distinguish between end of utterance and slow speech (like one two three, etc).

ExternalEouClassifier

Empty.

RecognitionClassifierOptions

FieldDescription
classifiers[]RecognitionClassifier
List of classifiers to use

RecognitionClassifier

FieldDescription
classifierstring
Classifier name
triggers[]enum TriggerType
Describes the types of responses to which the classification results will come

AudioChunk

FieldDescription
databytes
Bytes with audio data.

SilenceChunk

FieldDescription
duration_msint64
Duration of silence chunk in ms.

Eou

Empty.

StreamingResponse

FieldDescription
session_uuidSessionUuid
Session identifier
audio_cursorsAudioCursors
Progress bar for stream session recognition: how many data we obtained; final and partial times; etc.
response_wall_time_msint64
Wall clock on server side. This is time when server wrote results to stream
Eventoneof: partial, final, eou_update, final_refinement, status_code or classifier_update
  partialAlternativeUpdate
Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation from final_time_ms to partial_time_ms. Could change after new data will arrive.
  finalAlternativeUpdate
Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases.
  eou_updateEouUpdate
After EOU classifier, send the message with final, send the EouUpdate with time of EOU before eou_update we send final with the same time. there could be several finals before eou update.
  final_refinementFinalRefinement
For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Final normalization will introduce additional latency.
  status_codeStatusCode
Status messages, send by server with fixed interval (keep-alive).
  classifier_updateRecognitionClassifierUpdate
Result of the triggered classifier
channel_tagstring
Tag for distinguish audio channels.

SessionUuid

FieldDescription
uuidstring
Internal session identifier.
user_request_idstring
User session identifier.

AudioCursors

FieldDescription
received_data_msint64
Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server.
reset_time_msint64
Input stream reset data.
partial_time_msint64
How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data to update recognition results (includes silence as well).
final_time_msint64
Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore usually this even is followed by EOU detection (but this could change in future).
final_indexint64
This is index of last final server send. Incremented after each new final.
eou_time_msint64
Estimated time of EOU. Cursor is updated after each new EOU is sent. For external classifier this equals to received_data_ms at the moment EOU event arrives. For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings.

AlternativeUpdate

FieldDescription
alternatives[]Alternative
List of hypothesis for timeframes.
channel_tagstring

Alternative

FieldDescription
words[]Word
Words in time frame.
textstring
Text in time frame.
start_time_msint64
Start of time frame.
end_time_msint64
End of time frame.
confidencedouble
The hypothesis confidence. Currently is not used.
languages[]LanguageEstimation
Distribution over possible languages.

Word

FieldDescription
textstring
Word text.
start_time_msint64
Estimation of word start time in ms.
end_time_msint64
Estimation of word end time in ms.

LanguageEstimation

FieldDescription
language_codestring
Language code in ISO 639-1 format.
probabilitydouble
Estimation of language probability.

EouUpdate

FieldDescription
time_msint64
EOU estimated time.

FinalRefinement

FieldDescription
final_indexint64
Index of final for which server sends additional information.
Typeoneof: normalized_text
Type of refinement.
  normalized_textAlternativeUpdate
Normalized text instead of raw one.

StatusCode

FieldDescription
code_typeenum CodeType
Code type.
messagestring
Human readable message.

RecognitionClassifierUpdate

FieldDescription
window_typeenum WindowType
Response window type
start_time_msint64
Start time of the audio segment used for classification
end_time_msint64
End time of the audio segment used for classification
classifier_resultRecognitionClassifierResult
Result for dictionary-based classifier

RecognitionClassifierResult

FieldDescription
classifierstring
Name of the triggered classifier
highlights[]PhraseHighlight
List of highlights, i.e. parts of phrase that determine the result of the classification
labels[]RecognitionClassifierLabel
Classifier predictions

PhraseHighlight

FieldDescription
textstring
Text transcription of the highlighted audio segment
start_time_msint64
Start time of the highlighted audio segment
end_time_msint64
End time of the highlighted audio segment

RecognitionClassifierLabel

FieldDescription
labelstring
The label of the class predicted by the classifier
confidencedouble
The prediction confidence
Previous