Keyword Detection integration

Keyword Detection (KD), also known as Voice Activation or Sound Trigger, is a feature that triggers a speech recognition engine when a predefined keyphrase (keyword) is successfully detected. Offloading the keyphrase detection algorithm to the embedded processing environment (i.e. dedicated DSP) reduces system power consumption while listening for an utterance.

The terms Voice Activation and Keyphrase Detection are often used interchangeably to describe end-to-end system-level use cases that include:

  • Keyphrase detection algorithm

  • Keyphrase enrollment (parametrization of keyphrase detection algorithm)

  • Management of an audio stream that is used to transport utterances

  • Steps made to reduce system-level power consumption

  • System wakeup on keyphrase detection

The Keyphrase Detector component typically is used to identify a firmware processing component that implements an algorithm for keyphrase detection in an audio stream.

The speech audio stream is used to indicate that the stream is primarily used to deliver data to the automatic speech recognition (ASR) algorithm. The voice audio stream typically indicates that the recipient of audio data is a human.

Depending on system-level requirements for the keyphrase detection algorithm and the speech recognition engine, different policies for keyphrase buffering and voice data streaming may be applied. This document covers the reference implementation available in SOF. The following sections cover the functional scope.

Note

Currently, SOF implements the Keyphrase Detector component with a reference trigger function that allows testing of E2E flow by detecting a rapid volume change.

Timing sequence

@startuml

scale max 1024 width

footer: timeline not to scale 
robust "Speech application" as App
concise "Audio Stream" as Audio

App is idle
Audio is "Preceeding"

@App
0 is idle
+180 is Processing

@Audio
0 is Keyphrase
@0 <-> @100 : keyphrase length - L1
@100 <-> @+80 : detection\ntime - L2
@180 <-> @+80 : burst \ntransmission time - L3
Audio@180 -> App@180 : notification
@260 <-> @+60 : safety \nmargin - L4
100 is Command
+200 is Following
@enduml

Figure 38 Basic diagram for a timing sequence

A keyphrase is preceded by a period of silence and is followed by a user command. In order to balance power savings and user experience, the host system (CPU) is activated only if a keyphrase is detected. To reduce the number of false triggers for user commands, the keyphrase can be sent to the host for additional (2nd stage) verification. This requires the FW to buffer the keyphrase in a memory. Keyphrase transmission to the host is as fast as possible (faster than real-time) to reduce latency for a system response.

End-2-End flows

@startuml

scale max 1024 width

participant "Userspace component" as usr
participant "Audio driver" as drv
participant "FW infrastructure" as fw
participant "Data transfer to Host" as dma
participant "Keyword detection algorithm" as kda
participant "Data transfer to DSP" as gpdma

box "Linux User/Kernel space" #LavenderBlush
	participant usr
	participant drv
end box

box "DSP" #LightBlue
	participant fw
	participant dma
	participant kda
	participant gpdma
end box

activate fw

drv -> fw : Setup audio topology \n (Speech Capture & Keyword Detection pipes)
usr -> drv : Prepare & Open PCM capture \n(snd_pcm_open/snd_pcm_hw_params)
drv -> fw : Stream Open & Preparation
drv -> fw : HW Params
group optional (depends on keyword detection algorithm implementation)
 usr -> drv : Send keyword detection algorithm parameters \n (snd_ctl_elem_tlv_write)
 drv -> fw : Send keyword detection algorithm parameters
 fw -> kda : Send keyword detection algorithm parameters
end

drv ->drv : DAPM power up event
drv -> fw : HW Params for Keyphrase Detection Pipeline
usr -> drv : Trigger start (alsamixer)
drv -> fw : Keyword detection algorithm & buffer manager triggered

fw -> fw : Keyphrase Buffer Manager \nin acquisition mode
fw -> gpdma 

activate gpdma

fw -> kda : keypharse detection enabled

activate kda

usr -> drv : Trigger start (snd_pcm_read)

note over usr
Speech application indefinitely 
waits for data.
end note 

ref over usr, drv, fw , gpdma, kda, dma  
Speech Capture pipeline is not transmitting data to Host system
Host system may enter the low power state
end ref

loop keyword detection algorithm \nexecuted on DSP
 kda <- gpdma 
end

hnote over kda : keyword is detected

fw <-- kda : FW event on keyword detection
fw -> kda : keyword detection disabled

deactivate kda 

fw -> fw : Keyphrase Buffer Manager \nin drain mode
drv <-- fw : notification on keyword detection
'drv -> fw : enable data transission to Host \n(Capture[Speech] pipeline to Host is running)
usr <-- drv : notification on keyword detection (optional)
gpdma -> dma 

activate dma

ref over dma 
Sending a burst of historic data (approx.2s) 
with detected keyword for
second stage verification on host.
end ref

gpdma <-- dma 

deactivate dma

usr <-- drv : snd_pcm_read completed 

fw -> fw : Keyphrase Buffer Manager \nin passthrough mode 

loop Realtime capture
 usr -> drv : snd_pcm_read
 gpdma -> dma 
 activate dma
 gpdma <-- dma 

 deactivate dma
 usr <-- drv : snd_pcm_read completed 
end 

ref over usr 
User space optionally performs second stage keyword verification.
end ref

usr -> drv : Trigger stop (alsamixer)
drv ->drv : DAPM power down event
drv -> fw : Stop Keyphrase Detection algorithm pipeline
usr -> drv : Trigger stop (snd_pcm_drop / snd_pcm_free)
drv -> fw : Close Speech capture stream
fw -> gpdma 

deactivate gpdma

ref over usr, drv, fw , gpdma, kda, dma  
The flow can be repeated for next user command starting from snd_pcm_open()
end ref

deactivate fw
@enduml

Figure 39 E2E flow for SW/FW components

The fundamental assumption for the flow is that the keyphrase detection sequence is controlled by the user space component (application) that opens and closes the speech audio stream. The audio topology must be set up before the speech stream is opened. There is an optional sequence to customize the keyword detection algorithm by behavior by sending run-time parameters.

During the Stream Open and Preparation phase, HW parameters are sent to the DAI and configuration parameters are passed from the topology to the FW components. The DAPM events handlers are used to control a Keyphrase Detector node of the FW topology graph by the audio driver. Once the keyphrase is detected, a notification is sent to the driver. At the same time, an internal event in the FW triggers, draining buffered audio data in burst mode to the host. Once the buffer is drained, the speech capture pipeline starts to work as a passthrough capture until it is closed by the user space application.

FW topology

@startuml

scale max 1024 width

skinparam rectangle {
   backgroundColor<<dai>> #6fccdd
   backgroundColor<<dma>> #f6ed80
   backgroundColor<<stream>> #d6d6de
   borderColor<<stream>> #d6d6de
   borderColor<<ppl>> #a1a1ca

   backgroundColor<<event>> #f05772
   stereotypeFontColor<<event>> #ffffff
   fontColor<<event>> #ffffff

   backgroundColor<<cpu>> #f0f0f0
}


together {
rectangle "MIC HW" as dmic #DDDDDD

rectangle "Speech Capture Pipeline" as ppl_1 <<FW pipeline >>{
 rectangle "MIC DAI" as dai_1 <<dai>>
 rectangle "Keyphrase Buffer Manager" as kpb
 dai_1 -> kpb : 2ch/16kHz/16bit
 rectangle "Host" as host
 }

}

rectangle "Keyphrase Detector Pipeline" as ppl_2 <<FW pipeline >>{
 rectangle "Channel selector" as sel
 rectangle "Keyphrase detection algorithm" as wov
 sel -> wov : 1ch/16kHz/16bit
}

rectangle "Host System" as hsys {
 rectangle "Host Memory" as hmem #DDDDDD
}

dmic -> dai_1
kpb -> host
kpb -> sel : 2ch/16kHz/16bit
host -> hmem : 2ch/16kHz/16bit
wov ..> kpb : FW events
wov ..> hsys : FW notifications
@enduml

Figure 40 Basic diagram for FW components topology

The diagram above provides an overview of FW and HW components that play a role in keyphrase detection flows. The components are organized in pipelines:

  1. Speech capture pipeline

    1. DMIC DAI configures the HW interface to capture data from microphones.

    2. The Keyphrase Buffer Manager is responsible for managing the data captured by microphones. This includes control of an internal buffer for incoming data and routing of incoming audio samples. The audio buffer with historic audio data is implemented as a cyclic buffer. While listening to a keyphrase, the component stores incoming data in an internal buffer and copies it to a sink that leads toward the keyword detector component. On successful detection of a keyphrase, the buffer is drained during a burst transmission to a host. Once the buffer is drained, it starts to work as a passthrough component on a capture pipeline.

    3. The host component configures transport (over DMA) to the host system. The component is responsible for transmitting from local memory (FW accessible) to remote (host CPU accessible) memory.

  2. Keyphrase detector pipeline

    1. The channel selector is responsible for providing a single channel on input to the keyphrase detection algorithm. The decision of which channel to select is made by the platform integrator. The component can accept parameters from a topology file.

    2. The keyphrase detection algorithm accepts audio frames and returns information if a keyphrase is detected. Note that the FW infrastructure can allow a FW event to be sent to the Keyphrase Buffer Manager component if a keyphrase is detected. The component also sends a notification to the audio driver and implements large parameters support.

KPBM state diagram

The state diagram below presents all possible keyphrase buffer manager states as well as the valid relationships between them.

@startuml
[*] --> KPB_DISABLED:  Start \nor\n [IPC] free \nmessage \nfrom either  state
KPB_DISABLED: Starting state of KPB - \nNo action has been taken yet
KPB_DISABLED--> KPB_CREATED: [IPC] \nnew component
KPB_DISABLED-[#0000FF]-> KPB_DISABLED: [IPC] \nreset

KPB_CREATED : New KPB component has been created
KPB_CREATED --> KPB_PREPARING: [IPC] \npcm params
KPB_CREATED -[#0000FF]-> KPB_CREATED : [IPC] \nreset

KPB_PREPARING: Prepare Key Phrase Buffer component.
KPB_PREPARING-> KPB_STATE_RUN: Success
KPB_PREPARING-> KPB_PREPARING: Failure
KPB_PREPARING-[#0000FF]-> KPB_PREPARING: [IPC] \nreset

KPB_STATE_RUN: KPB is prepared and ready.
KPB_STATE_RUN-[#0000FF]-> KPB_PREPARING: [IPC] \nreset
KPB_STATE_RUN---> KPB_STATE_INIT_DRAINING: [EVENT] \nkey phrase detected
KPB_STATE_RUN-> KPB_STATE_BUFFERING: Start \nbuffering

KPB_STATE_BUFFERING: Buffer incoming samples in the \ninternal history buffer
KPB_STATE_BUFFERING-> KPB_STATE_RUN: Done
KPB_STATE_BUFFERING-> KPB_STATE_INIT_DRAINING: Done
KPB_STATE_BUFFERING-> KPB_STATE_DRAINING: Done
KPB_STATE_BUFFERING-[#0000FF]-> KPB_STATE_RESETTING: [IPC] \nreset

KPB_STATE_INIT_DRAINING: KPB received detection event
KPB_STATE_INIT_DRAINING-[#0000FF]-> KPB_PREPARING: [IPC] \nreset
KPB_STATE_INIT_DRAINING--> KPB_STATE_DRAINING: Draining task starts
KPB_STATE_INIT_DRAINING--> KPB_STATE_BUFFERING: Start \nbuffering

KPB_STATE_DRAINING: KPB is draining internal history buffer \nto the client's buffer
KPB_STATE_DRAINING-->KPB_STATE_HOST_COPY: Draining done
KPB_STATE_DRAINING-[#0000FF]-> KPB_STATE_RESETTING: [IPC] \nreset
KPB_STATE_DRAINING--> KPB_STATE_BUFFERING: Start \nbuffering

KPB_STATE_RESETTING: KPB is preparing itself for the reset
KPB_STATE_RESETTING-->KPB_STATE_RESET_FINISHING

KPB_STATE_RESET_FINISHING: KPB is finishing reset sequence
KPB_STATE_RESET_FINISHING->KPB_PREPARING: Reset done

KPB_STATE_HOST_COPY: KPB is copying real time \nstream into client's buffer
@enduml

Figure 41 Keyphrase buffer manager state diagram

Latency & buffering

This section covers calculations needed to be done to properly configure the keyphrase buffer size. The symbols used in a formula below are depicted above; see Basic diagram for a timing sequence.

Note

The formula for size of a keyphrase buffer: ( L1 + L2 + L3 + L4 ) * number of channels * bitdepth = Size [Kb]

Specifically:

  1. L1 is defined as length of a keyphrase with preceding or trailing silence. The value depends highly on the keyphrase itself and detection algorithm requirements.

  2. L2 is a sum of the algorithmic (processing) latency of a detection algorithm and the additional time needed to execute additional components in pipelines as well as prepare and send notifications.

  3. L3 is the time required to send already-buffered data to the host. Typically, a Write Pointer (WP) is used to indicate where data that’s coming from microphones is written to a keyphrase buffer. The keyphrase buffer is organized as a cyclic buffer and the WP moves if data is coming from mics at a regular rate. The Read Pointer (RP) indicates from which offset in the buffer data is fetched to host. To start burst transmission, the RP is set to the WP - “history depth” position. The history depth is defined at FW or is passed from topology. The RP moves faster than the WP due to draining that is executed as a background task. The draining phase lasts until the RP again reaches the WP, which moves at a regular (slower) rate. This signals the end of the L3 period and the RP follows the WP at a rate that the data is available in the DAI DMA buffer. Implementation note: “history depth” may be updated on-the-fly during the draining phase if new data is captured in the meantime.

  4. L4 is a safety margin that can be accommodated in any period of time defined above. It is explicitly defined to make sure it is included in the calculation. L4 length depends on: an audio frame size that is processed by a detector; the amount of detector compute time; the output audio format; the keyphrase buffer size; etc.