December 4, 2022

– Advertisement –

This article was published as a part of the Data Science Blogathon.

– Advertisement –

Using conversational context in ASR systems has proven to be an effective approach to improving conversational ASR performance. As a result, several methods have been proposed. However, existing methods present some issues, such that the recognized hypothesis of current pronunciation may be biased due to inevitable historical recognition errors. This is problematic and needs to be resolved. To address this problem, researchers have proposed an audio-textual cross-modal representation extractor, which we will explore in this article.

– Advertisement –

Now, let’s begin…

– Advertisement –

Highlight

An audio-textual cross-modal representation extractor consists of two pre-trained single-modal encoders, a pre-trained speech model called Wav2Vec2.0 and a language model called RoBERTA, and a cross-modal encoder which is meant to learn contextual representations directly from the preceding. speech is proposed.

Some input tokens and input sequences of each instrument are randomly masked. Then a modal-missing or token-missing prediction is performed with a modal-level CTC loss on the cross-modal encoder.

When training a conversational ASR system, the extractor is frozen to extract the textual representation of the previous speech; The extracted representation is then used as a reference fed to the ASR decoder via the attention mechanism.

Textual representations extracted from current and past speech are sent to ASR’s decoder module, which reduces the relative CER by 16% on three Mandarin conversational datasets (MagicData, DDT, and ).

What was the problem with the existing methods?

As we discussed briefly in the introduction section, existing methods usually rely on extracting relevant information from transcripts of speeches preceding in conversation. However, the hypotheses of the preceding statements are used during estimation rather than ground truth tapes to extract relevant representations. As a result, new errors may be introduced by inaccuracies in historical ASR hypotheses when the current pronunciation is recognized, which is an issue that needs to be resolved.

Now that we have looked at the problem, let’s look at the proposed method, which aims to solve the above challenge.

Figure 1 shows a diagram of the proposed conversational ASR system. The Cross-Modal Representation Extractor was introduced, which takes advances from the pre-trained speech model Wav2Vec2.0 and the language model RoBERTa, and is a cross-modal encoder for extracting contextual information from speech.

See also  GM temporarily pauses paid advertising on Twitter following Elon Musk takeover

Figure 1: Diagram showing the proposed method, where speech sequences are the input.

The proposed conversational ASR system is trained in the following two phases:

Step 1: In the first step, the contextual representation extractor is trained, as shown in Figure 2. Text and audio embeddings are derived from paired speech and tape using a text encoder and speech encoder, respectively. After that, the received embeddings are sent to the cross-modal encoder to extract the cross-modal representations. Using multitask learning, the representation extractor learns the correlation between paired speech and transcripts at different data granularity.

Stage 2: The text encoder in the Multi-Modal Representation Extractor is rejected in the second stage. Alternatively, the extractor learns contextual representations from speech. During training and testing of the ASR module, the relevant representations are incorporated into the decoder of the ASR module by the attention mechanism.

contextual representation extractor

Figure 2: Diagram showing the proposed contextual representation extractor. (source: arxiv)

In addition, some of the input tokens and sequences of each modality are randomly masked. Then a modal-missing or token-missing prediction is performed with a modal-level CTC loss on the cross-modal encoder. In this way, the model captures the bi-directional context dependence in a particular modality and the relationship between the two modalities.

Additionally, when training the conversational ASR system, the extractor is frozen to extract textual representations of previous speeches; The extracted representation is then used as a reference fed to the ASR decoder via the attention mechanism.

In the following subsection, we will briefly discuss each component:

i) Speech Encoder: Speech Encoder consists of a pre-trained speech representation model. The wav2vec2.0 large model is trained on WenetSpeech and a linear layer.

ii) Text Encoder: The pre-trained language model RoBERTawwm-ext is used as the text encoder, which is trained on in-house text data including news, encyclopedias and question answer web.

iii) Cross Model Encoder (CME): Cross Model Encoder consists of three transformer blocks. Speech embedding (A) and text embedding (T) obtained from the speech encoder and text encoder, respectively, are sent to the CME to obtain a high-dimensional cross-modal contextual representation.

iv) Relevant Analogous ASR Models:

a) Conformer Encoder: Conformer combines self-attention with convolution in ASR tasks, learning the interaction of global information through a self-attention mechanism and also the representation of local features through a convolutional neural network (CNN) Learns, which leads to better performance. , The conformer blocks are grouped together as the encoders of the ASR model, where each conformer block includes a convolution layer (CONV), a multi-head self-attention layer (MHSA), and a feed-forward layer (FFN). it happens.

See also  America’s AI Bill of Rights is the right idea – and it’s about time

b) Contextual decoder: Contextual decoder consists of a transformer with an additional cross-sectional layer. First, everyday speech as well as text embeddings of past speech are generated. For this, the speech to be processed is sent to the extractor, together with a dummy embedding. Then the current text embedding with the previous context embedding is combined to get the final contextual embedding. The obtained contextual embeddings are fed into each decoder block so that the decoder can learn the context information extracted by the text extractor. Finally, the output of the last layer of the decoder is used to predict the character probabilities through the softmax function.

Result

1. Effect of Acoustic Contextual Representation: Considering the results shown in rows 1, 3, and 5 in Table 1, it can be inferred that the proposed method improves recognition accuracy, even when the current speech pronunciation Only the relevant representation of AI is extracted. Speech recognition performance is further improved after simultaneously incorporating contextual representations of the previous speech utterance AI-1 and the current speech utterance AI.

It should be noted that in the following Table 1, AcousticCur refers to the model that uses the text embedding of the current sentence, AcousticCon refers to the model that uses the text embedding of the current sentence and the previous sentence, And “ExtLM” denotes extra. Language models that are used in ASR decoding.

relevant representation

Table 1: CER Comparison of Different End-to-End Models on Three for Mandarin. (source: arxiv)

2. Relevant information of Wav2vec2.0: As shown in Table 2, even though the pre-trained Wav2Vec2.0 model improves recognition accuracy compared to the baseline model, the proposed method (AcousticCon) gives significantly better results. This indicates that the proposed model utilizes the representational capability of the pre-trained model and successfully achieves cross-modal textual representations.

Conversational ASR

3. Length and location of historical information: The performance of the ASR model is generally influenced by the location and length of historical speech utterances used to extract contextual representations. The effects of history length and locations were examined on the HKUST and MagicData sets. From Table 3, it can be inferred that the more similar a sentence is to the current sentence, the more useful it is to promote the identification accuracy of the current sentence. However, inputting the textual features of the last two sentences together does not produce better results. This may be due to the decoder’s inability to learn appropriate concerns from extensive historical data.

See also  Building TidyModels classification models from scratch and deploying with Vetiver | r-bloggers

In the following table, AcousticConone refers to AcousticCon with the previous sentence, and AcousticContwo refers to Acoustic

Conversational ASR

Table 3: Comparison of length and location based on historical information. (source: arxiv)

conclusion

To sum it up, in this article, we learned the following:

1. Relevant information can be leveraged to boost the performance of the negotiation system. In light of this, a cross-modal representation extractor is proposed for learning contextual information from speech and using representations for conversational ASR through attention mechanisms.

2. The Cross-Modal Representation Extractor was introduced, consisting of two pre-trained modal-related encoders, a pre-trained speech model (Wav2Vec2.0) and a language model (RoBERTA), which extracts high-level latent from speech. Extracts attributes and associated transcript (text). and a cross-modal encoder whose purpose is to learn the relationship between speech and text.

3. When training a conversational ASR system, the extractor is frozen to extract the textual representation of the previous speech; The extracted representation is then used as a reference fed to the ASR decoder via the attention mechanism.

4. Textual representations extracted from current and past speech are sent to ASR’s decoder module, which reduces relative CER by 16% on three Mandarin conversational datasets (MagicData, DDT, and HKUST), as well as improved Demonstrates the vanilla conformer model and the CVAE-conformer model.

That ends this article. Thanks for reading. If you have any questions or concerns, please post them in the comments section below. Happy Learning!

The media shown in this article is not owned by Analytics Vidya and is used at the sole discretion of the author.

related

Source link

– Advertisement –