|
# Generating EEG features from Acoustic features |
|
|
|
Gautam Krishna |
|
|
|
_Brain Machine Interface Lab_ |
|
|
|
_The University of Texas at Austin_ |
|
|
|
Austin, Texas |
|
|
|
Co Tran |
|
|
|
_Brain Machine Interface Lab_ |
|
|
|
_The University of Texas at Austin_ |
|
|
|
Austin, Texas |
|
|
|
Mason Carnahan* |
|
|
|
_Brain Machine Interface Lab_ |
|
|
|
_The University of Texas at Austin_ |
|
|
|
Austin, Texas |
|
|
|
###### Abstract |
|
|
|
In this paper we demonstrate predicting electroencephalography (EEG) features from acoustic features using recurrent neural network (RNN) based regression model and generative adversarial network (GAN). We predict various types of EEG features from acoustic features. We compare our results with the previously studied problem on speech synthesis using EEG and our results demonstrate that EEG features can be generated from acoustic features with lower root mean square error (RMSE), normalized RMSE values compared to generating acoustic features from EEG features (ie: speech synthesis using EEG) when tested using the same data sets. |
|
|
|
electroencephalography (EEG), deep learning |
|
|
|
## I Introduction |
|
|
|
Electroencephalography (EEG) is a non invasive way of measuring electrical activity of human brain. EEG sensors are placed on the scalp of a subject to obtain the EEG recordings. The references [1, 2, 3] demonstrate that EEG features can be used to perform isolated and continuous speech recognition where EEG signals recorded while subjects were speaking or listening, are translated to text using automatic speech recognition (ASR) models. In [4] authors demonstrated synthesizing speech from invasive electrocorticography (ECoG) signals using deep learning models. Similarly in [2, 5] authors demonstrated synthesizing speech from EEG signals using deep learning models. In [2, 5] authors demonstrated results using different types of EEG feature sets. Speech synthesis and speech recognition using EEG features might help people with speaking disabilities or people who are not able to speak with speech restoration. |
|
|
|
In this paper we are interested in investigating whether it is possible to predict EEG features from acoustic features. This problem can be formulated as an inverse problem of EEG based speech synthesis. In EEG based speech synthesis, acoustic features are predicted from EEG features as demonstrated by the work explained in references [2, 5]. Predicting EEG features or signatures from unique acoustic patters might help in better understanding of how human brain process speech perception and production. Recording EEG signals in a laboratory is a time consuming and expensive process which requires the use of specialized EEG sensors and amplifiers, thus having a computer model which can generate EEG features from acoustic features might also help with speeding up the EEG data collection process as it is much easier to record speech or audio signal, especially for the task of collecting EEG data for performing speech recognition experiments. |
|
|
|
In [6] authors demonstrated medical time series generation using conditional generative adversarial networks [7] for toy data sets. Other related work include the reference [8] where authors demonstrated generating EEG for motor task using wasserstein generative adversarial networks [9]. Similarly in [10] authors generate synthetic EEG using various generative models for the task of steady state visual evoked potential classification. In [11] authors demonstrated EEG data augmentation for the task of emotion recognition. Our work focuses only on generating EEG features from acoustic features. |
|
|
|
We first performed experiments using the model used by authors in [5] and then we tried performing experiments using generative adversarial networks (GAN) [12]. In this work we predict various EEG feature sets introduced by authors in [2] from acoustic features extracted from the speech of the subjects as well as from acoustic features extracted from the utterances that the subjects were listening. |
|
|
|
Our results demonstrate that predicting EEG features from acoustic features seem to be easier compared to predicting acoustic features from EEG features as the root mean square error (RMSE) values during test time where much lower for predicting EEG features from acoustic features compared to it's inverse problem when tested using the same data sets. To the best of our knowledge this is the first time predicting EEG features from acoustic features is demonstrated using deep learning models. |
|
|
|
## II Regression and GAN model |
|
|
|
The regression model we used in this work was very similar to the ones used by the authors in [5]. We used the exact training parameters used by authors in [5] for setting values for batch size, number of training epochs, learning rate etc. In [5] authors used only gated recurrent unit (GRU) [13]layers in their model but in this work we also tried performing experiments using Bi directional GRU layers where a forward GRU and backward GRU layer outputs are concatenated to produce the output of the bi directional GRU layer. The architecture of our regression model is described in Figure 1. The model takes acoustic features or mel-frequency cepstral coefficients (MFCC) of dimension 13 as input and outputs EEG features of a specific dimension at every time step. The dimension of the EEG features outputted depends on the EEG feature set used during training, as each EEG feature set had a different dimension value. The time distributed dense layer in the model has number of hidden units equal to the dimension of the EEG feature set used. The mean squared error (MSE) function was used as the regression loss function for the model. The Figure 4 shows the training convergence for the regression model when Bi directional GRU layers were used. There were two Bi-GRU layers with 256 and 128 hidden units respectively. |
|
|
|
Generative adversarial network (GAN) [12] consists of two networks namely the generator model and the discriminator model which are trained simultaneously. The generator model learns to generate data from a latent space and the discriminator model evaluates whether the data generated by the generator is fake or is from true data distribution. The training objective of the generator is to fool the discriminator. The main motivation behind trying to perform experiments using GAN was in the case of GAN the loss function is learned where as in regression a fixed loss function (MSE) is used. However GAN models are extremely difficult to train. |
|
|
|
Our generator model, as shown in Figure 2, consists of two layers of Bi-GRU with 256, 128 hidden units respectively in each layer followed by a time distributed dense layer with hidden units equal to the dimension of EEG feature set. During training, real MFCC features with dimension 13 from training set are fed into the generator model and the generator outputs a vector of dimension equal to EEG feature set dimension, which can be considered as fake EEG. |
|
|
|
The discriminator model, as described in Figure 3, consists of two single layered Bi-GRU with 256, 128 hidden units connected in parallel. At each training step a pair of inputs are fed into the discriminator. The discriminator takes (real MFCC features, fake EEG) and (real MFCC features, real EEG) pairs. The outputs of the two parallel Bi-GRU's are concatenated and then fed to a GRU layer with 128 hidden units. The last time step of the GRU layer is fed into the dense layer with sigmoid activation function. |
|
|
|
In order to define the loss functions for both our generator and discriminator model let us first define few terms. Let \(P_{s_{f}}\) be the sigmoid output of the discriminator for (real MFCC features, fake EEG) input pair and let \(P_{s_{e}}\) be the sigmoid output of the discriminator for (real MFCC features, real EEG) input pair during training time. Then we can define the loss function of generator as \(-\log(P_{s_{f}})+(realEEG-fakeEEG)^{2}*0.5\) and loss function of discriminator as \(-\log(P_{s_{e}})-\log(1-P_{s_{f}})\). The weights of Bi-GRU layers in the generator model were initialized with weights of the regression model for easier training. During test time, the trained generator model takes acoustic features or MFCC from test set as input and produces EEG features as output. |
|
|
|
The Figure 6 shows the generator model training loss and Figure 7 shows the discriminator model training loss. The GAN model was trained for 200 epochs using adam optimizer with a batch size of 32. |
|
|
|
Fig. 1: Regression Model |
|
|
|
Fig. 2: Generator in GAN Model |
|
|
|
## III Data Sets used for performing experiments |
|
|
|
We used the data set used by authors in [5] for performing experiments. The data set contains the simultaneous speech and EEG recording for four subjects. For each subject we used 80% of the data as the training set, 10% as validation set and remaining 10% as test set. This was the main data set used in this work for comparisons. More details of the data set is covered in [5]. We will refer this data set as data set A in this paper. |
|
|
|
We also performed some experiments using data set B used by authors in [2]. For this data set we didn't perform experiments for each subject instead we used 80% of the total data as training set, 10% as validation set and remaining 10% as test set. More details of the data set is covered in [2]. We will refer this data set as data set B in this paper. The train-test split was done randomly. |
|
|
|
The EEG data used in these data sets were recorded using wet EEG electrodes. In total 32 EEG sensors were used including one electrode as ground as shown in Figure 5. The Brain Product's ActiChamp EEG amplifier was used in the experiments to collect data. |
|
|
|
## IV EEG feature extraction details |
|
|
|
We followed the same preprocessing methods used by authors in [1, 2, 3, 5] for preprocessing EEG and speech data. |
|
|
|
EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. The EEGlab's [14] Independent component analysis (ICA) toolbox was used to remove biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We then extracted the three EEG feature sets explained by authors in [2]. The details of each EEG feature set are covered in [2]. Each EEG feature set was extracted at a sampling frequency of 100 Hz for each EEG channel [3]. |
|
|
|
The recorded speech signal was sampled at 16KHz frequency. We extracted mel-frequency cepstral coefficients (MFCC) of dimension 13 as features for speech signal. The MFCC features were also sampled at 100Hz same as the sampling frequency of EEG features. |
|
|
|
Fig. 4: Bi-GRU training loss convergence |
|
|
|
Fig. 5: EEG channel locations for the cap used in our experiments |
|
|
|
Fig. 3: Discriminator in GAN Model |
|
|
|
Fig. 6: Generator training loss |
|
|
|
[MISSING_PAGE_FAIL:4] |
|
|
|
[MISSING_PAGE_FAIL:5] |