отсюда (https://bit.ly/2MAHjRd, Learning the Speech Front-end With Raw Waveform CLDNNs
)? Или намекнуть, как реализовать на tensorflow?
> Our time convolution layer is shown in Figure 1a. First, we take a small window of the raw waveform of length M samples, and convolve the raw waveform with a set of P filters. If we assume each convolutional filter has length N and we stride the convolutional filter by 1, the output from the convolution will be (M − N + 1) × P in time × frequency. Next, we pool the filterbank output in time (thereby discarding short term phase information), over the entire time length of the output signal, to produce 1 × P outputs. Finally, we apply a rectified nonlinearity, followed by a stabilized logarithm compression2 , to produce a frame-level feature vector at time t, i.e., xt ∈ <P . We then shift the window around the raw waveform by a small amount (i.e., 10ms) and repeat this time convolution to produce a set of time-frequency frames at 10ms intervals.
Тот же вопрос. Кто-то сталкивался с чем-то подобным? Или с сочетанием cnn+rnn?
Обсуждают сегодня