\documentclass{llncs}
\usepackage{amsbsy}
\input{psfig.sty}
\begin{document}
\pagestyle{empty}
\mainmatter % start of the contributions
\title{Speaker Identification \\* Using Kalman Cepstral Coefficients\thanks{The work was supported
by the Ministry of Education of the Czech Republic, project no.
MSM235200004, and by the Grant Agency of the Czech Republic,
project no. 102/96/K087.}}
\author{Zden\v{e}k \v{S}venda and Vlasta Radov\'a}
\institute{University of West Bohemia, Department of
Cybernetics,\\
Univerzitn\'\i\ 22, 306 14 Plze\v{n}, Czech Republic\\
\email{$\{$svendaz, radova$\}$@kky.zcu.cz}}
\maketitle
\begin{abstract}
In this paper an approach to speaker identification based on an
estimation of parameters of a linear speech-production model is
presented. The estimation is based on the discrete Kalman
estimator. It is generally supposed that the vocal tract can be
modeled by a system with constant parameters over short intervals.
Taking this assumption into account we can derive a special form
of the discrete Kalman estimator for the model of speech
production. The parameters of the vocal tract model obtained by
the above mentioned Kalman estimation we used then to compute a
new type of cepstral coefficients which we call Kalman cepstral
coefficients (KCCs). These coefficients were used in
text-independent speaker identification experiments based on
discrete vector quantization. Achieved results were then compared
with results obtained by the using LPC-derived cepstral
coefficients (LPCCs). The experiments were performed in a closed
group of 591 speakers (312 male, 279 female).
\end{abstract}
\section{Introduction}
There are many methods of signal processing. A set of methods
assumes that a combination of glottal cords, vocal tract and
radiation can be modeled by a simple system with a known single
all-pole transfer function which has unknown parameters. The
parameters of such model have to be then estimated. For this
purpose either a very popular method of linear predictive coding
\cite{sveradbib:mamzharam96}, \cite{sveradbib:psu95} or methods
based on a Kalman estimator \cite{sveradbib:macjai85} can be used.
Another set of signal processing methods analyze signal in
frequency domain, for example using a filterbank. In this way
energy in particular frequency band can be obtained. Mel-frequency
cepstral coefficients can be derived from this energy
\cite{sveradbib:radsve99}.
\section{Model of Speech Production}
The speech waveform is an acoustic pressure wave that originates
from voluntary physiological movements of anatomical structures
such as the vocal cords, vocal tract, nasal cavity, tongue and
lips. Human speech production can be modeled by a linear filter
where the glottal cords, vocal tract, and radiation are
individually modeled as linear filters
\cite{sveradbib:mamzharam96}, \cite{sveradbib:psu95}. The input of
the total filter is either a quasi-periodic impulse sequence for
voiced sounds or a random noise sequence for unvoiced sounds, with
a gain factor $G$ set to control the intensity of the excitation.
The glottal cords can be modeled by second-order low-pass filter
$G(z)$. The vocal tract model $V(z)$ can be described as all-pole
model, where each pole of this model corresponds to a formant or
resonance frequency of the sound. The radiation model $R(z)$
describes the air pressure at the lips and can be reasonably
approximated by a first-order backward difference. Combining the
glottal pulse, vocal tract, and radiation yields a single all-pole
transfer function given by
\begin{equation} \label{eq:sveradrov1}
H\left(z\right)=G\left(z\right)V\left(z\right)R\left(z\right)=\frac{G}{1+\sum\limits_{i=1}^{Q}a_{i}z^{-i}}\;,
\end{equation}
where $a_{i}$ are unknown parameters of model. With this transfer
function, we get a difference equation for synthesising the speech
samples $y(k)$ as
\begin{equation} \label{eq:sveradrov2}
y\left(k\right)=-\sum_{i=1}^{Q}a_{i}y\left(k-i\right)+Gu\left(k\right)\;,
\end{equation}
where $u(k)$ is the input of the filter.
It can be noted that $y(k)$ is predicted as a linear combination
of the previous $Q$ samples. Therefore, the speech production
model is often called the \textit{linear prediction} (LP) model,
or the \textit{autoregressive model}.
\section{Feature Extraction}
\label{sect:sveradsec3}
The feature extraction can be divided into two
steps. First a set of predictors coefficients has to be obtained,
and then this set has to be transformed into feature vector.
In practice, the predictor coefficients $\{a_{i}\}$ describing the
autoregressive model must be computed from the speech signal.
Since speech is a time-varying signal and the vocal-tract
configuration changes over time, an accurate set of predictor
coefficients is adaptively determined over short intervals (10-30
ms) called frames, during that time-invariance is assumed. The
gain $G$ is usually ignored to allow the parameterizations to be
independent of the signal intensity \cite{sveradbib:mamzharam96},
\cite{sveradbib:psu95}.
\subsection{Linear Predictive Coding}
\label{sect:sveradsec3.1} One of the standard methods of predictor
coefficients calculation is the \textit{autocorrelation} method.
The predictor coefficients can be obtained by solving the equation
\begin{equation} \label{eg:sveradrov3}
\boldsymbol{R}\boldsymbol{a}=\boldsymbol{r}\;,
\end{equation}
where $\boldsymbol{R}$ is Toeplitz autocorrelation matrix,
$\boldsymbol{a}$ is vector of predictor coefficients, and
$\boldsymbol{r}$ is a vector of autocorrelation coefficients. It
is assumed that the speech samples are identically zero outside
the frame of interest. A computationally efficient algorithm known
as the Levinson-Durbin recursion can be used to solve this
equation \cite{sveradbib:psu95}.
\subsection{The Discrete Kalman Estimator}
\label{sect:sveradsec3.2}
In the state space a system can be
generally described by equation system
\begin{equation} \label{eq:sveradrov4}
\begin{array}{rcl}
x_{k+1} & = & A_{k}x_{k}+B_{k}y_{k}+\Gamma_{k}\xi_{k+1}\;,
\\
y_{k+1} & = & C_{k}x_{k}+D_{k}y_{k}+\Delta_{k}\xi_{k+1}\;,
\end{array}
\end{equation}
where $x_{k}$ is an internal immeasurable part of state, $y_{k}$
is an external measurable part of state, $A_{k}$, $B_{k}$, $C_{k}$
and $D_{k}$ are matrices describing relations among vectors
$x_{k}$, $y_{k}$, $x_{k+1}$ and $y_{k+1}$, the term
$\Gamma_{k}\xi_{k+1}$ represents system errors and the term
$\Delta_{k}\xi_{k+1}$ represents an additive noise in the
observation process ($\xi_{k+1}$ is the white noise).
In arbitrary step $k$ the state of the system is not known
exactly. The estimate of actual state is given by normal
distributions
\begin{equation} \label{eq:sveradrov5}
\left[
\begin{array}{c}
x_{k+1}
\\
y_{k+1}
\end{array}
\right]
\sim N
\left\{
\left[
\begin{array}{c}
\widehat{x}_{k+1}
\\
\widehat{y}_{k+1}
\end{array}
\right],
\left[
\begin{array}{cc}
A_{k}P_{k}A_{k}^{T}+\Gamma_{k}\Gamma_{k}^{T} & A_{k}P_{k}C_{k}^{T}+\Gamma_{k}\Delta_{k}^{T}
\\
C_{k}P_{k}A_{k}^{T}+\Delta_{k}\Gamma_{k}^{T} & C_{k}P_{k}C_{k}^{T}+\Delta_{k}\Delta_{k}^{T}
\end{array}
\right]
\right\}\;,
\end{equation}
where $\widehat{x}_{k}$ and $\widehat{y}_{k}$ are the estimates of
$x_{k}$ and $y_{k}$ in step $k$. On the assumption that the vector
of unknown parameters $x$ is random with known mean $\mu_{0}$ and
covariance $P_{0}$ and independent on the random distribution of
initial state $s_{0}$, the optimal estimate of the vector of
parameters $\mu_{k}$ can be obtained by recursive algorithm
\begin{equation} \label{eq:sveradrov6}
\begin{array}{rcl}
K_{k+1} & = &
\left(
A_{k}P_{k}C_{k}^{T}+\Gamma_{k}\Delta_{k}^{T}
\right)
\left(
C_{k}P_{k}C_{k}^{T}+\Delta_{k}\Delta_{k}^{T}
\right)^{-1}\;,
\\
\mu_{k+1} & = & A_{k}\mu_{k}+B_{k}y_{k}+K_{k+1}
\left(
y_{k+1}-C_{k}\mu_{k}-D_{k}y_{k}
\right)\;,
\\
P_{k+1} & = &
\left(
A_{k}P_{k}A_{k}^{T}+\Gamma_{k}\Gamma_{k}^{T}
\right)-K_{k+1}
\left(
C_{k}P_{k}A_{k}^{T}+\Delta_{k}\Gamma_{k}^{T}
\right)\;,
\end{array}
\end{equation}
with initial conditions $\mu_{0}$ and $P_{0}$. $K_{k}$ represents
so called Kalman gain, $\mu_{k}$ the best a posteriori estimate
and $P_{k}$ the covariance of estimate dependent on observation.
\subsection{A Special Form of the Discrete Kalman Estimator for the Model of Speech Production}
The linear model of speech production (\ref{eq:sveradrov2}) is a
system with time-invariant parameters over short interval (see
above). It means the transition matrix $A_{k}$ is the identity
over this interval and the matrices $B_{k}$ and $\Gamma_{k}$ are
zero. It can be seen from (\ref{eq:sveradrov2}), that matrix
$D_{k}$ is zero as well. Taking this assumptions into account the
linear model of speech production in state representation can be
described by the equation system
\begin{equation} \label{eq:sveradrov7}
\begin{array}{rcl}
x_{k+1} & = & x_{k}=x\;,
\\
y\left(k\right) & = &
C\left(y_{k}\right)x+\Delta_{k}\xi_{k+1}\;,
\end{array}
\end{equation}
where $x=\left[a_{1}, a_{2}, \ldots,
a_{Q}\right]^T$,\hspace{0.4cm} $y_{k}=\left[y\left(k-1\right),
y\left(k-2\right), \ldots, y\left(k-Q\right)\right]^T$ \hfill and
$C\left(y_{k}\right)=\left[-y_{k}\right]^T$. If we take into
account forms of the matrices $A_{k}$, $B_{k}$, $C_{k}$, $D_{k}$
and $\Gamma_{k}$, we can derive a special form of Kalman estimator
(\ref{eq:sveradrov6}) for the linear model of speech production
\begin{equation} \label{eq:sveradrov8}
\begin{array}{rcl}
K_{k+1} & = & -P_{k}y_{k}
\left(
y_{k}^{T}P_{k}y_{k}+\Delta_{k}^{2}
\right)^{-1}\;,
\\
\mu_{k+1} & = &
\left(
I+K_{k+1}y_{k}^{T}
\right)
\mu_{k}+K_{k+1}y\left(k+1\right)\;,
\\
P_{k+1} & = &
\left(
I+K_{k+1}y_{k}^{T}
\right)P_{k}\;,
\end{array}
\end{equation}
where $I$ represents the identical matrix.
\subsection{Cepstrum}
\label{sect:sveradsec3.3} In many speaker identification
applications the predictor coefficients are transformed into
feature vectors consisting of so called \textit{LPC-derived
cepstral coefficients} (LPCCs). A recursive relation between the
LPC-derived cepstral coefficients and the predictor coefficients
is given as \cite{sveradbib:mamzharam96}, \cite{sveradbib:psu95}
\begin{equation} \label{eq:sveradrov9}
c\left(k\right)=-a_{k}-\sum_{i=1}^{k-1}\left(\frac{i}{k}\right)c\left(i\right)a_{k-i}\;.
\end{equation}
We used this formula for computation of so called \textit{Kalman
cepstral coefficients} (KCCs) from the predictor coefficients
produced by the Kalman estimator (described in Sect.
\ref{sect:sveradsec3.2}).
\section{Experiments}
All experiments described in this section were performed following
this initial condition: $P_{0}=5\,000\,000\,I$ ($I$ is the
identical matrix) and $\mu_{0}=\left[0,\ldots,0\right]^{T}$.
\subsection{Reconstructed Spectrum}
\label{sect:sveradsec4.1} The predictor coefficients $\{a_{i}\}$
allows to compute the signal spectrum. Using the substitution
$z=e^{j\omega}$ in transfer function (\ref{eq:sveradrov1}) we
obtain $|H\left(j\omega\right)|$ that represents the spectral
envelope of the speech. In the Fig. \ref{fig:sveradfig1} the
reconstructed spectra (dashed line), the predictor coefficients
$\{a_{i}\}$ were estimated by algorithm (\ref{eq:sveradrov8}), and
spectra evaluated by Fourier transformation (solid line) are
compared.
\begin{figure}
\centering
\begin{minipage}[c]{0.5\textwidth}
\centerline{\psfig{figure=a.eps,height=4cm}}
\end{minipage}%
\begin{minipage}[c]{0.5\textwidth}
\centerline{\psfig{figure=e.eps,height=4cm}}
\end{minipage}
\caption{Spectrum of the vowel $a$ (left) and $e$ (right) spoken by a female speaker}
\label{fig:sveradfig1}
\end{figure}
\subsection{Speaker Identification}
\label{sect:sveradsec4.2}
\subsubsection{Database.}
The database consisted of speech signal obtained from 591 speakers
(312 male, 279 female). Every speaker spoke a different set of
short Czech sentences and isolated words during one session. The
speech signal was transferred through a telephone channel, sampled
at the 8kHz sampling rate and stored in a $\mu$-law 8bit format.
Before further processing, the 8bit $\mu$-law digitalized samples
were finally converted to linear 16bit PCM samples. About 40s of
obtained speech were regarded as training data for each speaker
and were used to form a reference model. The remaining speech data
(on average 30s per spreaker) were used for identification tests.
\subsubsection{Feature Extraction.}
The speech signal was windowed by a 16ms Hamming window (128
samples) and not pre-emphasised. For each 16ms segment a feature
vector of 12th order of either LPCCs or KCCs was formed. The
feature vectors were then mean normalised even thought it is not
necessary because the training and testing condition were the
same. Segments of silence were removed from the speech data before
the feature extraction using an adaptive energy threshold
speech/silence detector.
\subsubsection{Vector Quantization.}
The vector quantization is an approach used for reduction of an
extensive data set. The data set $X=\left\{x_{i}|i=1,\ldots,
N\right\}$, where $N$ is the number of vectors in the set $X$, is
mapped onto a finite set of $M$ $\left(M\ll N\right)$ codebook
vectors $W=\left\{w_{i}|i=1,\ldots,M\right\}$. Each vector is
assigned to the nearest codebook vector. In the result every
codebook is composed of centroid vectors (means) representing
nonoverlapping regions in the feature space. Every codebook
obtained in this way is included in the reference database. During
an identification phase each sequence of input vectors of an
unknown speaker is quantized using $K$ codebooks corresponding to
$K$ different speakers. The unknown speaker is then identified as
the reference speaker with the minimum average distortion. This
method of speaker identification is described in more detail in
\cite{sveradbib:radsve99}.
\subsubsection{Experimental Results.}
In our experiments we formed codebooks of either 80 or 320
vectors. The experimental results are summarised in Table
\ref{tab:sveradtab1}.
\begin{table}[!htbp]
\begin{center}
\caption{Identification results}
\label{tab:sveradtab1}
\begin{tabular}{l r}
\begin{tabular}{|l|c|c|}
\hline
\multicolumn{3}{|c|}{Codebook size 80}
\\ \hline
Coefficients & \# correct & Correct [\%]
\\ \hline
LPCCs & 578 & 97.80
\\ \hline
KCCs & 576 & 97.46
\\ \hline
\end{tabular}
&
\begin{tabular}{|l|c|c|}
\hline
\multicolumn{3}{|c|}{Codebook size 320}
\\ \hline
Coefficients & \# correct & Correct [\%]
\\ \hline
LPCCs & 582 & 98.48
\\ \hline
KCCs & 582 & 98.48
\\ \hline
\end{tabular}
\end{tabular}
\end{center}
\end{table}
As the table shows, in the case with the codebook of 80 vectors,
better results were obtained when the LPCCs were used, but the
difference is almost insignificant. In the second case, using the
codebook of 320 vectors, the results are absolutely identical.
\section{Conclusion}
Results of our experiments show that the predictor coefficients
calculation based on discrete Kalman estimator can be usable in
speech processing. It can be usable for the reconstruction of
signal spectra (see Sect. \ref{sect:sveradsec4.1}) as well as for
speaker identification application (see Sect.
\ref{sect:sveradsec4.2}). The handicap of this algorithm is its
very high time consuming.
\begin{thebibliography}{4}
\bibitem{sveradbib:macjai85}
Mack G. A., Jain V. K.: A Compensated-Kalman Speech Parameter
Estimator. IEEE Signal Processing Magazine (1985)
\bibitem{sveradbib:mamzharam96}
Mammone R. J., Zhang X., Ramachandran R. P.: Robust Speaker
Recognition. IEEE Signal Processing Magazine (1996)
\bibitem{sveradbib:psu95}
Psutka J.: Communication with Computer by Speech. Academia, Prague
(1995) (in Czech)
\bibitem{sveradbib:radsve99}
Radov\'a V., \v{S}venda Z.: An Approach to Speaker Recognition
Based on Vector Quantization. First International Conference on
Advanced Engineering Design, Prague (1999)
\end{thebibliography}
\end{document}