NeuCoSVC-2 / REAPER /README.md
kevinwang676's picture
Upload folder using huggingface_hub
9f5b176 verified

A newer version of the Gradio SDK is available: 5.9.1

Upgrade

REAPER: Robust Epoch And Pitch EstimatoR

This is a speech processing system. The reaper program uses the EpochTracker class to simultaneously estimate the location of voiced-speech "epochs" or glottal closure instants (GCI), voicing state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). We define the local (instantaneous) F0 as the inverse of the time between successive GCI.

This code was developed by David Talkin at Google. This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

Downloading and Building reaper

cd convenient_place_for_repository
git clone https://github.com/google/REAPER.git
cd REAPER
mkdir build   # In the REAPER top-level directory
cd build
cmake ..
make

reaper will now be in convenient_place_for_repository/REAPER/build/reaper

You may want to add that path to your PATH environment variable or move reaper to your favorite bin repository.

Example:

To compute F0 (pitch) and pitchmark (GCI) tracks and write them out as ASCII files:

reaper -i /tmp/bla.wav -f /tmp/bla.f0 -p /tmp/bla.pm -a

Input Signals:

As written, the input stage expects 16-bit, signed integer samples. Any reasonable sample rate may be used, but rates below 16 kHz will introduce increasingly coarse quantization of the results, and higher rates will incur quadratic increase in computational requirements without gaining much in output accuracy.

While REAPER is fairly robust to recording quality, it is designed for use with studio-quality speech signals, such as those recorded for concatenation text-to-speech systems. Phase distortion, such as that introduced by some close-talking microphones or by well-intended recording-studio filtering, including rumble removal, should be avoided, for best results. A rumble filter is provided within REAPER as the recommended (default) high-pass pre-filtering option, and is implemented as a symmetric FIR filter that introduces no phase distortion.

The help text (-h) provided by the reaper program describes various output options, including debug output of some of the feature signals. Of special interest is the residual waveform which may be used to check for the expected waveshape. (The residual has a .resid filename extension.) During non-nasalized, open vocal tract vocalizations (such as /a/), each period should show a somewhat noisy version of the derivative of the idealized glottal flow. If the computed residual deviates radically from this ideal, the Hilbert transform option (-t) might improve matters.

The REAPER Algorithm:

The process can be broken down into the following phases:

  • Signal Conditioning
  • Feature Extraction
  • Lattice Generation
  • Dynamic Programming
  • Backtrace and Output Generation

Signal Conditioning

DC bias and low-frequency noise are removed by high-pass filtering, and the signal is converted to floating point. If the input is known to have phase distortion that is impacting tracker performance, a Hilbert transform, optionally done at this point, may improve performance.

Feature Extraction

The following feature signals are derived from the conditioned input:

  • Linear Prediction residual: This is computed using the autocorrelation method and continuous interpolation of the filter coefficients. It is checked for the expected polarity (negative impulses), and inverted, if necessary.
  • Amplitude-normalized prediction residual: The normalization factor is based on the running, local RMS.
  • Pseudo-probability of voicing: This is based on a local measure of low-frequency energy normalized by the peak energy in the utterance.
  • Pseudo-probability of voicing onset: Based on a forward delta of lowpassed energy.
  • Pseudo-probability of voicing offset: Based on a backward delta of lowpassed energy.
  • Graded GCI candidates: Each negative peak in the normalized residual is compared with the local RMS. Peaks exceeding a threshold are selected as GCI candidates, and then graded by a weighted combination of peak amplitude, skewness, and sharpness. Each of the resulting candidates is associated with the other feature values that occur closest in time to the candidate.
  • Normalized cross-correlation functions (NCCF) for each GCI candidate: The correlations are computed on a weighted combination of the speech signal and its LP residual. The correlation reference window for each GCI candidate impulse is centered on the inpulse, and correlations are computed for all lags in the expected pitch period range.

Lattice Generation

Each GCI candidate (pulse) is set into a lattice structure that links preceding and following pulses that occur within minimum and maximum pitch period limits that are being considered for the utterance. These links establish all of the period hypotheses that will be considered for the pulse. Each hypothesis is scored on "local" evidence derived from the NCCF and peak quality measures. Each pulse is also assigned an unvoiced hypothesis, which is also given a score based on the available local evidence. The lattice is checked, and modified, if necessary to ensure that each pulse has at least one voiced and one unvoiced hypothesis preceding and following it, to maintain continuity for the dynamic programming to follow. (Note that the "scores" are used as costs during dynamic programming, so that low scores encourage selection of hypotheses.)

Dynamic Programming

For each pulse in the utterance:
  For each period hypotheses following the pulse:
    For each period hypothesis preceding the pulse:
      Score the transition cost of connecting the periods.  Choose the
      minimum overall cost (cumulative+local+transition) preceding
      period hypothesis, and save its cost and a backpointer to it.
      The costs of making a voicing state change are modulated by the
      probability of voicing onset and offset.  The cost of
      voiced-to-voiced transition is based on the delta F0 that
      occurs, and the cost of staying in the unvoiced state is a
      constant system parameter.

Backtrace and Output Generation

Starting at the last peak in the utterance, the lowest cost period candidate ending on that peak is found. This is the starting point for backtracking. The backpointers to the best preceding period candidates are then followed backwards through the utterance. As each "best candidate" is found, the time location of the terminal peak is recorded, along with the F0 corresponding to the period, or 0.0 if the candidate is unvoiced. Instead of simply taking the inverse of the period between GCI estimates as F0, the system refers back to the NCCF for that GCI, and takes the location of the NCCF maximum closest to the GCI-based period as the actual period. The output array of F0 and estimated GCI location is then time-reversed for final output.