Department of Electrical Engineering - Columbia University

[SEAS logo]

ELEN E4896 - Spring 2014

MUSIC SIGNAL PROCESSING

Home page

Course outline

Code

Practicals

Assignments

Columbia Courseworks E4896

Practicals

This page contains descriptions and instructions for weekly practical sessions.


Weds 2014-04-30: Unmixing

[unmixing image]

This week we will experiment with separating musical sources by using Nonnegative Matrix Factorization (NMF) to decompose a short-time Fourier transform magnitude (i.e., a spectrogram) into a time-frequency mask that can be used to isolate sources. We will be using nmf_kl_sparse_v.m, an implementation of KL-based NMF separation using Virtanen's approach to sparsity from Graham Grindlay's NMFlib. You can download all the pieces for this practical in prac13.zip, and you can read about the underlying algorithm in: T. Virtanen, Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , IEEE Tr. Audio, Speech & Lang. Proc., 15(3):1066-1074, March 2007.

The Matlab script below leads you through how to use these tools and data:

% Load piano example
[d,sr] = wavread('gould-wtc1.wav');

% Can we separate first two notes from rest of first arpeggio?
% We'll use stft.m, which is like specgram but has istft.m
fftlen = 512;
D = stft(d, fftlen);
% Make the axis values and plot as spectrogram
tt = [0:size(D,2)]*(fftlen/2)/sr;
ff = [0:fftlen/2]/fftlen*sr;
subplot(411)
imagesc(tt, ff, 20*log10(abs(D))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
% gap between 2nd and 3rd fundamentals is about 370 Hz - which STFT bin?
370/sr*fftlen
% Say bin 18
% Try separating low and high parts with STFT masking:
DR = D; 
DR(18:end,:) = 0;
subplot(412)
imagesc(tt, ff, 20*log10(abs(DR))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
% Resynthesize
dr = istft(DR);
soundsc(dr(1:10*sr),sr);
% Just lowest noes, but pretty muffled
% How does the remaining part sound?  Try complementary HP mask:
DR = D; 
DR(1:17,:) = 0;
imagesc(tt, ff, 20*log10(abs(DR))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
dr = istft(DR);
soundsc(dr(1:10*sr),sr);
% Just sounds high-pass, can still hear all notes

% NMF separation of spectrogram
% Use Virtanen's KL-based objective to factor STFT magnitude
% Fit 40 terms
r = 40;
[W,H] = nmf_kl_sparse_v(abs(D), r);
% Compare to original
subplot(412)
imagesc(tt, ff, 20*log10(abs(W*H))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
% Full reconstruction looks pretty good.  Let's check components
% Since component ordering is arbitrary, attempt to sort by pitch
I = order_dict(W);
% Now plot dictionary terms
subplot(413)
imagesc(1:r, ff, 20*log10(W(:,I))); axis xy; caxis([-40 20]); axis([0, r, 0 2000])
% Components mostly have clear correspondance to fundamentals
% Now look at activations: 
subplot(414)
imagesc(tt,1:r, 20*log10(H(I,:))); axis xy; caxis([-30 30]); axis([0 10 0, r])
% You can sort-of see different notes, but quite noisy

% Try partial reconstruction with just lowest components
X = I(1:15);
% Have a look at resulting spectrogram (mask)
subplot(412)
imagesc(tt, ff, 20*log10(abs(W(:,X)*H(X,:)))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
% Now reconstruct from that part of the spectrogram
dr = resynth_nmf(D,W,H,X);
soundsc(dr(1:10*sr), sr);
% Not bad at separating lowest notes, doesn't sound muffled

% What about remainder?
X = I(16:r);
imagesc(tt, ff, 20*log10(abs(W(:,X)*H(X,:)))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
dr = resynth_nmf(D,W,H,X);
soundsc(dr(1:10*sr), sr);
% Some removal of lower notes, but mostly still there

% Now try NMF with sparsity penalty enabled
[W1,H1] = nmf_kl_sparse_v(abs(D),r,'alpha', 1);
subplot(412)
imagesc(tt, ff, 20*log10(abs(W1*H1))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
I1 = order_dict(W1);
subplot(413)
imagesc(1:r, ff, 20*log10(W1(:,I1))); axis xy; caxis([-40 20]); axis([0, r, 0 2000])
% Dictionary pretty similar
subplot(414)
imagesc(tt,1:r, 20*log10(H1(I1,:))); axis xy; caxis([-30 30]); axis([0 10 0, r])
% Activations are much more sparse
X1 = I1(1:15);
subplot(412)
imagesc(tt, ff, 20*log10(abs(W1(:,X1)*H1(X1,:)))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
dr1 = resynth_nmf(D,W1,H1,X1);
soundsc(dr1(1:10*sr),sr)
% Good separation of low notes
X1 = I1(16:r);
subplot(412)
imagesc(tt, ff, 20*log10(abs(W1(:,X1)*H1(X1,:)))); axis xy; caxis([-40 20]); axis([0 10 0 2000])
dr1 = resynth_nmf(D,W1,H1,X1);
soundsc(dr1(1:10*sr),sr)
% And now high notes have low notes mostly removed!

Things to investigate:

  • The nature of the decomposition naturally depends on the rank of the modeling (i.e., the number of rows and columns in H and W respectively). What happens when you use significantly higher, or significantly lower, ranks?
  • The "alpha" parameter controls the extent to which the sparsity penalty is applied (which tends to drive elements in H to zero). What happens as you vary its value?
  • Piano is a good match to the "separable" spectrogram model assumed by NMF. What happens if you try some different music, perhaps including less stable sounds such as voice?
  • Our scheme for selecting components is pretty crude. Can you improve results by manually selecting components, or by coming up with a different scheme to select them? How does one component sound, and how many do you need to make a plausible separated sound? Can you remove artifacts such as the click of the piano keys?

Weds 2014-04-23: Fingerprinting

[fingerprinting image]

Our practical investigation of fingerprinting will use a re-implementation of the Shazam fingerprint system that I put together (see also the newer version including compiled Matlab target at audfprint ). You can look at the explanation and examples on that web page to see how it works, but we will use a more recent version of the code, which is more efficient: prac12.zip. We will also use pre-built hash tables, since populating the hash table takes about 5 sec / track, which adds up for the artist20 database of 1413 tracks (6 albums across 20 artists) we are using today. We have three tables to play with: HTA20-20hps.mat (30MB) is the largest and most detailed, generated with 20 hashes/sec. HTA20-10hps.mat (17MB) is smaller since it only recorded about 10 hashes/sec for the reference items, and HTA20-10hps-20c.mat (13MB) saved only a maximum of 20 tracks per hash (instead of 100), giving it a much smaller RAM footprint of 80 MB compared to 400 MB for the other two.

The Matlab script below leads you through how to use these tools and data:

% Set up commands
addpath('mp3readwrite');  % mp3 reading

% Calculate fingerprints for some audio
[d,sr] = mp3read('http://labrosa.ee.columbia.edu:8013/beatles/Revolver/01-Taxman.mp3');
% ("Let It Be" is not in this database!)
% Find the landmark pairs
L = find_landmarks(d,sr);
size(L)
L(1:10,:)
% Each row of L is {start-time-col start-freq-row end-freq-row delta-time}
% in the quantized units of the time-frequency cells
% Visualize them superimposed on the spectrogram (zoom in on first 20 secs)
show_landmarks(d,sr,L)
axis([0 20 0 4000])

% We can convert each pair into a single, 20 bit hash:
H = landmark2hash(L);
size(H)
H(1:10,:)
% Each row of H is {track_id start_time hash_val}, where the track_id defaults to 0
% For building the database, we'd store the track_id and start_time keyed by the hash_val

% Try loading the precalculated database
global HashTable HashTableCounts
load HTA20-10hps-20c
whos
% The simple hash table has 1048576 columns (one for each possible 20 bit value)
% Each column consists of 32 bit values; the top 17 bits are the track_id
% and the bottom 15 bits are the time offset within the track.

% We can see which tracks contain any given hash (3rd column of H):
HashTable(:,H(1,3)+1)
% (we add 1 to the hash value to avoid trying to access array element 0)
% Zero entries in the HashTable are where it's empty.
% We get the track indices by dividing by 16384; Names{} converts the 
% track indices into the actual tracks recorded in the hash table
Names{HashTable(1:5,H(1,3)+1)/16384}
% "Taxman" is there, although it's not the only one.  However, if we try another hash:
Names{HashTable(1:5,H(2,3)+1)/16384}
% .. we get a different set of tracks, with Taxman as the only repeat.
% match_query is the main routine to search the hash table (see fingerprinting web page)
R = match_query(d, sr)
% Returns 4 columns: the track ID, the number of "filtered" matching hashes (after checking for consistent timimg), 
% the actual relative timing (in 32 ms steps) of the best skew, and the total number of raw matching hashes.

% You can use fingerprints to examine the relationship between two "related" tracks:
[d1,sr] = mp3read('http://labrosa.ee.columbia.edu:8013/depeche_mode/Speak_and_Spell/11-Just_Can_t_Get_Enough.mp3');
[d2,sr] = mp3read('http://labrosa.ee.columbia.edu:8013/depeche_mode/Speak_and_Spell/16-Just_Can_t_Get_Enough_Schizo_Mix_.mp3');
bestlinalign(d1,sr,d2,sr);
% bestlinalign attempts to find a linear time warp between the hashes in two files.
% It also shows the scatter of how one track appears in the other.
% Note the section between 160 and 180 s which is skewed differently from the rest

% We can evaluate the fingerprinter by generating a set of queries from random positions in a few true tracks:
% (we're grabbing the soundfiles across the network).  
% The default is 30s queries with no added white noise:
tru = 100:100:1400;
[Q,SR] = gen_random_queries(addprefixsuffix(Names(tru),'http://labrosa.ee.columbia.edu:8013/','.mp3'));
% If we run the fingerprint query on these, we should get the "tru" track_ids back
% eval_fprint just runs match_query for each element of Q and returns the top hit
% You can truncate and add noise too; here, truncate to 4 sec, but very little noise (high SNR)
Ttrunc = 4.0;
SNR = 60;
[S,R,QT] = eval_fprint(Q,SR,tru, Ttrunc, SNR);
R
% You should see most of the first row matching the values of tru, but sometimes not.
% You can examine a match in more detail. QT returns the truncated/noised queries.
% Show the matching landmarks per the fingerprinting web page
[dm,srm] = illustrate_match(QT{2},SR,addprefixsuffix(Names,'http://labrosa.ee.columbia.edu:8013/','.mp3'));
% Listen to both query and candidate match in stereo:
soundsc([dm,QT{2}],SR)
% (When it's wrong, it's completely wrong)

% To build your own database, use clear_hashtable() then add_tracks(list)

Things to investigate:

  • The second column of the return from match_query (and eval_fprint) is the number of matching hashes. Do you notice the trend here for bad matches?
  • The fourth argument to eval_fprint truncates the length of the queries (in sec). How long do they have to be to get reliable retrieval?
  • The fifth argument to eval_fprint adds white noise at that SNR to the queries. Try a value as low as zero, or negative. Does it hurt? Can you compensate with longer queries?
  • In addition to added noise, the fingerprinter should be able to handle spectral modifications (e.g. filter([1 1], [1 -2*0.9*cos(.1) 0.9^2], Q{1}) ). Can you generate test queries modified along these lines (listen to see how they sound), and see how it affects performance?
  • What happens if you add two query waveforms together, e.g. Q{1}+0.2*Q{2} ?
  • The denser hash table, HTA-20hps.mat, needs at least 500MB of RAM, but if you can load it, try it out and see how much better it does.
  • For a given query, how does the number of matching hashes vary with the precise alignment of the query in the soundfile? I.e., what do you get with match_query([zeros(K,1); Q{i}], SR) for different values of K (the number of prepended samples of silence)?

Weds 2014-04-16: Incipit Matching

[incipits image]

This week we will use the Echo Nest Analyze API to find "incipits" -- fragments of music from the beginnings of phrases -- and search for them within a database of recordings. We'll be working in Matlab. The MATLAB script below is mostly self-explanatory. You can download the scripts in prac11.zip and data in AllIncipits.mat (43MB).

% Set up commands
addpath('mp3readwrite');  % mp3 reading

% Run the Echo Nest Analyze on an MP3 file
ENA = en_analyze('test.mp3',1)
% Look at the results - plot the per-segment chroma using the segment times
axs = subplot(311)
plot_chroma(ENA.pitches, ENA.segment);
% Compare to spectrogram
[d,sr] = mp3read('test.mp3',0,1,2);
ax = subplot(312)
specgram(d,512,sr);
caxis([-30 30]);
colormap(1-gray);
% (linkaxes will let you scroll the panes in sync)
axs = [axs,ax];
linkaxes(axs,'x')
axis([0 30 0 5000]);
% Superimpose the segment start times to check they make sense
overplot_times(ENA.segment,'y');
% Now look at the beat times
overplot_times(ENA.beat,'r');
% .. and the bar times
overplot_times(ENA.bar,'b');
% .. and the sections (major phrase breaks)
overplot_times(ENA.section,'g');

% Do the sections make sense?  Listen to 10 s excerpts
soundsc(seltime(d,sr,ENA.section(1),ENA.section(1)+10),sr);
soundsc(seltime(d,sr,ENA.section(2),ENA.section(2)+10),sr);
% sometimes...

% Calculate the beat-synchronous chroma from the segments
BC = en_beatchroma(ENA);
ax = subplot(313);
plot_chroma(BC,ENA.beat);
axs = [axs,ax];
linkaxes(axs,'x');

% We define incipits as the first N beats after each 
% section (starting from the nearest bar division), 
% and represent them with beat-chroma matrices
[In,St,En,Bt] = make_incipits(ENA);
subplot(321)
In1 = squeeze(In(1,:,:));
imagesc(In1); axis xy
title('1st incipit of test track')
% Incipits are key-normalized, so may not exactly match raw beat-chroma
% St, En are the actual start and end times
% listen to audio
soundsc(seltime(d,sr,St(1),En(1)),sr);
% compare to chroma resynthesis
% (make sure synthesis times start from 0)
soundsc(synthesize_chroma(In1,ENA.beat(Bt(1)+[0:31])-ENA.beat(Bt(1)),sr),sr)

% AllIncipits.mat contains incipits from 8000+ tracks of uspop2002
AI = load('AllIncipits.mat');
% Calculate the distance to all of them
% Incipits are stored in AI.Incipits as unravelled vectors of 384 (=12x32) values
% Just match on the first 16 beats (of 32) - a smaller space
mb = 16;
nchr = 12;
dist = sqrt(sum((AI.Incipits(:,1:mb*nchr) - ...
                 repmat(reshape(In1(:,1:mb),1,mb*nchr),size(AI.Incipits,1),1)).^2,2));
% Sort by distance
[vv,xx] = sort(dist);
% Plot the most similar one
ix = xx(1);
subplot(323)
imagesc(reshape(AI.Incipits(ix,:),12,32)); axis xy
% Which tracks is it?
title(['Incipit from ',AI.Names{AI.Tracks(ix)},' @ time ',num2str(AI.Starts(ix))], ...
      'Interpreter','none')
% ('Interpreter' stops it trying to translate underscores)
% Listen to the chroma resynth to see if it's similar
soundsc(synthesize_chroma(reshape(AI.Incipits(ix,:),12,32),0.35,sr),sr)
% Download & listen to the original audio
[d2,sr2] = mp3read(['http://labrosa.ee.columbia.edu:8013/',AI.Names{AI.Tracks(ix)},'.mp3']);
soundsc(seltime(d2,sr2,AI.Starts(ix),AI.Ends(ix)),sr2);
% Similar?
% You can also get the full EN Analysis structure for each track via wget (or similar) to
% ['http://labrosa.ee.columbia.edu:8013/ENA/', AI.Names{AI.Tracks(ix)}, '.mat']

Things to try:

  • Does the Echo Nest Analyze time segmentation make sense? You can try adding "blips" corresponding to beat, bar, and section times to your original audio (using beat_play.m from the beat-tracking practical) and see if they sound right.
  • Can you find any particularly close matches to any examples (i.e. where dist is small)? Does the chroma sound similar? Does the audio sound similar? Why might these be different?
  • Try modifying the chroma values, e.g. with the power-law compression we tried last week. Does that improve the matching?
  • Do you find certain "degenerate" matching patches? What do you think makes them come up so often?

Weds 2014-04-09: Chord Recognition

[chroma image]

This week we will train and use a simple Hidden Markov Model to do chord recognition. We will be using precomputed chroma features along with the ground-truth chord labels for the Beatles opus that were created by Chris Harte of Queen Mary, University of London. This practical is all run in Matlab.

Here's an example of using the Matlab scripts. You can download all the scripts as well as the features and labels for the Beatles data in prac10.zip.

TrainFileList = textread('trainfilelist.txt','%s');
% Load beat-synchronous chroma for "Let It Be" - item 135
[Chroma,Times] = load_chroma(TrainFileList{135});
% Resynthesize with Shepard tones
SR = 16000;
X = synthesize_chroma(Chroma,Times,SR);
% Listen to first 20 seconds
soundsc(X(1:20*SR),SR)
% Somewhat recognizable

% Train Gaussian models for each chord from whole training set
[Models,Transitions,Priors] = train_chord_models(TrainFileList);
% Look at the means of the 25 learned models (nochord + 12 major + 12 minor)
for i = 1:25; MM(:,i) = Models(i).mean'; end
imagesc(MM)
% label axes
chromlabels = 'C|C#|D|D#|E|F|F#|G|G#|A|A#|B';
set(gca, 'YTick', 1:12);
set(gca, 'YTickLabel', chromlabels);
keylabels = '-|C|C#|D|D#|E|F|F#|G|G#|A|A#|B|c|c#|d|d#|e|f|f#|g|g#|a|a#|b';
set(gca, 'XTick', 1:25);
set(gca, 'XTickLabel', keylabels);

% Try recognizing chords in Let It Be (which was in the train set, so cheating)
[HypChords, LHoods] = recognize_chords(Chroma,Models,Transitions,Priors);
% We can look at the best (Viterbi) path overlaid on the per-frame log likelihoods
imagesc(max(-10,log10(LHoods)));
colormap(1-gray)
colorbar
hold on; plot(HypChords+1,'-r'); hold off
% Look just at the first hundred beats
axis([0 100 0.5 25.5])
xlabel('time / beats');
ylabel('chord');
set(gca,'YTick',1:25);
set(gca,'YTickLabel',keylabels);
% Compare the Viterbi (HMM) path to the simple most-likely model for each frame
[Val,Idx] = max(LHoods);
hold on; plot(Idx, 'og'); hold off
% The HMM transition matrix makes it more likely to stay in any given state, 
% thus it smooths the chord sequence (eliminates single-frame chords)

% Evaluate accuracy compared to ground-truth
TrueChords = load_labels(TrainFileList{135});
% Add the true labels to the plot
hold on; plot(TrueChords+1, '.y'); hold off
legend('Viterbi','Best','True')

% HypChords and TrueChords are simple vectors of labels in range 0..24.
% What is the average accuracy for this track?
mean(HypChords==TrueChords)
% 71.5% - pretty good!
% For reference, the best per-frame model, without the HMM, gives
mean(Idx-1 == TrueChords)  % subtract 1 to convert indices 1..25 into chords 0..24
% 44.9% - nowhere near as good

% To get the full confusion matrix (rows=true, cols=recognized as):
[S,C] = score_chord_recognition(HypChords,TrueChords);
imagesc(C);
set(gca,'XTick',1:25);
set(gca,'XTickLabel',keylabels);
set(gca,'YTick',1:25);
set(gca,'YTickLabel',keylabels);
% Most common confusion is F being recognized as C.

% What do the true chords sound like when rendered as Shepard tones?
LabelChroma = labels_to_chroma(TrueChords);
% .. creates a simple chroma array with canonical triads for each chord
X2 = synthesize_chroma(LabelChroma,Times,SR);
soundsc(X2(1:20*SR),SR)

% Compare "target" chroma, actual chroma, and both true and hypothesized labels
subplot(311)
imagesc(LabelChroma);
axis xy
set(gca, 'YTick', [1 3 5 8 10 12]'); set(gca, 'YTickLabel', 'C|D|E|G|A|B');
subplot(312)
imagesc(Chroma);
axis xy
set(gca, 'YTick', [1 3 5 8 10 12]'); set(gca, 'YTickLabel', 'C|D|E|G|A|B');
subplot(313)
plot(1:length(TrueChords),TrueChords,'o',1:length(HypChords),HypChords,'.r')
legend('True','Hyp')
set(gca,'YTick',[1 3 5 8 10 13 15 17 20 22]); set(gca,'YTickLabel','C|D|E|G|A|c|d|e|g|a');
axis([0 length(TrueChords) 0 25])
colormap hot
% This gives the picture above

% Evaluate recognition over entire test set
TestFileList = textread('testfilelist.txt','%s');
[S,C] = test_chord_models(TestFileList,Models,Transitions,Priors);
% Overall recognition accuracy = 57.7%

Things to try:

  • Try training on different numbers of training tracks (e.g. TrainFileList(1:50)). How does this affect performance?
  • Look at the variation in accuracy on the different items in the test set. How much does it vary? Can you explain this variation from your knowledge of the different tracks?
  • If you listen to the resynthesis of the recognized chords (i.e., labels_to_chroma(recognize_chords(...))), can you hear why the recognizer made mistakes, i.e., do the mistaken chords make sense?
  • load_chroma includes simple feature normalization, to make the maximum value of each chroma vector equal to 1. You can probably improve performance by modifying this e.g. by applying a compressive nonlinearity such as Chroma.^0.25. See how this affects performance.
  • The Hidden Markov Model smoothing relies on the large self-loop probabilities of the Transition matrix. See what happens if you replace Transition with a flat matrix e.g. ones(25,25) (which is equivalent to not using an HMM and just taking the most-likely model for each frame), or some mixture between that and a large self-loop probability e.g. eye(25)+0.1*ones(25,25).

Notes: The code includes and makes use of gaussian_prob.m, viterbi_path.m, and normalise.m, all from Kevin Murphy's wonderful HMM Toolbox.
You can calculate your own beat-synchronous chroma features with the chrombeatftrs.m routine, part of my cover song detection system.

This week's practical is the basis of Mini Project 3.


Weds 2014-04-02: Beat tracking

[beat_track image]

As promised, we now move on from the real-time processing of Pd to do some offline analysis using Matlab. (In fact, beat tracking in real time is an interesting and worthwhile problem, but doing it offline is much simpler.)

I've put together a cut-down version of my dynamic programming beat tracker for us to play with. It includes:

  • beat_onset.m to calculate the onset strength function;
  • beat_tempo.m to estimate the global tempo from the onset function;
  • beat_simple.m - the core of the dynamic programming search;
  • beat_track.m - a wrapper around these three functions to go from audio file to beat times.

There are also a number of helper/utility functions:

  • beat_plot.m to plot beat times on top of a spectrogram;
  • beat_play.m to make a sound file indicating where the beats were placed;
  • beat_ground_truth.m to read in a file of the McKinney/Moelants ground truth and filter for the dominant tempo;
  • beat_score.m to compare two beat tracks;
  • beat_test.m to run the beat tracker on a list of files and score each one;
  • beat_demo.m - a script that shows an example of using these functions.

All these functions are available in prac09.zip. The collection of 20 example excerpts and human tapping data that McKinney and Moelants donated for MIREX 2006 (some 50 MB) is separately available as mirex06examples.zip.

Here are some things to try:

  • Go through the steps in beat_demo to understand the basic pieces.
  • Investigate one of the examples where the beat tracker doesn't do so well. What went wrong? How consistent are the human results?
  • The one parameter of the DP beat tracker is alpha, the balance between local onset strength and inter-onset-time consistency. A larger alpha results in a more rigid tempo track, better able to bridge over rhythmically ambiguous regions, but less able to cope with small variations in tempo. Can you improve the baseline performance by changing the value of alpha (e.g., the default value set at the top of beat_simple)?
  • The tempo estimation is essentially independent of the beat tracking. Sometimes you can improve the agreement with human tappers simply by "biasing" the tempo picking differently. See if you can improve performance by modifying the bpmmean parameter in beat_tempo.
  • The onset function in beat_onset is very simple and unweighted. Can you perhaps weight different frequency bands to improve performance?
  • The beat tracker sometimes has a systematic shift relative to the humans (it's often early). The hh and tt returns of beat_score give a histogram of where system beats occur relative to human ground truth. How much can you improve system performance by systematically shifting every beat time (e.g., beats = beats + 0.1)?
  • What happens if the tempo varies within a single performance? Maybe you can find an example like that on your computer - try it out! (You might need mp3read.m).

Weds 2014-03-26: Autotune

[autotune image]

Given pitch tracking and pitch modification, we can now put them both together to modify the pitch towards a target derived from the current input pitch, i.e., autotune, in which a singer's pitch is moved to the nearest exact note to compensate for problems in their intonation.

We can use sigmund both to track the singing pitch, and to analyze the voice into sinusoids which we can then resynthesize after possibly changing the pitch. We'll use the following Pd patches:

  • autotune.pd - the main patch, shown above
  • sigmundosc~.pd - a patch to take sigmund's "tracks" message stream and resynthesize into audio
  • sineosc~.pd - the individual sinusoid oscillator, modified from last time to accept frequency scaling messages
  • my_lop.pd - a simple one-pole low-pass filter for control (not audio) values
  • playsound~.pd and grapher.pd from last time.

You can download all these patches in prac08.zip.

Loading a sound file then playing it into the patch should generate a close copy of the original voice, but quantized to semitone pitches. The "pitch smoothing" slider controls how abruptly the pitch moves between notes. Try it on some voice files, such as the Marvin Gaye voice.wav, the query-by-singing example 00014.wav, or my pitch sweep ahh.wav. You can also try it on live input by hooking up the adc~ instead of the soundfile playback, but you will probably need to use headphones to avoid feedback.

Here are some things to investigate:

  1. What is the purpose of the moses 0 block in the middle of the patch? If you remove it, and hook the raw pitch input directly into the quantizer, what changes?
  2. What is the left-hand output of the moses block achieving? Can you hear the difference if it disconnected?
  3. Occasionally you may hear glitches in the output pitch. What is causing these? How might they be avoided?
  4. The "center" tones are defined by the A440 reference of the default MIDI scale. However, the query-by-humming example seems to be misaligned. Can you add a tuning offset to allow the input pitch to be systematically offset prior to quantization? Does this help? Could it be automated?
  5. Currently, the quantization just takes the nearest semitone. How could you make it select only the "white notes" of a major scale, i.e. skipping every 2nd semitone except between E/F and B/C?
  6. The current system generates singing with a mostly steady pitch, but singers often add vibrato (pitch modulation at around 6 Hz). How could we add vibrato with the autotune?
  7. Can you impose an entirely unrelated pitch (e.g. a monotone, or a steady slow vibrato) on an input signal?

Weds 2014-03-12: Pitch tracking

[sigmund image]

Miller Puckette (author of Pd) created a complex pitch tracking object called sigmund~. This week we'll investigate its use and function.

You will use the following Pd patches:

You can download these in prac07.zip.

sigmund~ operates in various different modes - as a raw pitch tracker, as a note detector/segmenter, and also as a sinusoid tracker. We'll try each mode.

  • In the test_sigmund patch, the left-most path runs sigmund~ in raw pitch tracking mode, echoing the pitch and energy it detects with a sawtooth oscillator. Try running this with the mic input (adc~). How well can it track your voice? What kind of errors does it make?
  • If instead you hook up the playsound~ unit, you can feed an existing soundfile into the tracker. Try voice.wav, speech.wav, and piano.wav. How does it do?
  • The second path runs sigmund~ in note-detecting mode i.e. it first detects the raw pitch, then tries to convert that into a series of note events. We have fed these to our analog synthesizer patch so we can hear them as distinct notes. Try this for the voice and piano.
  • Note tracking depends on a lot of thresholds, which can be set via messages to sigmund~. Try varying the "growth" (minimum level increase, in dB, before a new note is reported) and "vibrato" (maximum pitch deviation, in semitones, before a new note is reported) using the provided sliders. Can you improve the note extraction? How do the best settings depend on the material?
  • The third branch uses sigmund~ as a sinusoid tracker (analogous to SPEAR), then connects the output to a set of sinusoidal oscillators to reconstruct a version of the sound. Try feeding different sounds into this: how does it do? You can vary the cross-fade time of the oscillators with the "fadetime" slider - what does this do?

Note that Mini project 2 is based on this practical and the following one (on autotuning).


Weds 2014-03-05: Reverb

[reverb image]

For this week's practical you will examine a reverberation algorithm, trying to understand the link between the algorithm controls and pieces, and the subjective experience of the reverberation. We will be working with the algorithm in the rev2~ reverberator patch that comes with Pd, although we'll be modifying our own version of it. It's based on the design described in the 1982 Stautner and Puckette paper.

You will use the following Pd patches:

You can download them all in prac06.zip.

The main test harness allows you to adjust the control parameters of the reverb patch, and to feed in impulses, short tone bursts of different durations, or sound files. You can also sample the impulse response and write it out to a wave file, to be analyzed by an external program.

  • Try varying the main reverb controls: Liveness, Crossover Frequency, HF Damping, as well as the balance between "wet" and "dry" sound. What are the effects on the sounds of these different effects? How does the effect on the sound you hear depend on the kind of sound you use as test input?
  • Try writing out the impulse response of a setting to a sound file (using the "write -wave ..." message at the bottom right) and load it into Matlab (or another sound editor) to look in more detail at its waveform and spectrogram. Does it look like you expect? How do the controls affect these objective properties of the signal?
  • Open the my_rev2 subpatch (shown to the right). The main work is done by four delay lines, implemented with the four delwrite~/delread~ pairs. You can vary the actual delays of these four lines. How does this affect sound? Are the default values critical to the sound?
  • The other main part of the reverberator is the early echos, implemented in the early-reflect subpatch. If you open that subpatch, you'll find a fader to choose between including those early echos, or just feeding the direct path into the delay lines. How does this affect the sound?
  • This reverberator implements a mixing matrix implemented with the 4x4 block of *~ 1 and *~ -1 elements in the lower left. Try flipping some of these; what happens?

Here are some sound files you can use to try out the reverberator: voice.wav, guitar.wav, drums.wav, drums2.wav.


Wed 2014-02-26: LPC

[lpcanalysis image]

This week we will experiment with LPC analysis/synthesis in Pd using the lpc~ and lpcreson~ units by Edward Kelly and Nicolas Chetry, which are included with the Pd-extended package.

The patch lpc.pd uses these units to perform LPC analysis and synthesis, including taking the LPC filters based on one sound and applying them to a different sound (cross-synthesis). The main patch handles loading, playing, and selecting the soundfiles, and uses the additional patches loadsoundfile.pd to read in sound files, playloop.pd to play or loop a sound file, and audiosel~.pd to allow selecting one of several audio streams. The LPC part is done by the lpcanalysis subpatch of the main patch, shown to the right. It "pre-emphasizes" both voice and excitation to boost high frequencies, then applies lpc~ to the voice signal to generate a set of LPC filter coefficients and a residual. The [myenv] subpatch then calculates the energy (envelope) of the residual, and the excitation is scaled to reflect this envelope. This excitation, along with the original filter coefficients (delayed by one block to match the delay introduced by [myenv]), is passed to lpreson~, which simply implements the all-pole filter to impose the voice's formant structure on the excitation. A final de-emphasis re-balances low and high freqencies. You can download these patches in prac05.zip.

The entire lpcanalysis subpatch is applied to overlapping windows of 1024 samples at half the sampling rate of the parent patch (i.e. 1024/22050 = 46.4 ms) thanks to the block~ unit, a special feature of Pd which allows subpatches to have different blocking etc. from their parents. On coming out of the subpatch, Pd will overlap-add as necessary, so we apply a final tapered window (from the $0-hanning array) to the outputs. tabreceive~ repeatedly reads the hanning window on every frame.

Here is a list of options for things to investigate:

  1. Run the basic cross-synthesis function by loading sm1-44k.wav as the voice input, and Juno106_Strings.wav as the excitation input. (You'll probably need to enable the dsp on the output~ unit and turn up the volume.) Notice how the strings sound can still be understood as speech.
  2. Try selecting the noise input as excitation (using the audiosel~ panel) instead. This lets you hear the information being carried by the LPC filters in its most neutral form.
  3. Try using some other sounds as source and/or excitation (they should be short, mono WAV files sampled at 44 kHz). What happens if you swap the original two sound files?
  4. Try varying the LPC Filter Order (also called the model order) with the number box (click and drag up and down to change the number). How does it affect the synthesis sound?
  5. Change the patching to the output~ unit to listen to the "residual" output from the lpcanalysis subpatch. How does the residual sound change with the model order?
  6. The subpatch uses the residual's envelope (extracted by [myenv]) to scale the excitation. If you disable this (e.g. by patching the excitation directly into the input of lpreson~), how does this affect the sound?
  7. The LPC analysis relies on Pd's block~ unit to manage the block size, overlap-add, and sample rate scaling. You can try varying this to see e.g. how the sound varies if you use a longer block size and/or different sampling rate. (Remember to regenerate the hanning window array if you change the block size). Be warned, however: I have been able to crash both Pd and the audio drivers with these settings.
  8. The second output from the [pd lpcanalysis] block is the residual (the whitened excitation). You can't feed this directly back in as excitation, because that introduces a delay-free loop. But you can delay it by one analysis block, using the [pack~] - [unpack~] trick as is used inside the patch. You could then, for instance, turn this into a buzz-hiss-style excitation with [expr~ $v1>0], then use it as excitation. How does that sound?
  9. One weakness with this setup is that the excitation may not have a "flat" spectrum to provide sufficient high-frequency energy for the LPC filters to "shape". One approach to this is to use a second lpc~ unit to extract a (whitened) residual for the excitation too (the actual LPC coefficients from the excitation are discarded). Try this out and see how it affects the sound.
  10. In principle, LPC cross-synthesis can be used for time scaling, simply by slowing down the frame rate between analysis and synthesis (or possibly by interpolating the LPC parameters). The framework in lpcanalysis won't support this, however, as it uses LPC parameters extracted in real time from the speech waveform. To do timescaling, you'd need to implement something more like the way we did sinusoidal synthesis last week, where the parameters are written to table, then read back as needed on synthesis. You should, however, be able to implement both analysis and synthesis stages within Pd, without needing to read or write the parameters to an external file.


Weds 2014-02-19: Sinusoidal Synthesis

[mypartial image]

This week we use Pd to perform additive synthesis by controlling the frequencies and magnitudes of a bank of sinewave oscillators based on parameters read from an analysis file written by the SPEAR program we saw in class.

The main additive patch instantiates a bank of 32 oscillators, and provides the controls to load an analysis file, to trigger playback, and to modify the time and frequency scale. The actual parsing of the SPEAR file is provided by loadspearfile, and the individual sinusoid partials are rendered by mypartial. Here are some analysis files for notes from a violin, trumpet, and guitar. You can download all these files in prac04.zip,

You can experiment with playing back each of these, turning individual partials on or off, and adjusting the time and frequency scales. When the analysis file contains more harmonics than the number of oscillators (32), some of the sinusoids are dropped from the synthesis. You can identify (roughly) which harmonic is which by the average frequency and summed-up magnitude of each harmonic, which gets displayed on each individual [mypartial] patch in additive.pd.

Here are the things to try:

  1. Using the keybd patch, make the sound resynthesize under keyboard control, modifying its pitch to reflect the key pressed.
  2. Try adding inharmonicity to the sound i.e. stretching the frequencies not by a constant factor, but according to some other rule. For instance, what happens if you add a constant offset to each frequency? What happens if you scale only the frequencies above a certain minimum?
  3. Add vibrato (slow frequency modulation) to the resynthesized note.
  4. Implement looping sustain, where the parameters cycle within some subportion of the analysis data while a key is held down (to provide sustain), then proceed through to the end of the sample once the key is released.
  5. Try making a sinusoidal analysis of a sound of your own (using SPEAR) and see how well it works for other sounds, perhaps a spoken word. Remember that the system will only resynthesize the first 32 sinusoids it reads from file, so be careful to prune out the less important tracks before you "export" from SPEAR. (You can also modify the SPEAR analysis files in Matlab using spearread.m and spearwrite.m).

Note: if you modify the mypartial patch, it's probably a good idea to close and reopen the parent additive patch. I'm not sure how good Pd is at keeping multiple instantiations in sync when one instance is edited. Re-opening the parent patch will re-instantiate all the mypartials, so they should all get updated.


Weds 2014-02-12: Analog synthesis

[monosynth image]

This week we'll experiment with simulating an analog synthesizer with Pd.

The Pd analog synth simulator consists of several patches:

  • oscillator~.pd implements a band-limited square wave with pulse-width modulation, including a low-frequency oscillator (LFO) to modulate the pulse width.
  • adsr0~.pd provides a simple ADSR envelope generator. We use two of these, one for the amplitude and one for the filter.
  • voice~.pd includes an [oscillator~] and two [adsr0~]s along with [vcf~] to provide a complete synthesizer voice.
  • demo_voice.pd is a simple main patch that routes events from the keyboard to the [voice~] and provides sliders to control the various parameters.

You can download all these patches along with some support functions in the zip file prac03.zip. Load demo_voice.pd into Pd, and the synth should run.

Things to investigate:

  1. Analog synth programming: You could choose one of the sounds from Juno-106 examples page, or you could download and run the stand-alone demo version of the Loomer Aspect analog synth emulator that I demo'd in class (you need a MIDI keyboard to drive it, but you can use the open-source Virtual MIDI Piano Keyboard). Although our basic synth simulation is simple (and only monophonic), it includes all the basic components you should need to make the full range of classic analog synth sounds. By starting from one of the Aspect presets (many of which sound amazing), you can actually see how the sound is put together as a basis for duplicating it (although you might need to consult the Aspect manual to full decipher how things are connected together: most controls have up to three inputs which are summed).
  2. Additional oscillators: One thing Aspect and the Juno have that's not in the basic synth is multiple oscillators, including a "sub-oscillator", for instance a square wave tuned an octave below the main oscillator. Can you modify the simple synth to include this? How does it sound? You might want to also deploy an LFO, perhaps using the lfo.pd patch.
  3. Modified filter: Another difference with the Juno is that its variable filter is a 24 dB/octave low-pass, much steeper than the single vcf (voltage controlled filter) unit in the patch. You could improve this by adding multiple filters in cascade; how does this affect the sound?
  4. Improved ADSR: The existing ADSR generates linear amplitude envelopes, but these don't sound as good as nonlinear envelopes, for instance in dB or quartic mappings. Miller's book discusses this in chapter 4; see if you can change the amplitude mapping from the ADSR to improve the sound.
  5. Drum sounds: You can also create drum sounds with analog synthesizers, but you usually need a noise source instead of, or in addition to, an oscillator. Try using the "noise~" unit to broaden the range of sounds from the synthesizer, and see if you can get some convincing drum sounds (or other interesting sounds). A good drum sound might also want some pitch variation, perhaps generated by an additional ADSR unit.
  6. Velocity sensitivity: One big musical musical limitation of our synthesizer is that it has no dynamic control - no variation of the sound based on how hard you strike the keys. With the computer keyboard as controller, we're pretty much stuck, but I have a couple of USB musical keyboards as well. If you want to try playing with one of these, they will give you a "velocity" value out of the keybd unit. You could then use this to modulate the amplitude, and maybe even the timbre, of the synth.

This week's practical is the basis of mini-project 1, for which you will be submitting a short report in two weeks' time.


Weds 2014-02-05: Filtering

[demo_filters screenshot]

Last week we looked a fairly complex structure built in Pd. This week, we'll back up a bit and play with some simple filters within Pd.

Pd provides a range of built-in filtering objects: [lop~], [hip~], [bp~] (and its sample-rate-controllable twin [vcf~]), the more general [biquad~], and the elemental [rpole~], [cpole~] and [rzero~], [czero~] (see the filters section of the Floss Pd manual, and the Subtractive Synthesis chapter of Johannes Kreidler's Pd Tutorial).

The patch demo_filters.pd provides a framework to play with these filter types. It uses playsound~.pd to allow you to play (and loop) a WAV file, select4~.pd to select one of 4 audio streams via "radio buttons", and plotpowspec~.pd (slightly improved from last week) to plot the running FFT spectrum, as well as the [output~] built-in from Pd-extended. You can download all these patches in prac02.zip.

Try listening to the various sound inputs provided by the top [select4~] through the different filters provided by the bottom [select4~]. Try changing the cutoff frequency with the slider (as well as the Q of the bpf with the number box); listen to the results and look at the short-time spectrum. You can try loading the speech and guitar samples into the [playsound~] unit to see how filtering affects "real" sounds (click the button at the bottom right of [playsound~] to bring up the file selection dialog; click the button on the left to play the sound, and check the toggle to make it loop indefinitely).

Here are a few further experiments you could try:

  1. The standard [lop~] and [hip~] units are single-pole and thus give a fairly gentle roll-off. Try repeating them (i.e., applying several instances of the filter in series) to get a steeper filter. How does that sound?
  2. Notch (bandstop) filters are noticeably absent from this lineup. Of course, [biquad~] can be used to create an arbitrary pole/zero filter, and the extended unit [notch] can generate the appropriate parameters to make a notch filter. See if you can get one going, and see what it sounds like.
  3. Pd-extended includes a large number of more complex filter units, including [lpN_TYPE~], where N is the order, and TYPE is butt (Butterworth), cheb (Chebyshev), bess (Bessel), or crit (for critical damping). There are also a number of more exotic things like [resofilt~] and [comb~]. You can try these to see how they sound.
  4. Changing a filter's cutoff in real-time can have a dramatic sound, particularly for bandpass and notch. See if you can rig up a low-frequency modulator for the cutoff frequency control. (You can look at how I added modulated cutoff in the body of my modified karpluckvib~.pd from last week's practical).
  5. Some well known instances of modulated filtering are guitar "pedal" effects like wah-wah and chorus. We'll be talking about these in more detail later in the semester, but according to this image from page 57 of Udo Zoelzer's DAFx book, a set of bandpass filters with complementary motion gives a good wah-wah. Can you build this?
  6. Multiple voices singing approximately the same notes lead to rapidly-moving notches in frequency due to phase cancellations. A "chorus" effect can make a single voice sound like a chorus by introducing a complex pattern of moving notches. Can you build something like that? What is the effect of making the motion of the different notches synchronized or incoherent (different frequencies)?
  7. The Karplus-Strong plucked string algorithm (slide 13 of last week's slides) is actually just an IIR comb filter -- a set of highly-resonant poles at the harmonics of the fundamental frequency -- whose input is the initial wave shape followed by zeros. How could you simulate such a filter with these primitives?

Weds 2014-01-29: Plucked string

[plucked_string screenshot]

This week's practical looks at the Karplus-Strong plucked string simulation in Pure Data (Pd). The general pattern of these weekly practical sessions is to give you a piece of code to start with, then ask you to investigate some aspects, by using and changing the code. However, the areas to investigate are left somewhat open, in the hope that we'll each discover different things -- that we can then share.

We start with demo_karpluck.pd, my wrapper around Loomer's karpluck~.pd. In addition to the keybd.pd patch used to provide MIDI-like controls from the computer keyboard, this one also uses grapher~.pd to provide an oscilloscope-like time-domain waveform plot, and plotpowspec~.pd (based on the original by Ron Weiss) to provide a smoothed Fourier transform view. You can download all these patches in prac01.zip, then open the demo_karpluck.pd patch to run the demo.

This patch provides three main controls for the sound of the plucked string:

  • the "Click width", which determines the duration of the noise burst fed into the digital waveguide to initiate the note;
  • the "Decay constant", which scales the overall amplitude of the reflected pulse in each cycle; and
  • the "Feedback lowpass", which is an additional attenuation applied to higher frequencies.

To get going, try playing with these three parameters to see what kinds of "real" sounds you can approximate. Can you put "physical" interpretations on these parameters?

Here are some things to try:

  1. In addition to playing with the exposed controls, you can always modify the "innards" of the karpluck patch. For instance, the feedback lowpass is implemented with a lop~ object in the body~ subpatch of karpluck~. Can you change the order of this filter? Or its type (see lp2_butt~)? Can you explain these effects?
  2. Can you set a slower attack and delay for the envelope of the noise burst generated by excitation~ subpatch of karpluck~? How does that affect the sound?
  3. Adding a low-frequency modulation of around 4 Hz over a range +/- 500 Hz to the feedback lowpass can give a nice, subtle vibrato/tremolo effect. Can you do this? You can use [osc~ 4] to generate a 4 Hz low-frequency modulation waveform, but you'll then need [metro 20] and [snapshot~] to sample it every 20 ms and convert it into a "control" (hollow box) signal.
  4. One of the nice things about the waveguide physical models is that they can support different kinds of interaction. For instance, on a real string you can get harmonics by damping (touching) exactly half way along its length. You might be able to approximate something like this by having a variable, nonlinear attenuation at a fixed position in the the delay line (a linear attenuation would simply commute and not matter where you put it). Can you implement something like this?
  5. The string model is the simplest waveguide model. Instruments like clarinet and trumpet can also be simulated, but need a nonlinear energy input simulating the reed or lips (see for instance this paper on clarinet modeling by Gary Scavone). I haven't yet found a good implementation of this for Pd, but it would be an interesting thing to look for and/or try.

Credit: To be credited with completing this practical, you'll need to show one of the teaching staff a working instance of one of these suggestions.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.
Dan Ellis <dpwe@ee.columbia.edu>
Last updated: Thu May 01 09:38:27 AM EDT 2014