Thursday, February 02, 2006

late post

This post should have been up here last sunday the latest.

Last time I wrote that I was going to test to see if our method was working for simulator created data that doesn't include any noise or reverberations. I added a small part in the energy method that calculates the ratio of energy above the treshold value to all energy in each frame. If the ratio is less than 20db, I say that the frame is useless and do not use it to find the orientation. All said and done, I got near perfect results. I'm saying near perfect because I found out 2 "minor" problems.

1) When we are close to a set of microphones and we are looking right at the opposite direction than the mics we are close to (that is the aiming microphones are considerably distant) I got a bias of almost 10 degrees.

2) In files that worked really really well, there was usually one frame that would screw up, giving a difference of 10-20 degrees, but only in one frame. This was usually the same frame in each simulated data (remember i use the same clean speech file) so I think that is because of some special speech case.

So after these results, I started thinking about what to do next and came up with the following ideas:

1) check to see what is causing the bias I described.
2) check to see what is causing the problem described in 2.
3) trying to set up the discriminator without the clean speech
4) try to come up with ways to deal with the reverberant energies
5) try to come up with ways to use the energy method without position data.

After discussing these with prof silverman, i decided to attack the first problem. My first suggestion was that it was caused by the simulator. The orientation data that the simulator uses suggest that the very high frequencies aren't right at the front but slightly to the left or to the right, so I thought in longer distances this had caused the bias. I was dramatically wrong.

In the original energy method after interpolating the energies of the microphones to angles, a front to back ratio is taken. I found out that when we are close to the microphone we are aiming we screw up when we are not taking this ratio, however when we are far away from the aiming microphone then we screw up when we take this ratio. The reason to that problem is to do with the heights of the microphones. I have set up the problem and I'll explain it in my next post, it's complicated and it's not anything to do on the computer but it has to be solved or thought about using pen and paper.

coming soon...!

Monday, January 23, 2006

Real problems - reverberant energies and a different corner effect.

The previous post is basically useless - in terms of direct contribution to my research. However, the things I've been going through since the last post has really opened my mind and gave me some invaluable experience in terms of how to react to unexpected results. Basically, I always have to investigate the reasoning behind the results and when making observations or coming to conclusions, I always have to support my points with clear quantitative data.

The energy method, as my professor has proposed, is not working because of one major and one minor problem. Major problem is - when the talker is close to a reverberant wall, the magnitude of the reverberant signals get comparable to the original signals at distant microphones and the squared compensation of the signal received messes up the energy values. We get peaks at different locations.

The minor problem is when looking directly at the corners (especially at distant locations to the corners) the microphones that are slightly off from the corners get pretty much the same frequencies and since they are closer to the talker the magnitudes are slightly larger. The squared compensation doesnt help making the energys at the corners make bigger so we basically get no peaks with in 5 degree error but get almost all the peaks with in 20 degree error.

I'll be attacking the major problem first of all. To do this last week I have used some parts of the code of the simulator my professor has developed to get simulated HMA files seperately for direct and the reverberant speech given a cleen speech file. This week I'll start by

1) testing using the direct speech HMA files if the energy method works without any reverberations for different types of speech (the different types of speech is a completely different area that i'll hope to talk about sometime later.)

2) using the reverberant speech HMA files to observe the reverberant energies in different position and orientations.

That's it for now.

Thursday, January 12, 2006

First Results and First Problems

Well, I have worked on the data I was telling you about, there was some very weird results, anyway I then took some other data compared them, did some analysis and found this weird thing called the "corner effect" due to reverberation energy. Here is a copy of the document which I originally wrote for Prof. Silverman, but then didn't even give it to him since it's no big deal, the problem is very visible. However for me to keep track of what I'm doing even in the future I will post it here:
--------------------------------------------0---------------------------------------

I have used 2 sets of 5 recordings and processed them using the energy method code using different input parameters. Before explaining the procedure, the results and ideas about them, I think I have to explain the structure of the recordings.

Below is an approximate (lousy/definitely not accurate) aerial layout of the room.

In the method the height hasn’t been considered as a factor, so I’ll not take that into account but it should be noted that not all sets of microphones are at the same height.

The first set of data (from now on will be referred to as old files or old data) is HFS standing at the 5 locations that are marked with red and saying nearly the same sentence looking at pt 1. The second set of data (from now on will be referred to as new files or new data) is AL standing at the same 5 locations but looking at pt 2. So totally we have 10 HMA files 2 in each location, with exactly the opposite orientation. For each HMA file, I ran the energy method for 8 different cutoff frequencies, starting from 0 Hz, incrementing 500 Hz each time up to 3500 Hz both with and without spectral subtraction. For each run I get 6 percentage values: 5, 10, 20 degrees tolerance percentages from the processing of raw and smoothed data. The correct angle for the new data is 45 and 225 for the old data and I’m looking at the polar plots with cutoff frequencies 2000 Hz.

My original aim was to observe the strength of the energy method as functions of different cutoff frequencies, spectral subtraction (with and without) and the position of the talker. However I’ve encountered a phenomenon which I’ll refer as the corner effect. Corner effect is the result of reverberant energies due to short reflections at the corners. When I try to analyze the data for 3 different points, I’m hoping that it will be clear.

By the way, to pursue my original aim, I took new data today that are facing the midpoint of the horizontal wall and the empty space across it on a line that divides the room into half vertically.

Closest to Orientation Point:

Right now, I am looking at the old file from location 1 looking at pt1 and the new file from location 5 looking at pt2. The old data gives terrible results, going up to 20% with 20 degrees tolerance and less then 5% with 5 and 10 degrees of tolerance in raw energies that were spectrally subtracted. All other percentages are practically zero. The new data is slightly better, going up to 60% without and up to 50% with spectral subtraction when energies aren’t smoothed. When they are smoothed, we get 0’s everywhere. The only comparison we can make to get to an understanding of what’s going on is comparing the percentages found by the maximizing the raw energy’s in spectrally subtracted data (since the other stuff are all zeros in the old data). There are 2 important points to notice, 1) in the old data the 20degree tolerance percentage is much less than the new data and 2) in the old data the percentages vary exponential-like with the tolerance while in the new data they almost do not vary. When I looked at the polar plot of the old data, I see that most of the maximum energy is around 170 degrees and only few go above 200.

Polar plots of the new data shows some interesting results. In the new data we have around 60 frames that we calculate. The raw energies peak very near the orientation point for the first 30 frames. However in those 30 frames the smoothed energies show a peak around 330 degrees which probably mean that a lot of reverberant energy reflected from the horizontal wall is accumulated. The second half is even more interesting. Raw energies start to peak around 290 degrees, around 170 degrees and finally they peak at 260 degrees while the smooth energy peaks around 250 degrees the entire second half. This is clearly the effect of compensating the energies by the distance squared which blows the reverberant energies (even from across the room) to become more than what they should be.

Middle of the room (location 3)

So this is a comparison of the old data and the new data recorded at location 3. The old data is very unsuccessful without spectral subtraction. In the with spectral subtraction case, both raw and smooth energies yield up to 40% in +/-20 degrees tolerance. The method is working only with cutoff frequencies above 1500 Hz and the percentages vary linearly with the tolerance values. The success of the new data varies in different cutoff frequencies with spectral subtraction (with or without). I’ll look at the results with the spectral subtraction with frequencies above 1500 Hz. Raw data in 20 degrees tolerance vary around 70% and smoothed data varies around 80%. The difference with the old data is the percentages vary like exponentially with the tolerance.

In the old data the angle of the maximum raw energy start out pretty randomly for the first 10 frames, then settles around 200 for the next 20, then 15 of them around 170 with some random ones between 180 and 260, next 10 around 200, and finally the remaining ones are either around 180 or 225. The angle of the maximum smoothed energies start around 250, then next 20 is around 200, then 15 of them between 180 and 225, then 10 of them 200 and 5 of them around 225.

The new data gives clearer results, in smoothed energies, we get 55 peaks around 30 degrees and then 10 around 250 degrees. The raw data gives the same 55 peaks around 30 degrees but the next 10 are around 150 degrees. I think to get rid of the peaks in these different places we have to use as large cutoff frequencies as possible (because of the directionality) as a first step. We will still get the peaks in a wrong angle but at least it’ll be one wrong angle instead of many wrong angles. To see that it’s true, I have run the same file with 3500 Hz of cutoff frequency. It didn’t work as good as I thought but instead of getting peaks at 250 and 150 I got the 10 peaks around 310 degrees, which is still weird?

Farthest

For the farthest one I’m only going to take a look at the new file, the talker looking at pt2. Since from what I’ve written, I see that it is clearly enough to look at the new files as they are better data.

Comparing no spectral subtraction to with spectral subtraction, I see that although 20% values are slightly better in no case, 5% and 10% values are zero in no case while we at least get something in the with case. So this is my second point: Spectral subtraction is probably good for getting around the corner effect.

When I looked at the polar plot as usual in 2000 Hz cutoff frequency, almost all the raw and the smoothed max energies were around 30 degrees, only 1 above 35. To see that I am correct in my first point, I have run the same data at 3500 which improves the percentages in 5 and 10 degrees tolerance. This time more of them are closer to 45 degrees.
-----------------------------------------0------------------------------------------

That's it. I hope there will be more to come by this weekend.

Sunday, January 08, 2006

Finally, some action

After all the previous posts about endless papers, this is one with some action in it. Professor Silverman gave me the code he wrote for the energy method to determine the talker orientation from the microphone array data. The code consists of a loop that does spectral subtraction(according to SF BOLL's paper described in the previous post) and calculates the energy above a given cutoff freq for each microphone. The minimum energy among all the microphones for each frame is subtracted from all the other microphones in the same frame to correct for some reverberation. Then this data is interpolated to give an energy value for all angles with an increment of 0.1 degrees. The front to back ratio of these energies are taken. Finally the data is LP filtered to get it smoother. The angle having the greatest front to back ratio is determined to be the orientation angle.

The last week, i first tried to understand the code. One problem was in the same .m file the method and the testing was implemented. I like to work the opposite way, so I first tried to get the spectral subtraction code out. While doing that I ended up writing my own code, the result was different than the professor's. After trying to figure out what's going on, I found a missing if statement in mine, I corrected it and it became unbelievably good. I also took out the method as a function with input and output parameters.

Currently we have 5 recordings from the microphone array. In all of them the talker is facing the same corner, they are on the same line which is a diagonal from the corner the talker is facing to the opposite corner, with nearly uniform distance between them.

Finally, yesterday I wrote a test code to evaluate the 5 data with/out spectral subtraction and for different cutoff frequencies. I came today, realized that I messed up the variable names in the test code, so I'll run the program right after posting this and get the baseline results.

Thursday, December 29, 2005

More Papers(2)

Here are the other 2 papers that I read over this week.

Suppression of Acoustic Noise Signals Using Spectral Subtraction (S F Boll): This is truly a great paper written in 1979. The method used in this paper is the method that my advisor does subtracting the noise in the signals. What this paper suggests in suppressing the signal is take a part of the non-speech signal and create a signal that has the same statistical charachteristics over the full length of the speech signal. Now of course there are issues with statistical assumptions. First one is that the noise is assumed to be locally stationary - meaning that it's statistical properties are the same during the speech and non-speech segments. If there is a change, then the noise spectrum has to be recalculated. Basically the procedure is take the filtered and digitized speech, window it as half overlapped data buffers (hanning window is used since it's half overlapped the signal can be perfectly reconstructed), magnitude spectra of the windowed data are calculated, spectral noise bias during non-speech activity is subtracted, resulting negative amplitudes are zeroed out, secondary noise suppression done(explained in the next paragraph) , time-waveform is formed and it's overlap added to previous data.

The secondary noise suppression methods are used to suppress the error in noise predicted to the real noise. Spectral subtraction process is actually applying a filter which has a frequency responce of [1-(predicte_noise_freq_response / freq_response_of_speech)]. The methods are:
1) time averaging the freq_response of the speech source over a period of time where the speech is assumed to be stationary.
2) half-wave rectification which is adding the magnitude of the filter to the filter itself and dividing it by 2. What this does is for the frequencies where the noisy signal magnitude is less then the predicted noise magnitude, the noisy signal magnitude is changed to zero. The advantage of doing this is the noise floor is reduced by the freq response of the noise. The disadvantage is the cases where the noise and the speech magnitude is less than the predicted noise magnitude are basically lost.
3) Residual noise reduction - after half wafe rectification for the frequencies where the estimated noise magnitude is less than the real noise magnitude at the non-speech activity, you look at adjacent frames and replace the value with the least one.
4) Additional Signal Attenuation during Non-Speech Activity: If for a certain frame over all frequencies the ratio of the subtracted signal over the estimated noise is less than -12db, that frame is considered as no-speech activity. In those frames what you do is just attenuate the signal 30db.

This is a great paper and I'm really looking forward to see how my profesor implemented the spectral subtraction on the code.

The Generalized Correlation Method for Estimation of Time Delay (C H Knapp, G C Carter): Another very fundamental paper from the 1970's. The basics of this paper is pretty simple. Before using the cross-correlation method on the two input signals, prefilters are used that aim to emphasize the signal at high SNR frequencies and suppress the signal at low SNR frequencies. The ways that are presented here are the Roth Processor, SCOT Processor, PHAT (which I think is widely used) and Eckart Filter. The mathematics behind it isn't that simple but there no use in explaining it right now.

That's basically it for this holiday week. Tomorrow I'm going back to providence, (I was in Boston since last saturday) and I'll look at the code and try to relate the suggested method in the draft paper to the code itself until monday. So my next post will probably be a week from now where I'll probably have some results in my hand.