Dr Tim Brookes
About
Biography
I joined the University as Lecturer in Audio in January 1997, having previously been a Research Associate in the Department of Electronics at the University of York, developing real-time transputer-based systems for perceptually-inspired audio analysis. Prior to that I worked in Nottingham as a software developer at Business Systems Computing and as an assistant recording engineer at Square Dance Studios. A long time ago, my MSc project, MIDIBox, led to the development of the commercial product MIDI Creator (which, in May 2022, Look Mum No Computer turned into a MIDI toilet). Sometimes I make music too.
ResearchResearch interests
My research interests are on the engineering side of psychoacoustics: measuring, modelling and exploiting the relationships between the physical characteristics of sound and the corresponding perception evoked in listeners. I am particularly interested in the development of systems to predict and/or optimise the perceived quality of audio.
Research projects
Postgraduate and funded projects supervised and managed include the following.- 2001-2004 Towards a Spatial Ear Trainer
- 2001-2005 An Onset-Guided Spatial Analyser for Binaural Audio
- 2002-2005 Perceptually Motivated Measurement of Spatial Sound Attributes for audio-based information systems
- 2003-2008 The Development of SAALTS: A Spatial Audio Attribute Listener Training System
- 2006-2010 The role of head movement in the analysis of spatial impression
- 2004-2010 Perceptual Considerations in Audio Morphing
- 2007-2010 A Psychoacoustic Engineering Approach to Machine Sound Source Separation in Reverberant Environments
- 2006-2011 Towards the automatic assessment of spatial quality in the reproduced sound environment
- 2006-2011 Spatial Audio Creative Engineering Network (SpACE-Net)
- 2008-2012 Listener Response to Different Types of Loudspeaker Directivity
- 2010-2015 The Effect of Head Movement on the Perception of Source Elevation
- 2011-2015 Auditory Adaptation
- 2011-2015 Audio Un-mixing
- 2012-2016 Microphone Quality Metering & Enhancement
- 2013-2016 Metering the Perceived Quality of Mixed Music
- 2017-2018 Distance Adjustments in Loudness Metering
- 2013-2019 S3A: Future Spatial Audio for Immersive Listener Experience at Home
- 2016-2019 Audio Commons: an Ecosystem for Creative Reuse of Audio Content
- 2019-2023 Timbral Characteristics of Off-Axis Microphone Response
Research interests
My research interests are on the engineering side of psychoacoustics: measuring, modelling and exploiting the relationships between the physical characteristics of sound and the corresponding perception evoked in listeners. I am particularly interested in the development of systems to predict and/or optimise the perceived quality of audio.
Research projects
- 2001-2004 Towards a Spatial Ear Trainer
- 2001-2005 An Onset-Guided Spatial Analyser for Binaural Audio
- 2002-2005 Perceptually Motivated Measurement of Spatial Sound Attributes for audio-based information systems
- 2003-2008 The Development of SAALTS: A Spatial Audio Attribute Listener Training System
- 2006-2010 The role of head movement in the analysis of spatial impression
- 2004-2010 Perceptual Considerations in Audio Morphing
- 2007-2010 A Psychoacoustic Engineering Approach to Machine Sound Source Separation in Reverberant Environments
- 2006-2011 Towards the automatic assessment of spatial quality in the reproduced sound environment
- 2006-2011 Spatial Audio Creative Engineering Network (SpACE-Net)
- 2008-2012 Listener Response to Different Types of Loudspeaker Directivity
- 2010-2015 The Effect of Head Movement on the Perception of Source Elevation
- 2011-2015 Auditory Adaptation
- 2011-2015 Audio Un-mixing
- 2012-2016 Microphone Quality Metering & Enhancement
- 2013-2016 Metering the Perceived Quality of Mixed Music
- 2017-2018 Distance Adjustments in Loudness Metering
- 2013-2019 S3A: Future Spatial Audio for Immersive Listener Experience at Home
- 2016-2019 Audio Commons: an Ecosystem for Creative Reuse of Audio Content
- 2019-2023 Timbral Characteristics of Off-Axis Microphone Response
Supervision
Postgraduate research supervision
I have supervised fourteen PhD students and also served as internal examiner to a further fourteen and external examiner to eight.
NB. For the foreseeable future I will not be taking on any new PhD students.
Teaching
I currently teach on the following modules.
- TON1027 Acoustics and Psychoacoustics
- TONP017 Professional Training Year Module
- TON3014 Technical Project
Publications
Data accompanying the paper "Evaluation of Spatial Audi Reproduction Methods (Part2): Analysis of Listener Preference.
This is the dataset used for the accompanying paper "Automatic text clustering for audio attribute elicitation experiment responses".
Simulations of the human hearing system can help in a number of research fields; including work with the speech and hearing impaired as well as improving the accuracy of speech recognition systems and the naturalness of speech synthesis technology. The results from psychoacoustic experiments carried out over the last few decades have enabled models of the human peripheral hearing system to be developed. Conventionally, analyses such as the Fast Fourier Transform are used to analyze speech and other sounds to establish the acoustic cues which are important for human perception. Such analyses can be shown to be inappropriate in a number of ways. Additional insights could be gained into the importance of various acoustic cues if analyses based on hearing models were used. This paper describes an implementation of a real-time spectrograph based on a contemporary model of the peripheral human hearing system, executing on a network of T9000 transputers. The differences between it and conventional spectrographs are illustrated by means of test signals and speech sounds. © 1998 Elsevier Science B.V. All rights reserved.
Understanding the way in which listeners move their heads must be part of any objective model for evaluating and reproducing the sonic experience of space. Head movement is part of the listening experience because it allows for sensing the spatial distribution of parameters. In the first experiment, the head positions of subjects was recorded when they were asked to evaluate perceived source location, apparent source width, envelopment, and timbre of synthesis stimuli. Head motion was larger when judging source width than when judging direction or timbre. In the second experiment, head movement was observed in natural listening activities such as concerts, movies, and video games. Because the statistics of movement were similar to that observed in the first experiment, laboratory results can to be used as the basis of an objective model of spatial behavior. The results were based on 10 subjects.
This research introduces a novel technique for capturing binaural signals for objective evaluation of spatial impression; the technique allows for simulation of the head movement that is typical in a range of listening activities. A subjective listening test showed that the amount of head movement made was larger when listeners were rating perceived source width and envelopment than when rating source direction and timbre, and that the locus of ear positions corresponding to the pattern of head movement formed a bounded sloped path – higher towards the rear and lower towards the front. Based on these findings, a signal capture system was designed comprising a sphere with multiple microphones, mounted on a torso. Evaluation of its performance showed that a perceptual model incorporating this capture system is capable of perceptually accurate prediction of source direction based on interaural time and level differences (ITD and ILD), and of spatial impression based on interaural cross-correlation coefficient (IACC). Investigation into appropriate parameter derivation and interpolation techniques determined that 21 pairs of spaced microphones were sufficient to measure ITD, ILD and IACC across the sloped range of ear positions.
To improve the experience of listening to reproduced audio, it is beneficial to determine the differences between listening to a live performance and a recording. An experiment was performed in which three live performances (a jazz duet, a jazz-rock quintet, and a brass quintet) were captured and simultaneously replayed over a nine-channel with-height surround sound system. Experienced and inexperienced listeners moved freely between the live performance and the reproduction and described the difference in listening experience. In subsequent group discussions, the experienced listeners produced twenty-nine categories using some terms that are not commonly found in the current spatial audio literature. The inexperienced listeners produced five categories that overlapped with the experienced group terms but that were not as detailed.
From the early days of reproduced sound, engineers have sought to reproduce the spatial properties of sound fields, leading to the development of a range of technologies. Two-channel stereo has been prevalent for many years; however, systems with a higher number of discrete channels (including rear and height loudspeakers) are becoming more common and, recently, there has been a move towards loudspeaker-agnostic methods using audio objects. Perceptual evaluation, and perceptually-informed objective measurement, of alternative reproduction systems can inform further development and steer future innovations. It is important, therefore, that any gaps in the field of perceptual evaluation and measurement are identified and that future work aims to fill those gaps. A standard research paradigm in the field is identification of the perceptual attributes of a stimulus set, facilitating controlled listening tests and leading to the development of predictive models. There have been numerous studies that aim to discover the perceptual attributes of reproduced spatial sound, leading to more than fifty descriptive terms. However, a literature review revealed the following key problems: (i) there is little agreement on exact definitions, nor on the relative importance of each attribute; (ii) there may be important attributes that have not yet been identified (e.g. attributes arising from differences between real and reproduced audio, or pertaining to new 3D or object-based methods); and (iii) there is no model of overall spatial quality based directly on the important attributes. Consequently, the authors contend that future research should focus on: (i) ascertaining which attributes of reproduced spatial audio are most important to listeners; (ii) identifying any important attributes currently missing; (iii) determining the relationships between the important attributes and listener preference; (iv) modelling overall spatial quality in terms of the important perceptual attributes; and (v) modelling these perceptual attributes in terms of their physical correlates.
This research aims, ultimately, to develop a system for the objective evaluation of spatial impression, incorporating the finding from a previous study that head movements are naturally made in its subjective evaluation. A spherical binaural capture model, comprising a head-sized sphere with multiple attached microphones, has been proposed. Research already conducted found significant differences in interaural time and level differences, and cross-correlation coefficient, between this spherical model and a head and torso simulator. It is attempted to lessen these differences by adding to the sphere a torso and simplified pinnae. Further analysis of the head movements made by listeners in a range of listening situations determines the range of head positions that needs to be taken into account. Analyses of these results inform the optimum positioning of the microphones around the sphere model.
Reverberation is a problem for source separation algorithms. Because the precedence effect allows human listeners to suppress the perception of reflections arising from room boundaries, numerous computational models have incorporated the precedence effect. However, relatively little work has been done on using the precedence effect in source separation algorithms. This paper compares several precedence models and their influence on the performance of a baseline separation algorithm. The models were tested in a variety of reverberant rooms and with a range of mixing parameters. Although there was a large difference in performance among the models, the one that was based on interaural coherence and onset-based inhibition produced the greatest performance improvement. There is a trade-off between selecting reliable cues that correspond closely to free-field conditions and maximizing the proportion of the input signals that contributes to localization. For optimal source separation performance, it is necessary to adapt the dynamic component of the precedence model to the acoustic conditions of the room.
Test recordings can facilitate evaluation of a microphone's characteristics but there is currently no standard or experimentally validated method for making recordings to compare the perceptual characteristics of microphones. This paper evaluates previously used recording methods, concluding that, of these, the most appropriate approach is to record multiple microphones simultaneously. However, perceived differences between recordings made with microphones in a multi-microphone array might be due to (i) the characteristics of the microphones and/or (ii) the different locations of the microphones. Listening tests determined the maximum acceptable size of a multi-microphone array to be 150 mm in diameter, but the diameter must be reduced to no more than 100 mm if the microphones to be compared are perceptually very similar.
Spatial audio processes (SAPs) commonly encountered in consumer audio reproduction systems are known to generate a range of impairments to spatial quality. Two listening tests (involving two listening positions, six 5-channel audio recordings, and 48 SAPs) indicate that the degree of quality degradation is determined largely by the nature of the SAP but that the effect of a particular SAP can depend on program material and on listening position. Combining off-center listening with another SAP can reduce spatial quality significantly compared to auditioning that SAP centrally. These findings, and the associated listening test data, can guide the development of an artificial-listener-based spatial audio quality evaluation system.
Previous research has indicated that the relationship between the interaural cross-correlation coefficient (IACC) of a narrow-band sound and its perceived auditory source width is dependent on its frequency. However, this dependency has not been investigated in sufficient detail for researchers to be able to properly model it in order to produce a perceptually relevant IACC-based model of auditory source width. A series of experiments has therefore been conducted to investigate this frequency dependency in a controlled manner, and to derive an appropriate model. Three main factors were discovered in the course of these experiments. First, the nature of the frequency dependency of the perceived auditory source width of stimuli with an IACC of 1 was determined, and an appropriate mathematical model was derived. Second, the loss of perceived temporal detail at high frequencies, caused by the breakdown of phase locking in the ear, was found to be relevant, and the model was modified accordingly using rectification and a low-pass filter. Finally, it was found that there was a further frequency dependency at low frequencies, and a method for modeling this was derived. The final model was shown to predict the experimental data well. (c) 2005 Acoustical Society of America.
This research extends the study of head movements during listening by including various listening tasks where the listeners evaluate spatial impression and timbre, in addition to the more common task of judging source location. Subjective tests were conducted in which the listeners were allowed to move their heads freely whilst listening to various types of sound and asked to evaluate source location, apparent source width, envelopment, and timbre. The head movements were recorded with a head tracker attached to the listener’s head. From the recorded data, the maximum range of movement, mean position and speed, and maximum speed were calculated along each axis of translational and rotational movement. The effects of various independent variables, such as the attribute being evaluated, the stimulus type, the number of repetition, and the simulated source location were examined through statistical analysis. The results showed that whilst there were differences between the head movements of individual subjects, across all listeners the range of movement was greatest when evaluating source width and envelopment, less when localising sources, and least when judging timbre. In addition, the range and speed of head movement was reduced for transient signals compared to longer musical or speech phrases. Finally, in most cases for the judgement of spatial attributes, head movement was in the direction of source direction.
Linear regression is commonly used in the audio industry to create objective measurement models that predict subjective data. For any model development, the measure used to evaluate the accuracy of the prediction is important. The most common measures assume a linear relationship between the subjective data and the prediction, though in the early stages of model development this is not always the case. Measures based on rank ordering (such as Spearman’s test), can alternatively be used. Spearman’s test, however, does not consider the variance of the subjective data. This paper presents a method of incorporating the subjective variance into the Spearman’s rank ordering test using Monte Carlo simulations, and shows how this can be beneficial in the development of predictive models.
This research incorporates the nature of head movement made in listening activities, into the development of a quasibinaural acoustical measurement technique for the evaluation of spatial impression. A listening test was conducted where head movements were tracked whilst the subjects rated the perceived source width, envelopment, source direction and timbre of a number of stimuli. It was found that the extent of head movements was larger when evaluating source width and envelopment than when evaluating source direction and timbre. It was also found that the locus of ear positions corresponding to these head movements formed a bounded sloped path, higher towards the rear and lower towards the front. This led to the concept of a signal capture device comprising a torso-mounted sphere with multiple microphones. A prototype was constructed and used to measure three binaural parameters related to perceived spatial impression - interaural time and level differences (ITD and ILD) and interaural crosscorrelation coefficient (IACC). Comparison of the prototype measurements to those made with a rotating Head and Torso Simulator (HATS) showed that the prototype could be perceptually accurate for the prediction of source direction using ITD and ILD, and for the prediction of perceived spatial impression using IACC. Further investigation into parameter derivation and interpolation methods indicated that 21 pairs of discretely spaced microphones were sufficient to measure the three binaural parameters across the sloped range of ear positions identified in the listening test.
Measurements that attempt to predict the perceived spatial impression of musical signals in concert halls typically are conducted by calculating the interaural cross-correlation coefficient (IACC) of an impulse response. The causes of interaural decorrelation are investigated and it is found that this is affected by frequency dependent interaural time and level differences and variations in these over time. It is found that the IACC of impulsive and of narrowband tonal signals can be very different from each other in a wide range of acoustical environments, due to the differences in the spectral content and the duration of the signals. From this, it is concluded that measurements made of impulsive signals are unsuitable for attempting to predict the perceived spatial impression of musical signals. It is suggested that further work is required to develop a set of test signals that is representative of a wide range of musical stimuli
An algorithm is described which detects auditory onsets quickly in arbitrary binaural audio streams. Aspects of the precedence effect are implemented to speed up computation, and to increase the usability of the output. The onset detector is tested with a number of binaural signals. Onsets that are suitable for spatial auditory processing are found reliably. This will allow spatial feature extraction to be performed.
It is desirable to determine which of the many different spatial audio reproduction systems listeners prefer, and the perceptual attributes that are most important to listener experience, so that future systems can be perceptually optimized. A paired comparison preference rating experiment was performed alongside a free elicitation task for eight reproduction methods (consumer and professional systems with a wide range of expected quality) and seven program items (representative of potential broadcast material). The experiment was performed by groups of experienced and inexperienced listeners. Thurstone Case V modeling was used to produce preference scales. Both listener groups preferred systems with increased spatial content; nineand five-channel systems were most preferred. The use of elicited attributes was analyzed alongside the preference ratings, resulting in an approximate hierarchy of attribute importance: three attributes (amount of distortion, output quality, and bandwidth) were found to be important for differentiating systems where there was a large preference difference; sixteen were always important (most notably enveloping and horizontal width); and seven were used alongside small preference differences.
The spatial quality of automotive audio systems is often compromised due to their unideal listening environments. Automotive audio systems need to be developed quickly due to industry demands. A suitable perceptual model could evaluate the spatial quality of automotive audio systems with similar reliability to formal listening tests but take less time. Such a model is developed in this research project by adapting an existing model of spatial quality for automotive audio use. The requirements for the adaptation were investigated in a literature review. A perceptual model called QESTRAL was reviewed, which predicts the overall spatial quality of domestic multichannel audio systems. It was determined that automotive audio systems are likely to be impaired in terms of the spatial attributes that were not considered in developing the QESTRAL model, but metrics are available that might predict these attributes. To establish whether the QESTRAL model in its current form can accurately predict the overall spatial quality of automotive audio systems, MUSHRA listening tests using headphone auralisation with head tracking were conducted to collect results to be compared against predictions by the model. Based on guideline criteria, the model in its current form could not accurately predict the overall spatial quality of automotive audio systems. To improve prediction performance, the QESTRAL model was recalibrated and modified using existing metrics of the model, those that were proposed from the literature review, and newly developed metrics. The most important metrics for predicting the overall spatial quality of automotive audio systems included those that were interaural cross-correlation (IACC) based, relate to localisation of the frontal audio scene, and account for the perceived scene width in front of the listener. Modifying the model for automotive audio systems did not invalidate its use for domestic audio systems. The resulting model predicts the overall spatial quality of 2- and 5-channel automotive audio systems with a cross-validation performance of R^2 = 0.85 and root-mean-square error (RMSE) = 11.03%.
A number of problems have recently come to light whilst attempting to perform perceptually relevant computational analysis of binaural recordings made within enclosed spaces. In particular, it is not possible to extract reliable information for auditory source width or listener envelopment without accounting for the time-domain properties of the stimulus. A new method for performing computational spatial analysis entails computing the running interaural cross-correlation of the binaural signal whilst employing an adaptive filter to perform basic dereverberation, hence gaining an amplitude characteristic of the source stream. Early experimental results indicate that this new technique yields an indication of auditory spatial attributes which is more reliable than that attainable previously.
In the context of devising a spatial ear-training system, a study into the perceptual construct ‘ensemble depth’ was executed. Based on the findings of a pilot study into the auditory effects of early reflection (ER) pattern characteristics, exemplary stimuli were created. Changes were highly controlled to allow unidimensional variation of the intended quality. To measure the psychological structure of the stimuli and hence to evaluate the success of the simulation, Multidimensional Scaling (MDS) techniques were employed. Supplementary qualitative data were collected to assist with the analyses of the perceptual (MDS) spaces. Results show (1) that syllabicity1 of source material (rather than ER design) is crucial to depth hearing and (2) that unidimensionality was achieved, thus suggesting the stimuli to be suitable for training purposes.
Listeners are more sensitive to timbral differences when comparing stimuli side-by-side than temporally-separated. The contributions of auditory memory and spectral compensation to this effect are unclear. A listening test examined the role of auditory memory in timbral discrimination, across retention intervals (RIs) of up to 40 s. For timbrally complex music stimuli discrimination accuracy was good across all RIs, but there was increased sensitivity to onset spectrum, which decreased with increasing RI. Noise stimuli showed no onset sensitivity but discrimination performance declined with RIs of 40 s. The difference between program types may suggest different onset sensitivity and memory encoding (categorical vs non-categorical). The onset bias suggests that memory effects should be measured prior to future investigation of spectral compensation.
Timbrai qualities of loudspeakers and rooms are often compared in listening tests involving short listening periods. Outside the laboratory, listening occurs over a longer time course. In a study by Olive et al. (1995) smaller timbrai differences between loudspeakers and between rooms were reported when comparisons were made over longer versus shorter time periods. This is a form of timbrai adaptation, a decrease in sensitivity to timbre over time. The current study confirms this adaptation and establishes that it is not due to response bias but may be due to timbrai memory, specific mechanisms compensating for transmission channel acoustics, or attentional factors. Modifications to listening tests may be required where tests need to be representative of listening outside of the laboratory.
There are a wide variety of spatial audio reproduction systems available, from a single loudspeaker to many spatially distributed loudspeakers. An important factor in the selection, development, or optimization of such systems is listener preference, and the important perceptual characteristics that contribute to this. An experiment was performed to determine the attributes that contribute to listener preference for a range of spatial audio reproduction methods. Experienced and inexperienced listeners made preference ratings for combinations of seven program items replayed over eight reproduction systems, and reported the reasons for their judgments. Automatic text clustering reduced redundancy in the responses by approximately 90%, facilitating subsequent group discussions that produced clear attribute labels, descriptions, and scale end-points. Twenty-seven and twenty-four attributes contributed to preference for the experienced and inexperienced listeners respectively. The two sets of attributes contain a degree of overlap (ten attributes from the two sets were closely related); however, the experienced listeners used more technical terms whilst the inexperienced listeners used more broad descriptive categories.
Reverberation continues to present a major problem for sound source separation algorithms. However, humans demonstrate a remarkable robustness to reverberation and many psychophysical and perceptual mechanisms are well documented. The precedence effect is one of these mechanisms; it aids our ability to localize sounds in reverberation. Despite this, relatively little work has been done on incorporating the precedence effect into automated source separation. Furthermore, no work has been carried out on adapting a precedence model to the acoustic conditions under test and it is unclear whether such adaptation, analogous to the perceptual Clifton effect, is even necessary. Hence, this study tests a previously proposed binaural separation/precedence model in real rooms with a range of reverberant conditions. The precedence model inhibitory time constant and inhibitory gain are varied in each room in order to establish the necessity for adaptation to the acoustic conditions. The paper concludes that adaptation is necessary and can yield significant gains in separation performance. Furthermore, it is shown that the initial time delay gap and the direct-to-reverberant ratio are important factors when considering this adaptation.
A new tool for speech analysis is presented, operating in real-time and incorporating the analysing power of a contemporary auditory model to produce the familiar display of the speech spectrograph. This ?auditory spectrograph? is used to analyse English consonant sounds and the results are compared with conventional wide and narrow band spectrograms. The auditory analyses are found to attach more visual weight to the acoustic cues associated with speech production and perception, and features that are either difficult or impossible to distinguish on conventional spectrograms are clarified.
A measurement model based on the interaural cross-correlation coefficient (IACC) that attempts to predict the perceived source width of a range of auditory stimuli is currently under development. It is necessary to combine the predictions of this model with measurements of interaural time difference (ITD) to allow the model to provide its output on a meaningful scale and to allow integration of results across frequency. A detailed subjective experiment was undertaken using narrow-band stimuli with a number of centre frequencies, IACCs and ITDs. Subjects were asked to indicate the perceived position of the left and right boundaries of a number of these stimuli by altering the ITD of a pair of white noise comparison stimuli. It is shown that an existing IACC-based model provides a poor prediction of the subjective results but that modifications to the model significantly increase its accuracy.
A model based on the interaural cross-correlation coefficient (IACC) has been developed that aims to predict the perceived source width of a wide range of sounds. The following factors differentiate it from more commonly used IACC-based measurements: the use of a running measurement to quantify variations in width over time; half-wave rectification and low pass filtering of the input signal to mimic the breakdown of phase locking in the ear; compensation for the frequency and loudness dependency of perceived width; combination of a model of perceived location with a model of perceived width; and conversion of the results to an intuitive scale. Objective and subjective methods have been used to evaluate the accuracy and limitations of the resulting measurement model.
For subjective experimentation on 3D audio systems, suitable programme material is needed. A large-scale recording session was performed in which four ensembles were recorded with a range of existing microphone techniques (aimed at mono, stereo, 5.0, 9.0, 22.0, ambisonic, and headphone reproduction) and a novel 48-channel circular microphone array. Further material was produced by remixing and augmenting pre-existing multichannel content. To mix and monitor the programme items (which included classical, jazz, pop and experimental music, and excerpts from a sports broadcast and a lm soundtrack), a flexible 3D audio reproduction environment was created. Solutions to the following challenges were found: level calibration for different reproduction formats; bass management; and adaptable signal routing from different software and fille formats.
A system for morphing the softness and brightness of two sounds independently from their other perceptual or acoustic attributes was coded. The system is an extension of a previous one that morphed brightness only, that was based on the Spectral Modelling Synthesis additive/residual model. A Multidimensional Scaling analysis, of listener responses to paired comparisons of stimuli generated by the morpher, showed movement in three perceptually-orthogonal directions. These directions were labelled in a subsequent verbal elicitation experiment which found that the effects of the brightness and softness controls were perceived as intended. A Timbre Morpher, adjusting additional timbral attributes with perceptually-meaningful controls, can now be considered for further work.
In order to undertake controlled investigations into perceptual effects that relate to the interaural cross-correlation coefficient, experiment stimuli that meet a tight set of criteria are required. The requirements of each stimulus are that it is narrow band, normally has a constant cross-correlation coefficient over time, and can be altered to cover the full range of values of cross-correlation coefficient, including specified variations over time if required. Stimuli created using a technique based on amplitude modulation are found to meet these criteria, and their use in a number of subjective experiments is described.
Object-based audio can be used to customize, personalize, and optimize audio reproduction depending on the speci?c listening scenario. To investigate and exploit the bene?ts of object-based audio, a framework for intelligent metadata adaptation was developed. The framework uses detailed semantic metadata that describes the audio objects, the loudspeakers, and the room. It features an extensible software tool for real-time metadata adaptation that can incorporate knowledge derived from perceptual tests and/or feedback from perceptual meters to drive adaptation and facilitate optimal rendering. One use case for the system is demonstrated through a rule-set (derived from perceptual tests with experienced mix engineers) for automatic adaptation of object levels and positions when rendering 3D content to two- and ?ve-channel systems.
To improve the search functionality of online sound effect libraries, timbral information could be extracted using perceptual models, and added as metadata, allowing users to filter results by timbral characteristics. This paper identifies the timbral attributes that end-users commonly search for, to indicate the attributes that might usefully be modelled for automatic metadata generation. A literature review revealed 1187 descriptors that were subsequently reduced to a hierarchy of 145 timbral attributes. This hierarchy covered the timbral characteristics of source types and modifiers including musical instruments, speech, environmental sounds, and sound recording and reproduction systems. A part-manual, part-automated comparison between the hierarchy and a freesound.org search history indicated that the timbral attributes hardness, depth, and brightness occur in searches most frequently.
Whilst it is possible to create exciting, immersive listening experiences with current spatial audio technology, the required systems are generally difficult to install in a standard living room. However, in any living room there is likely to already be a range of loudspeakers (such as mobile phones, tablets, laptops, and so on). ____Media device orchestration" (MDO) is the concept of utilising all available devices to augment the reproduction of a media experience. In this demonstration, MDO is used to augment low channel count renderings of various programme material, delivering immersive three-dimensional audio experiences.
The “spectral compensation effect” (Watkins, 1991) describes a decrease in perceptual sensitivity to spectral modifications caused by the transmission channel (e.g., loudspeakers, listening rooms). Few studies have examined this effect: its extent and perceptual mechanisms are not confirmed. The extent to which compensation affects the perception of sounds colored by loudspeakers and other channels should be determined. This compensation has been mainly studied with speech. Evidence suggests that speech engages special perceptual mechanisms, so compensation might not occur with non-speech sounds. The current study provides evidence of compensation for spectrum in nonspeech tests: channel coloration was reduced by approximately 20%.
Computational auditory models that predict the perceived location of sound sources in terms of azimuth are already available, yet little has been done to predict perceived elevation. Interaural time and level differences, the primary cues in horizontal localisation, do not resolve source elevation, resulting in the ‘Cone of Confusion’. In natural listening, listeners can make head movements to resolve such confusion. To mimic the dynamic cues provided by head movements, a multiple microphone sphere was created, and a hearing model was developed to predict source elevation from the signals captured by the sphere. The prototype sphere and hearing model proved effective in both horizontal and vertical localisation. The next stage of this research will be to rigorously test a more physiologically accurate capture device.
Envelopment is an important attribute of listener preference for spatial audio reproduction. Object-based audio offers the possibility of altering the rendering of an audio scene in order to modify or maintain perceptual attributes - including envelopment - if the relationships between attributes and mix parameters are known. In a method of adjustment experiment, mixing engineers were asked to produce mixes of four program items at low, medium, and high levels of envelopment, in 2-channel, 5-channel, and 22-channel reproduction systems. The participants could vary a range of level, position, and equalization parameters that can be modified in object-based audio systems. The parameters could be varied separately for different semantic object categories. Nine parameters were found to have significant relationships with envelopment; parameters relating to the horizontal and vertical spread of sources were shown to be most important. A follow-on experiment demonstrated that these parameters can be adjusted to produce a range of envelopment levels in other program items.
When listening test subjects are required to rate changes in a single attribute, but also hear changes in other attributes, their ratings can become skewed by “dumping bias.” To assess the influence of dumping bias on timbral “clarity” ratings, listeners were asked to rate stimuli: (i) in terms of clarity only; and (ii) in terms of clarity, warmth, fullness, and brightness. Clarity ratings of type (i) showed (up to 20%) larger interquartile ranges than those of type (ii). It is concluded that in single-attribute timbral rating experiments, statistical noise—potentially resulting from dumping bias—can be reduced by allowing listeners to rate additional attributes either simultaneously or beforehand.
Previous studies give contradicting evidence as to the importance of head movements in localisation. In this study head movements were shown to increase localisation response accuracy in elevation and azimuth. For elevation, it was found that head movement improved localisation accuracy in some cases and that when pinna cues were impeded the significance of head movement cues was increased. For azimuth localization, head movement reduced front-back confusions. There was also evidence that head movement can be used to enhance static cues for azimuth localisation. Finally, it appears that head movement can increase the accuracy of listeners’ responses by enabling an interaction between auditory and visual cues.
Auditory width measurements based on the interaural cross-correlation coefficient (IACC) are often used in the field of concert hall acoustics. However, there are a number of problems with such measurements, including large variations around the centre of a room and a limited range of values at low frequencies. This paper explores how some of these problems can be solved by applying the IACC in a more perceptually valid manner and using it as part of a more complete hearing model. It is proposed that measurements based on the IACC may match the perceived width of stimuli more accurately if a source signal is measured rather than an impulse response, and when factors such as frequency and loudness are taken into account. Further developments are considered, including methods to integrate the results calculated in different frequency bands, and the temporal response of spatial perception
Auditory width measurements based on the interaural cross-correlation coefficient (IACC) are often used in the field of concert hall acoustics. However, there are a number of problems with such measurements, including large variations around the centre of a room and a limited range of values at low frequencies. This paper explores how some of these problems can be solved by applying the IACC in a more perceptually valid manner and using it as part of a more complete hearing model. It is proposed that measurements based on the IACC may match the perceived width of stimuli more accurately if a source signal is measured rather than an impulse response, and when factors such as frequency and loudness are taken into account. Further developments are considered, including methods to integrate the results calculated in different frequency bands, and the temporal response of spatial perception
Significant amounts of user-generated audio content, such as sound effects, musical samples and music pieces, are uploaded to online repositories and made available under open licenses. Moreover, a constantly increasing amount of multimedia content, originally released with traditional licenses, is becoming public domain as its license expires. Nevertheless, the creative industries are not yet using much of all this content in their media productions. There is still a lack of familiarity and understanding of the legal context of all this open content, but there are also problems related with its accessibility. A big percentage of this content remains unreachable either because it is not published online or because it is not well organised and annotated. In this paper we present the Audio Commons Initiative, which is aimed at promoting the use of open audio content and at developing technologies with which to support the ecosystem composed by content repositories, production tools and users. These technologies should enable the reuse of this audio material, facilitating its integration in the production workflows used by the creative industries. This is a position paper in which we describe the core ideas behind this initiative and outline the ways in which we plan to address the challenges it poses.
The ideal binary mask (IBM) is widely considered to be the benchmark for time–frequency-based sound source separation techniques such as computational auditory scene analysis (CASA). However, it is well known that binary masking introduces objectionable distortion, especially musical noise. This can make binary masking unsuitable for sound source separation applications where the output is auditioned. It has been suggested that soft masking reduces musical noise and leads to a higher quality output. A previously defined soft mask, the ideal ratio mask (IRM), is found to have similar properties to the IBM, may correspond more closely to auditory processes, and offers additional computational advantages. Consequently, the IRM is proposed as the goal of CASA. To further support this position, a number of studies are reviewed that show soft masks to provide superior performance to the IBM in applications such as automatic speech recognition and speech intelligibility. A brief empirical study provides additional evidence demonstrating the objective and perceptual superiority of the IRM over the IBM.
In a previous study it was discovered that listeners normally make head movements attempting to evaluate source width and envelopment as well as source location. To accommodate this finding in the development of an objective measurement model for spatial impression, two capturing models were introduced and designed in this research, based on binaural technique: 1) rotating Head And Torso Simulator (HATS), and 2) a sphere with multiple microphones. As an initial study, measurements of interaural time difference (ITD), level difference (ILD) and cross-correlation coefficient (IACC) made with the HATS were compared with those made with a sphere containing two microphones. The magnitude of the differences was judged in a perceptually relevant manner by comparing them with the just-noticeable differences (JNDs) of these parameters. The results showed that the differences were generally not negligible, implying the necessity of enhancement of the sphere model, possibly by introducing equivalents of the pinnae or torso. An exception was the case of IACC, where the reference of JND specification affected the perceptual significance of its difference between the two models.
Auditory adaptation is thought to reduce the perceptual impact of static spectral energy and increase sensitivity to spectral change. Research suggests that this adaptation helps listeners to extract stable speech cues across different talkers, despite inter-talker spectral variations caused by differing vocal tract acoustics. This adaptation may also be involved in compensation for distortions caused by transmission channels more generally (e.g. distortions caused by the room or loudspeaker through which a sound has passed). The magnitude of this adaptation and its ecological importance has not been established. The physiological and psychological mechanisms behind adaptation are also not well understood. The current research aimed to confirm that adaptation to transmission channel spectrum occurs when listening to speech produced though two types of transmission channel: loudspeakers and rooms. The loudspeaker is analogous to the vocal tract of a talker, imparting resonances onto a sound source which reaches the listener both directly and via reflections. The room-affected speech however, reaches the listener only via reflections – there is no direct path. Larger adaptation to the spectrum of the room was found, compared to adaptation to the spectrum of the loudspeaker. It appears that when listening to speech, mechanisms of adaptation to room reflections, and adaptation to loudspeaker/vocal tract spectrum, may be different.
A system for morphing the warmth of a sound independently from its other timbral attributes was coded, building on previous work morphing brightness only (1), and morphing brightness and softness (2). The new warmth-softness-brightness morpher was perceptually validated using a series of listening tests. A Multidimensional Scaling analysis of listener responses to paired-comparisons showed perceptually orthogonal movement in two dimensions within a warmth-morphed and everything-else-morphed stimulus set. A verbal elicitation experiment showed that listeners’ descriptive labeling of these dimensions was as intended. A further ‘quality control’ experiment provided evidence that no ‘hidden’ timbral attributes were altered in parallel with the intended ones. A complete timbre morpher can now be considered for further work, and evaluated using the tri-stage procedure documented here.
A system for morphing the brightness of two sounds independently from their other perceptual or acoustic attributes was coded, based on the Spectral Modelling Synthesis additive/residual model. A Multidimensional Scaling analysis of listener responses showed that the brightness control was perceptually independent from the other controls used to adjust the morphed sound. A Timbre Morpher, adjusting additional timbral attributes with perceptually meaningful controls, can now be considered for further work.
The IoSR is responsible for world-class research in audio-related subject areas, and offers postgraduate research-based MPhil and PhD programmes, as well as being home to the world-famous Tonmeister™ BMus undergraduate degree course in Music & Sound Recording. Since the creation of the Institute of Sound Recording (IoSR) in 1998 it has become known internationally as a leading centre for research in psychoacoustic engineering, with world-class facilities and with significant funding from research councils (in particular EPSRC) and from industry (we have successfully completed projects in collaboration with Adrian James Acoustics, Bang & Olufsen, BBC R&D, Genelec, Harman-Becker, Institut für Rundfunktechnik, Meridian Audio, Nokia, Pharos Communications and Sony BPE). Additionally, the IoSR was a founding partner in the EPSRC-funded Digital Music Research Network (DMRN) and Spatial Audio Creative Engineering Network (SpACE-Net). We are interested in human perception of audio quality, primarily of high-fidelity music signals. Our work combines elements of acoustics, digital signal processing, psychoacoustics (theoretical and experimental), psychology, sound synthesis, software engineering, statistical analysis and user-interface design, with an understanding of the aesthetics of sound and music. One particular focus of our work is the development of tools to predict the perceived audio quality of a given soundfield or audio signal. If, for example, a new concert hall, hi-fi or audio codec is being designed, it is important to know how each candidate prototype would be rated by human listeners and how it would compare to other products which may be in competition. Traditional acoustic and electronic measurements (e.g. RT60, SNR, THD) can give some indication but a truly representative assessment requires lengthy listening tests with a panel of skilled human listeners. Such tests are time-consuming, costly and often logistically difficult. The tools that we are developing will describe the quality of the prototype without the need for human listeners. An introduction to our research will be given by the Director of research, Dr.Tim Brookes, followed by demonstrations and posters from our postgraduate researchers. We welcome those working in industry and academia to attend the presentation and to discuss our recent findings and overall research goals.
Separation of underdetermined audio mixtures is often performed in the Time-Frequency (TF) domain by masking each TF element according to its target-to-mixture ratio. This work uses sigmoidal functions to map the target-to-mixture ratio to mask values. The series of functions used encompasses the ratio mask and an approximation of the binary mask. Mixtures are chosen to represent a range of different amounts of TF overlap, then separated and evaluated using objective measures. PEASS results show improved interferer suppression and artifact scores can be achieved using softer masking than that applied by binary or ratio masks. The improvement in these scores gives an improved overall perceptual score; this observation is repeated at multiple TF resolutions.
Experiments were undertaken to elicit the perceived effects of head-position-dependent variations in the interaural cross-correlation coefficient of a range of signals. A graphical elicitation experiment showed that the variations in the IACC strongly affected the perceived width and depth of the reverberant environment, as well as the perceived width and distance of the sound source. A verbal experiment gave similar results, and also indicated that the head-position-dependent IACC variations caused changes in the perceived spaciousness and envelopment of the stimuli.
A number of metrics has been proposed in the literature to assess sound source separation algorithms. The addition of convolutional distortion raises further questions about the assessment of source separation algorithms in reverberant conditions as reverberation is shown to undermine the optimality of the ideal binary mask (IBM) in terms of signal-to-noise ratio (SNR). Furthermore, with a range of mixture parameters common across numerous acoustic conditions, SNR–based metrics demonstrate an inconsistency that can only be attributed to the convolutional distortion. This suggests the necessity for an alternate metric in the presence of convolutional distortion, such as reverberation. Consequently, a novel metric—dubbed the IBM ratio (IBMR)—is proposed for assessing source separation algorithms that aim to calculate the IBM. The metric is robust to many of the effects of convolutional distortion on the output of the system and may provide a more representative insight into the performance of a given algorithm.
Typically, measurements that aim to predict perceived spatial impression of music signals in concert halls are performed by calculating the interaural cross-correlation coefficient (IACC) of a binaurally-recorded impulse response. Previous research, however, has shown that this can lead to results very different from those obtained if a musical input signal is used. The reasons for this discrepancy were investigated, and it was found that the overall duration of the source signal, its onset and offset times, and the magnitude and rate of any spectral fluctuations, have a very strong effect on the IACC. Two test signals, synthesised to be representative of a wide range of musical stimuli, can extend the external validity of traditional IACC-based measurements.
The attributes contributing to the differences perceived between microphones (when auditioning recordings made with those microphones) are not clear from previous research. Consideration of technical specifications and expert opinions indicated that recording five programme items with eight studio and two MEMS microphones could allow determination of the attributes related to the most prominent inter-microphone differences. Pairwise listening comparisons between the resulting 50 recordings, followed by multi-dimensional scaling analysis, revealed up to five salient dimensions per programme item; seventeen corresponding pairs of recordings were selected exemplifying the differences across those dimensions. Direct elicitation and panel discussions on the seventeen pairs identified a hierarchy of 40 perceptual attributes. An attribute contribution experiment on the 31 lowest-level attributes in the hierarchy allowed them to be ordered by degree of contribution and showed brightness, harshness, and clarity to always contribute highly to perceived inter-microphone differences. This work enables the future development of objective models to predict these important attributes.
Head movement has been shown to significantly improve localisation response accuracy in elevation. It is unclear from previous research whether this is due to static cues created once the head has reached a new stationary position or dynamic cues created through the act of moving the head. In this experiment listeners were asked to report the location of loudspeakers placed on vertical planes at four different azimuth angles (0°, 36°, 72°, 108°) with no head movement. Static elevation response accuracy was significantly more accurate for sources away from the median plane. This finding, combined with the statement that listeners orient to face the source when localising, suggests that dynamic cues are the cause of improved localisation through head movement.
A Recomendação BS.1770 do Setor de Radiocomunicação da União Internacional de Telecomunicações (ITU-R) para medição de intensidade sonora percebida loudness em conteúdo multicanal se estabeleceu como padrão de facto na indústria de áudio, e de jure para a radiodifusão digital. Embora sua ponderação em frequência contabilize os efeitos acústicos da cabeça, o modelo é insensível à distância fonte-ouvinte, importante dica de localização de objetos sonoros. Testes subjetivos foram conduzidos para investigar o efeito da distância auditiva percebida na sensação de loudness provocada por ruído, fala, música e sons ambientais. Com base nas variações encontradas, uma adaptação no algoritmo ITU-R é proposta e testada em observância às avaliações dos participantes do experimento. As diferenças de nível de loudness situaram-se no interior dos intervalos de confiança das diferenças de nível apontadas pelos participantes do experimento nas distâncias fonte-ouvinte mais comuns em salas de estar. (The International Telecommunication Union, Radiocommunication Sector (ITU-R) Recommendation BS.1770 for loudness measurement in mutichannel audio is established as a de facto standard for audio companies and de jure for digital broadcasters. Although its frequency weighting accounts for acoustic effects of the head, the model is insensitive to source distance. Listening tests were undertook to investigate the effect of auditory distance perception on loudness of noise, speech, music and environmental sounds. Based on the variations found, an adaptation of the ITU-R algorithm is proposed and evaluated against subject responses. Resulting differences in loudness levels were within the confidence intervals of the level differences indicated by the subjects in source-reciever distances commonly found in living rooms.)
There are many spatial audio reproduction systems currently in domestic use (e.g. mono, stereo, surround sound, sound bars, and headphones). In an experiment, pairwise pref-erence magnitude ratings for a range of such systems were collected from trained and untrained listeners. The ratings were analysed using internal preference mapping to: (i) uncover the principal perceptual dimensions of listener preference; (ii) label the dimensions based on the important perceptual attributes; and (iii) observe differences between trained and untrained listeners. To aid with labelling the dimensions, perceptual attributes were elicited alongside the preference ratings and were analysed by: (i) considering a metric derived from the frequency of use of each attribute and the magnitude of the related preference judgements; and (ii) observing attribute use for comparisons between specific methods. The first preference dimension accounted for the vast majority of the variance in ratings; it was related to multiple important attributes, including those associated with spatial capability and freedom from distortion. All participants exhibited a preference for reproduction methods that were positively correlated with the first dimension (most notably 5-, 9-, and 22-channel surround sound). The second dimension accounted for only a very small proportion of the variance, and appeared to separate the headphone method from the other methods. The trained and untrained listeners generally showed opposite preferences in the second dimension, suggesting that trained listeners have a higher preference for headphone reproduction than untrained listeners.
In order to take head movement into account in objective evaluation of perceived spatial impression (including source direction), a suitable binaural capture device is required. A signal capture system was suggested that consisted of a head-sized sphere containing multiple pairs of microphones which, in comparison to a rotating head and torso simulator (HATS), has the potential for improved measurement speed and the capability to measure time varying systems, albeit at the expense of some accuracy. The error introduced by using a relatively simple sphere compared to a more physically accurate HATS was evaluated in terms of three binaural parameters related to perceived spatial impression – interaural time and level differences (ITD and ILD) and interaural cross-correlation coefficient (IACC). It was found that whilst the error in the IACC measurements was perceptually negligible when the sphere was mounted on a torso, the differences in measured ITD and ILD values between the sphere-with-torso and HATS were not perceptually negligible. However, it was found that the sphere-with-torso could give accurate predictions of source location based on ITD and ILD, through the use of a look-up table created from known ITD-ILD-direction mappings. Therefore the validity of the multi-microphone sphere-with-torso as a binaural signal capture device for perceptually relevant measurements of source direction (based on ITD and ILD) and spatial impression (based on IACC) was demonstrated.
This paper presents some preliminary results from an ongoing study into methods for the training of listeners in subjective evaluation of spatial sound reproduction. Exemplary stimuli were created illustrating two spatial attributes: individual source width and source distance. Changes in each of the two attributes were highly controlled in an attempt to allow uni-dimensional variation of their perceptual effects. The stimuli were validated with the help of an experienced listening panel and then used to instruct naïve listeners. By comparing the listeners' performances at ranking a number of stimuli before and after the training sessions the effectiveness of the adopted method was quantified.
Hardness is the most commonly searched timbral attribute within freesound.org, a commonly used online sound effect repository. A perceptual model of hardness was developed to enable the automatic generation of metadata to facilitate hardness-based filtering or sorting of search results. A training dataset was collected of 202 stimuli with 32 sound source types, and perceived hardness was assessed by a panel of listeners. A multilinear regression model was developed on six features: maximum bandwidth, attack centroid, midband level, percussive-to-harmonic ratio, onset strength, and log attack time. This model predicted the hardness of the training data with R2 = 0.76. It predicted hardness within a new dataset with R2 = 0.57, and predicted the rank order of individual sources perfectly, after accounting for the subjective variance of the ratings. Its performance exceeded that of human listeners.
Previously-obtained data, quantifying the degree of quality degradation resulting from a range of spatial audio processes (SAPs), can be used to build a regression model of perceived spatial audio quality in terms of previously developed spatially and timbrally relevant metrics. A generalizable model thus built, employing just five metrics and two principal components, performs well in its prediction of the quality of a range of program types degraded by a multitude of SAPs commonly encountered in consumer audio reproduction, auditioned at both central and off-center listening positions. Such a model can provide a correlation to listening test data of r = 0.89, with a root mean square error (RMSE) of 11%, making its performance comparable to that of previous audio quality models and making it a suitable core for an artificial-listener-based spatial audio quality evaluation system.
Spectrum is an important factor in determining timbral clarity. An experiment where listeners rate the changes in timbral clarity resulting from spectral equalisation (EQ) can provide insight into the relationship between EQ and the clarity of string instruments. Overall, higher frequencies contribute to clarity more positively than lower ones, but the relationship is programme-item-dependent. Fundamental frequency and spectral slope both appear to be important. Change in harmonic centroid (or dimensionless spectral centroid) correlates well with change in clarity, more so than octave band boosted/cut, harmonic number boosted/cut, or other variations on the spectral centroid.
Collection of text data is an integral part of descriptive analysis, a method commonly used in audio quality evaluation experiments. Where large text data sets will be presented to a panel of human assessors (e.g., to group responses that have the same meaning), it is desirable to reduce redundancy as much as possible in advance. Text clustering algorithms have been used to achieve such a reduction. A text clustering algorithm was tested on a dataset for which manual annotation by two experts was also collected. The comparison between the manual annotations and automatically-generated clusters enabled evaluation of the algorithm. While the algorithm could not match human performance, it could produce a similar grouping with a significant redundancy reduction (approximately 48%).
Measurements of the spatial attributes of auditory environments or sound reproduction systems commonly only consider a single receiver position. However, it is known that humans make use of head movement to help to make sense of auditory scenes, especially when the physical cues are ambiguous. Results are summarised from a three-year research project which aims to develop a practical binaural-based measurement system that takes head movements into account. Firstly, the head movements made by listeners in various situations were investigated, which showed that a wide range of head movements are made when evaluating source width and envelopment, and minimal head movements made when evaluating timbre. Secondly, the effect of using a simplified sphere model containing two microphones instead of a head and torso simulator was evaluated, and methods were derived to minimise the errors in measured cues for spatial perception that were caused by the simplification of the model. Finally, the results of the two earlier stages were combined to create a multi-microphone sphere that can be used to measure spatial attributes incorporating head movements in a perceptually-relevant manner, and which allows practical and rapid measurements to be made.
A multiple-microphone-sphere-based localisation model has been developed that predicts source location by modelling the cues given by head movement. In order to inform improvements to this model, a series of experiments was devised to investigate the impact of head movement cues on the localisation response accuracy of human listeners. It was shown that head movements improve elevation localisation response accuracy for noise sources. When pinna cues are impaired the significance of head movement cues increases. The improved localisation resulting from head movement is due to dynamic cues available during the period of movement, and not to improved static cues available once the head is turned to face the sound source. Head movements improve elevation localisation to a similar degree for band- limited sources with differing centre frequencies (500 Hz, 2 kHz and 6 kHz), which indicates that both dynamic ILDs and dynamic ITDs are used. Head movements do not improve elevation response accuracy for programme items with less than an octave bandwidth. Head movements improve elevation response accuracy to a greater degree for sources further away from the equatorial plane.
Signal-processing algorithms that are meant to evoke a certain subjective effect often have to be perceptually equalized so that any unwanted artifacts are, as far as possible, eliminated. They can then be said to exhibit “unidimensionality of perceived variation.” Aiming to design a method that allows unidimensionality of perceived variation to be verified, established sensory evaluation approaches are examined in terms of their suitability for detailed, undistorted profiling and hence reliable validation of an algorithm’s subjective effects. It is found that a procedure combining multidimensional scaling with supplementary verbal elicitation constitutes the most appropriate approach. In the context of validating a signal-processing method intended to produce a specific spatial effect, this procedure is evaluated and some shortcomings are identified. However, following refinements, it is concluded that these can be overcome through additional data collection and analysis, resulting in a multistage hybrid validation technique.
Loudness measurements are often necessary in psychoacoustic research and legally required in broadcasting. However, existing loudness models have not been widely tested with new multichannel audio systems. A trained listening panel used the method of adjustment to balance the loudnesses of eight reproduction methods: low-quality mono, mono, stereo, 5-channel, 9-channel, 22-channel, ambisonic cuboid, and headphones. Seven programme items were used, including music, sport, and a lm soundtrack. The results were used to test loudness models including simple energy-based metrics, variants of ITU-R BS.1770, and complex psychoacoustically motivated models. The mean differences between the perceptual results and model predictions were statistically insignificant for all but the simplest model. However, some weaknesses in the model predictions were highlighted.
This paper reports recent progress towards the development of a spatial ear trainer. A study into the perceptual construct of 'ensemble width' (i.e. the lateral spacing of the outer sources contained within an auditory scene) was conducted. With the help of a novel surround panner, exemplary stimuli were created. Changes were highly controlled to enable unidimensional variation of the intended qualitative effect. To assess the success of the simulation, a subjective experiment was designed based on Multidimensional Scaling (MDS) techniques and completed by an experienced listening panel. Additional verbal and non-verbal data were collected so as to facilitate analysis of the perceptual (MDS) space. Results show that unidimensionality was achieved, thus suggesting the stimuli to be suitable for training purposes.
The ITU-R BS.1770 multichannel loudness algorithm performs a sum of channel energies with weighting coefficients based on azimuth and elevation angles of arrival of the audio signal. In its current version, these coefficients were estimated based on binaural summation gains and not on subjective directional loudness. Also, the algorithm lacks directional weights for wider elevation angles (jfj 30). A listening test with broadband stimuli was conducted to collect subjective data on directional effects. The results were used to calculate a new set of directional weights. A modified version of the loudness algorithm with these estimated weights was tested against its benchmark using the collected data, and using program material rendered to reproduction systems with different loudspeaker configurations. The modified algorithm performed better than the benchmark, particularly with reproduction systems with more loudspeakers positioned out of the horizontal plane.
Automated separation of the constituent signals of complex mixtures of sound has made significant progress over the last two decades. Unfortunately, completing this task in real rooms, where echoes and reverberation are prevalent, continues to present a significant challenge. Conversely, humans demonstrate a remarkable robustness to reverberation. An overview is given of a project that set out to model some of the aspects of human auditory perception in order to improve the efficacy of machine sound source separation in real rooms. Using this approach, the models that were developed achieved a significant improvement in separation performance. The project also showed that existing models of human auditory perception are markedly incomplete and work is currently being undertaken to model additional aspects that had previously been neglected. Work completed so far has shown that an even greater improvement in separation performance will be possible. The work could have many applications, including intelligent hearing aids and intelligent security cameras, and could be incorporated in to many other products that perform automated listening tasks, such as speech recognition, speech enhancement, noise reduction and medical transcription.
This research incorporates the nature of head movement made in listening activities, into the development of a quasi- binaural acoustical measurement technique for the evaluation of spatial impression. A listening test was conducted where head movements were tracked whilst the subjects rated the perceived source width, envelopment, source direction and timbre of a number of stimuli. It was found that the extent of head movements was larger when evaluating source width and envelopment than when evaluating source direction and timbre. It was also found that the locus of ear positions corresponding to these head movements formed a bounded sloped path, higher towards the rear and lower towards the front. This led to the concept of a signal capture device comprising a torso-mounted sphere with multiple microphones. A prototype was constructed and used to measure three binaural parameters related to perceived spatial impression - interaural time and level differences (ITD and ILD) and interaural cross- correlation coefficient (IACC). Comparison of the prototype measurements to those made with a rotating Head and Torso Simulator (HATS) showed that the prototype could be perceptually accurate for the prediction of source direction using ITD and ILD, and for the prediction of perceived spatial impression using IACC. Further investigation into parameter derivation and interpolation methods indicated that 21 pairs of discretely spaced microphones were sufficient to measure the three binaural parameters across the sloped range of ear positions identified in the listening test.
A new publicly available dataset of microphone impulse responses (IRs) has been generated. The dataset covers 25 microphones, including a Class-1 measurement microphone, plus polar pattern variations for 7 of the microphones. Microphones were included having: omnidirectional, cardioid, supercardioid and bidirectional polar patterns; condenser, moving-coil and ribbon transduction types; single and dual diaphragms; multiple body and head basket shapes; small and large diaphragms; and end-address and side-address designs. Using a custom-developed computer-controlled precision turntable, IRs were captured quasi-anechoically at incident angles from 0º to 355º in steps of 5º, and at source-to-microphone distances of 0.5 m, 1.25 m and 5 m. The resulting dataset is suitable for perceptual and objective studies related to the incident-angle-dependent response of microphones, as well as for the development of tools for predicting and emulating on- and off-axis microphone characteristics. The captured IRs allow generation of frequency response plots with a degree of detail not commonly available in manufacturer-supplied data sheets, and are also particularly well suited to harmonic distortion analysis.
Binary masking is a common technique for separating target audio from an interferer. Its use is often justified by the high signal-to-noise ratio achieved. The mask can introduce musical noise artefacts, limiting its perceptual performance and that of techniques that use it. Three mask-processing techniques, involving adding noise or cepstral smoothing, are tested and the processed masks are compared to the ideal binary mask using the perceptual evaluation for audio source separation (PEASS) toolkit. Each processing technique's parameters are optimised before the comparison is made. Each technique is found to improve the overall perceptual score of the separation. Results show a trade-off between interferer suppression and artefact reduction.
Interest in spatial audio has increased due to the availability of multichannel reproduction systems for the home and car. Various timbral ear training systems have been presented, but relatively little work has been carried out into training in spatial attributes of reproduced sound. To demonstrate that such a training system is truly useful, it is necessary to show that learned skills are transferable to different settings. Issues relating to the transfer of training are examined; a recent study conducted by the authors is discussed in relation to the level of transfer shown by participants, and a new study is proposed that is aimed to optimise the transfer of training to different environments.
At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.
Computational auditory models that predict the perceived location of sound sources in terms of azimuth are already available, yet little has been done to predict perceived elevation. Interaural time and level differences, the primary cues in horizontal localisation, do not resolve source elevation, resulting in the ‘Cone of Confusion’. In natural listening, listeners can make head movements to resolve such confusion. To mimic the dynamic cues provided by head movements, a multiple microphone sphere was created, and a hearing model was developed to predict source elevation from the signals captured by the sphere. The prototype sphere and hearing model proved effective in both horizontal and vertical localisation. The next stage of this research will be to rigorously test a more physiologically accurate capture device.
The challenge of installing and setting up dedicated spatial audio systems can make it difficult to deliver immersive listening experiences to the general public. However, the proliferation of smart mobile devices and the rise of the Internet of Things mean that there are increasing numbers of connected devices capable of producing audio in the home. ____Media device orchestration" (MDO) is the concept of utilizing an ad hoc set of devices to deliver or augment a media experience. In this paper, the concept is evaluated by implementing MDO for augmented spatial audio reproduction using objectbased audio with semantic metadata. A thematic analysis of positive and negative listener comments about the system revealed three main categories of response: perceptual, technical, and content-dependent aspects. MDO performed particularly well in terms of immersion/envelopment, but the quality of listening experience was partly dependent on loudspeaker quality and listener position. Suggestions for further development based on these categories are given.
A new method is presented for examining the spatial attributes of a sound recorded within a room. A binaural recording is converted into a running representation of instantaneous lateral angle. This conversion is performed in a way that is influenced strongly by the workings of the human auditory system. Auditory onset detection takes place alongside the lateral angle conversion. These routines are combined to form a powerful analytical tool for examining the spatial features of the binaural recording. Exemplary signals are processed and discussed in this paper. Further work will be required to validate the system, and to compare it against existing auditory analysis techniques.
Auralisation is the process of rendering virtual sound fields. It is used in areas including: acoustic design, defence, gaming and audio research. As part of a PhD project concerned with the influence of loudspeaker directivity on the perception of reproduced sound, a fully-computed auralisation system has been developed. For this, acoustic modelling software is used to synthesise and extract binaural impulse responses of virtual rooms. The resulting audio is played over headphones and allows listeners to experience the excerpt being reproduced within the synthesised environment. The main advance with this system is that impulse responses are calculated for a number of head positions, which allows the listeners to move when listening to the recreated sounds. This allows for a much more realistic simulation, and makes it especially useful for conducting subjective experiments on sound reproduction systems and/or acoustical environments which are either not available or are even impractical to create. Hence, it greatly increases the range and type of experiments that can be undertaken at Surrey. The main components of the system are described, together with the results from a validation experiment which demonstrate that this system provides similar results to experiments conducted previously using loudspeakers in an anechoic chamber.
Brightness is one of the most common timbral descriptors used for searching audio databases, and is also the timbral attribute of recorded sound that is most affected by microphone choice, making a brightness prediction model desirable for automatic metadata generation. A model, sensitive to microphone-related as well as source-related brightness, was developed based on a novel combination of the spectral centroid and the ratio of the total magnitude of the signal above 500 Hz to that of the full signal. This model performed well on training data (r = 0.922). Validating it on new data showed a slight gradient error but good linear correlation across source types and overall (r = 0.955). On both training and validation data, the new model out-performed metrics previously used for brightness prediction.