Professor John Illingworth
Academic and research departments
About
Biography
John Illingworth originally hails from Bishop Auckland, County Durham.
Educated at Universities of Birmingham (BSc in Physics) and Oxford (DPhil in Experimental Particle Physics).
Member of academic staff at University of Surrey since 1986. Currently Professor of Machine Vision.
Publications
The field of Action Recognition has seen a large increase in activity in recent years. Much of the progress has been through incorporating ideas from single-frame object recognition and adapting them for temporal-based action recognition. Inspired by the success of interest points in the 2D spatial domain, their 3D (space-time) counterparts typically form the basic components used to describe actions, and in action recognition the features used are often engineered to fire sparsely. This is to ensure that the problem is tractable; however, this can sacrifice recognition accuracy as it cannot be assumed that the optimum features in terms of class discrimination are obtained from this approach. In contrast, we propose to initially use an overcomplete set of simple 2D corners in both space and time. These are grouped spatially and temporally using a hierarchical process, with an increasing search area. At each stage of the hierarchy, the most distinctive and descriptive features are learned efficiently through data mining. This allows large amounts of data to be searched for frequently reoccurring patterns of features. At each level of the hierarchy, the mined compound features become more complex, discriminative, and sparse. This results in fast, accurate recognition with real-time performance on high-resolution video. As the compound features are constructed and selected based upon their ability to discriminate, their speed and accuracy increase at each level of the hierarchy. The approach is tested on four state-of-the-art data sets, the popular KTH data set to provide a comparison with other state-of-the-art approaches, the Multi-KTH data set to illustrate performance at simultaneous multiaction classification, despite no explicit localization information provided during training. Finally, the recent Hollywood and Hollywood2 data sets provide challenging complex actions taken from commercial movie sequences. For all four data sets, the proposed hierarchical approa- h outperforms all other methods reported thus far in the literature and can achieve real-time operation.
In this chapter, we present a generic classifier for detecting spatio-temporal interest points within video, the premise being that, given an interest point detector, we can learn a classifier that duplicates its functionality and which is both accurate and computationally efficient. This means that interest point detection can be achieved independent of the complexity of the original interest point formulation. We extend the naive Bayesian classifier of Randomised Ferns to the spatio-temporal domain and learn classifiers that duplicate the functionality of common spatio-temporal interest point detectors. Results demonstrate accurate reproduction of results with a classifier that can be applied exhaustively to video at frame-rate, without optimisation, in a scanning window approach. © 2010, IGI Global.
This paper presents a generic method for recognising and localising human actions in video based solely on the distribution of interest points. The use of local interest points has shown promising results in both object and action recognition. While previous methods classify actions based on the appearance and/or motion of these points, we hypothesise that the distribution of interest points alone contains the majority of the discriminatory information. Motivated by its recent success in rapidly detecting 2D interest points, the semi-naive Bayesian classification method of Randomised Ferns is employed. Given a set of interest points within the boundaries of an action, the generic classifier learns the spatial and temporal distributions of those interest points. This is done efficiently by comparing sums of responses of interest points detected within randomly positioned spatio-temporal blocks within the action boundaries. We present results on the largest and most popular human action dataset using a number of interest point detectors, and demostrate that the distribution of interest points alone can perform as well as approaches that rely upon the appearance of the interest points.
The use of sparse invariant features to recognise classes of actions or objects has become common in the literature. However, features are often ”engineered” to be both sparse and invariant to transformation and it is assumed that they provide the greatest discriminative information. To tackle activity recognition, we propose learning compound features that are assembled from simple 2D corners in both space and time. Each corner is encoded in relation to its neighbours and from an over complete set (in excess of 1 million possible features), compound features are extracted using data mining. The final classifier, consisting of sets of compound features, can then be applied to recognise and localise an activity in real-time while providing superior performance to other state-of-the-art approaches (including those based upon sparse feature detectors). Furthermore, the approach requires only weak supervision in the form of class labels for each training sequence. No ground truth position or temporal alignment is required during training.
There is a clear trend in the use of robots to accomplish services that can help humans. In this paper, robots acting in urban environments are considered for the task of person guiding. Nowadays, it is common to have ubiquitous sensors integrated within the buildings, such as camera networks, and wireless communications like 3G or WiFi. Such infrastructure can be directly used by robotic platforms. The paper shows how combining the information from the robots and the sensors allows tracking failures to be overcome, by being more robust under occlusion, clutter, and lighting changes. The paper describes the algorithms for tracking with a set of fixed surveillance cameras and the algorithms for position tracking using the signal strength received by a wireless sensor network (WSN). Moreover, an algorithm to obtain estimations on the positions of people from cameras on board robots is described. The estimate from all these sources are then combined using a decentralized data fusion algorithm to provide an increase in performance. This scheme is scalable and can handle communication latencies and failures. We present results of the system operating in real time on a large outdoor environment, including 22 nonoverlapping cameras,WSN, and several robots. © Institut Mines-Télécom and Springer-Verlag 2012.
Within the field of action recognition, features and descriptors are often engineered to be sparse and invariant to transformation. While sparsity makes the problem tractable, it is not necessarily optimal in terms of class separability and classification. This paper proposes a novel approach that uses very dense corner features that are spatially and temporally grouped in a hierarchical process to produce an overcomplete compound feature set. Frequently reoccurring patterns of features are then found through data mining, designed for use with large data sets. The novel use of the hierarchical classifier allows real time operation while the approach is demonstrated to handle camera motion, scale, human appearance variations, occlusions and background clutter. The performance of classification, outperforms other state-of-the-art action recognition algorithms on the three datasets; KTH, multi-KTH, and Hollywood. Multiple action localisation is performed, though no groundtruth localisation data is required, using only weak supervision of class labels for each training sequence. The Hollywood dataset contain complex realistic actions from movies, the approach outperforms the published accuracy on this dataset and also achieves real time performance. ©2009 IEEE.
Often within the field of tracking people within only fixed cameras are used. This can mean that when the the illumination of the image changes or object occlusion occurs, the tracking can fail. We propose an approach that uses three simultaneous separate sensors. The fixed surveillance cameras track objects of interest cross camera through incrementally learning relationships between regions on the image. Cameras and laser rangefinder sensors onboard robots also provide an estimate of the person. Moreover, the signal strength of mobile devices carried by the person can be used to estimate his position. The estimate from all these sources are then combined using data fusion to provide an increase in performance. We present results of the fixed camera based tracking operating in real time on a large outdoor environment of over 20 non-overlapping cameras. Moreover, the tracking algorithms for robots and wireless nodes are described. A decentralized data fusion algorithm for combining all these information is presented.
This study proposes a new algorithm for cylinder and conic surface extraction. The algorithm exploits pairs of surfaces patches to generate potential curved surface parameters which are in turn clustered using an un-supervised technique. This algorithm has the desirable property of being able to work parse, as well as dense depth data, and avoids any restrictive assumptions that the data is presented in a 2D image format. It is shown that the proposed algorithm is successful even for quite complicated depth images.
Helmholtz Stereopsis (HS) has recently been explored as a promising technique for capturing shape of objects with unknown reflectance. So far, it has been widely applied to objects of smooth geometry and piecewise uniform Bidirectional Reflectance Distribution Function (BRDF). Moreover, for nonconvex surfaces the inter-reflect ion effects have been completely neglected. We extend the method to surfaces which exhibit strong texture, nontrivial geometry and are possibly nonconvex. The problem associated with these surface features is that Helmholtz reciprocity is apparently violated when point-based measurements are used independently to establish the matching constraint as in the standard HS implementation. We argue that the problem is avoided by computing radiance measurements on image regions corresponding exactly to projections of the same surface point neighbourhood with appropriate scale. The experimental results demonstrate the success of the novel method proposed on real objects.
We address the problem of reliable real-time 3D-tracking of multiple objects which are observed in multiple wide-baseline camera views. Establishing the spatio-temporal correspondence is a problem with combinatorial complexity in the number of objects and views. In addition vision based tracking suffers from the ambiguities introduced by occlusion, clutter and irregular 3D motion. We present a discrete relaxation algorithm for reducing the intrinsic combinatorial complexity by pruning the decision tree based on unreliable prior information from independent 2D-tracking for each view. The algorithm improves the reliability of spatio-temporal correspondence by simultaneous optimisation over multiple views in the case where 2D-tracking in one or more views is ambiguous. Application to the 3D reconstruction of human movement, based on tracking of skin-coloured regions in three views, demonstrates considerable improvement in reliability and performance. The results demonstrate that the optimisation over multiple views gives correct 3D reconstruction and object labeling in the presence of incorrect 2D-tracking whilst maintaining real-time performance
In this paper the problem of image feature extraction is considered with emphasis on developing methods which are resilient in the presence of data contamination. The issue of robustness of estimation procedures has received considerable attention in the statistics community [1-3] but its results are only recently being applied to specific image analysis tasks [4-7]. In this paper we show how the design of robust methods applies to image description tasks posed within a statistical hypothesis testing and parameter estimation framework. The methodology is illustrated by applying it to finding robust, optimal estimation kernels for line detection and edge detection. We then discuss the relationship of these optimal solutions to both the well established Hough Transform technique and the standard estimation kernels developed in the statistics literature. The application of standard robust kernels to image analysis tasks is illustrated by two examples which involve circular arc detection in gray-level imagery and planar surface segmentation in depth data. Robust methods are found to be effective general tools for generating 2D and 3D image descriptions. © 1994 J.C. Baltzer AG, Science Publishers.
In this paper we present a strategy for the problem of exploring an unknown 2D environment. Existing techniques can be methodical, goal oriented or non-reactive to additional knowledge received at each new viewpoint. We present an approach which is not goal driven, but rather seeks new unseen areas to view and explore. The novelty of the strategy presented is the use of a view-improvement technique along with an optimal viewpoint planning method for the calculation and selection of the next-best-viewpoint. The strategy is designed for a sensor system with a limited field-of-view. Example explorations are presented and we demonstrate that the strategy finds new areas to view without exhaustive searching.
Machine Learning for Human Motion Analysis: Theory and Practice highlights thedevelopment of robust and effective vision-based motion understanding systems.
A new surface based approach to implicit surface polygonisation is introduced. This is applied to the reconstruction of 3D surface models of complex objects from multiple range images. Geometric fusion of multiple range images into an implicit surface representation was presented in previous work. This paper introduces an efficient algorithm to reconstruct a triangulated model of a manifold implicit surface, a local 3D constraint is derived which defines the Delaunay surface triangulation of a set of points on a manifold surface in 3D space. The `marching triangles' algorithm uses the local 3D constraint to reconstruct a Delaunay triangulation of an arbitrary topology manifold surface. Computational and representational costs are both a factor of 3-5 lower than previous volumetric approaches such as marching cubes
This paper addresses the problem of reconstructing an integrated 3D model from multiple 2.5D range images. A novel integration algorithm is presented based on a continuous implicit surface representation. This is the first reconstruction algorithm to use operations in 3D space only. The algorithm is guaranteed to reconstruct the correct topology of surface features larger than the range image sampling resolution. Reconstruction of triangulated models from multi-image data sets is demonstrated for complex objects. Performance characterization of existing range image integration algorithms is addressed in the second part of this paper. This comparison defines the relative computational complexity and geometric limitations of existing integration algorithms.
We have measured charged particle pair production in two-photon scattering at the e+e- storage ring PETRA. While the main source of such events is the production of lepton pairs, the presence of an additional process is clearly indicated by the measured invariant mass distribution of the two particles and their angular distributions. We determine that the excess is mainly due to the decay f0(1270)→π+π-. We derive a width Γ(f0→γγ)=3.2±0.2±0.6 keV (statistical and systematic). © 1981 Springer-Verlag.
We present an analysis of ρ0ρ0 production by two photons in the ρ0ρ0 invariant mass range from 1.2 to 2.0 GeV. From a study of the angular correlations in the process γγ→ρ0ρ0→π- π+π- we exclude a dominant contribution from JP=0- or 2- states. The data indicate sizeable contributions from JP=0+ for four pion masses M4π1.7 GeV. The data are also well described by a model with isotropic production and uncorrelated isotropic decay of the ρ0,s. The cross section stays high below the nominal ρ0ρ0 threshold, i.e. M4π
A method for word recognition based on the use of hidden Markov models (HMMs) is described. An evaluation of its performance is presented using a test set of real printed documents that have been subjected to severe photocopy and fax transmission distortions. A comparison with a commercial OCR package highlights the inherent advantages of a segmentation-free recognition strategy when the word images are severely distorted, as well as the importance of using contextual knowledge. The HMM method makes only one quarter of the number of word errors made by the commercial package when tested on word images taken from faxed pages. © 1998 Springer-Verlag Berlin Heidelberg.
Shape, in both 2D and 3D, provides a primary cue for object recognition and the Hough transform (P.V.C. Hough, U.S. Patent 3,069,654, 1962) is a heuristic procedure that has received considerable attention as a shape-analysis technique. The literature that covers application of the Hough transform is vast; however, there have been few analyses of its behavior. We believe that one of the reasons for this is the lack of a formal mathematical definition. This paper presents a formal definition of the Hough transform that encompasses a wide variety of algorithms that have been suggested in the literature. It is shown that the Hough transform can be represented as the integral of a function that represents the data points with respect to a kernel function that is defined implicitly through the selection of a shape parameterization and a parameter-space quantization. The kernel function has dual interpretations as a template in the feature space and as a point-spread function in the parameter space. A novel and powerful result that defines the relationship between parameterspace quantization and template shapes is provided. A number of interesting implications of the formal definition are discussed. It is shown that the Radon transform is an incomplete formalism for the Hough transform. We also illustrate that the Hough transform has the general form of a generalized maximum-likelihood estimator, although the kernel functions used in estimators tend to be smoother. These observations suggest novel ways of implementing Hough-like algorithms, and the formal definition forms the basis of work for optimizing aspects of Hough transform performance (see J. Princen et. al., Proc. IEEE 3rd Internat. Conf. Comput. Vis., 1990, pp. 427-435). © 1992 Kluwer Academic Publishers.
We address the problem of anomaly detection in machine perception. The concept of domain anomaly is introduced as distinct from the conventional notion of anomaly used in the literature. We propose a unified framework for anomaly detection which exposes the multifaceted nature of anomalies and suggest effective mechanisms for identifying and distinguishing each facet as instruments for domain anomaly detection. The framework draws on the Bayesian probabilistic reasoning apparatus which clearly defines concepts such as outlier, noise, distribution drift, novelty detection (object, object primitive), rare events, and unexpected events. Based on these concepts we provide a taxonomy of domain anomaly events. One of the mechanisms helping to pinpoint the nature of anomaly is based on detecting incongruence between contextual and noncontextual sensor(y) data interpretation. The proposed methodology has wide applicability. It underpins in a unified way the anomaly detection applications found in the literature. To illustrate some of its distinguishing features, in here the domain anomaly detection methodology is applied to the problem of anomaly detection for a video annotation system.
In this paper a new technique is introduced for automatically building recognisable, moving 3D models of individual people. A set of multiview colour images of a person is captured from the front, sides and back by one or more cameras. Model-based reconstruction of shape from silhouettes is used to transform a standard 3D generic humanoid model to approximate a person’s shape and anatomical structure. Realistic appearance is achieved by colour texture mapping from the multiview images. The results show the reconstruction of a realistic 3D facsimile of the person suitable for animation in a virtual world. The system is inexpensive and is reliable for large variations in shape, size and clothing. This is the first approach to achieve realistic model capture for clothed people and automatic reconstruction of animated models. A commercial system based on this approach has recently been used to capture thousands of models of the general public.
Measurements of R, sphericity and thrust are presented for c.m. energies between 12 and 31.6 GeV. A possible contribution of a t {Mathematical expression} continuum can be ruled out for c.m. energies between 16 and 31 GeV. © 1980 Springer-Verlag.
We have observed τ production in e+e- annihilation at centre-of-mass energies between 12 and 31.6 GeV with cross sections in agreement with the QED τ-pair cross section. Branching ratios for τ decay have been measured and are consistent with the world averages. We have determined the cutoff parameters of QED Λ+ (Λ-) to be > 73 GeV (82 GeV) and have obtained an upper limit on the τ lifetime of 1.4 × 10-12 s(95% CL). © 1980.
High-energy e+e--annihilation events obtained in the TASSO detector at PETRA have been used to determine the spin of the gluon in the reaction e+e- → qqg. We analysed angular correlations between the three jet axes. While vector gluons are consistent with the data (55% confidence limit), scalar gluons are disfavoured by 3.8 standard deviations, corresponding to a confidence level of about 10-4. Our conclusion is free of possible biases due to uncertainties in the fragmentation process or in determining the qqg kinematics from the observed hadrons. © 1980.
The process e+e-→ π0 + anything has been measured at c.m. energies of 14 and 34 GeV for π0 energies between 0.5 and 4 GeV. The ratio of π0 to π± production for π momenta between 0.5 and 1.5 GeV/c is measured to be 2σ(π0)/ [σ(π+) + σ(π-)] = 1.3 ± 0.4 (1.2 ± 0.4) at 14 (34) GeV. The scaled cross section (s/μ)dσ/dx when compared with lower energy (4.9-7.4 GeV) π0 data indicates a substantial scaling violation. © 1982.
Production of proton-antiproton pairs by two-photon scattering has been observed at the electron-position storage ring PETRA. A total of eight proton-antiproton pairs have been identified using the time-of-flight technique. We have measured a total cross section of 4.5 ± 0.8 nb in the photon-photon c.m. energy range 2.0-2.6 GeV. © 1982.
We have studied the properties of hadron production in photon-photon scattering with tagged photons at the e+e- storage ring PETRA. A tail in the pT distribution of particles consistent with pT-4 has been observed. We show that this tail cannot be due to the hadronic part of the photon. Selected events with high pT particles are found to be consistent with a two-jet structure as expected from a point-like coupling of the photons to quarks. The lowest-order cross section predicted for γγ → qq, σ = 3 Σ eq4 · σγγ → μμ, is approached from above by the data at large transverse momenta. © 1981.
A fine energy scan has been performed to search for narrow states in e+e- annihilation at c.m. energies between 29.90 and 31.46 GeV. No such state has been observed. The 90% confidence upper limit on the leptonic decay width times the hadronic decay branching ratio is ΓeeBh
Hadron production by e+e- annihilation has been studied for c.m. energies W between 13 and 31.6 GeV. As a function of 1n W the charged particle multiplicity grows faster at high energy than at lower energies. This is correlated with a rise in the plateau of the rapidity distribution. The cross section sdσ/dx is found to scale within ±30% for x > 0.2 and 5 ≤ W ≤ 31.6 GeV. © 1980.
Hadron jets produced in e+e- annihilation between 13 GeV and 31.6 GeV in c.m. at PETRA are analyzed. The transverse momentum of the jets is found to increase strongly with c.m. energy. The broadening of the jets is not uniform in azimuthal angle around the quark direction but tends to yield planar events with large and growing transverse momenta in the plane and smaller transverse momenta normal to the plane. The simple qq collinear jet picture is ruled out. The observation of planar events shows that there are three basic particles in the final state. Indeed, several events with three well-separated jets of hadrons are observed at the highest energies. This occurs naturally when the outgoing quark radiates a hard noncollinear gluon, i.e., e+e- → qqg with the quarks and the gluons fragmenting into hadrons with limited transverse momenta. © 1979.
We have observed e+e- hadrons at C.M. energies of 13 GeV and 17 GeV at PETRA using the TASSO detector. We find R(13 GeV) = 5.6 ± 0.7 and R(17 GeV) = 4.0 ± 0.7. The additional systematic uncertainty is 20%. Comparing inclusive charged hadron spectra we observe scaling between 5 GeV and 17 GeV for x = p/pbeam > 0.2; however the 13 GeV cross section is above the 17 GeV cross section for smaller x. This may be due to copious bb̄ production. The events become increasingly jet like at high energies as evidenced by a shrinking sphericity distribution with increasing energy. © 1979.
This paper presents a layered animation framework which uses displacement maps for efficient representation and animation of highly detailed surfaces. The model consists of three layers: a skeleton; low-resolution control model; and a displacement map image. The novel aspects of this approach are an automatic closed-form solution for displacement map generation and animation of the layered displacement map model. This approach provides an efficient representation of complex geometry which allows realistic deformable animation with multiple levels-of-detail. The representation enables compression, efficient transmission and level-of-detail control for animated models.
The detection and recognition of objects from image data is a difficult problem that is closely related to problems of segmentation and stable and reliable feature detection. Feature detection is dependent on a number of factors such as the resolution at which data is sensed. In normal vision systems, the sensor is static with no ability to pan or zoom. However, with the advent of active robot vision heads such as GETAFIX, there is the ability to pan and zoom onto areas of interest. In this article, the use of GETAFIX for object recognition by automatic active panning, zooming and focusing is considered. This is demonstrated by conducting experiments for the case of detecting cylindrical 3D objects in table-top scenes.
The frequency response of the filter consists of two independent parts. The first is a prolate spheroidal sequence that is dependent on the polar radius. The second is a cosine function of the polar angle. The product of these two parts constitutes a 2-D filtering function. The frequency characteristics of the new filter are similar to that of the 2-D Cartesian separable filter which is defined in terms of two prolate spheroidal sequences. However, in contrast to the 2-D Cartesian separable filter, the position and direction of the new filter in the frequency domain is easy to control. Some applications of the new filter in texture processing, such as generation of synthetic texture, estimation of texture orientation, feature extraction, and texture segmentation, are discussed.
One of the most important challenges for face recognition algorithms is dealing with large variability due to facial expression. This paper presents an approach for the expression classification of 3D face scans. The proposed method is based on modelling local deformations which are calculated as the surface change between a neutral face and a face with expression. These deformations are used to train a multiclass/multi-feature LDA classifier. On an unseen face local deformations are calculated automatically using a face with neutral expression as a reference. It is shown that the results obtained are comparable with other similar approaches with the advantage that there is not manual intervention is required for the classification process.
Deformable surface fitting methods have been widely used to establish dense correspondence across different 3D objects of the same class. Dense correspondence is a critical step in constructing morphable face models for face recognition. In this paper a mainstream method for constructing dense correspondences is evaluated on 912 3D face scans from the Face Recognition Grand Challenge FRGC V1 database. A number of modifications to the standard deformable surface approach are introduced to overcome limitations identified in the evaluation. Proposed modifications include multi-resolution fitting, adaptive correspondence search range and enforcing symmetry constraints. The modified deformable surface approach is validated on the 912 FRGC 3D face scans and is shown to overcome limitations of the standard approach which resulted in gross fitting errors. The modified approach halves the rms fitting error with 98% of points within 0.5mm of their true position compared to 67% with the standard approach. © 2006 IEEE.
An important aspect of any scientific discipline is the objective and independent comparison of algorithms which perform common tasks. In image analysis this problem has been neglected. In this paper we present the results and conclusions of a comparison of four Hough Transform, HT, based line finding algorithms on a range of realistic images from the industrial domain. We introduce the line detection problem and show the role of the Hough Transform in it. The basic idea underlying the Hough Transform is presented and is followed by a brief description of each of the four HT based methods considered in our work. The experimental evaluation and comparison of the four methods is given and a section offers our conclusions on the merits and deficiencies of each of the four methods.
A novel method of automatic threshold selection based on a simple image statistic is proposed. The method avoids the analysis of complicated image histograms. The properties of the algorithm are presented and experimentally verified on computer generated and real world images.
We present an algorithm for automatic localization of landmarks on 3D faces. An active shape model, ASM, is used as a statistical joint location model for configurations of facial features. The ASM is adapted to individual faces via a guided search whereby landmark specific shape index models are matched to local surface patches. The algorithm is trained and tested on 912 3D face images from the face recognition grand challenge dataset. Results demonstrate that the automatic procedure successfully and reliably locates landmarks and, compared with an iterative closest point (ICP) algorithm, reduces the mean error for location of landmarks by nearly a half. ©2008 The Institution of Engineering and Technology.
A method for the recognition of hand-printed numerals using hidden Markov models is described. The method involves the representation of 2D images of a character with two 1D models, one for the pixel columns of the image and the other for the rows. Various normalisations are applied to both the training and test data to reduce variations between characters within a class, resulting in a corresponding improvement in classification performance. In our latest experiments, a character recognition rate of over 93% was achieved on digit strings of variable length.