Security and Surveillance Cameras are Uniquely Positioned to be Enhanced with More Intelligence
- By Paul Gallagher
- Jul 24, 2020
Capturing depth in addition to the usual two dimensions is already a must-have feature for systems where depth-sensing information combined with computing power enables functionality such as face-based unlocking, with additional AI that can distinguish between people, animals, and vehicles as well as recognizing familiar faces.
Depth Carries Critical Information
3D imaging is inspired by the most complex imaging device: the human eye. We have natural depth perception capabilities that help us navigate our world.
Many of today's devices translate the 3D world into a 2D image using 2D image recognition-based computer vision. The limitation in 2D technology is that a flat image does not recognize how far apart objects are inside a given scene.
Many cameras use a passive infrared (PIR) motion detection scenario to wake the camera or alert on activity. The PIR sensor only senses thermal motion changes in the scene. Changing the sensitivity setting only changes the level of thermal change to alarm, not how near the object is to the camera. In this case, a truck on the street, or a pet near the camera can cause false alarms. Too often, to remove these types of false alarms, we end up turning the sensitivity down to the point which prevents the system from detecting human presence.
Computer vision motion detection allows for greater analysis of the scene and for identification of the subject through advanced features such as person detection or facial recognition. This requires the full camera to run and to process the scene to determine if the objects of interest are within the scene. Each camera is then using a non-trivial amount of processing and can be fooled by pictures of human subjects. An additional issue with these features is their reliability which depends highly on the specifics of the algorithms used.
Using 3D imaging systems, one can add distance or depth information to each pixel on the RGB image and accurately measure the distances between objects in a complete scene in real time. When a device reads images in 3D, it will not only know the color or shape information from the flat image, but it will also know the positions and size information.
A 3D camera has the ability to detect how far the subject is located from the camera, as well as detect what the subject is, a person, pet, or a truck, with fewer false positives. Only once a verified human subject gets close enough, the system may proceed with biometric identification.
A 3D imaging solution can also boost anti-spoofing capabilities, particularly attempts to thwart facial recognition systems using a photo, whether a hard copy or an image on the display of a smartphone.
Depth perception enables 3D camera applications in a variety of scenarios such as human presence, motion tracking, activity monitoring, smart alerts, physical access, device access, and gesture recognition.
How is 3D Imaging Implemented?
There are various active and passive 3D imaging solutions available ranging from structured light (IR dot-projectors and IR cameras) and stereoscopic camera systems, to time-of-flight dedicated sensors.
Active systems use various methods for spatially or temporally modulating light, such as time-of-flight and structured light. Passive methods include stereo, depth from focus, and light field, where ambient or broad fixed illumination is trained on the object.
In both active and passive 3D imaging systems, reflected light from the object being illuminated is captured by a CMOS-based camera to generate a depth map of the structure, and then a 3D model of the object.
Standard Passive Stereo Vision
This approach relies on ambient light and uses two cameras located at fixed distance apart to capture a scene from two different positions. Using triangulation, depth information is extracted from, say, the left-right image disparity by matching features in both images. The range of depth a stereo system can accurately detect depth is based on the sensor, lens, and distance between the two cameras. The closer the cameras are to each other, the nearer the depth range.
Changes in the distance from the two cameras after deployment, whether due to thermal expansion, being dropped or struck, will impact depth accuracy. Passive stereo only finds depth points where the two cameras can ‘see’ the same point in the scene. When there are regions in the scene that are homogenous in response (e.g. flat single-color walls), there are no single points for the system to converge on. This means there are regions within the scene where no depth information can be determined.
Active stereo. To provide depth in the regions where passive stereo provides no depth information, a patterned light can be used. In addition to the stereo cameras, a light source that puts out a pattern of light onto the scene is included. This light source reflects points off the parts of the scene that are homogenous in response, essentially adding artificial texture, which can then be detected by the stereo systems.
Structured light. In addition to the camera itself, a structured light system adds a patterned light source/ projector to illuminate the scene projecting a specific pattern. The pattern of this illumination is distorted by the surface of the object, and from this distortion a depth map of the object can be calculated through triangulation. Like stereo, the solution relies on knowing the exact distance between the camera and in this case the light source, rather than a second camera.
Typically, this is not the same camera used for the 2D color image because the pattern of light would be visible. Instead, it uses a separate monochrome sensor coupled to a patterned lighting working in the near IR range, beyond the wavelengths a person could see.
These systems rely on capturing reflected light off objects and they tend to have problems when working in sunlight when parts of the scene have too much light on them to distinguish between where the pattern falls and where there is no patterned light.
Since structured light is dependent upon the distance between the camera and the light source, its accuracy will be impacted if this distance changes. The ranging limit is also dependent upon how bright the source illumination is and the reflectivity of the object in the scene it impinges upon.
Time of Flight (ToF) is a technique of calculating the distance between the camera and the object by measuring the time it takes the projected infrared light to travel from the camera, bounce off the object surface, and return to the sensor. As light speed is constant, by analyzing the phase shift of the emitted and returned light, the processor can calculate the distance of the object and reconstruct it.
Typically, the light is a single pulse that illuminates the entire scene at the same time. As this system is also generating light, it can be subject to the same scene lighting issues described for structured light.
Additionally, the ToF sensor pixel typically is a very different architecture from a conventional color sensor pixel, and so it requires a special sensor just to detect depth. ToF systems have the same ranging limitations based on the intensity of light generated as described in structured light.
Alternatively, a direct ToF system can use a time-to-digital converter to analyze a histogram of the returned light signal from objects in the scene, as is the case for many of the proximity sensors used in smartphones and gesture recognition devices.
LIDAR. A problem with the ToF approach is the sensor will receive more than just the intended reflections from the pulse of light, it will also receive off-angle reflections and multiple reflections. LIDAR addresses this issue by not illuminating the whole scene at the same time, but by sending a pulsing beam to a specific point in the scene. This way only a single return pulse is captured at any one time. Because LIDAR uses a single beam, its depth range can be much greater than ToF or structured light.
It's important to note that active solutions that use structured light and time of flight measure depth but do not necessarily produce conventional 2D images. On the other hand, since multi-camera solutions such as stereo and depth-by defocus depend on light intensity data, they can only natively record 2D images and cannot directly measure depth.
Additionally, the solutions that generate light produce much lower depth resolutions, due to either the finite number of dots that a structured light system can generate or the size of the specialty pixels the ToF and LIDAR systems use. For most solutions, the depth resolution of these active illumination approaches is 1 megapixel or less.
All these implementations use complex hardware that can influence the industrial design choices of a security camera and significantly increase its bill of materials cost. They include multiple RGB or IR cameras, specialty sensors, and require a relatively high level of computation.
A new and simpler approach was recently introduced by DepthIQ™, a passive single-sensor solution developed by AIRY3D, which uses diffraction to measure depth directly through an optical encoding mask applied on a conventional 2D CMOS sensor. Together with image depth processing software, DepthIQ™ converts the 2D color or monochrome sensor into a 3D sensor that generates both 2D images and depth maps which are inherently correlated, resulting in depth per pixel. DepthIQ™ promises to enable high-quality depth sensing at a much lower cost, with very light-weight processing, and without limitation on the resolution of the sensor.
When choosing between imaging systems for security camera applications, several factors must be considered.
Depth performance and anti-spoofing capabilities. Even though biometric technology increases the security of systems that use it, they are prone to spoof attacks where attempts of fraudulent biometrics are used. Using 3D cameras is one major anti-spoofing approach, where the system uses 3D cameras to calculate the depth of the scanned face and determine its authenticity.
Tests to assess depth information in face authentication techniques and evaluate their capability in distinguishing fake biometrics have been previously performed in the industry. The best results were obtained by 3D solutions capable of detecting a subject’s real face in contrast to a flat picture of a subject’s face, a subject’s face displayed on a screen, or a video of the subject, regardless of the subject’s head orientation, facial movement, and distance from the camera.
The depth range of the 3D camera is also an important factor, particularly where occlusion zones (“blind spots”) that limit the ability to collect 3D range information are concerned. With stereo solutions, one can set the range of depth detection (near & far) but all stereoscopic imaging systems and active illumination systems have inherent dead zones with little or no 3D information. In particular, some objects in the foreground may be visible to only one of the two cameras.
Structured light and ToF implementations are also affected by limitations due to geometrical occlusion zones in the scene, if one object blocks the light from getting to a second object. However, it should be noted that the baseline between stereo cameras is effectively replaced in a ToF system by the separation between the ToF light source and the ToF receiver. Thus, an inherent geometrical limitation remains for objects in the foreground. The difference between a ToF system and a stereo system is that the foreground depth accuracy is limited by the response time for the return signal phase detection (or time-to-digital conversion in direct ToF systems).
In contrast to the challenges with accurate detection of nearby objects in both of the above two-device geometries, stereoscopic structured light and ToF, DepthIQ uses a single CMOS image sensor and has no built-in occlusion zones. The geometric design of the monolithically integrated encoding/diffraction mask is optimized to provide the greatest possible depth sensitivity and this accounts for both the numerical aperture of the objective lens and the chief ray angle of the microlens array patterned on the pixel array of the image sensor.
Sunlight interference. As mentioned, light sources such as sunlight or reflections off shiny objects can saturate cameras. This limitation becomes especially problematic for systems that rely on structured light and time-of-flight for 3D image capture. The main mitigation strategy for bright sunlight for structure light and ToF systems relies on narrow bandpass optical filters to limit the spectral range of the image to a wavelength region near 940 nm or 850 nm, chosen to be compatible with the artificial light source, typically a vertical cavity surface emitting laser (VCSEL). Thus adding cost and while improving performance under sunlight, it still has issues under bright sunlight. Conversely, low light levels can produce noisy images that confound the depth computation algorithms.
The same limitations do not apply to DepthIQ and stereoscopic techniques, since these techniques fundamentally rely on some form of A-to-B image disparity (that is, left-right or up-down angular separation). Generally speaking, stereoscopic and DepthIQ solutions perform well in sunlit scenes, unless large areas of the image are badly overexposed.
Computation, power consumption, footprint and cost. Very low power consumption in a compact, low profile form factor is essential in embedded computing applications.
Passive sensing mechanisms use less power than active depth sensing camera systems. Active solutions require a high level of computation, which is taxing on the image signal processor. In contrast, DepthIQ allows for perfectly aligned 2D and 3D data and most of the computational burden is eliminated by having the physics of light diffraction contribute the disparity information on a pixel-to-pixel basis. This renders the extraction of depth information from the raw 2D image somewhat analogous to the demosaicing of a Bayer pattern color filter array.
The number of components used weighs heavily on the footprint of the solution and the associated cost. Traditional 3D solutions require multiple components: either two image sensors in a stereo camera, or a single sensor with a structured light projector, or a time-of-flight detector accompanied by one or two light sources. In comparison, DepthIQ uses a single existing 2D CMOS image sensor to generate both the 2D image and the 3D depth map.
With an increasing usage of smart cameras for security and surveillance purposes in residential, commercial facilities, and public spaces, no one solution will perfectly fit all applications. There may be cases when a specific application may require the combination of two or more 3D solutions for best results.
For example, a stereo solution could be used in a long-range application such as airport security in conjunction with DepthIQ. In this hybrid scenario, the DepthIQ-enabled sensors will serve to enhance the RGB sensors to improve close depth and extend far depth, fill in occlusion zones, simplify computation, and monitor drift. This opens up cost-effective, low-compute extended 3D capabilities in physical security applications wherever conventional CMOS image sensors are in use today.