## Image Formation, Cameras, and Lenses

Image formation depends on

- Lighting conditions
- Scene geometry
- Surface properties
- Camera optics
- Sensor properties

### Image Processing Pipeline

The sequence of image processing operations applied by the camera’s image signal processor (ISP) to convert a RAW image into a regular JPG/PNG.

- Lens
- CFA
- Analog Front-end → RAW image (mosaiced, linear, 12-bit)
- White balance
- CFA demoisaicing
- Denoising
- Colour transforms
- Tone reproduction
- Compression → final RGB image (non-linear, 8-bit)

### Examples

#### Reflection

Surface reflection depends on viewing $(θ_{v},ϕ_{v})$ and illumination $(θ_{i},ϕ_{i})$ direction along with the Bidirectional Reflection Distribution Function: $BRDF(θ_{v},ϕ_{v},θ_{i},ϕ_{i})=πρ_{d} $

All angles are spherical coordinates w.r.t. the normal line of the surface.

A Lambertian (matte) surface is one which appears the same brightness from all directions. A mirror (specular) surface is one where all incident light is reflected in one direction $(θ_{v},ϕ_{v})=(θ_{r},ϕ_{r})$

#### Cameras

*All scene points contribute to all sensor pixels*

As a result, the image is really blurry.

*Pinhole camera*

The image here is flipped, but no longer blurry. Roughly, each scene point contributes to one sensor. Pinhole camera means you need to get the right size of pinhole. If the pinhole is too big, then many directions are averaged, blurring the image. If the pinhole is too small, then diffraction becomes a factor, also blurring the image.

A few perspective ‘tricks’ arise out of the pinhole

- Size is inversely proportional to distance
- Parallel lines meet at a point (vanishing point on the horizon)

Side note, pinhole cameras are really slow because only a small amount of light actually makes it through the pinhole. As a result, we have lenses

#### Lenses

The role of a lens is to capture more light while preserving, as much as possible, the abstraction of an ideal pinhole camera, the thin lens equation

$z_{′}1 −z1 =f1 $

where $f$ is the focal point, $z_{′}$ is the distance to the image plane, and $z$ is the distance to the object.

Focal length can be thought of as the distance behind the lens where incoming rays parallel to the optical axis converge to a single point.

Different effects can be explained using physics phenomena. Vignettes are simply light that reaches one lens but not the other in a compound lens. Chromatic aberration happens because the index of refraction depends on wavelength of the light so not all colours can be in equal focus.

Similarities with the human eye

- pupil is analogous to the pinhole/aperture
- retina is analogous to the film/digital sensor

#### Weak Perspective

Only accurate when object is small/distant. Useful for recognition

$P= xyz projects to a 2D image pointP_{′}=[x_{′}y_{′} ]wherem=z_{0}f_{′} ,x_{′}=mx,y_{′}=my$

#### Orthographic Projection

$P= xyz projects to a 2D image pointP_{′}=[x_{′}y_{′} ]wherex_{′}=x,y_{′}=y$

#### Perspective Projection

$P= xyz projects to a 2D image pointP_{′}=[x_{′}y_{′} ]wherem=z_{0}f_{′} ,x_{′}=f_{′}zx ,y_{′}=f_{′}zy $

## Image as functions

### Grayscale images

2D function $I(x,y)$ where the domain is $(x,y)∈([1,width],[1,height])$ and the range is $I(x,y)∈[0,255]∈Z$

### Point Processing

Apply a single mathematical operation to each individual pixel

## Linear Filters

Let $I(x,y)$ be an $n×n$ digital image. Let $F(x,y)$ be the $m×m$ filter or kernel. For convenience, assume $m$ is odd and $m<n$. Let $k=⌊2m ⌋$, we call $k$ the half-width.

For a correlation, we then compute the new image $I_{′}(x,y)$ as follows: $I_{′}(x,y)=∑_{j=−k}∑_{i=−k}F(i,j)I(x+i,y+j)$

For a convolution, we then compute the new image $I_{′}(x,y)$ as follows: $I_{′}(x,y)=∑_{j=−k}∑_{i=−k}F(i,j)I(x−i,y−j)$

A convolution is just the correlation with the filter rotated 180 degrees. We denote a convolution with the $⊗$ symbol.

In general,

- Correlation: measures similarity between two signals. In our case, this would mean similarity between a filter and an image patch it is being applied to
- Convolution: measures the effect one signal has on another signal

Each pixel in the output image is a linear combination of the central pixel and its neighbouring pixels in the original image. This results in $m_{2}×n_{2}$ computations. When $m≈n$, then this is a $O(n_{4})$

Low pass filters filter out high frequences, high pass filters filter out low frequencies.

### Properties of Linear Filters

- Superposition: distributive law applies to convolution. Let $F_{1}$ and $F_{2}$ be digital filters. Then $(F_{1}+F_{2})⊗I(x,y)=F_{1}⊗I(x,y)+F_{2}⊗I(x,y)$
- Scaling. Let $F$ be a digital filter and let $k$ be a scalar. $(kF)⊗I(x,y)=F⊗(kI(x,y))=k(F⊗I(x,y))$
- Shift invariance: output is local (doesn’t depend on absolute position in image)

**Characterization Theorem**: Any operation is linear if it satisfies both superposition and scaling. Any linear, shift-invariant operation can be expressed as a convolution.

## Non-linear filters

- Median Filter (take the median value of the pixels under the filter), effective at reducing certain kinds of noise, such as impulse noise (‘salt and pepper’ noise or ‘shot’ noise)
- Bilateral Filter (edge-preserving filter). Effectively smooths out the image but keeps the sharp edges, good for denoising. Weights of neighbour at a spacial offset $(x,y)$ from the center pixel $I(X,Y)$ given by a product $exp_{−2σx+y}exp_{−2σ(I(X+x,Y+y)−I(X,Y))}$. We call the first half of the product the
*domain kernel*(which is essentially a Gaussian) and the second half the*range kernel*(which depends on location in the image). - ReLU. $x$ for all $x>0$, $0$ otherwise.

## Sampling

Images are a discrete (read: sampled) representation of a continuous world.

An image suggests a 2D surface which can be grayscale or colour. We note that in the continuous case that

$i(x,y)$ is a real-valued function of real spatial variables, $x$ and $y$. It is bounded above and below, meaning $0≤i(x,y)≤M$ where $M$ is the maximum brightness. It is bounded in extent, meaning that $x$ and $y$ do not span the entirety of the reals.

Images can also be considered a function of time. Then, we write $i(x,y,t)$ where $t$ is the temporal variable. We can also explicitly state the dependence of brightness on wavelength explicit, $i(x,y,t,λ)$ where $λ$ is a spectral variable.

We denote the discrete image with a capital I as $I(x,y)$. So when we go from continuous to discrete, how do we sample?

- Point sampling is useful for theoretical development
- Area-based sampling occurs in practice

We also quantize the brightness into a finite number of equivalence classes. These values are called gray-levels

$I(x,y)⟹⌊Mi(x,y) (2_{n}−1)+0.5⌋$

Typically, $n=8$ giving us $0≤n≤256$

Is it possible to recover $i(x,y)$ from $I(x,y)$? In the case when the continuous is equal to the discrete, this is possible (e.g. a completely flat image). However, if there is a discontinuouty that doesn’t fall at a precise integer, we cannot recover the original continuous image.

A bandlimit is the maximum *spatial frequency* of an image. The audio equivalent of this is audio frequency, the upper limit of human hearing is about 20kHz which is the human hearing bandlimit.

Aliasing is the idea that we don’t have have enough samples to properly reconstruct the original signal so we construct a lower frequency (fidelity) version.

We can reduce aliasing artifacts by doing

- Oversampling: sample more than you think you need and average
- Smoothing before sampling: reduce image frequency

This creates funky patterns on discrete images called moire patterns. This happens in film too (temporal aliasing), this is why wheels sometimes look like they go backwards.

The fundamental result (Sampling Theorem): For bandlimited signals, if you sample regularly at or above twice the maximum frequency (called the Nyquist Rate), then you can reconstruct the original signal exactly.

- Oversampling: nothing bad happens! Just wasted bits.
- Undersampling: things are missing and there are artifacts (things that shouldn’t be there)

### Shrinking Images

We can’t shrink an image simply by taking every second pixel

Artifacts will appear:

- Small phenomena can look bigger (moire patterns in checkerboards and striped shirts)
- Fast phenomena can look slower (wagon wheels rolling the wrong way)

We can sub-sample by using Gaussian pre-filtering (Gaussian blur first then throw away every other column/row). Practically, for ever image reduction of a half, smooth by $σ=1$