Visual Perception and Aesthetics



We perceive some things as "beautiful", and other things as "ugly". Research shows most of this is universal. If we know why we react and what the factors are, it might help us make stronger more successful images (whether we're going for 'beauty' or 'ugly').    

Things we like
fall into two main categories:

Human - here are roughly four subcategories:
Anatomy - bodies, faces, eyes... reason is obvious - members of a tribe or family needed to stay close for survival.

Shiny - we evolved to like the shininess of eyes, teeth, tongue, hair, juicy fruit, water, and that translates to other things like metals, jewels, silk...

Translucent - skin, flesh, teeth, food, which translates to other materials and objects; tropical fish, semiprecious stones, frosted glass, flowers, clouds etc...

Incandescent - could be because of the caustics in eyes, back-lit skin and hair...
It could also be because we're daytime creatures, and so prefer light to dark, in which case this category might fit better under the next heading.  In any case, we like any glowing substance, any strong color or light..

Non-Human - for example trees, coral, mountains, waves, lightning, fire, clouds, wide vistas, intricate patterns (snake skin, birds, butterflies)...
(As some categories overlap, and many types of objects vary, some objects fit in more than one category: for instance opals)

Evolutions way to strengthen our bonding, to help us know good food, to make us prefer a wide view of our surroundings, etc etc
Some find dangerous things beautiful, others don't - simply shows difference between 'adventurers' and 'nurturers'.









I'll focus on the 'Non-Human'. It's obvious why we would like the 'Human', and easy to use this - even without knowing it - in art. In fact we do all the time.

Not so obvious is why we sometimes feel pleasure just staring at a dried tree stump or a Japanese rock garden.
If we can find out exactly what about the stump and the rocks is causing this, we should be able to apply it to any image.
And no; answers like "they are natural", "they have aesthetically pleasing textures", or "their positive and negative spaces are balanced" etc, really aren't answers at all.




A saccade is a special kind of eye movement, usually automatic, that rotates the eyeball a very short distance in a very short time. Here is a representation of the saccades an eye may do looking at this image; note the 'jerky' quality of the path.

In 1950s researchers showed that if saccades are eliminated - an image is 'frozen' with respect to the retina - one stops seeing the image in 1-3 seconds - basically goes blind. The same thing happens when the visible field is too uniform (all one color). This is a "blank field".

But the opposite is just as bad: an "aggressive field". An example: a wall with thousands of regular spots on it - the distance between spots about 15 mm. From a distance of 4 meters the space between the spots is 0.23°, roughly the distance of an average saccade. Looking at it soon gets extremely uncomfortable. 1960's Op-Art was based on this. If taken too far this can cause nausea, or even seizures in some people. See if you enjoy looking at this, for any length of time:



And just to clarify, I'm not talking about any old polkadot pattern, as in this dress. The image of this dress is, in contrast to the above pattern, very varied and not too many separate high contrast foci.


















So too little contrast is bad, and too much is bad. But medium contrast isn't the answer either; the above pattern would be almost as bad if it was low contrast, or fewer dots (see 'Composition' below). So what's the attribute we're looking for?


Change, the spice of life

What can a tree stump, a rust stain, a coral reef and sunset have in common - what can make them all aesthetically appealing to us?

Change, but more than that: Changes of Change.

We prefer variation. 
But we like varying variation even more.
And we like varying variations of variations even more...  And so on.
Let's call it Levels of Variation.

We could think of LoVs as being similar to harmonics in music - L0 would be without any variation, a flat line;



L1 would be adding a modulation to L0, perhaps a sine wave;



L2 would be adding a higher frequency to that sine wave, and so on..





You could also think of them as equivalent to the hierarchical levels of a Subdivision Surface in a 3d program.
Each higher LoV represents a (higher resolution) change or modulation to the preceding lower level.  (Of course the number of possible kinds of modulation is practically infinite.)


Here's Christian Bale from Equilibrium - on the left heavy smudging to smooth the details - removing the higher frequencies. On the right minimal retouching (just to remove a heavy grain). The curves below represent what part of this face might look like in cross-section.

Note that the image, and the curve, on the right is by far the more interesting, in the primal basic sense I use the word here - the eyes are instinctively more attracted to it.



Another attribute that can raise interest is Breaks - discontinuities, a change so abrupt it can't be defined as a modulation. Breaks have Borders.

I think images can be classified according to how many LoVs and Breaks they have (and the size of those Breaks, and the shape of the Border between them).  I also think the most important aspect of these is the Lovs.
Here's an example:

L0, no Break: an image consisting of all pixels the same color.  Extremely boring to look at.

L1: a very low frequency modulation, across the whole image.  Strangely, with such a small change, it's now a lot more pleasant to look at.

 L2 - a variation on the variation, but NOT covering all of the previous level - it could, but that leads to less multi-leveled variation, and so to a less interesting image.


L3 (in fact my sloppy brushstrokes add a bit of L4 as well): again, not covering all of the previous level.  It’s looking much better.



A Break containing an L1 texture. Not improving it that much, the L1 needs to be modulated into a higher level.


Estimated L4 or 5: an image with a huge amount of interesting detail.


This landscape, from which the above image was cropped, I estimate is mostly L3 or 4 at this resolution, and the Borders between the many Breaks very varied in shape. High L, many Breaks, varying Borders = a more interesting image. 

On the right I've simplified it as much as I could without loosing it's identity, making it mostly L1 and 2. Look at them both for a minute or so.  Which one of these could hold your attention longer, and why? 



Good cg is realistic, bad cg isn't.
This doesn't mean a cg image necessarily and always has to be realistic, like a photograph - but in practice, it turns out that's usually the direction it should be taken. Even cartoony images must follow the laws of perspective and light, and have a logic to the design and anatomy, even if it's only internal.
(As you may have guessed, I'm not partial to abstract art, especially in cg.

As discussed above, every successful image has some or all of these attributes: 
1.  Detail (complexity)
2. Variation of this detail,

Breaks between different areas

But how to apply it? The simplest way is to look at reality. Reality has an infinity of this.  As a rule, cg is the direct opposite. This is why I hold reality up as the standard.
Below is an image deliberately created to be unreal and ugly.  You may often see very similar images from beginners. Note how confusing, flat and uninteresting it is, how alien to our depth perception.  Hardly any edge is defined; there's lots of detail and many Breaks - but no LoVs.   This leads to a huge compositional problem, mainly due to poor distribution of the contrasts, the foci of attention, as mentioned in the Composition section.
How do we fix this horror?   The easiest way is to simply make it more realistic.


1. Shading
Add shading to the objects in the scene, this raises the LoV, and improves depth-cueing.



2. Shadows and texture complexity
Don't confuse a strong texture with a complex one. All textures must be above a certain level of complexity, OR they must be very muted and subtle. NEVER use a simplistic texture on it's own with very high contrasts, or strong colors.  A single layer of fractal noise just isn't enough – as described above, we want more than that . Note that I placed the green bulge texture in the bump channel instead, and removed it completely from the smaller cube.

Best is to use real photo textures as a starting point, there are plenty of them on this site



4. Soft shadows, details
A good general rule is: the more varied detail the better.   For instance, any surface should usually have a different texture at the top than at the bottom, to the left and to the right... keywords are SUBTLE and VARIATIONS.

5. Reflections, GI etc
Still ugly colors, but that's easily fixed, and otherwise it's much improved. Next we should improve the composition, by moving the foci around, by playing with the light, placement, camera angle, adding or removing objects etc. The strongest focus is off-center like it should be (upper edge of the biggest box), but almost too close to the edge. The second focus, the brightest part of the wall, is definitely too close to the edge, leading the eye right out. There's another focus near the lower right hand corner doing the same thing there. I'd suggest moving the blue disc to the right, and darkening the upper part of the wall. Other than that, the composition is okay.


Here's a perfectly symmetrical image, with only two foci, of exactly similar weight, close together.    

As an image it fails to hold our attention for very long, mostly because it soon becomes tiresome to look at. 
On the right, see a rough indication (in red) of how the eyes will probably move across the image. 
They become 'caught' in the center, hardly venturing outside - there just isn't anything to lead them there, and to 'stick to' once there.  And if the eyes do leave the center, they stay glued to the edges and corners of the image.  This is uncomfortable.
(I'm guessing because here our visual system can't do what it's supposed to do very efficiently - gather information.)

A similarly bad image would result from using only one focus, even if not centered. 


But how about moving the 2 foci further apart, to lead the eye around more real estate?

Even worse - this configuration will quickly tire the eyes.  (Try it; enlarge it and look at it intently for a minute.)

Six similar foci are almost as bad, as you can see on the right.  In fact the more the worse, seems to be the rule - see the 'aggressive field' above.




So let's try something else.  Maybe 2 slanted?  A little bit better - now the eye goes in a triangle between the foci and some of the corners. 

But even better is lowering the amplitude of one focus, giving it less 'weight' to the eye.  

Let's add more foci...


After much experimenting, I came up with this:


One of many possible configurations that would be close to the theoretical ideal.
Here are the guidelines:
1.  All foci of descending weight; no two exactly the same. This is more important the more important the focus.
2.  All foci off-center, with respect to the positive and negative spaces created between them. The spacing of the foci should vary - the negative and positive spaces should be as varied as the foci themselves - no two the same exact weight, size and shape.
3.  None of the main foci too close to the edges of the image, which could lead the eye outside of the frame (a well-known no-no in composition).

All this makes it possible for the eye to travel widely around the picture, going around many times along many different paths.  (As you can see, for this purpose triangular and curved paths are more useful than square-rectangular, which explains why too many verticals-horizontals can kill a composition.)
This 'off-center' issue is sometimes referred to as 'The Rule of Thirds', or the more precise implementation 'The Golden Section'.



© Steven Ståhlberg 1998, 1999, 2000, 2002. 2003, 2004, 2005