Sunday, August 29, 2010

Photo library data mining

I've been very happy with my DSLR, and I really enjoy the photos it produces when compared to my previous point-and-shoots. I've been thinking of getting another fast prime lens for low-light indoor shots and for walking around. I have a 50mm prime, and although it takes wonderful portraits, it acts as a telephoto lens on my crop sensor, so I often back into walls or have trouble capturing the entire scene.

To decide which lens to get, I decided to look at what I could learn from the EXIF data from the approximately 9,000 photos in my library over the past 8 years or so. In particular, is there a particular zoom I tend to prefer? Of course, the zoom is intimately tied to the camera itself and its capabilities -- a camera that's limited to a particular zoom range means that no matter what I'd be taking photos within that zoom range (and correcting by walking back and forth to frame the shot -- which is naturally impossible to capture in the EXIF). Nonetheless, I thought this would be a fun and interesting exercise, so let's go mining.


To start, I have to extract the EXIF data from all my photos and set it in a format where I could explore it. I used the excellent exiftool command line utility which displays the full EXIF information. I am only interested in a few EXIF fields, specifically:

  • Camera name/model
  • Zoom range supported by the camera
  • Zoom at which the photo was taken
I would like this data to be displayed all on one line, per photo, so I used the following BASH script:

#!/bin/bash

exiftool "$1" | awk -vfname="$1" -F " : " '
BEGIN {
  camera = "";
  zoom = "";
  lens = ""
}
{
  gsub(/ *$/, "", $1);
  gsub(/^ */, "", $2);
  if ($1 == "Camera Model Name") {
    camera = $2;
  }
  if ($1 == "Focal Length") { 
    zoom = $2;
  }
  if ($1 == "Lens") {
    lens = $2;
  }
}
END {
  printf("%s, %s, %s, %s\n", fname, camera, zoom, lens);
}'

The script basically pulls out the EXIF lines that start with "Camera Model Name", "Focal Length", and "Lens" and displays their corresponding values, along with the name of the file, on a single line.

I then ran the script over my photo library, which all resides under the same directory, with the following commands:
$ find . -iname "*.jpg" -print0 | xargs -0 -I % ./exinfo.sh "%" > data.txt

Note that I ran these commands on Windows (in Cygwin), so I used "-iname" to search over both *.jpg and *.JPG files.

The script took a few hours to run (exiftool, for all its awesome qualities, is not particularly fast), and in the end I got all the data I needed in data.txt

The first interesting part was the focal length of each photo, which is reported in the following format by exiftool:

4.6 mm (35 mm equivalent: 27.2 mm)

I actually want the 35 mm equivalent value (in this case, 27.2 mm, rounded to the nearest integer), which I extracted in Excel using the following formula (assuming the value above is in cell B2):
=INT(MID(B2, FIND(": ",B2) + 2,FIND(" mm)", B2) - FIND(": ", B2) - 2))

Lastly, I produced a histogram of the 35mm equivalent data, which looks like this:


The histogram seems to indicate that:
  • 17% of my photos were taken at 27mm (in 35mm equivalent)
  • 34% of my photos were taken in a range of (35 - 40) mm (in 35mm equivalent)
    • This was the largest coherent cluster in the entire library
Because the D90 has a 1.5 crop sensor, the 27 mm equivalent comes from an 18mm focus; the (35 - 40) mm range comes from a (23 - 27) mm range. Intuitively, these numbers make sense to me -- I tend to strongly prefer wide-angles over zooms in my photos in general.

Nikon makes a number of non-fisheye prime lenses in this range, but many of them are either too wide for my taste (10 or 14mm), or very expensive, or have pretty poor quality. The lens which seems to have good quality, reasonable price, and fits nicely in the range above is the 24mm prime. I'll want to refine this further by weighing the zoom range by the capabilities of the camera, but as a first cut I'm pretty happy with the data and it agrees with my intuition.

Another nice side-effect of this analysis is that it reminded me of the all cameras I've used over the years:
I really liked my PowerShot A70 (and only "upgraded" to A75 because my A70 broke and there was no way to replace it since Canon had discontinued the A70). The SD800 was a quantum leap in that it gave me a wider angle (27 vs. 34) and IS (so I could take photos without flash in low-light). Nowadays, I only use my Nikon.

No comments: