I wanted to find the durations of a bunch of MP4 files located out on the net – durations for the introduction videos for the top Kickstarter projects.
But I wanted to do this quickly. Downloading all those MP4 files would take too long. A little bit of research revealed that MP4 files files set up for streaming have their metadata (or moov atom) at the beginning of the file.
Now I need a way to read just the metadata, without getting the entire file.
More research reveals that I can use curl and dd to get the first bytes of a file. For some reason ‘curl -r’ doesn’t work.
So now we’re ready to go.
I made a file that had one Kickstarter project URL per line. Here’s a couple of them:
http://www.kickstarter.com/projects/formlabs/form-1-an-affordable-professional-3d-printer
http://www.kickstarter.com/projects/1523379957/oculus-rift-step-into-the-game
This script will load the Kickstarter project page, and get the URL-encoded download link for the project’s introductory video, if there is one:
$ cat ks-urls | xargs -Ifoo sh -c "curl -s foo|grep link |grep http://www.kickstarter.com/swf/kickplayer.swf |cut -d '&' -f 5| sed -e 's/amp;file=//g' " > ks-video-urls
Now we need to URL-decode the URLs:
$ cat ks-video-urls | python -c 'import sys, urllib; print urllib.unquote_plus(sys.stdin.read())' > ks-decoded-video-urls
Now we get the durations from the video urls, you’ll need Python, pip, and virtualenvwrapper installed. We make a Python virtual environment, and install hsaudiotag module to decode the mp4 metadata:
$ mkvirtualenv mp4
$ pip install hsaudiotag
$ cat ks-decoded-video-urls| xargs -Ifoo sh -c "curl -s foo| dd count=1 2>/dev/null | python -c 'import sys, StringIO; from hsaudiotag import mp4; s=StringIO.StringIO(sys.stdin.read()); print mp4.File(s).duration'" > ks-video-durations
This code uses curl and dd to download only the first 512-byte block of the MP4 file.
Now we analyze the durations using a simple R script, I am on a Mac so I need to use Homebrew to install R:
$ brew install gfortran
$ brew install R
$ R -q -e "x <- read.csv('ks-video-durations', header = F); summary(x); sprintf('standard deviation: %f', sd(x[ , 1]))"
Output for the top 100 Kickstarter technology projects (by amount raised) - all numbers are in seconds:
Min. : 52.0
1st Qu.:145.8
Median :183.5
Mean :203.3
3rd Qu.:246.5
Max. :583.0
[1] "standard deviation: 90.14273"
The average duration of the top 100 Kickstarter videos is 203.3 seconds, or just about 3.38 minutes.
Thanks to: