Kelvin Tan - Solr/Elasticsearch Consultant - Download KhanAcademy videos with a PHP crawler

Posted by Kelvin on 08 Oct 2011 at 07:35 pm | Tagged as: programming, PHP

At the moment (October 2011), there's no simple way to download all videos from a playlist from KhanAcademy.org.

This simple PHP crawler script changes that. 🙂

What it does is downloads the videos (from archive.org) to a subfolder, numbering and naming the videos with the respective titles (not the gibberish titles that archive.org has assigned them). Additionally, through the use of wget –continue, the crawler has auto-resume support, so even if your computer crashes in the middle of a crawl, you don't need to start all over again.

Usage

Usage is like this, assuming the script is named downkhan.php:

php downkhan.php {folder} {urls.txt}
php downkhan.php history history.txt

where folder is the subdirectory to save the videos in, and urls.txt is a list of urls obtained by running a regex on http://www.khanacademy.org/#browse.

Regex

The regex used was

href="(.*?)".*?><span.*?>(.*?)</span>

urls

Here is a few lines of a urls.txt file:

http://www.khanacademy.org/video/scale-of-earth-and--sun?playlist=Cosmology+and+Astronomy|Scale of Earth and  Sun
http://www.khanacademy.org/video/scale-of-solar-system?playlist=Cosmology+and+Astronomy|Scale of Solar System
http://www.khanacademy.org/video/scale-of-distance-to-closest-stars?playlist=Cosmology+and+Astronomy|Scale of Distance to Closest Stars

Here's a list of what I've created so far:

http://www.supermind.org/code/history.txt
http://www.supermind.org/code/biology.txt
http://www.supermind.org/code/finance.txt
http://www.supermind.org/code/cosmology.txt
http://www.supermind.org/code/healthcare.txt
http://www.supermind.org/code/linearalgebra.txt
http://www.supermind.org/code/statistics.txt

script code

And here's the script:

&lt;?php 
$args = $_SERVER['argv'];
$folder = $args[1];
$file = $args[2];
 
$arr = explode("\n", trim(file_get_contents(getcwd()."/".$file)));
$urls = array();
foreach($arr as $k) {
  $split = explode("|", $k);
  $urls[$split[0]] = $split[1];
}
 
 
mkdir($folder);
chdir($folder);
$counter = 0;
 
foreach($urls as $url=>$title) {
  $counter++;
 
  echo "Fetching $url\n";
  $html = '';
  while(!$html) $html = fetch_url($url);
  $vid = get_match("/<a href=\"(http:\/\/www.archive.org.*?)\"/", $html);
  $outfile = "$counter. $title.mp4";
 
  `wget --continue $vid -O "$outfile"`;  
}
 
function get_match($pattern, $s) {
  preg_match($pattern, $s, $matches);
  if($matches) {
    return $matches[1];
  } else return NULL;
}
 
function fetch_url($url)
{
    $curl_handle = curl_init(); // initialize curl handle
    curl_setopt($curl_handle, CURLOPT_URL, $url); // set url to post to
    curl_setopt($curl_handle, CURLOPT_FAILONERROR, 1);
    curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 5);
    curl_setopt($curl_handle, CURLINFO_TOTAL_TIME, 20);
    curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
    curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1); // return into a variable
    curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('Accept: */*', 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows)'));
    $result = curl_exec($curl_handle); // run the whole process
    if (curl_exec($curl_handle) === false) {
        echo 'Curl error: ' . curl_error($curl_handle);
    }
    curl_close($curl_handle);
    return $result;
}
 
function rel2abs($rel, $base)
{
    /* return if already absolute URL */
    if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
 
    /* queries and anchors */
    if ($rel[0] == '#' || $rel[0] == '?') return $base . $rel;
 
    /* parse base URL and convert to local variables:
 $scheme, $host, $path */
    extract(parse_url($base));
 
    /* remove non-directory element from path */
    $path = preg_replace('#/[^/]*$#', '', $path);
 
    /* destroy path if relative url points to root */
    if ($rel[0] == '/') $path = '';
 
    /* dirty absolute URL */
    $abs = "$host$path/$rel";
 
    /* replace '//' or '/./' or '/foo/../' with '/' */
    $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
    for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
    }
 
    /* absolute URL is ready! */
    return $scheme . '://' . $abs;
}

No Comments »

Supermind Search Consulting Blog Solr - Elasticsearch - Big Data

Download KhanAcademy videos with a PHP crawler

Usage

Regex

urls

script code

Supermind Search Consulting Blog
Solr - Elasticsearch - Big Data