Download KhanAcademy videos with a PHP crawler
Posted by Kelvin on 08 Oct 2011 at 07:35 pm | Tagged as: programming, PHP
At the moment (October 2011), there's no simple way to download all videos from a playlist from KhanAcademy.org.
This simple PHP crawler script changes that. 🙂
What it does is downloads the videos (from archive.org) to a subfolder, numbering and naming the videos with the respective titles (not the gibberish titles that archive.org has assigned them). Additionally, through the use of wget –continue, the crawler has auto-resume support, so even if your computer crashes in the middle of a crawl, you don't need to start all over again.
Usage
Usage is like this, assuming the script is named downkhan.php:
php downkhan.php {folder} {urls.txt} php downkhan.php history history.txt
where folder is the subdirectory to save the videos in, and urls.txt is a list of urls obtained by running a regex on http://www.khanacademy.org/#browse.
Regex
The regex used was
href="(.*?)".*?><span.*?>(.*?)</span>
urls
Here is a few lines of a urls.txt file:
http://www.khanacademy.org/video/scale-of-earth-and--sun?playlist=Cosmology+and+Astronomy|Scale of Earth and Sun http://www.khanacademy.org/video/scale-of-solar-system?playlist=Cosmology+and+Astronomy|Scale of Solar System http://www.khanacademy.org/video/scale-of-distance-to-closest-stars?playlist=Cosmology+and+Astronomy|Scale of Distance to Closest Stars
Here's a list of what I've created so far:
http://www.supermind.org/code/history.txt
http://www.supermind.org/code/biology.txt
http://www.supermind.org/code/finance.txt
http://www.supermind.org/code/cosmology.txt
http://www.supermind.org/code/healthcare.txt
http://www.supermind.org/code/linearalgebra.txt
http://www.supermind.org/code/statistics.txt
script code
And here's the script:
<?php $args = $_SERVER['argv']; $folder = $args[1]; $file = $args[2]; $arr = explode("\n", trim(file_get_contents(getcwd()."/".$file))); $urls = array(); foreach($arr as $k) { $split = explode("|", $k); $urls[$split[0]] = $split[1]; } mkdir($folder); chdir($folder); $counter = 0; foreach($urls as $url=>$title) { $counter++; echo "Fetching $url\n"; $html = ''; while(!$html) $html = fetch_url($url); $vid = get_match("/<a href=\"(http:\/\/www.archive.org.*?)\"/", $html); $outfile = "$counter. $title.mp4"; `wget --continue $vid -O "$outfile"`; } function get_match($pattern, $s) { preg_match($pattern, $s, $matches); if($matches) { return $matches[1]; } else return NULL; } function fetch_url($url) { $curl_handle = curl_init(); // initialize curl handle curl_setopt($curl_handle, CURLOPT_URL, $url); // set url to post to curl_setopt($curl_handle, CURLOPT_FAILONERROR, 1); curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 5); curl_setopt($curl_handle, CURLINFO_TOTAL_TIME, 20); curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, 1); // allow redirects curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1); // return into a variable curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('Accept: */*', 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows)')); $result = curl_exec($curl_handle); // run the whole process if (curl_exec($curl_handle) === false) { echo 'Curl error: ' . curl_error($curl_handle); } curl_close($curl_handle); return $result; } function rel2abs($rel, $base) { /* return if already absolute URL */ if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel; /* queries and anchors */ if ($rel[0] == '#' || $rel[0] == '?') return $base . $rel; /* parse base URL and convert to local variables: $scheme, $host, $path */ extract(parse_url($base)); /* remove non-directory element from path */ $path = preg_replace('#/[^/]*$#', '', $path); /* destroy path if relative url points to root */ if ($rel[0] == '/') $path = ''; /* dirty absolute URL */ $abs = "$host$path/$rel"; /* replace '//' or '/./' or '/foo/../' with '/' */ $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#'); for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) { } /* absolute URL is ready! */ return $scheme . '://' . $abs; }