What is OSTEP

I’m currently taking the Undergrad OS course at UW-Madison this semester (CS537: Operating System).
Our professor is REMZI H. ARPACI-DUSSEAU who is apparently very famous in the OS industry (at least that’s what I think considering the google search result).

We are using a textbook called Operating System: Three Easy Pieces(OSTEP) written by Remzi. Remzi thinks that textbook should be free, c.f. his blog. And that’s what he did for the OSTEP book, it is available in seperate PDFs online for free to download and read.

However, I’m just too lazy to download tens of files or read them in the browser (which is absolutely painful).

So I wrote a script

to download all the up-to-date chapters in their pdf forms and stitch them together into one single PDF with table of contents.

Here is how

It is easy to notice that the links to chapters of the book has a unique pattern that can distinguish them from other links.

They all:

  • ends with extension .pdf
  • they have a siblin called small which contains the chapter number
  • they have a parent called td which is a colume in the chapter list table

Therefore we just need to search through all those links and sort them in the order of chapter number to get all the seperate pdfs.

To achieve that, we use BeautifulSoup to parse the website source:

from bs4 import BeautifulSoup
import requests
base_url = 'http://pages.cs.wisc.edu/~remzi/OSTEP/'
html = requests.get(base_url)
soup = BeautifulSoup(html.text, 'html.parser')
links = {}
for link in soup.find_all('a'):
    value = link.get('href')
    if value and value.endswith('pdf'):
        parent = link.find_parent("td")
        chap_num = parent.find_all("small")
        if (chap_num):
            print(str(chap_num[0].contents[0]) + " " + base_url + value)

And this should give you the output like:

∅ everette @ LER0ever-Desktop [Code/Python/ostep] → python exp.py
3 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-virtualization.pdf
12 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-vm.pdf
25 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-concurrency.pdf
35 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-persistence.pdf
4 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-intro.pdf
13 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-intro.pdf
26 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf
36 http://pages.cs.wisc.edu/~remzi/OSTEP/file-devices.pdf
1 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-threeeasy.pdf
5 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-api.pdf
14 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-api.pdf
27 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-api.pdf
37 http://pages.cs.wisc.edu/~remzi/OSTEP/file-disks.pdf
2 http://pages.cs.wisc.edu/~remzi/OSTEP/intro.pdf
6 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-mechanisms.pdf
15 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-mechanism.pdf
28 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-locks.pdf
38 http://pages.cs.wisc.edu/~remzi/OSTEP/file-raid.pdf
7 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched.pdf
16 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-segmentation.pdf
29 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-locks-usage.pdf
39 http://pages.cs.wisc.edu/~remzi/OSTEP/file-intro.pdf
8 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf
17 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-freespace.pdf
30 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf
40 http://pages.cs.wisc.edu/~remzi/OSTEP/file-implementation.pdf
9 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-lottery.pdf
18 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-paging.pdf
31 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-sema.pdf
41 http://pages.cs.wisc.edu/~remzi/OSTEP/file-ffs.pdf
10 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-multi.pdf
19 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-tlbs.pdf
32 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-bugs.pdf
42 http://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf
11 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-dialogue.pdf
20 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-smalltables.pdf
33 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-events.pdf
43 http://pages.cs.wisc.edu/~remzi/OSTEP/file-lfs.pdf
21 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-beyondphys.pdf
34 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-dialogue.pdf
44 http://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf
22 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-beyondphys-policy.pdf
45 http://pages.cs.wisc.edu/~remzi/OSTEP/file-integrity.pdf
23 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-vax.pdf
46 http://pages.cs.wisc.edu/~remzi/OSTEP/file-dialogue.pdf
24 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-dialogue.pdf
47 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-distribution.pdf
48 http://pages.cs.wisc.edu/~remzi/OSTEP/dist-intro.pdf
49 http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf
50 http://pages.cs.wisc.edu/~remzi/OSTEP/dist-afs.pdf
51 http://pages.cs.wisc.edu/~remzi/OSTEP/dist-dialogue.pdf

∅ everette @ LER0ever-Desktop [Code/Python/ostep] → 

Stitch them together

In order to build a single PDF with table of contents in the right position, I tried various pdf tools for concatanting PDFs. And pdftk turns out working pretty well.

The limitation is that I can only insert TOC.pdf inside the single big PDF file instead of using the PDF’s outline feature. (I guess that’s what you have to pay unless you buy the electronic version for $10)

So the final script, which processes the webpage, downloads the seperate pdfs and combines them together, is as follows

#!/usr/bin/env python

import requests, os, subprocess, shutil
from bs4 import BeautifulSoup
from collections import OrderedDict

base_url = 'http://pages.cs.wisc.edu/~remzi/OSTEP/'
html = requests.get(base_url)
bs = BeautifulSoup(html.text, 'html.parser')
links = {}
for link in bs.find_all('a'):
    value = link.get('href')
    if value and value.endswith('pdf'):
        parent = link.find_parent("td")
        chapter_num = parent.find_all("small")
        if(chapter_num):
            num = chapter_num[0].contents[0]
            key = "%02d" % (int(num),)
            links[key] = value
            print(str(num) + ": " + base_url + value)
        else:
            # 00x is smaller than 0x
            if value == 'preface.pdf':
                links['000'] = value
            if value == 'toc.pdf':
                links['001'] = value
ordered = OrderedDict(sorted(links.items()))

out_path = 'OSTEP'
if not os.path.exists(out_path):
    os.makedirs(out_path)
for chpt,resource in ordered.items():
    response = requests.get(base_url + resource)
    with open('{0}/{1}_{2}'.format(out_path, chpt, resource), 'wb') as f:
        f.write(response.content)
        print(chpt + resource + " downloaded")
owd = os.getcwd()
os.chdir(out_path)
if os.path.isfile("OSTEP.pdf"):
    os.remove("OSTEP.pdf")
print("combining all into one ...")
proc = subprocess.Popen('pdftk *.pdf cat output ../OSTEP.pdf', shell=True)
proc.wait()
os.chdir(owd)
shutil.rmtree("OSTEP")
print("done")

Here we go, but still, please support Remzi by buying a hardcopy textbook or an electronic version.

It’s certainly worth the price!