Navigate back to the homepage

Download and stitch together OSTEP book

LER0ever
April 16th, 2018 · 1 min read

What is OSTEP

I’m currently taking the Undergrad OS course at UW-Madison this semester (CS537: Operating System).
Our professor is REMZI H. ARPACI-DUSSEAU who is apparently very famous in the OS industry (at least that’s what I think considering the google search result).

We are using a textbook called Operating System: Three Easy Pieces(OSTEP) written by Remzi. Remzi thinks that textbook should be free, c.f. his blog. And that’s what he did for the OSTEP book, it is available in seperate PDFs online for free to download and read.

However, I’m just too lazy to download tens of files or read them in the browser (which is absolutely painful).

So I wrote a script

to download all the up-to-date chapters in their pdf forms and stitch them together into one single PDF with table of contents.

Here is how

It is easy to notice that the links to chapters of the book has a unique pattern that can distinguish them from other links.

They all:

  • ends with extension .pdf
  • they have a siblin called small which contains the chapter number
  • they have a parent called td which is a colume in the chapter list table

Therefore we just need to search through all those links and sort them in the order of chapter number to get all the seperate pdfs.

To achieve that, we use BeautifulSoup to parse the website source:

1from bs4 import BeautifulSoup
2import requests
3base_url = 'http://pages.cs.wisc.edu/~remzi/OSTEP/'
4html = requests.get(base_url)
5soup = BeautifulSoup(html.text, 'html.parser')
6links = {}
7for link in soup.find_all('a'):
8 value = link.get('href')
9 if value and value.endswith('pdf'):
10 parent = link.find_parent("td")
11 chap_num = parent.find_all("small")
12 if (chap_num):
13 print(str(chap_num[0].contents[0]) + " " + base_url + value)

And this should give you the output like:

1∅ everette @ LER0ever-Desktop [Code/Python/ostep] → python exp.py
23 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-virtualization.pdf
312 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-vm.pdf
425 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-concurrency.pdf
535 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-persistence.pdf
64 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-intro.pdf
713 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-intro.pdf
826 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf
936 http://pages.cs.wisc.edu/~remzi/OSTEP/file-devices.pdf
101 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-threeeasy.pdf
115 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-api.pdf
1214 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-api.pdf
1327 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-api.pdf
1437 http://pages.cs.wisc.edu/~remzi/OSTEP/file-disks.pdf
152 http://pages.cs.wisc.edu/~remzi/OSTEP/intro.pdf
166 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-mechanisms.pdf
1715 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-mechanism.pdf
1828 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-locks.pdf
1938 http://pages.cs.wisc.edu/~remzi/OSTEP/file-raid.pdf
207 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched.pdf
2116 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-segmentation.pdf
2229 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-locks-usage.pdf
2339 http://pages.cs.wisc.edu/~remzi/OSTEP/file-intro.pdf
248 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-mlfq.pdf
2517 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-freespace.pdf
2630 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf
2740 http://pages.cs.wisc.edu/~remzi/OSTEP/file-implementation.pdf
289 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-lottery.pdf
2918 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-paging.pdf
3031 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-sema.pdf
3141 http://pages.cs.wisc.edu/~remzi/OSTEP/file-ffs.pdf
3210 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-sched-multi.pdf
3319 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-tlbs.pdf
3432 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-bugs.pdf
3542 http://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf
3611 http://pages.cs.wisc.edu/~remzi/OSTEP/cpu-dialogue.pdf
3720 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-smalltables.pdf
3833 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-events.pdf
3943 http://pages.cs.wisc.edu/~remzi/OSTEP/file-lfs.pdf
4021 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-beyondphys.pdf
4134 http://pages.cs.wisc.edu/~remzi/OSTEP/threads-dialogue.pdf
4244 http://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf
4322 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-beyondphys-policy.pdf
4445 http://pages.cs.wisc.edu/~remzi/OSTEP/file-integrity.pdf
4523 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-vax.pdf
4646 http://pages.cs.wisc.edu/~remzi/OSTEP/file-dialogue.pdf
4724 http://pages.cs.wisc.edu/~remzi/OSTEP/vm-dialogue.pdf
4847 http://pages.cs.wisc.edu/~remzi/OSTEP/dialogue-distribution.pdf
4948 http://pages.cs.wisc.edu/~remzi/OSTEP/dist-intro.pdf
5049 http://pages.cs.wisc.edu/~remzi/OSTEP/dist-nfs.pdf
5150 http://pages.cs.wisc.edu/~remzi/OSTEP/dist-afs.pdf
5251 http://pages.cs.wisc.edu/~remzi/OSTEP/dist-dialogue.pdf
53
54∅ everette @ LER0ever-Desktop [Code/Python/ostep] →

Stitch them together

In order to build a single PDF with table of contents in the right position, I tried various pdf tools for concatanting PDFs. And pdftk turns out working pretty well.

The limitation is that I can only insert TOC.pdf inside the single big PDF file instead of using the PDF’s outline feature. (I guess that’s what you have to pay unless you buy the electronic version for $10)

So the final script, which processes the webpage, downloads the seperate pdfs and combines them together, is as follows

1#!/usr/bin/env python
2
3import requests, os, subprocess, shutil
4from bs4 import BeautifulSoup
5from collections import OrderedDict
6
7base_url = 'http://pages.cs.wisc.edu/~remzi/OSTEP/'
8html = requests.get(base_url)
9bs = BeautifulSoup(html.text, 'html.parser')
10links = {}
11for link in bs.find_all('a'):
12 value = link.get('href')
13 if value and value.endswith('pdf'):
14 parent = link.find_parent("td")
15 chapter_num = parent.find_all("small")
16 if(chapter_num):
17 num = chapter_num[0].contents[0]
18 key = "%02d" % (int(num),)
19 links[key] = value
20 print(str(num) + ": " + base_url + value)
21 else:
22 # 00x is smaller than 0x
23 if value == 'preface.pdf':
24 links['000'] = value
25 if value == 'toc.pdf':
26 links['001'] = value
27ordered = OrderedDict(sorted(links.items()))
28
29out_path = 'OSTEP'
30if not os.path.exists(out_path):
31 os.makedirs(out_path)
32for chpt,resource in ordered.items():
33 response = requests.get(base_url + resource)
34 with open('{0}/{1}_{2}'.format(out_path, chpt, resource), 'wb') as f:
35 f.write(response.content)
36 print(chpt + resource + " downloaded")
37owd = os.getcwd()
38os.chdir(out_path)
39if os.path.isfile("OSTEP.pdf"):
40 os.remove("OSTEP.pdf")
41print("combining all into one ...")
42proc = subprocess.Popen('pdftk *.pdf cat output ../OSTEP.pdf', shell=True)
43proc.wait()
44os.chdir(owd)
45shutil.rmtree("OSTEP")
46print("done")

Here we go, but still, please support Remzi by buying a hardcopy textbook or an electronic version.

It’s certainly worth the price!

More articles from L.E.R Space | Blog

高性能排序算法

马上明天就是紧张刺激的GCJ 201…

April 5th, 2018 · 1 min read

我的前20年

万字长文预警:本文总计 14414 字,预计认真读需要 15 分钟,略读需要 4 分钟。 初窥世界2…

January 15th, 2018 · 4 min read
© 2014–2020 L.E.R Space | Blog
Link to $https://twitter.com/LER0everLink to $https://github.com/LER0everLink to $https://instagram.com/rongyi.ioLink to $https://www.linkedin.com/in/LER0ever