📚 The CoCalc Library - books, templates and other resources
License: OTHER
% LaTeX source for ``Think OS:1% A Brief Introduction to Operating Systems''2% Copyright 2015 Allen B. Downey.34% License: Creative Commons Attribution-NonCommercial 3.0 Unported License.5% http://creativecommons.org/licenses/by-nc/3.0/6%78%\documentclass[10pt,b5paper]{book}9\documentclass[12pt]{book}10\usepackage[width=5.5in,height=8.5in,11hmarginratio=3:2,vmarginratio=1:1]{geometry}1213% for some of these packages, you might have to install14% texlive-latex-extra (in Ubuntu)1516\usepackage[T1]{fontenc}17\usepackage{textcomp}18\usepackage{mathpazo}19%\usepackage{pslatex}2021\usepackage{url}22\usepackage{fancyhdr}23\usepackage{fancyvrb}24\usepackage{graphicx}25\usepackage{subfig}26\usepackage{amsmath}27\usepackage{amsthm}28%\usepackage{amssymb}29\usepackage{makeidx}30\usepackage{setspace}31\usepackage{upquote}32\usepackage{listings}33\usepackage{color}3435\title{Think OS}36\author{Allen B. Downey}3738\newcommand{\thetitle}{Think OS: A Brief Introduction to Operating Systems}39\newcommand{\theversion}{0.7.3}404142% Commands that control the appearance of the listings43\definecolor{light-gray}{gray}{0.95}4445\lstset{basicstyle=\tt, frame=single,46backgroundcolor=\color{light-gray}, escapeinside={(*}{*)},47numbers=left, numberstyle=\tiny, numbersep=10pt}484950\makeindex5152\begin{document}5354\frontmatter555657%%% EXERCISE5859\newtheoremstyle{exercise}% name of the style to be used60{\topsep}% measure of space to leave above the theorem. E.g.: 3pt61{\topsep}% measure of space to leave below the theorem. E.g.: 3pt62{}% name of font to use in the body of the theorem63{0pt}% measure of space to indent64{\bfseries}% name of head font65{}% punctuation between head and body66{ }% space after theorem head; " " = normal interword space67{}% Manually specify head6869\theoremstyle{exercise}70\newtheorem{exercise}{Exercise}[chapter]7172\sloppy73%\setlength{\topmargin}{-0.375in}74%\setlength{\oddsidemargin}{0.0in}75%\setlength{\evensidemargin}{0.0in}7677% Uncomment these to center on 8.5 x 1178%\setlength{\topmargin}{0.625in}79%\setlength{\oddsidemargin}{0.875in}80%\setlength{\evensidemargin}{0.875in}8182%\setlength{\textheight}{7.2in}8384\setlength{\headsep}{3ex}85\setlength{\parindent}{0.0in}86\setlength{\parskip}{1.7ex plus 0.5ex minus 0.5ex}87\renewcommand{\baselinestretch}{1.02}8889% see LaTeX Companion page 6290\setlength{\topsep}{-0.0\parskip}91\setlength{\partopsep}{-0.5\parskip}92\setlength{\itemindent}{0.0in}93\setlength{\listparindent}{0.0in}9495% see LaTeX Companion page 2696% these are copied from /usr/local/teTeX/share/texmf/tex/latex/base/book.cls97% all I changed is afterskip9899\makeatletter100101\renewcommand{\section}{\@startsection102{section} {1} {0mm}%103{-3.5ex \@plus -1ex \@minus -.2ex}%104{0.7ex \@plus.2ex}%105{\normalfont\Large\bfseries}}106\renewcommand\subsection{\@startsection {subsection}{2}{0mm}%107{-3.25ex\@plus -1ex \@minus -.2ex}%108{0.3ex \@plus .2ex}%109{\normalfont\large\bfseries}}110\renewcommand\subsubsection{\@startsection {subsubsection}{3}{0mm}%111{-3.25ex\@plus -1ex \@minus -.2ex}%112{0.3ex \@plus .2ex}%113{\normalfont\normalsize\bfseries}}114115% The following line adds a little extra space to the column116% in which the Section numbers appear in the table of contents117\renewcommand{\l@section}{\@dottedtocline{1}{1.5em}{3.0em}}118\setcounter{tocdepth}{1}119120\newcommand{\adjustpage}[1]{\enlargethispage{#1\baselineskip}}121122123% Note: the following command seems to cause problems for Acroreader124% on Windows, so for now I am overriding it.125%\newcommand{\clearemptydoublepage}{126% \newpage{\pagestyle{empty}\cleardoublepage}}127\newcommand{\clearemptydoublepage}{\cleardoublepage}128129%\newcommand{\blankpage}{\pagestyle{empty}\vspace*{1in}\newpage}130\newcommand{\blankpage}{\vspace*{1in}\newpage}131132% HEADERS133134\renewcommand{\chaptermark}[1]{\markboth{#1}{}}135\renewcommand{\sectionmark}[1]{\markright{\thesection\ #1}{}}136137\lhead[\fancyplain{}{\bfseries\thepage}]%138{\fancyplain{}{\bfseries\rightmark}}139\rhead[\fancyplain{}{\bfseries\leftmark}]%140{\fancyplain{}{\bfseries\thepage}}141\cfoot{}142143\pagestyle{fancyplain}144145146% turn off the rule under the header147%\setlength{\headrulewidth}{0pt}148149% the following is a brute-force way to prevent the headers150% from getting transformed into all-caps151\renewcommand\MakeUppercase{}152153% Exercise environment154\newtheoremstyle{myex}% name155{9pt}% Space above156{9pt}% Space below157{}% Body font158{}% Indent amount (empty = no indent, \parindent = para indent)159{\bfseries}% Thm head font160{}% Punctuation after thm head161{0.5em}% Space after thm head: " " = normal interword space;162% \newline = linebreak163{}% Thm head spec (can be left empty, meaning `normal')164165\theoremstyle{myex}166167168169%--title page--------------------------------------------------170\pagebreak171\thispagestyle{empty}172173\begin{flushright}174\vspace*{2.0in}175176\begin{spacing}{3}177{\huge Think OS}\\178{\Large A Brief Introduction to Operating Systems}179\end{spacing}180181\vspace{0.25in}182183Version \theversion184185\vspace{1in}186187188{\Large189Allen B. Downey\\190}191192193\vspace{0.5in}194195{\Large Green Tea Press}196197{\small Needham, Massachusetts}198199%\includegraphics[width=1in]{figs/logo1.eps}200\vfill201202\end{flushright}203204205%--copyright--------------------------------------------------206\pagebreak207\thispagestyle{empty}208209{\small210Copyright \copyright ~2015 Allen B. Downey.211212213\vspace{0.2in}214215\begin{flushleft}216Green Tea Press \\2179 Washburn Ave \\218Needham MA 02492219\end{flushleft}220221Permission is granted to copy, distribute, and/or modify this document222under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported223License, which is available at \url{http://creativecommons.org/licenses/by-nc/3.0/}.224225The original form of this book is \LaTeX\ source code. Compiling this226code has the effect of generating a device-independent representation227of a textbook, which can be converted to other formats and printed.228229The \LaTeX\ source for this book is available from230\url{http://greenteapress.com/thinkos}.231232The cover for this book is based on a photo by Paul Friel233(\url{http://flickr.com/people/frielp/}), who made it available under234the Creative Commons Attribution license. The original photo235is at \url{http://flickr.com/photos/frielp/11999738/}.236237\vspace{0.2in}238239} % end small240241242\chapter{Preface}243\label{preface}244245In many computer science programs, Operating Systems is an advanced246topic. By the time students take it, they know how to program247in C, and they have probably taken a class in Computer Architecture.248Usually the goal of the class is to expose students to the design249and implementation of operating systems, with the implied assumption250that some of them will do research in this area, or write part of251an OS.252253This book is intended for a different audience, and it has different254goals. I developed it for a class at Olin College called Software255Systems.256257Most students taking this class learned to program in Python,258so one of the goals is to help them learn C.259For that part of the class, I use Griffiths and Griffiths, {\it Head260First C}, from O'Reilly Media. This book is meant to complement261that one.262263Few of my students will ever write an operating system, but many of264them will write low-level applications in C or work on embedded265systems. My class includes material from operating systems, networks,266databases, and embedded systems, but it emphasizes the topics267programmers need to know.268269This book does not assume that you have studied Computer Architecture.270As we go along, I will explain what we need.271272If this book is successful, it should give you a better understanding273of what is happening when programs run, and what you can do to make274them run better and faster.275276Chapter 1 explains some of the differences between compiled and277interpreted languages, with some insight into how compilers work.278Recommended reading: {\it Head First C} Chapter 1.279280Chapter 2 explains how the operating system uses processes to281protect running programs from interfering with each other.282283Chapter 3 explains virtual memory and address translation.284Recommended reading: {\it Head First C} Chapter 2.285286Chapter 4 is about file systems and data streams.287Recommended reading: {\it Head First C} Chapter 3.288289Chapter 5 describes how numbers, letters, and other values are290encoded, and presents the bitwise operators.291292Chapter 6 explains how to use dynamic memory management, and how293it works.294Recommended reading: {\it Head First C} Chapter 6.295296Chapter 7 is about caching and the memory hierarchy.297298Chapter 8 is about multitasking and scheduling.299300Chapter 9 is about POSIX threads and mutexes.301Recommended reading: {\it Head First C} Chapter 12 and302{\it Little Book of Semaphores} Chapters 1 and 2.303304Chapter 10 is about POSIX condition variables and the producer/consumer305problem. Recommended reading: {\it Little Book of Semaphores} Chapters 3306and 4.307308Chapter 11 is about using POSIX semaphores and implementing semaphores in C.309310\section*{A note on this draft}311312The current version of this book is an early draft. While I am313working on the text, I have not yet included the figures. So314there are a few places where, I'm sure, the explanation will be315greatly improved when the figures are ready.316317318\section{Using the code}319\label{code}320321Example code for this book is available from322\url{https://github.com/AllenDowney/ThinkOS}. Git is a version323control system that allows you to keep track of the files that324make up a project. A collection of files under Git's control is325called a {\bf repository}. GitHub is a hosting service that provides326storage for Git repositories and a convenient web interface.327\index{repository}328\index{Git}329\index{GitHub}330331The GitHub homepage for my repository provides several ways to332work with the code:333334\begin{itemize}335336\item You can create a copy of my repository337on GitHub by pressing the {\sf Fork} button. If you don't already338have a GitHub account, you'll need to create one. After forking, you'll339have your own repository on GitHub that you can use to keep track340of code you write while working on this book. Then you can341clone the repo, which means that you copy the files342to your computer.343\index{fork}344345\item Or you could clone346my repository. You don't need a GitHub account to do this, but you347won't be able to write your changes back to GitHub.348\index{clone}349350\item If you don't want to use Git at all, you can download the files351in a Zip file using the button in the lower-right corner of the352GitHub page.353354\end{itemize}355356357\section*{Contributor List}358359360If you have a suggestion or correction, please send email to361{\tt downey@allendowney.com}. If I make a change based on your362feedback, I will add you to the contributor list363(unless you ask to be omitted).364\index{contributors}365366If you include at least part of the sentence the367error appears in, that makes it easy for me to search. Page and368section numbers are fine, too, but not quite as easy to work with.369Thanks!370371\small372373\begin{itemize}374375\item I am grateful to the students in Software Systems at Olin376College, who tested an early draft of this book in Spring 2014.377They corrected many errors and made many helpful suggestions.378I appreciate their pioneering spirit!379380\item James P Giannoules spotted a copy-and-paste error.381382\item Andy Engle knows the difference between GB and GiB.383384\item Aashish Karki noted some broken syntax.385386% ENDCONTRIB387388\end{itemize}389390Other people who found typos and errors include391Jim Tyson, Donald Robertson, Jeremy Vermast, Yuzhong Huang, Ian Hill.392393\normalsize394395\clearemptydoublepage396397% TABLE OF CONTENTS398399\tableofcontents400401\clearemptydoublepage402403% inline syntax formatting404\newcommand{\vb}{\verb}%}405406% START THE BOOK407\mainmatter408409410\chapter{Compilation}411412413\section{Compiled and interpreted languages}414415People often describe programming languages as either compiled or interpreted.416``Compiled'' means that programs are translated into machine language and417then executed by hardware; ``interpreted'' means that programs418are read and executed by a software interpreter.419Usually C is considered a compiled language and Python is420considered an interpreted language. But the distinction is not always421clear-cut.422423First, many languages can be either compiled or interpreted. For424example, there are C interpreters and Python compilers.425Second, there are languages like Java that use a hybrid426approach, compiling programs into an intermediate language and then427running the translated program in an interpreter. Java uses an428intermediate language called Java bytecode, which is similar to429machine language, but it is executed by a software interpreter, the430Java virtual machine (JVM).431432So being compiled or interpreted is not an intrinsic433characteristic of a language; nevertheless, there are some general434differences between compiled and interpreted languages.435436437\section{Static types}438439Many interpreted languages support dynamic types, but compiled440languages are usually limited to static types. In a statically-typed441language, you can tell by looking at the program what type each442variable refers to. In a dynamically-typed language,443you don't always know the type of a variable until the444program is running. In general, {\bf static} refers to things that445happen at compile time (while a program is being compiled), and {\bf dynamic} refers to things that happen446at run time (while a program is running).447448For example, in Python you can write a function like this:449450\begin{verbatim}451def add(x, y):452return x + y453\end{verbatim}454455Looking at this code, you can't tell what type {\tt x} and {\tt y}456will refer to at run time. This function might be called several457times, each time with values with different types. Any values that458support the addition operator will work; any other types will cause an459exception or {\bf runtime error}.460461In C you would write the same function like this:462463\begin{verbatim}464int add(int x, int y) {465return x + y;466}467\end{verbatim}468469The first line of the function includes {\bf type declarations} for the470parameters and the return value: {\tt x} and {\tt y} are declared to471be integers, which means that we can check at compile time472whether the addition operator is legal for this type (it is). The473return value is also declared to be an integer.474475Because of these declarations, when this function is called elsewhere476in the program, the compiler can check whether the arguments provided477have the right type, and whether the return value is used correctly.478479These checks happen before the program starts executing, so errors can480be found earlier. More importantly, errors can be found in parts481of the program that have never run. Furthermore, these checks don't482have to happen at run time, which is one of the reasons compiled483languages generally run faster than interpreted languages.484485Declaring types at compile time also saves space. In dynamic486languages, variable names are stored in memory while the program runs,487and they are often accessible by the program. For example, in Python488the built-in function {\tt locals} returns a dictionary that contains489variable names and their values. Here's an example in a Python490interpreter:491492\begin{verbatim}493>>> x = 5494>>> print locals()495{'x': 5, '__builtins__': <module '__builtin__' (built-in)>,496'__name__': '__main__', '__doc__': None, '__package__': None}497\end{verbatim}498499This shows that the name of the variable is stored in memory while500the program is running (along with some other values that are part501of the default runtime environment).502503In compiled languages, variable names exist at compile-time but not at504run time. The compiler chooses a location for each variable and505records these locations as part of the compiled program.\footnote{This506is a simplification; we will go into more detail later.} The507location of a variable is called its {\bf address}. At run time, the508value of each variable is stored at its address, but the names of the509variables are not stored at all (unless they are added by the compiler510for purposes of debugging).511512%This difference is reflected in the way people draw diagrams for513%different languages. In Python, every variable is a reference to514%a value, so the usual diagram shows a name with an arrow pointing515%to its value. In C, a variable is the name of a location in memory516%that stores a value, so the usual diagram is a name next to a517%box that contains a value.518519%\begin{figure}520% descriptive.py521%\centerline{\includegraphics[width=2.5in]{figs/variables.pdf}}522%\caption{Diagrams that represent variables in Python (left) and523%C (right).}524%\label{variables}525%\end{figure}526527528529\section{The compilation process}530531As a programmer, you should have a mental model of what happens532during compilation. If you understand the process, it will help533you interpret error messages, debug your code, and avoid534common pitfalls.535536The steps of compilation are:537538\begin{enumerate}539540\item Preprocessing: C is one of several languages that include541{\bf preprocessing directives} that take effect before the program is542compiled. For example, the \verb"#include" directive causes the543source code from another file to be inserted at the location of the544directive.545546\item Parsing: During parsing, the compiler reads the source code and547builds an internal representation of the program, called an548{\bf abstract syntax tree}. Errors549detected during this step are generally syntax errors.550551\item Static checking: The compiler checks whether variables and552values have the right type, whether functions are called with the553right number and type of arguments, etc. Errors detected during554this step are sometimes called {\bf static semantic} errors.555556\item Code generation: The compiler reads the internal representation557of the program and generates machine code or byte code.558559\item Linking: If the program uses values and functions defined in a560library, the compiler has to find the appropriate library and561include the required code.562563\item Optimization: At several points in the process, the compiler564can transform the program to generate code that runs faster or565uses less space. Most optimizations are simple changes that eliminate566obvious waste, but some compilers perform sophisticated analyses and567transformations.568569\end{enumerate}570571Normally when you run {\tt gcc}, it runs all of these steps and572generates an executable file. For example, here is a minimal C573program:574575\begin{verbatim}576#include <stdio.h>577int main()578{579printf("Hello World\n");580}581\end{verbatim}582583If you save this code in a file called584{\tt hello.c}, you can compile and run it like this:585586\begin{verbatim}587$ gcc hello.c588$ ./a.out589\end{verbatim}590591By default, {\tt gcc} stores the executable code in a file592called {\tt a.out} (which originally stood for ``assembler output'').593The second line runs the executable. The prefix \verb"./" tells594the shell to look for it in the current directory.595596It is usually a good idea to use the {\tt -o} flag to provide a597better name for the executable:598599\begin{verbatim}600$ gcc hello.c -o hello601$ ./hello602\end{verbatim}603604605\section{Object code}606607The {\tt -c} flag tells {\tt gcc} to compile the program and608generate machine code, but not to link it or generate an executable:609610\begin{verbatim}611$ gcc hello.c -c612\end{verbatim}613614The result is a file named {\tt hello.o}, where the {\tt o} stands for615{\bf object code}, which is the compiled program. Object code is not616executable, but it can be linked into an executable.617618The UNIX command {\tt nm} reads an object file and generates619information about the names it defines and uses. For example:620621\begin{verbatim}622$ nm hello.o6230000000000000000 T main624U puts625\end{verbatim}626627This output indicates that {\tt hello.o} defines the name {\tt main}628and uses a function named {\tt puts}, which stands for ``put string''.629In this example, {\tt gcc} performs an optimization by replacing630{\tt printf}, which is a large and complicated function, with631{\tt puts}, which is relatively simple.632633You can control how much optimization {\tt gcc} does with634the {\tt -O} flag. By default, it does very little optimization, which635can make debugging easier. The option {\tt -O1} turns on the most636common and safe optimizations. Higher numbers turn on additional637optimizations that require longer compilation time.638639In theory, optimization should not change the behavior of the program,640other than to speed it up. But if your program has a subtle bug,641you might find that optimization makes the bug appear or disappear.642It is usually a good idea to turn off optimization while you are developing643new code. Once the program is working and passing appropriate tests,644you can turn on optimization and confirm that the tests still pass.645646647\section{Assembly code}648649Similar to the {\tt -c} flag, the {\tt -S} flag tells {\tt gcc}650to compile the program and generate assembly code, which is basically651a human-readable form of machine code.652653\begin{verbatim}654$ gcc hello.c -S655\end{verbatim}656657The result is a file named {\tt hello.s}, which might look something like658this:659660\begin{verbatim}661.file "hello.c"662.section .rodata663.LC0:664.string "Hello World"665.text666.globl main667.type main, @function668main:669.LFB0:670.cfi_startproc671pushq %rbp672.cfi_def_cfa_offset 16673.cfi_offset 6, -16674movq %rsp, %rbp675.cfi_def_cfa_register 6676movl $.LC0, %edi677call puts678movl $0, %eax679popq %rbp680.cfi_def_cfa 7, 8681ret682.cfi_endproc683.LFE0:684.size main, .-main685.ident "GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3"686.section .note.GNU-stack,"",@progbits687\end{verbatim}688689{\tt gcc} is usually configured to generate code for the machine you690are running on, so for me it generates x86 assembly language,691which runs on a wide variety of processors from Intel, AMD, and692others. If you are running on a different architecture, you might693see different code.694695696\section{Preprocessing}697698Taking another step backward through the compilation process, you699can use the {\tt -E} flag to run the preprocessor only:700701\begin{verbatim}702$ gcc hello.c -E703\end{verbatim}704705The result is the output from the preprocessor. In this example,706it contains the included code from {\tt stdio.h}, and all the files707included from {\tt stdio.h}, and all the files included from those708files, and so on. On my machine, the total is more than 800 lines709of code. Since almost every C program includes {\tt stdio.h}, those710800 lines of code get compiled a lot. If, like many C programs,711you also include {\tt stdlib.h}, the result is more than 1800 lines712of code.713714715\section{Understanding errors}716717Now that we know the steps in the compilation process, it is easier718to understand error messages. For example, if there is an error719in a \verb"#include" directive, you'll get a message from the720preprocessor:721722\begin{verbatim}723hello.c:1:20: fatal error: stdioo.h: No such file or directory724compilation terminated.725\end{verbatim}726727If there's a syntax error, you get a message from the compiler:728729\begin{verbatim}730hello.c: In function 'main':731hello.c:6:1: error: expected ';' before '}' token732\end{verbatim}733734If you use a function that's not defined in any of the standard735libraries, you get a message from the linker:736737\begin{verbatim}738/tmp/cc7iAUbN.o: In function `main':739hello.c:(.text+0xf): undefined reference to `printff'740collect2: error: ld returned 1 exit status741\end{verbatim}742743{\tt ld} is the name of the UNIX linker, so named because ``loading''744is another step in the compilation process that is closely related745to linking.746747Once the program starts, C does very little runtime checking,748so there are only a few runtime errors you are likely to see. If you749divide by zero, or perform another illegal floating-point operation, you750will get a ``Floating point exception.'' And if you try to read or751write an incorrect location in memory, you will get a ``Segmentation752fault.''753754% TODO: -Wall and lint755756757\chapter{Processes}758759\section{Abstraction and virtualization}760761Before we talk about processes, I want to define a few words:762763\begin{itemize}764765\item Abstraction: An abstraction is a simplified representation766of something complicated. For example, if you drive a car, you767understand that when you turn the wheel left, the car goes left,768and vice versa. Of course, the steering wheel is connected to769a sequence of mechanical and (often) hydraulic systems that turn770the wheels, and the wheels interact with the road in ways that771can be complex, but as a driver, you normally don't have to think772about any of those details. You can get along very well with773a simple mental model of steering. Your mental model is an774abstraction.775776Similarly, when you use a web browser, you understand that when777you click on a link, the browser displays the page the link refers778to. The software and network communication that make that possible779are complex, but as a user, you don't have to know the780details.781782A large part of software engineering is designing abstractions like783these that allow users and other programmers to use powerful784and complicated systems without having to know about the details785of their implementation.786787\item Virtualization: An important kind of abstraction is788virtualization, which is the process of creating a desirable789illusion.790791For example, many public libraries participate in inter-library792collaborations that allow them to borrow books from each other.793When I request a book, sometimes the book is on the shelf at my794local library, but other times it has to be transferred from another795collection. Either way, I get a notification when it is available796for pickup. I don't need to know where it came from, and I don't797need to know which books my library has. As a whole, the system798creates the illusion that my library has every book in the world.799800The collection physically located at my local library might be small,801but the collection available to me virtually includes every book802in the inter-library collaboration.803804As another example, most computers are only connected to one805network, but that network is connected to others, and so on. What806we call the Internet is a collection of networks and a set of807protocols that forward packets from one network to the next.808From the point of view of a user or programmer, the system behaves809as if every computer on the Internet is connected to every other810computer. The number of physical connections is small, but the811number of virtual connections is very large.812813\end{itemize}814815The word ``virtual'' is often used in the context of a virtual816machine, which is software that creates the illusion of a dedicated817computer running a particular operating system, when in reality818the virtual machine might be running, along with many other virtual819machines, on a computer running a different operating system.820821In the context of virtualization, we sometimes call what is822really happening ``physical'', and what is virtually happening823either ``logical'' or ``abstract.''824825826\section{Isolation}827828One of the most important principles of engineering is isolation:829when you are designing a system with multiple components, it is usually830a good idea to isolate them from each other so that a change in one831component doesn't have undesired effects on other components.832833One of the most important goals of an operating system is to isolate834each running program from the others so that programmers don't have to835think about every possible interaction. The software object that836provides this isolation is a {\bf process}.837838A process is a software object that represents a running program.839I mean ``software object'' in the sense of object-oriented programming;840in general, an object contains data and provides methods841that operate on the data. A process is an object that contains the842following data:843844\begin{itemize}845846\item The text of the program, usually a sequence of847machine language instructions.848849\item Data associated with the program, including static data (allocated850at compile time) and dynamic data (allocated at run time).851852\item The state of any pending input/output operations. For example,853if the process is waiting for data to be read from disk or for a854packet to arrive on a network, the status of these operations is855part of the process.856857\item The hardware state of the program, which includes data stored858in registers, status information, and the program counter, which859indicates which instruction is currently executing.860861\end{itemize}862863Usually one process runs one program, but it is also possible for864a process to load and run a new program.865866It is also possible, and common, to run the same program in more than one867process. In that case, the processes share the same program text868but generally have different data and hardware states.869870Most operating systems provide a fundamental set of capabilities871to isolate processes from each other:872873\begin{itemize}874875\item Multitasking: Most operating systems have the ability to876interrupt a running process at almost any time, save its hardware877state, and then resume the process later. In general, programmers878don't have to think about these interruptions. The program behaves879as if it is running continuously on a dedicated processor,880except that the time between instructions is unpredictable.881882\item Virtual memory: Most operating systems create the883illusion that each process has its own chunk of memory, isolated884from all other processes. Again, programmers generally don't885have to think about how virtual memory works; they can proceed886as if every program has a dedicated chunk of memory.887888\item Device abstraction: Processes running on the same computer share889the disk drive, the network interface, the graphics card, and other890hardware. If processes interacted with this hardware directly,891without coordination, chaos would ensue. For example, network data892intended for one process might be read by another. Or multiple893processes might try to store data in the same location on a hard894drive. It is up to the operating system to maintain order by895providing appropriate abstractions.896897\end{itemize}898899As a programmer, you don't need to know much about how these900capabilities are implemented. But if you are901curious, you will find a lot of interesting things902going on under the metaphorical hood. And if you know what's903going on, it can make you a better programmer.904905906\section{UNIX processes}907\label{unixps}908909While I write this book, the process I910am most aware of is my text editor, emacs. Every once in a while911I switch to a terminal window, which is a window running a UNIX shell912that provides a command-line interface.913914When I move the mouse, the window manager wakes up, sees that the915mouse is over the terminal window, and wakes up the terminal.916The terminal wakes up the shell.917If I type {\tt make} in the shell, it creates a918new process to run Make, which creates another process to run LaTeX919and then another process to display the results.920921If I need to look something up, I might switch to another desktop,922which wakes up the window manager again. If I click on the icon for a923web browser, the window manager creates a process to run the web924browser. Some browsers, like Chrome, create a new process for each925window and each tab.926927And those are just the processes I am aware of. At the same time928there are many other processes running in the {\bf background}.929Many of them are performing operations related to the operating930system.931932The UNIX command {\tt ps} prints information about running processes.933If you run it in a terminal, you might see something like this:934935\begin{verbatim}936PID TTY TIME CMD9372687 pts/1 00:00:00 bash9382801 pts/1 00:01:24 emacs93924762 pts/1 00:00:00 ps940\end{verbatim}941942The first column is the unique numerical process ID. The second943column is the terminal that created the process;944``TTY'' stands for teletypewriter, which was the original945mechanical terminal.946947The third column is the total processor time used by the process,948in hours, minutes, and seconds.949The last column is the name of the running program. In950this example, {\tt bash} is the name of the shell that interprets951the commands I type in the terminal, emacs is my text editor, and952ps is the program generating this output.953954By default, {\tt ps} lists only the processes associated with955the current terminal. If you use the {\tt -e} flag, you get every956process (including processes belonging to other users, which is957a security flaw, in my opinion).958959On my system there are currently 233 processes.960Here are some of them:961962\begin{verbatim}963PID TTY TIME CMD9641 ? 00:00:17 init9652 ? 00:00:00 kthreadd9663 ? 00:00:02 ksoftirqd/09674 ? 00:00:00 kworker/0:09688 ? 00:00:00 migration/09699 ? 00:00:00 rcu_bh97010 ? 00:00:16 rcu_sched97147 ? 00:00:00 cpuset97248 ? 00:00:00 khelper97349 ? 00:00:00 kdevtmpfs97450 ? 00:00:00 netns97551 ? 00:00:00 bdi-default97652 ? 00:00:00 kintegrityd97753 ? 00:00:00 kblockd97854 ? 00:00:00 ata_sff97955 ? 00:00:00 khubd98056 ? 00:00:00 md98157 ? 00:00:00 devfreq_wq982\end{verbatim}983984{\tt init} is the first process created when the operating system985starts. It creates many of the other processes, and then sits idle986until the processes it created are done.987988{\tt kthreadd} is a process the operating system uses to create new989{\bf threads}. We'll talk more about threads later, but for now you can990think of a thread as kind of a process. The {\tt k} at the beginning991stands for {\bf kernel}, which is the part of the operating system992responsible for core capabilities like creating threads. The extra993{\tt d} at the end stands for {\bf daemon}, which is another name for994processes like this that run in the background and provide operating995system services. In this context, ``daemon'' is used in the996sense of a helpful spirit, with no connotation of evil.997998Based on the name, you can infer that {\tt ksoftirqd} is also a kernel999daemon; specifically, it handles software interrupt requests, or1000``soft IRQ''.10011002{\tt kworker} is a worker process created by the kernel to do some1003kind of processing for the kernel.10041005There are often multiple processes running these kernel services.1006On my system at the moment, there are 8 {\tt ksoftirqd} processes1007and 35 {\tt kworker} processes.10081009I won't go into more details about the other processes, but if you1010are interested you can search for more information about them.1011You should run {\tt ps} on your system and compare your results1012to mine.10131014%TODO: using gdb here?101510161017\chapter{Virtual memory}10181019\section{A bit of information theory}10201021A {\bf bit} is a binary digit; it is also a unit of information. If you1022have one bit, you can specify one of two possibilities, usually1023written 0 and 1. If you have two bits, there are 4 possible1024combinations, 00, 01, 10, and 11. In general, if you have $b$ bits, you1025can indicate one of $2^b$ values. A {\bf byte} is 8 bits, so it can1026hold one of 256 values.10271028Going in the other direction, suppose you want to store a letter1029of the alphabet. There are 26 letters, so how many bits do you1030need? With 4 bits, you can specify one of 16 values, so that's1031not enough. With 5 bits, you can specify up to 32 values, so1032that's enough for all the letters, with a few values left over.10331034In general, if you want to specify one of $N$ values, you should1035choose the smallest value of $b$ so that $2^b \ge N$. Taking the1036log base 2 of both sides yields $b \ge log_2 N$.10371038Suppose I flip a coin and tell you the outcome. I have given1039you one bit of information. If I roll a six-sided die and tell1040you the outcome, I have given you $log_2 6$ bits of information.1041And in general, if the probability of the outcome is 1 in $N$,1042then the outcome contains $log_2 N$ bits of information.10431044Equivalently, if the probability of the outcome is $p$, then1045the information content is $-log_2 p$. This quantity is called1046the {\bf self-information} of the outcome. It measures1047how surprising the outcome is, which is why it is also called1048{\bf surprisal}. If your horse has only one chance in 16 of winning,1049and he wins, you get 4 bits of information (along with the1050payout). But if the favorite wins 75\% of the time, the news1051of the win contains only 0.42 bits.10521053Intuitively, unexpected news carries a lot of1054information; conversely, if there is something you were already confident1055of, confirming it contributes only a small amount of information.10561057For several topics in this book, we will need to be comfortable1058converting back and forth between the number of bits, $b$, and the1059number of values they can encode, $N = 2^b$.106010611062\section{Memory and storage}10631064While a process is running, most of its data is held in {\bf main1065memory}, which is usually some kind of random access memory (RAM).1066On most current computers, main memory is {\bf volatile}, which means that1067when the computer shuts down, the contents of main memory are lost.1068A typical desktop computer has 2--8 GiB of1069memory. GiB stands for ``gibibyte,'' which is $2^{30}$ bytes.10701071If the process reads and writes files, those files are usually stored1072on a hard disk drive (HDD) or solid state drive (SSD). These storage1073devices are {\bf non-volatile}, so they are used for long-term storage.1074Currently a typical desktop computer has a HDD with a capacity of 5001075GB to 2 TB. GB stands for ``gigabyte,'' which is $10^9$ bytes. TB1076stands for ``terabyte,'' which is $10^{12}$ bytes.10771078You might have noticed that I used the binary unit1079GiB for the size of main memory and the decimal units GB and TB for1080the size of the HDD. For historical and technical reasons, memory is1081measured in binary units, and disk drives are measured in decimal1082units. In this book I will be careful to distinguish binary and1083decimal units, but you should be aware that the word ``gigabyte'' and the1084abbreviation GB are often used ambiguously.10851086In casual use, the term ``memory'' is sometimes used for HDDs and SSDs1087as well as RAM, but the properties of these devices are very1088different, so we will need to distinguish them. I will use1089{\bf storage} to refer to HDDs and SSDs.109010911092\section{Address spaces}10931094Each byte in main memory is specified by an integer {\bf physical1095address}. The set of valid physical addresses is called the1096physical {\bf address space}. It1097usually runs from 0 to $N-1$, where $N$ is1098the size of main memory. On a system with 1 GiB of physical memory,1099the highest valid address is $2^{30}-1$, which is 1,073,741,823 in1100decimal, or 0x03ff ffff in hexadecimal (the prefix 0x indicates a1101hexadecimal number).11021103However, most operating systems provide {\bf virtual memory}, which1104means that programs never deal with physical addresses, and don't1105have to know how much physical memory is available.11061107Instead, programs work with {\bf virtual addresses}, which are numbered1108from 0 to $M-1$, where $M$ is the number of valid virtual addresses.1109The size of the virtual address space is determined by the operating1110system and the hardware it runs on.11111112You have probably heard people talk about 32-bit and 64-bit systems.1113These terms indicate the size of the registers, which is usually also1114the size of a virtual address. On a 32-bit system, virtual addresses1115are 32 bits, which means that the virtual address space runs from 0 to11160xffff ffff. The size of this address space is $2^{32}$ bytes, or 41117GiB.11181119On a 64-bit system, the size of the virtual address space is $2^{64}$1120bytes, or $2^4 \cdot 1024^6$ bytes. That's 16 exbibytes, which is1121about a billion times bigger than current physical memories. It might1122seem strange that a virtual address space can be so much bigger1123than physical memory, but we will see soon how that works.11241125When a program reads and writes values in memory, it generates virtual1126addresses. The hardware, with help from the operating system,1127translates to physical addresses before accessing main memory. This1128translation is done on a per-process basis, so even if two processes1129generate the same virtual address, they would map to different1130locations in physical memory.11311132Thus, virtual memory is one important way the operating system1133isolates processes from each other. In general, a process cannot1134access data belonging to another process, because there is no1135virtual address it can generate that maps to physical memory1136allocated to another process.1137113811391140\section{Memory segments}11411142The data of a running process is organized into five segments:11431144\begin{itemize}11451146\item The {\bf code segment} contains the program text; that is, the1147machine language instructions that make up the program.11481149\item The {\bf static segment} contains immutable values, like string1150literals. For example, if your program contains the string1151{\tt "Hello, World"}, those characters will be stored in the1152static segment.11531154\item The {\bf global segment} contains global variables and local variables that are declared {\tt static}.11551156\item The {\bf heap segment} contains chunks of memory allocated1157at run time, most often by calling the C library function1158{\tt malloc}.11591160\item The {\bf stack segment} contains the call stack, which is a1161sequence of stack frames. Each time a function is called, a stack1162frame is allocated to contain the1163parameters and local variables of the function. When the function1164completes, its stack frame is removed from the stack.11651166\end{itemize}11671168The arrangement of these segments is determined partly by the1169compiler and partly by the operating system. The details vary1170from one system to another, but in the most common arrangement:11711172\begin{itemize}11731174\item The text segment is near the ``bottom'' of memory, that is,1175at addresses near 0.11761177\item The static segment is often just above the text segment, that is,1178at higher addresses.11791180\item The global segment is often just above the static segment.11811182\item The heap is often above the global segment. As it expands,1183it grows up toward larger addresses.11841185\item The stack is near the top of memory; that is, near the1186highest addresses in the virtual address space. As the1187stack expands, it grows down toward smaller addresses.11881189\end{itemize}11901191%TODO: Figure out how to handle the code that is in both ExercisesInC1192% and the repo for the book.11931194To determine the layout of these segments on your system, try running1195this program, which is in {\tt aspace.c} in the repository for this1196book (see Section~\ref{code}).11971198\begin{verbatim}1199#include <stdio.h>1200#include <stdlib.h>12011202int global;12031204int main ()1205{1206int local = 5;1207void *p = malloc(128);1208char *s = "Hello, World";12091210printf ("Address of main is %p\n", main);1211printf ("Address of global is %p\n", &global);1212printf ("Address of local is %p\n", &local);1213printf ("p points to %p\n", p);1214printf ("s points to %p\n", s);1215}1216\end{verbatim}12171218{\tt main} is the name of a function; when it is used as a variable,1219it refers to the address of the first machine language instruction1220in {\tt main}, which we expect to be in the text segment.12211222{\tt global} is a global variable, so we expect it to be in the1223global segment. {\tt local} is a local variable, so we expect it1224to be on the stack.12251226{\tt s} refers to a ``string literal", which is a string that appears1227as part of the program (as opposed to a string that is read from a file,1228input by a user, etc.). We expect the location of the string to be1229in the static segment (as opposed to the pointer, {\tt s}, which is1230a local variable).12311232{\tt p} contains an address returned by {\tt malloc}, which allocates1233space in the heap. ``malloc'' stands for ``memory allocate.''12341235The format sequence \verb"%p" tells {\tt printf} to format each1236address as a ``pointer'', so it displays the results in hexadecimal.12371238When I run this program, the output looks like this (I added spaces1239to make it easier to read):12401241\begin{verbatim}1242Address of main is 0x 40057d1243Address of global is 0x 60104c1244Address of local is 0x7ffe6085443c1245p points to 0x 16c30101246s points to 0x 4006a412471248\end{verbatim}12491250As expected, the address of {\tt main} is the lowest, followed by1251the location of the string literal. The location of1252{\tt global} is next, then the address {\tt p} points to.1253The address of {\tt local} is much bigger.12541255The largest address has 12 hexadecimal digits. Each hex digit1256corresponds to 4 bits, so it is a 48-bit address. That suggests1257that the usable part of the virtual address space is $2^{48}$ bytes.12581259As an exercise, run this program on your computer and compare your1260results to mine. Add a second call to {\tt malloc} and check whether1261the heap on your system grows up (toward larger addresses). Add a1262function that prints the address of a local variable, and check1263whether the stack grows down.126412651266\section{Static local variables}12671268Local variables on the stack are sometimes called {\bf automatic},1269because they are allocated automatically when a function is called,1270and freed automatically when the function returns.12711272In C there is another kind of local variable, called {\bf static},1273which is allocated in the global segment. It is initialized when1274the program starts and keeps its value from one function call to1275the next.12761277For example, the following function keeps track of how many times1278it has been called.12791280\begin{verbatim}1281int times_called()1282{1283static int counter = 0;1284counter++;1285return counter;1286}1287\end{verbatim}12881289The keyword {\tt static} indicates that {\tt counter} is a static1290local variable. The initialization happens only once, when the program1291starts.12921293If you add this function to {\tt aspace.c} you can confirm that1294{\tt counter} is allocated in the global segment along with global1295variables, not in the stack.129612971298\section{Address translation}1299\label{address_translation}13001301How does a virtual address (VA) get translated to a physical address (PA)?1302The basic mechanism is simple, but a simple1303implementation would be too slow and take too much space. So actual1304implementations are a bit more complicated.13051306\begin{figure}1307\centerline{\includegraphics[width=3in]{figs/address_translation.pdf}}1308\caption{Diagram of the address translation process.}1309\label{addtrans}1310\end{figure}13111312Most processors provide a memory management unit (MMU) that sits between the CPU and main memory. The MMU performs fast translation between VAs and PAs.13131314\begin{enumerate}13151316\item When a program reads or writes a variable, the CPU generates a1317VA.13181319\item The MMU splits the VA into two parts, called the page number and1320the offset. A ``page'' is a chunk of memory; the size of a page1321depends on the operating system and the hardware, but common sizes1322are 1--4 KiB.13231324\item The MMU looks up the page number in the translation lookaside buffer (TLB) and gets the corresponding physical page number. Then it combines1325the physical page number with the offset to produce a PA.13261327\item The PA is passed to main memory, which reads or writes the given1328location.13291330\end{enumerate}13311332The TLB contains cached copies of data from the page table (which is stored in kernel memory). The page table contains the mapping from virtual page numbers to physical page numbers. Since each process has its own page table, the TLB has to make sure it only uses entries from the page table of the process that's running.13331334Figure~\ref{addtrans} shows a diagram of this process.1335To see how it all works, suppose that the VA is 32 bits and the physical memory is 1 GiB, divided into 1 KiB pages.13361337\begin{itemize}13381339\item Since 1 GiB is $2^{30}$ bytes and 1 KiB is $2^{10}$ bytes, there1340are $2^{20}$ physical pages, sometimes called ``frames.''13411342\item The size of the virtual address space is $2^{32}$ B and the size1343of a page is $2^{10}$ B, so there are $2^{22}$ virtual pages.13441345\item The size of the offset is determined by the page size. In this1346example the page size is $2^{10}$ B, so it takes 10 bits to specify1347a byte on a page.13481349\item If a VA is 32 bits and the offset is 10 bits, the remaining135022 bits make up the virtual page number.13511352\item Since there are $2^{20}$ physical pages, each physical page1353number is 20 bits. Adding in the 10 bit offset, the resulting1354PAs are 30 bits.13551356\end{itemize}13571358So far this all seems feasible. But let's think about how big a page1359table might have to be. The simplest implementation of a page1360table is an array with one entry for each virtual page.1361Each entry would contain a physical page number, which is 20 bits1362in this example, plus some additional information about each1363frame. So we expect 3--4 bytes per entry. But with $2^{22}$ virtual pages,1364the page table would require $2^{24}$ bytes, or 16 MiB.13651366And since we need a page table for each process, a system running1367256 processes would need $2^{32}$ bytes, or 4 GiB, just for page tables!1368And that's just with 32-bit virtual addresses. With 48- or 64-bit1369VAs, the numbers are ridiculous.13701371Fortunately, we don't actually need that much space, because1372most processes don't use even a small fraction of their1373virtual address space. And if a process doesn't use a virtual1374page, we don't need an entry in the page table for it.13751376Another way to say the same thing is that page tables are ``sparse'',1377which implies that the simple implementation, an array of page1378table entries, is a bad idea. Fortunately, there are several1379good implementations for sparse arrays.13801381One option is a multilevel page table, which is what many operating1382systems, including Linux, use. Another option is an associative table, where each entry includes both the virtual page number and the physical page number. Searching an associative table can be slow in software, but in hardware we1383can search the entire table in parallel, so associative arrays are1384often used to represent the page table entries in the TLB.13851386You can read more about these implementations at1387\url{http://en.wikipedia.org/wiki/Page_table}; you might find the1388details interesting. But the fundamental idea is that page tables are1389sparse, so we have to choose a good implementation for sparse arrays.13901391I mentioned earlier that the operating system can interrupt a running1392process, save its state, and then run another process. This mechanism1393is called a {\bf context switch}. Since each process has its own1394page table, the operating system has to work with the MMU to make1395sure each process gets the right page table. In older machines,1396the page table information in the MMU had to be replaced during every1397context switch, which was expensive. In newer systems, each page1398table entry in the MMU includes the process ID, so page tables from1399multiple processes can be in the MMU at the same time.1400140114021403\chapter{Files and file systems}14041405When a process completes (or crashes), any data stored in main1406memory is lost. But data stored on a hard disk drive (HDD) or1407solid state drive (SSD) is ``persistent;'' that is, it survives1408after the process completes, even if the computer shuts down.14091410Hard disk drives are complicated. Data is stored in blocks, which1411are laid out in sectors, which make up tracks, which are arranged1412in concentric circles on platters.14131414Solid state drives are simpler in one sense, because blocks are1415numbered sequentially, but they raise a different complication: each1416block can be written a limited number of times before it becomes1417unreliable.14181419As a programmer, you don't want to deal with these complications.1420What you want is an appropriate abstraction of persistent storage1421hardware. The most common abstraction is called a ``file system.''14221423Abstractly:14241425\begin{itemize}14261427\item A ``file system'' is a mapping from each file's name to its contents.1428If you think of the names as keys, and the contents as values,1429a file system is a kind of key-value database1430(see \url{https://en.wikipedia.org/wiki/Key-value_database}).14311432\item A ``file'' is a sequence of bytes.14331434\end{itemize}14351436File names are usually strings, and they are usually ``hierarchical'';1437that is, the string specifies a path from a top-level directory (or1438folder), through a series of subdirectories, to a specific file.14391440The primary difference between the abstraction and the underlying1441mechanism is that files are byte-based and persistent storage is1442block-based. The operating system translates byte-based file operations1443in the C library into block-based operations on storage devices.1444Typical block sizes are 1--8 KiB.14451446For example, the following code opens a file and reads the first byte:14471448\begin{verbatim}1449FILE *fp = fopen("/home/downey/file.txt", "r");1450char c = fgetc(fp);1451fclose(fp);1452\end{verbatim}14531454When this code runs:14551456\begin{enumerate}14571458\item {\tt fopen} uses the filename to find the top-level directory,1459called \verb"/", the subdirectory {\tt home}, and the1460sub-subdirectory {\tt downey}.14611462\item It finds the file named {\tt file.txt} and ``opens'' it for1463reading, which means it creates a data structure that represents the1464file being read. Among other things, this data structure1465keeps track of how much of the file has been read, called the ``file1466position''.14671468In DOS, this data structure is called a File Control Block, but I1469want to avoid that term because in UNIX it means something else. In1470UNIX, there seems to be no good name for it. It is an entry in the1471open file table, so I will call it an OpenFileTableEntry.14721473\item When we call {\tt fgetc}, the operating system checks whether1474the next character of the file is already in memory. If so, it1475reads the next character, advances the file position, and returns1476the result.14771478\item If the next character is not in memory, the operating1479system issues an I/O request to get the next block. Disk drives are1480slow, so a process waiting for a block from disk is usually1481interrupted so another process can run until the data arrives.14821483\item When the I/O operation is complete, the new block of data is1484stored in memory, and the process resumes. It reads the first1485character and stores it as a local variable.14861487\item When the process closes the file, the operating system completes1488or cancels any pending operations, removes data stored in1489memory, and frees the OpenFileTableEntry.14901491\end{enumerate}14921493The process for writing a file is similar, but there are some1494additional steps. Here is an example that opens a file for1495writing and changes the first character.14961497\begin{verbatim}1498FILE *fp = fopen("/home/downey/file.txt", "w");1499fputc('b', fp);1500fclose(fp);1501\end{verbatim}15021503When this code runs:15041505\begin{enumerate}15061507\item Again, {\tt fopen} uses the filename to find the file. If it1508does not already exist, it creates a new file and adds an entry in1509the parent directory, {\tt /home/downey}.15101511\item The operating system creates an OpenFileTableEntry that1512indicates that the file is open for writing, and sets the file1513position to 0.15141515\item {\tt fputc} attempts to write (or re-write) the first byte of1516the file. If the file already exists, the operating system has to1517load the first block into memory. Otherwise it allocates a new1518block in memory and requests a new block on disk.15191520\item After the block in memory is modified, it might not be copied1521back to the disk right away. In general, data written to a file is1522``buffered'', which means it is stored in memory and only written to1523disk when there is at least one block to write.15241525\item When the file is closed, any buffered data is written to disk1526and the OpenFileTableEntry is freed.15271528\end{enumerate}15291530To summarize, the C library provides the abstraction of a file1531system that maps from file names to streams of bytes. This abstraction1532is built on top of storage devices that are actually organized1533in blocks.153415351536\section{Disk performance}15371538\newcommand{\mus}{$\mu$s~}15391540I mentioned earlier that disk drives are slow. On current HDDs, the1541average time to read a block from disk to memory might be 5--25 ms1542(see \url{https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics}).1543SSDs are faster, taking 25 \mus to read a 4 KiB block and 250 \mus to1544write one (see \url{http://en.wikipedia.org/wiki/Ssd#Controller}).15451546To put these numbers in perspective, let's compare them to the clock1547cycle of the CPU. A processor with clock rate 2 GHz completes one1548clock cycle every 0.5 ns. The time to get a byte from memory to1549the CPU is typically around 100 ns. If the processor completes one1550instruction per clock cycle, it would complete 200 instructions1551while waiting for a byte from memory.15521553In one microsecond, it would complete 2000 instructions,1554so while waiting 25 \mus for a byte from an SSD, it would complete 50,000.15551556In one millisecond, it would complete 2,000,000 instructions,1557so while waiting 20 ms for a byte from a HDD, it might complete155840 million. If there's nothing for the CPU to do while it waits,1559it would be idle. That's why the operating system generally1560switches to another process while it is waiting for data from disk.15611562The gap in performance between main memory and persistent storage is1563one of the major challenges of computer system design. Operating1564systems and hardware provide several features intended to ``fill in''1565this gap:15661567\begin{itemize}15681569\item Block transfers: The time it takes to load a single byte from1570disk is 5--25 ms. By comparison, the additional time to load an 81571KiB block is negligible. So systems generally try to read large1572blocks each time they access the disk.15731574\item Prefetching: Sometimes the operating system can predict that a1575process will read a block and start loading it before it is1576requested. For example, if you open a file and read the first1577block, there is a good chance you will go on to read the second1578block. The operating system might start loading additional blocks1579before they are requested.15801581\item Buffering: As I mentioned, when you write a file, the operating1582system stores the data in memory and only writes it to disk later.1583If you modify the block several times while it is in memory, the1584system only has to write it to disk once.15851586\item Caching: If a process has used a block recently, it is likely to1587use it again soon. If the operating system keeps a copy of the1588block in memory, it can handle future requests at memory speed.15891590\end{itemize}15911592Some of these features are also implemented in hardware. For example,1593some disk drives provide a cache that stores recently-used blocks,1594and many disk drives read more than one block at a time, even if only1595one is requested.15961597These mechanisms generally improve the performance of1598programs, but they don't change the behavior. Usually programmers1599don't have to think about them, with two exceptions: (1) if the1600performance of a program is unexpectedly bad, you might have to know1601something about these mechanisms to diagnose the problem, and (2)1602when data is buffered, it can be harder to debug a program. For1603example, if a program prints a value and then crashes, the value1604might not appear, because it might be in a buffer. Similarly, if a1605program writes data to disk and then the computer loses power, the1606data might be lost if it is in a cache and not yet on disk.160716081609\section{Disk metadata}16101611The blocks that make up a file might be arranged contiguously on1612disk, and file system performance is generally better if they are,1613but most operating systems don't require contiguous allocation.1614They are free to place a block anywhere on disk, and they use1615various data structures to keep track of them.16161617In many UNIX file systems, that data structure is called an ``inode,''1618which stands for ``index node''. More generally, information about1619files, including the location of their blocks, is called ``metadata''.1620(The content of the file is data, so information about the file is1621data about data, hence ``meta''.)16221623Since inodes reside on disk along with the rest of the data, they are1624designed to fit neatly into disk blocks. A UNIX inode contains1625information about a file, including the user ID of the file owner;1626permission flags indicating who is allowed to read, write, or execute1627it; and timestamps that indicate when it was last modified and1628accessed. In addition, it contains block numbers for the first 121629blocks that make up the file.16301631If the block size is 8 KiB, the first 12 blocks make up 96 KiB.1632On most systems, that's big enough for a large majority of files,1633but it's definitely not big enough for all of them. That's1634why the inode also contains a pointer to an ``indirection block'',1635which contains nothing but pointers to other blocks.16361637The number of pointers in an indirection block depends on the sizes of1638the blocks and the block numbers, but it is often 1024. With 10241639block numbers and 8 KiB blocks, an indirection block can address 81640MiB. That's big enough for all but the largest files, but still not1641big enough for all.16421643That's why the inode also contains a pointer to a ``double indirection1644block'', which contains pointers to indirection blocks. With16451024 indirection blocks, we can address 8 GiB.16461647And if that's not big enough, there is (finally) a triple indirection1648block, which contains pointers to double indirection blocks, yielding1649a maximum file size of 8 TiB. When UNIX inodes were designed, that1650seemed big enough to serve for a long time. But that was a long time1651ago.16521653As an alternative to indirection blocks, some files systems, like FAT,1654use a File Allocation Table that contains one entry for each block,1655called a ``cluster'' in this context. A root directory contains a1656pointer to the first cluster in each file. The FAT entry for each1657cluster points to the next cluster in the file, similar to a linked1658list. For more details, see1659\url{http://en.wikipedia.org/wiki/File_Allocation_Table}.166016611662\section{Block allocation}16631664File systems have to keep track of which blocks belong to each file;1665they also have to keep track of which blocks are available for use.1666When a new file is created, the file system finds an available1667block and allocates it. When a file is deleted, the file system1668makes its blocks available for re-allocation.16691670The goals of the block allocation system are:16711672\begin{itemize}16731674\item Speed: Allocating and freeing blocks should be fast.16751676\item Minimal space overhead: The data structures used by the allocator1677should be small, leaving as much space as possible for data.16781679\item Minimal fragmentation: If some blocks are left unused, or some1680are only partially used, the unused space is called1681``fragmentation''.16821683\item Maximum contiguity: Data that is likely to be used at the same1684time should be physically contiguous, if possible, to improve1685performance.16861687\end{itemize}16881689It is hard to design a file system that achieves all of these1690goals, especially since file system performance depends on1691``workload characteristics'' like file sizes, access1692patterns, etc. A file system that is well tuned for one workload1693might not perform as well for another.16941695For this reason, most operating systems support several kinds of file1696systems, and file system design is an active area of research and1697development. In the last decade, Linux systems have migrated1698from ext2, which was a conventional UNIX file system, to ext3,1699a ``journaling'' file system intended to improve speed and1700contiguity, and more recently to ext4, which can handle larger files1701and file systems. Within the next few years, there might be1702another migration to the B-tree file system, Btrfs.170317041705\section{Everything is a file?}17061707The file abstraction is really a ``stream of bytes'' abstraction,1708which turns out to be useful for many things, not just file systems.17091710One example is the UNIX pipe, which is a simple form of inter-process1711communication. Processes can be set up so that output from one1712process is taken as input into another process. For the first1713process, the pipe behaves like a file open for writing, so it1714can use C library functions like {\tt fputs} and {\tt fprintf}.1715For the second process, the pipe behaves like a file open for1716reading, so it uses {\tt fgets} and {\tt fscanf}.17171718Network communication also uses the stream of bytes abstraction.1719A UNIX socket is a data structure that represents a communication1720channel between processes on different computers (usually). Again,1721processes can read data from and write data to a socket using1722``file'' handling functions.17231724Reusing the file abstraction makes life easier for programmers, since1725they only have to learn one API (application program interface).1726It also makes programs more versatile, since a program intended to1727work with files can also work with data coming from pipes and other1728sources.17291730% TODO: gprof here?173117321733\chapter{More bits and bytes}17341735\section{Representing integers}17361737You probably know that computers represent numbers in1738base 2, also known as binary. For positive numbers, the binary1739representation is straightforward; for example, the representation1740for $5_{10}$ is $b101$.17411742For negative numbers, the most obvious representation uses1743a sign bit to indicate whether a number is positive or negative.1744But there is another representation, called ``two's complement''1745that is much more common because it is easier to work with1746in hardware.17471748To find the two's complement of a negative number, $-x$, find1749the binary representation of $x$, flip all the bits, and add 1.1750For example, to represent $-5_{10}$, start with the representation1751of $5_{10}$, which is $b0000 0101$ if we write the 8-bit version.1752Flipping all the bits and adding 1 yields $b1111 1011$.17531754In two's complement, the leftmost bit acts like a sign bit;1755it is 0 for positive numbers and 1 for negative numbers.17561757To convert from an 8-bit number to 16-bits, we have to add1758more 0's for a positive number and add 1's for a negative number.1759In effect, we have to copy the sign bit into the new bits.1760This process is called ``sign extension''.17611762In C all integer types are signed (able to represent positive and1763negative numbers) unless you declare them {\tt unsigned}. The1764difference, and the reason this declaration is important, is that1765operations on unsigned integers don't use sign extension.176617671768\section{Bitwise operators}17691770People learning C are sometimes confused1771about the bitwise operators \verb"&" and \verb"|". These1772operators treat integers as bit vectors and compute logical1773operations on corresponding bits.17741775For example, \verb"&" computes the AND operation, which yields17761 if both operands are 1, and 0 otherwise. Here is an example1777of \verb"&" applied to two 4-bit numbers:1778%1779\begin{verbatim}178011001781& 10101782----178310001784\end{verbatim}1785%1786In C, this means that the expression \verb"12 & 10" has the1787value 8.17881789Similarly, \verb"|" computes the OR operation, which yields17901 if either operand is 1, and 0 otherwise.1791%1792\begin{verbatim}179311001794| 10101795----179611101797\end{verbatim}1798%1799So the expression \verb"12 | 10" has the value 14.18001801Finally, \verb"^" computes the XOR operation, which yields18021 if either operand is 1, but not both.1803%1804\begin{verbatim}180511001806^ 10101807----180801101809\end{verbatim}1810%1811So the expression \verb"12 ^ 10" has the value 6.18121813Most commonly, \verb"&" is used to clear a set of bits from1814a bit vector, \verb"|" is used to set bits, and \verb"^"1815is used to flip, or ``toggle'' bits. Here are the details:18161817{\bf Clearing bits}: For any value $x$, $x \& 0$ is 0, and $x \& 1$ is $x$.1818So if you AND a vector with 3, it1819selects only the two rightmost bits, and sets the rest to 0.1820%1821\begin{verbatim}1822xxxx1823& 00111824----182500xx1826\end{verbatim}1827%1828In this context, the value 3 is called a ``mask'' because it1829selects some bits and masks the rest.18301831{\bf Setting bits}: Similarly, for any $x$, $x | 0$ is x, and $x | 1$ is $1$.1832So if you OR a vector with 3, it sets the rightmost1833bits, and leaves the rest alone:1834%1835\begin{verbatim}1836xxxx1837| 00111838----1839xx111840\end{verbatim}1841%1842{\bf Toggling bits}: Finally, if you XOR a vector with 3, it flips the1843rightmost bits and leaves the rest alone. As an exercise, see if you1844can compute the two's complement of 12 using \verb"^". Hint: what's1845the two's complement representation of -1?18461847% (12 ^ -1) + 118481849C also provides shift operators, {\tt <<} and {\tt >>}, which shift1850bits left and right. Each left shift doubles a number, so1851{\tt 5 << 1} is 10, and {\tt 5 << 2} is 20. Each right shift1852divides by two (rounding down), so {\tt 5 >> 1} is 2 and1853{\tt 2 >> 1} is 1.185418551856\section{Representing floating-point numbers}18571858Floating-point numbers are represented using the binary1859version of scientific notation. In decimal notation, large1860numbers are written as the product of a coefficient and 10 raised1861to an exponent. For example, the speed of light in m/s is1862approximately $2.998 \cdot 10^8$.18631864Most computers use the IEEE standard for floating-point1865arithmetic. The C type {\tt float} usually corresponds1866to the 32-bit IEEE standard; {\tt double} usually corresponds1867to the 64-bit standard.18681869In the 32-bit standard, the leftmost bit is the sign bit, $s$.1870The next 8 bits are the exponent, $q$, and the last 23 bits are1871the coefficient, $c$. The value of a floating-point number is1872%1873\[ (-1)^s c \cdot 2^q \]1874%1875Well, that's almost correct, but there's one more wrinkle.1876Floating-point numbers are usually normalized so that there is1877one digit before the point. For example, in base 10, we prefer1878$2.998 \cdot 10^8$ rather than $2998 \cdot 10^5$ or any other1879equivalent expression. In base 2, a normalized number always1880has the digit 1 before the binary point. Since the digit in1881this location is always 1, we can save space by leaving it1882out of the representation.18831884For example, the integer representation of $13_{10}$ is $b1101$.1885In floating point, that's $1.101 \cdot 2^3$, so the exponent1886is 3 and the part of the coefficient that would be stored1887is 101 (followed by 20 zeros).18881889Well, that's almost correct, but there's one more wrinkle.1890The exponent is stored with a ``bias''. In the 32-bit standard,1891the bias is 127, so the exponent 3 would be stored as 130.18921893To pack and unpack floating-point numbers in C, we can use a1894union and bitwise operations. Here's an example:1895%1896\begin{verbatim}1897union {1898float f;1899unsigned int u;1900} p;19011902p.f = -13.0;1903unsigned int sign = (p.u >> 31) & 1;1904unsigned int exp = (p.u >> 23) & 0xff;19051906unsigned int coef_mask = (1 << 23) - 1;1907unsigned int coef = p.u & coef_mask;19081909printf("%d\n", sign);1910printf("%d\n", exp);1911printf("0x%x\n", coef);1912\end{verbatim}1913%1914This code is in {\tt float.c} in the repository for this1915book (see Section~\ref{code}).19161917The union allows us to store a floating-point value using1918{\tt p.f} and then read it as an unsigned integer using1919{\tt p.u}.19201921To get the sign bit, we shift the bits to the right 311922places and then use a 1-bit mask to select only the1923rightmost bit.19241925To get the exponent, we shift the bits 23 places, then select the1926rightmost 8 bits (the hexadecimal value {\tt 0xff} has eight 1's).19271928To get the coefficient, we need to extract the 23 rightmost bits1929and ignore the rest. We do that by making a mask with 1s in the193023 rightmost places and 0s on the left. The easiest way to do that1931is by shifting 1 to the left by 23 places and then subtracting 1.19321933The output of this program is:1934%1935\begin{verbatim}19361193713019380x5000001939\end{verbatim}1940%1941As expected, the sign bit for a negative number is 1. The exponent1942is 130, including the bias. And the coefficient, which I printed in1943hexadecimal, is $101$ followed by 20 zeros.19441945As an exercise, try assembling or disassembling a {\tt double}, which1946uses the 64-bit standard. See1947\url{http://en.wikipedia.org/wiki/IEEE_floating_point}.194819491950\section{Unions and memory errors}19511952There are two common uses of C unions. One, which we saw in the1953previous section, is to access the binary representation of data.1954Another is to store heterogeneous data. For example, you could1955use a union to represent a number that might be an integer, float,1956complex, or rational number.19571958However, unions are error-prone. It is up to you, as the programmer,1959to keep track of what type of data is in the union; if you write1960a floating-point value and then interpret it as an integer, the result1961is usually nonsense.19621963Actually, the same thing can happen if you read a location in memory1964incorrectly. One way that can happen is if you read past the end of1965an array.19661967To see what happens, I'll start with a function that allocates an1968array on the stack and fills it with the numbers from 0 to 99.19691970\begin{verbatim}1971void f1() {1972int i;1973int array[100];19741975for (i=0; i<100; i++) {1976array[i] = i;1977}1978}1979\end{verbatim}19801981Next I'll define a function that creates a smaller array and1982deliberately accesses elements before the beginning and after1983the end:19841985\begin{verbatim}1986void f2() {1987int x = 17;1988int array[10];1989int y = 123;19901991printf("%d\n", array[-2]);1992printf("%d\n", array[-1]);1993printf("%d\n", array[10]);1994printf("%d\n", array[11]);1995}1996\end{verbatim}19971998If I call {\tt f1} and then {\tt f2}, I get these results:19992000\begin{verbatim}20011720021232003982004992005\end{verbatim}20062007The details here depend on the compiler, which arranges variables2008on the stack. From these results, we can infer that the2009compiler put {\tt x} and {\tt y} next to each other, ``below''2010the array (at a lower address). And when we read past the2011array, it looks like we are getting values that were left on2012the stack by the previous function call.20132014In this example, all of the variables are integers, so it is2015relatively easy to figure out what is going on. But in general2016when you read beyond the bounds of an array, the values you2017read might have any type. For example, if I change {\tt f1}2018to make an array of floats, the results are:20192020\begin{verbatim}202117202212320231120141312202411202723842025\end{verbatim}20262027The latter two values are what you get if you interpret a2028floating-point value as an integer. If you encountered this output2029while debugging, you would have a hard time figuring out what's2030going on.203120322033\section{Representing strings}20342035Related issues sometimes come up with strings. First, remember2036that C strings are null-terminated. When you allocate space2037for a string, don't forget the extra byte at the end.20382039Also, the letters {\it and numbers} in C strings are2040encoded in ASCII. The ASCII codes for the digits ``0'' through ``9''2041are 48 through 57, {\it not} 0 through 9. The ASCII code 0 is the NUL2042character that marks the end of a string. And the ASCII codes 12043through 9 are special characters used in some communication protocols.2044ASCII code 7 is a bell; on some terminals, printing it makes a sound.20452046The ASCII code for the letter ``A'' is 65; the code for2047``a'' is 97. Here are those codes in binary:20482049\begin{verbatim}205065 = b0100 0001205197 = b0110 00012052\end{verbatim}20532054A careful observer will notice that they differ by a single2055bit. And this pattern holds for the rest of the letters; the2056sixth bit (counting from the right) acts as a ``case bit'', 0 for2057upper-case letters and 1 for lower case letters.20582059As an exercise, write a function that takes a string and converts2060from lower-case to upper-case by flipping the sixth bit. As a challenge,2061you can make a faster version by reading the string 32 or 64 bits2062at a time, rather than one character at a time. This optimization2063is made easier if the length of the string is a multiple of 4 or20648 bytes.20652066If you read past the end of a string, you are likely to see2067strange characters. Conversely, if you write a string and2068then accidentally read it as an int or float, the results2069will be hard to interpret.20702071For example, if you run:20722073\begin{verbatim}2074char array[] = "allen";2075float *p = array;2076printf("%f\n", *p);2077\end{verbatim}20782079You will find that the ASCII representation of the first 8 characters2080of my name, interpreted as a double-precision floating point number,2081is 69779713878800585457664.20822083% TODO: assert here?208420852086\chapter{Memory management}20872088C provides 4 functions for dynamic memory allocation:20892090\begin{itemize}20912092\item {\tt malloc}, which takes an integer size, in bytes, and returns2093a pointer to a newly-allocated chunk of memory with (at least) the2094given size. If it can't satisfy the request, it returns2095the special pointer value NULL.20962097\item {\tt calloc}, which is the same as {\tt malloc} except that2098it also clears the newly allocated chunk; that2099is, it sets all bytes in the chunk to 0.21002101\item {\tt free}, which takes a pointer to a previously allocated2102chunk and deallocates it; that is, it makes the space available for2103future allocation.21042105\item {\tt realloc}, which takes a pointer to a previously allocated2106chunk and a new size. It allocates a chunk of memory with the new2107size, copies data from the old chunk to the new, frees the old chunk,2108and returns a pointer to the new chunk.21092110\end{itemize}21112112This API is notoriously error-prone and unforgiving. Memory management2113is one of the most challenging parts of designing large software systems,2114which is why most modern languages provide higher-level memory2115management features like garbage collection.211621172118\section{Memory errors}21192120The C memory management API is a bit like Jasper Beardly, a minor2121character on the animated television program {\it The Simpsons};2122in a few episodes, he appears as a strict substitute teacher who imposes corporal punishment --- a ``paddlin''' --- for all infractions.21232124Here are some of things a program can do that deserve a paddling:21252126\begin{itemize}21272128\item If you access (read or write) any chunk that has not been2129allocated, that's a paddling.21302131\item If you free an allocated chunk and then access it, that's2132a paddling.21332134\item If you try to free a chunk that has not been allocated,2135that's a paddling.21362137\item If you free the same chunk more than once, that's a paddling.21382139\item If you call {\tt realloc} with a chunk that was not allocated,2140or was allocated and then freed, that's a paddling.21412142\end{itemize}21432144It might not sound difficult to follow these rules, but in a large2145program a chunk of memory might be allocated in one part of the2146program, used in several other parts, and freed in yet another2147part. So changes in one part of the program can require changes2148in many other parts.21492150Also, there might be many aliases, or references to the same allocated2151chunk, in different parts of the program. The chunk should not be2152freed until all references to the chunk are no longer in use.2153Getting this right often requires careful analysis across all parts2154of the program, which is difficult and contrary to fundamental2155principles of good software engineering.21562157Ideally, every function that allocates memory should include, as part2158of the documented interface, information about how that memory is supposed2159to be freed. Mature libraries often do this well, but in the real world,2160software engineering practice often falls short of this ideal.21612162To make matters worse, memory errors can be difficult2163to find because the symptoms are unpredictable. For example:21642165\begin{itemize}21662167\item If you read a value from an unallocated chunk, the system {\em might} detect the error, trigger a runtime error called a ``segmentation fault'', and stop the program. Or, the program might read unallocated memory without detecting the error; in that case, the value it gets is whatever happened to be stored at the accessed location, which is unpredictable, and might be different each time the program runs.21682169\item If you write a value to an unallocated chunk, and don't get a segmentation fault, things are even worse. After you write a value to an invalid location, a long time might pass before it is read and causes problems. At that point it will be very difficult to find the source of the problem.21702171\end{itemize}21722173And things can be even worse than that! One of the most common2174problems with C-style memory management is that the data structures2175used to implement {\tt malloc} and {\tt free} (which we will see soon)2176are often stored along with the allocated chunks. So if you2177accidentally write past the end of a dynamically-allocated chunk, you2178are likely to mangle these data structures. The system usually won't2179detect the problem until later, when you call {\tt malloc} or2180{\tt free}, and those functions fail in some inscrutable way.21812182One conclusion you should draw from this is that safe memory2183management requires design and discipline. If you write a library2184or module that allocates memory, you should also provide an2185interface to free it, and memory management should be part of2186the API design from the beginning.21872188If you use a library that allocates memory, you should be disciplined2189in your use of the API. For example, if the library provides2190functions to allocate and deallocate storage, you should use those2191functions and not, for example, call {\tt free} on a chunk you did not2192{\tt malloc}. And you should avoid keeping multiple references to the2193same chunk in different parts of your program.21942195Often there is a trade-off between safe memory management and performance.2196For example, the most common source of memory errors is writing2197beyond the bounds of an array. The obvious remedy for this problem2198is bounds checking; that is, every access to the array should check2199whether the index is out of bounds. High-level libraries that provide2200array-like structures usually perform bounds checking. But C arrays2201and most low-level libraries do not.220222032204\section{Memory leaks}2205\label{leak}22062207There is one more memory error that may or may not deserve a paddling.2208If you allocate a chunk of memory and never free it, that's a ``memory2209leak''.22102211For some programs, memory leaks are ok. For example, if your program2212allocates memory, performs computations on it, and then exits, it is2213probably not necessary to free the allocated memory. When the program2214exits, all of its memory is deallocated by the operating system.2215Freeing memory immediately before exiting might feel more responsible,2216but it is mostly a waste of time.22172218But if a program runs for a long time and leaks memory, its total2219memory use will increase indefinitely. At that point, a few things2220might happen:22212222\begin{itemize}22232224\item At some point, the system runs out of physical memory. On2225systems without virtual memory, the next call to {\tt malloc} will2226fail, returning NULL.22272228\item On systems with virtual memory, the operating system can move2229another process's pages from memory to disk and then allocate2230more space to the leaking process. I explain this mechanism2231in Section~\ref{paging}.22322233\item There might be a limit on the amount of space a single2234process can allocate; beyond that, {\tt malloc} returns NULL.22352236\item Eventually, a process might fill its virtual address space (or2237the usable part). After that, there are no more addresses to2238allocate, so {\tt malloc} returns NULL.22392240\end{itemize}22412242If {\tt malloc} returns NULL, but you persist and access2243the chunk you think you allocated, you get a segmentation fault.2244For this reason, it is considered good style to check the result from2245{\tt malloc} before using it. One option is to add a condition like2246this after every {\tt malloc} call:22472248\begin{verbatim}2249void *p = malloc(size);2250if (p == NULL) {2251perror("malloc failed");2252exit(-1);2253}2254\end{verbatim}22552256{\tt perror} is declared in {\tt stdio.h}; it prints2257an error message and additional information about the last error2258that occurred.22592260{\tt exit}, which is declared in {\tt stdlib.h}, causes the process2261to terminate. The argument is a status code that indicates how2262the process terminated. By convention, status code 0 indicates normal2263termination and -1 indicates an error condition. Sometimes other2264codes are used to indicate different error conditions.22652266Error-checking code can be a nuisance, and it makes programs2267harder to read. You can mitigate these problems by wrapping library2268function calls and their error-checking code in your own2269functions. For example, here is a {\tt malloc} wrapper that checks2270the return value.22712272\begin{verbatim}2273void *check_malloc(int size)2274{2275void *p = malloc (size);2276if (p == NULL) {2277perror("malloc failed");2278exit(-1);2279}2280return p;2281}2282\end{verbatim}22832284Because memory management is so difficult, most large programs, like2285web browsers, leak memory. To see which programs on your system are2286using the most memory, you can use the UNIX utilities {\tt ps} and2287{\tt top}.228822892290% TODO: using Valgrind here?229122922293\section{Implementation}22942295When a process starts, the system allocates space for the text segment2296and statically allocated data, space for the stack, and space for the2297heap, which contains dynamically allocated data.22982299Not all programs allocate data dynamically, so the initial size of the2300heap might be small or zero. Initially the heap contains only one2301free chunk.23022303When {\tt malloc} is called, it checks whether it can find a free2304chunk that's big enough. If not, it has to request more memory2305from the system. The function that does that is {\tt sbrk},2306which sets the ``program break'', which you can think of as a pointer2307to the end of the heap.23082309When {\tt sbrk} is called, the OS allocates new pages of physical2310memory, updates the process's page table, and sets the program2311break.23122313In theory, a program could call {\tt sbrk} directly (without using2314{\tt malloc}) and manage the heap itself. But {\tt malloc} is easier2315to use and, for most memory-use patterns, it runs fast and uses memory2316efficiently.23172318To implement the memory management API (that is, the functions2319{\tt malloc}, {\tt free}, {\tt calloc}, and {\tt realloc}),2320most Linux systems use {\tt ptmalloc},2321which is based on {\tt dlmalloc}, written by Doug Lea. A short paper2322that describes key elements of the implementation is2323available at \url{http://gee.cs.oswego.edu/dl/html/malloc.html}.23242325For programmers, the most important elements to be aware of are:23262327\begin{itemize}23282329\item The run time of {\tt malloc} does not usually depend on the2330size of the chunk, but might depend on how many free chunks there2331are. {\tt free} is usually fast, regardless of the number of2332free chunks. Because {\tt calloc} clears every byte in the chunk,2333the run time depends on chunk size (as well as the number of free2334chunks).23352336{\tt realloc} is sometimes fast, if the new size is smaller than the2337current size, or if space is available to expand the existing chunk.2338If not, it has to copy data from the old chunk to the new; in that2339case, the run time depends on the size of the old chunk.23402341\item Boundary tags: When {\tt malloc} allocates a chunk, it adds2342space at the beginning and end to store information about the chunk,2343including its size and the state (allocated or free). These bits of2344data are called ``boundary tags''. Using these tags, {\tt malloc}2345can get from any chunk to the previous chunk and the next chunk in2346memory. In addition, free chunks are chained into a doubly-linked2347list; each free chunk contains pointers to the next and previous2348chunks in the ``free list''.23492350The boundary tags and free list pointers make up {\tt malloc}'s2351internal data structures. These data structures are interspersed with2352program data, so it is easy for a program error to damage them.23532354\item Space overhead: Boundary tags and free list pointers take up2355space. The minimum chunk size on most systems is 16 bytes. So for2356very small chunks, {\tt malloc} is not space efficient. If your2357program requires large numbers of small structures, it might be more2358efficient to allocate them in arrays.23592360\item Fragmentation: If you allocate and free chunks with varied2361sizes, the heap will tend to become fragmented. That is, the free2362space might be broken into many small pieces. Fragmentation wastes2363space; it also slows the program down by making memory caches less2364effective.23652366\item Binning and caching: The free list is sorted by size into bins,2367so when {\tt malloc} searches for a chunk with a particular size, it2368knows what bin to search in. If you free a chunk and then2369immediately allocate a chunk with the same size, {\tt malloc} will2370usually be fast.23712372\end{itemize}237323742375\chapter{Caching}23762377%TODO: move this model of computer hardware to Chapter 12378%TODO: talk about CPU modes, either here or in Chapter 123792380\section{How programs run}23812382In order to understand caching, you have to understand how computers2383execute programs. For a deep understanding of this topic, you should2384study computer architecture. My goal in this chapter is to provide2385a simple model of program execution.23862387When a program starts, the code (or text) is usually on a hard disk2388or solid state drive. The operating system creates a new process to2389run the program, then the ``loader''2390copies the text from storage into main memory and starts the program by2391calling {\tt main}.23922393While the program is running, most of its data is stored in main2394memory, but some of the data is in registers, which are2395small units of memory on the CPU. These registers include:23962397\begin{itemize}23982399\item The program counter, or PC, which contains the address (in2400memory) of the next instruction in the program.24012402\item The instruction register, or IR, which contains the machine code2403instruction currently executing.24042405\item The stack pointer, or SP, which contains the address of the2406stack frame for the current function, which contains its parameters2407and local variables.24082409\item General-purpose registers that hold the data the program is2410currently working with.24112412\item A status register, or flag register, that contains information2413about the current computation. For example, the flag register2414usually contains a bit that is set if the result of the previous2415operation was zero.24162417\end{itemize}24182419When a program is running, the CPU executes the following steps,2420called the ``instruction cycle'':24212422\begin{itemize}24232424\item Fetch: The next instruction is fetched from memory and stored2425in the instruction register.24262427\item Decode: Part of the CPU, called the ``control unit'', decodes2428the instruction and sends signals to the other parts of2429the CPU.24302431\item Execute: Signals from the control unit cause the appropriate2432computation to occur.24332434\end{itemize}24352436Most computers can execute a few hundred different instructions,2437called the ``instruction set''. But most instructions fall2438into a few general categories:24392440\begin{itemize}24412442\item Load: Transfers a value from memory to a register.24432444\item Arithmetic/logic: Loads operands from registers, performs2445a mathematical operation, and stores the result in a register.24462447\item Store: Transfers a value from a register to memory.24482449\item Jump/branch: Changes the program counter, causing the flow2450of execution to jump to another location in the program. Branches2451are usually conditional, which means that they check a flag2452in the flag register and jump only if it is set.24532454\end{itemize}24552456Some instructions sets, including the ubiquitous x86, provide2457instructions that combine a load and an arithmetic operation.24582459During each instruction cycle, one instruction is read from the2460program text. In addition, about half of the instructions in a2461typical program load or store data. And therein2462lies one of the fundamental problems of computer architecture: the2463``memory bottleneck''.24642465In current computers, a typical core is capable of executing an instruction in less than 1 ns. But the time it takes to transfer data to and from memory is about 100 ns. If the CPU has to wait 100 ns to fetch the next instruction, and another 100 ns to load data, it would complete instructions 200 times slower than what's theoretically possible. For many computations, memory is the speed limiting factor, not the CPU.246624672468\section{Cache performance}24692470The solution to this problem, or at least a partial solution, is2471caching. A ``cache'' is a small, fast memory that is physically close2472to the CPU, usually on the same chip.24732474%TODO: Clean up these paragraphs24752476Actually, current computers typically have several levels of cache: the Level 1 cache, which is the smallest and fastest, might be 1--2 MiB with a access times near 1 ns; the Level 2 cache might have access times near 4 ns, and the Level 3 might take 16 ns.24772478When the CPU loads a value from memory, it stores a copy in the cache.2479If the same value is loaded again, the CPU gets the cached copy2480and doesn't have to wait for memory.24812482Eventually the cache gets full. Then, in order to bring something2483new in, we have to kick something out. So if the CPU loads a value2484and then loads it again much later, it might not be in cache any more.24852486The performance of many programs is limited by the effectiveness2487of the cache. If the instructions and data needed by the CPU are usually in cache, the program can run close to the full speed of the CPU. If the CPU2488frequently needs data that are not in cache, the program is2489limited by the speed of memory.24902491The cache ``hit rate'', $h$, is the fraction of memory accesses that2492find data in cache; the ``miss rate'', $m$, is the fraction of memory2493accesses that have to go to memory. If the time to process a cache2494hit is $T_h$ and the time for a cache miss is $T_m$, the average time2495for each memory access is2496%2497\[ h T_h + m T_m \]2498%2499Equivalently, we could define the ``miss penalty'' as the extra2500time to process a cache miss, $T_p = T_m - T_h$. Then the average access2501time is2502%2503\[ T_h + m T_p \]2504%2505When the miss rate is low, the average access time can be close to2506$T_h$. That is, the program can perform as if memory ran at2507cache speeds.250825092510\section{Locality}25112512When a program reads a byte for the first time, the cache usually2513loads a ``block'' or ``line'' of data that includes the requested2514byte and some of its neighbors. If the program goes on to read one2515of the neighbors, it will already be in cache.25162517As an example, suppose the block size is 64 B;2518you read a string with length 64, and the first2519byte of the string happens to fall at the beginning of a block. When2520you load the first byte, you incur a miss penalty, but2521after that the rest of the string will be in cache. After2522reading the whole string, the hit rate will be 63/64, about 98\%.2523If the string spans two blocks, you would incur 2 miss penalties. But2524even then the hit rate would be 62/64, or almost 97\%. If you then2525read the same string again, the hit rate would be 100\%.25262527On the other hand, if the program jumps around unpredictably,2528reading data from scattered locations in memory, and seldom2529accessing the same location twice, cache performance would be2530poor.25312532The tendency of a program to use the same data more than once is2533called ``temporal locality''. The tendency to use data in nearby2534locations is called ``spatial locality''. Fortunately, many2535programs naturally display both kinds of locality:25362537\begin{itemize}25382539\item Most programs contain blocks of code with no jumps or2540branches. Within these blocks, instructions run2541sequentially, so the access pattern has2542spatial locality.25432544\item In a loop, programs execute the same instructions many2545times, so the access pattern has temporal locality.25462547\item The result of one instruction is often used immediately as2548an operand of the next instruction, so the data access pattern2549has temporal locality.25502551\item When a program executes a function, its parameters and local2552variables are stored together on the stack; accessing these values2553has spatial locality.25542555\item One of the most common processing patterns is to read or write2556the elements of an array sequentially; this pattern also has2557spatial locality.25582559\end{itemize}25602561The next section explores the relationship2562between a program's access pattern and cache performance.256325642565\section{Measuring cache performance}25662567When I was a graduate student at U.C. Berkeley I was a teaching2568assistant for Computer Architecture with Brian Harvey. One of my2569favorite exercises involved a program that iterates through an array2570and measures the average time to read and write an element. By2571varying the size of the array, it is possible to infer the size2572of the cache, the block size, and some other attributes.25732574My modified version of this program is in the {\tt cache} directory2575of the repository for this2576book (see Section~\ref{code}).25772578The important part of the program is this loop:25792580\begin{verbatim}2581iters = 0;2582do {2583sec0 = get_seconds();25842585for (index = 0; index < limit; index += stride)2586array[index] = array[index] + 1;25872588iters = iters + 1;2589sec = sec + (get_seconds() - sec0);25902591} while (sec < 0.1);2592\end{verbatim}25932594The inner {\tt for} loop traverses the array. {\tt limit}2595determines how much of the array it traverses; {\tt stride}2596determines how many elements it skips over. For example, if2597{\tt limit} is 16 and {\tt stride} is 4, the loop would access2598elements 0, 4, 8, and 12.25992600{\tt sec} keeps track of the total CPU time used by the inner loop.2601The outer loop runs until {\tt sec} exceeds 0.1 seconds, which is2602long enough that we can compute the average time with sufficient2603precision.26042605\verb"get_seconds" uses the system call \verb"clock_gettime",2606converts to seconds, and returns the result as a {\tt double}:26072608\begin{verbatim}2609double get_seconds(){2610struct timespec ts;2611clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);2612return ts.tv_sec + ts.tv_nsec / 1e9;2613}2614\end{verbatim}26152616\begin{figure}2617% graph_cache_data.py2618\centerline{\includegraphics[width=3in]{figs/cache_data.pdf}}2619\caption{Average miss penalty as a function of array size and stride.}2620\label{cachedata}2621\end{figure}26222623To isolate the time to access the elements of the array,2624the program runs a second loop that is almost identical except2625that the inner loop doesn't touch the array; it always increments2626the same variable:26272628\begin{verbatim}2629iters2 = 0;2630do {2631sec0 = get_seconds();26322633for (index = 0; index < limit; index += stride)2634temp = temp + index;26352636iters2 = iters2 + 1;2637sec = sec - (get_seconds() - sec0);26382639} while (iters2 < iters);2640\end{verbatim}26412642The second loop runs the same number of iterations as the first.2643After each iteration, it {\em subtracts} the elapsed time from2644{\tt sec}. When the loop completes, {\tt sec} contains the total2645time for all array accesses, minus the total time it took to increment2646{\tt temp}. This difference is the total miss penalty incurred by2647all accesses. Finally, we divide by the number of accesses to2648get the average miss penalty per access, in ns:26492650\begin{verbatim}2651sec * 1e9 / iters / limit * stride2652\end{verbatim}26532654If you compile and run {\tt cache.c} you should see output like this:26552656\begin{verbatim}2657Size: 4096 Stride: 8 read+write: 0.8633 ns2658Size: 4096 Stride: 16 read+write: 0.7023 ns2659Size: 4096 Stride: 32 read+write: 0.7105 ns2660Size: 4096 Stride: 64 read+write: 0.7058 ns2661\end{verbatim}26622663If you have Python and {\tt matplotlib} installed, you can use2664\verb"graph_data.py" to graph the results. Figure~\ref{cachedata}2665shows the results when I ran it on a Dell Optiplex 7010.2666Notice that the array size and stride are reported in2667bytes, not number of array elements.26682669Take a minute to consider this graph, and see what you can infer2670about the cache. Here are some things to think about:26712672\begin{itemize}26732674\item The program reads through the array many times, so it has plenty2675of temporal locality. If the entire array fits in cache, we expect2676the average miss penalty to be near 0.26772678\item When the stride is 4 bytes, we read every element of the array,2679so the program has plenty of spatial locality. If the block size is2680big enough to contain 64 elements, for example, the hit rate would2681be 63/64, even if the array does not fit in cache.26822683\item If the stride is equal to the block size (or greater), the2684spatial locality is effectively zero, because each time we read a2685block, we only access one element. In that case we expect to see2686the maximum miss penalty.26872688\end{itemize}26892690In summary, we expect good cache performance if the array is smaller2691than the cache size {\em or} if the stride is smaller than the block2692size. Performance only degrades if the array is bigger than the2693cache {\em and} the stride is large.26942695In Figure~\ref{cachedata}, cache performance is good, for all strides,2696as long as the array is less than $2^{22}$ B. We can infer that the2697cache size is near 4 MiB; in fact, according to the specs, it is 32698MiB.26992700When the stride is 8, 16, or 32 B, cache performance is good. At 64 B2701it starts to degrade, and for larger strides the average miss2702penalty is about 9 ns. We can infer that the block size near 128 B.27032704Many processors use ``multi-level caches'' that include a small,2705fast cache and a bigger, slower cache. In this example, it looks2706like the miss penalty increases a little when the array size is bigger2707than $2^{14}$ B, so it's possible that this processor also has a 16 KB2708cache with an access time less than 1 ns.270927102711\section{Programming for cache performance}27122713Memory caching is implemented in hardware, so most of the time2714programmers don't need to know much about it. But if you know how2715caches work, you can write programs that use them more effectively.27162717For example, if you are working with a large array, it might be2718faster to traverse the array once, performing several operations with2719each element, rather than traversing the array several times.27202721If you are working with a 2-D array, it might be stored as an array2722of rows. If you traverse through the elements, it would be faster2723to go row-wise, with stride equal to the element size, rather2724than column-wise, with stride equal to the row length.27252726Linked data structures don't always exhibit spatial locality, because2727the nodes aren't necessarily contiguous in memory. But if you allocate2728many nodes at the same time, they are usually co-located in the heap.2729Or, even better, if you allocate an array of nodes all at once, you2730know they will be contiguous.27312732Recursive strategies like mergesort often have good cache behavior2733because they break big arrays into smaller pieces and then work2734with the pieces. Sometimes these algorithms can be tuned to take2735advantage of cache behavior.27362737For applications where performance is critical, it is possible2738to design algorithms tailored to the size of the cache, the block size,2739and other hardware characterstics. Algorithms like that are2740called ``cache-aware''. The obvious drawback of cache-aware2741algorithms is that they are hardware-specific.274227432744\section{The memory hierarchy}27452746At some point during this chapter, a question like the following2747might have occurred to you: ``If caches are so much faster than2748main memory, why not make a really big cache and forget about2749memory?''27502751Without going too far into computer architecture, there are two2752reasons: electronics and economics. Caches are fast because they are2753small and close to the CPU, which minimizes delays due to capacitance2754and signal propagation. If you make a cache big, it will be slower.27552756Also, caches take up space on the processor chip, and bigger chips are2757more expensive. Main memory is usually dynamic random-access memory2758(DRAM), which uses only one transistor and one capacitor per bit, so2759it is possible to pack more memory into the same amount of space. But2760this way of implementing memory is slower than the way caches are2761implemented.27622763Also main memory is usually packaged in a dual in-line memory module2764(DIMM) that includes 16 or more chips. Several small chips are cheaper2765than one big one.27662767The trade-off between speed, size, and cost is the fundamental reason2768for caching. If there were one memory technology that was fast,2769big, and cheap, we wouldn't need anything else.27702771The same principle applies to storage as well as memory. Solid state drives (SSD) are fast, but they are more expensive than hard drives (HDD), so they tend to be smaller. Tape drives are even slower than hard2772drives, but they can store large amounts of data relatively2773cheaply.27742775The following table shows typical access times, sizes, and2776costs for each of these technologies.27772778\vspace{0.1in}2779\begin{center}2780\begin{tabular}{| l | l | l | l |}2781\hline2782Device & Access & Typical & Cost \\2783& time & size & \\ \hline2784Register & 0.5 ns & 256 B & ? \\ \hline2785Cache & 1 ns & 2 MiB & ? \\ \hline2786DRAM & 100 ns & 4 GiB & \$10 / GiB \\ \hline2787SSD & 10 \mus & 100 GiB & \$1 / GiB \\ \hline2788HDD & 5 ms & 500 GiB & \$0.25 / GiB \\ \hline2789Tape & minutes & 1--2 TiB & \$0.02 / GiB \\ \hline2790\end{tabular}2791\end{center}2792\vspace{0.1in}27932794The number and size of registers depends on details of the2795architecture. Current computers have about 32 general-purpose2796registers, each storing one ``word''. On a 32-bit computer, a word2797is 32 bits or 4 B. On a 64-bit computer, a word is 64 bits or 8 B.2798So the total size of the register file is 100--300 B.27992800The cost of registers and caches is hard to quantify. They contribute2801to the cost of the chips they are on, but consumers don't see that2802cost directly.28032804For the other numbers in the table, I looked at the specifications for2805typical hardware for sale from online computer hardware stores. By2806the time you read this, these numbers will be obsolete, but they give2807you an idea of what the performance and cost gaps looked like at one2808point in time.28092810These technologies make up the ``memory hierarchy'' (note that this2811use of ``memory'' also includes storage). Each2812level of the hierarchy is bigger and slower than the one above it.2813And in some sense, each level acts as a cache for the one below2814it. You can think of main memory as a cache for programs and data2815that are stored permanently on SSDs and HHDs. And if you are working2816with very large datasets stored on tape, you could use hard drives2817to cache one subset of the data at a time.281828192820\section{Caching policy}28212822The memory hierarchy suggests a framework for thinking about2823caching. At every level of the hierarchy, we have to address2824four fundamental questions of caching:28252826\begin{itemize}28272828\item Who moves data up and down the hierarchy? At the top of the2829hierarchy, register allocation is usually done by the compiler.2830Hardware on the CPU handles the memory cache. Users implicitly move2831data from storage to memory when they execute programs and open2832files. But the operating system also moves data back and forth2833between memory and storage. At the bottom of the hierarchy,2834administrators move data explicitly between disk and tape.28352836\item What gets moved? In general, block sizes are small at the top2837of the hierarchy and bigger at the bottom. In a memory cache, a2838typical block size is 128 B. Pages in memory might be 4 KiB, but2839when the operating system reads a file from disk, it might read 10s2840or 100s of blocks at a time.28412842\item When does data get moved? In the most basic cache, data gets2843moved into cache when it is used for the first time. But many2844caches use some kind of ``prefetching'', meaning that data is2845loaded before it is explicitly requested. We have already seen2846one form of prefetching: loading an entire block when only part of2847it is requested.28482849\item Where in the cache does the data go? When the cache is full, we2850can't bring anything in without kicking something out. Ideally,2851we want to keep data that will be used again soon and replace data2852that won't.28532854\end{itemize}28552856The answers to these questions make up the ``cache policy''.2857Near the top of the hierarchy, cache policies tend to be simple2858because they have to be fast and they are implemented in hardware.2859Near the bottom of the hierarchy, there is more time to make decisions,2860and well-designed policies can make a big difference.28612862Most cache policies are based on the principle that history repeats2863itself; if we have information about the recent past, we can use it to2864predict the immediate future. For example, if a block of data has2865been used recently, we expect it to be used again soon. This2866principle suggests a replacement policy called ``least recently2867used,'' or LRU, which removes from the cache a block of data that2868has not been used recently. For more on this topic, see2869\url{http://en.wikipedia.org/wiki/Cache_algorithms}.287028712872\section{Paging}2873\label{paging}28742875In systems with virtual memory, the operating system can move2876pages back and forth between memory and storage. As I mentioned2877in Section~\ref{leak}, this mechanism is called ``paging'' or2878sometimes ``swapping''.28792880Here's how the process works:28812882\begin{enumerate}28832884\item Suppose Process A calls {\tt malloc} to allocate a chunk. If there2885is no free space in the heap with the requested size, {\tt malloc} calls2886{\tt sbrk} to ask the operating system for more memory.28872888\item If there is a free page in physical memory, the operating system2889adds it to the page table for Process A, creating a new range of valid2890virtual addresses.28912892\item If there are no free pages, the paging system chooses a ``victim2893page'' belonging to Process B. It copies the contents of the victim2894page from memory to disk, then it modifies the page table for Process2895B to indicate that this page is ``swapped out''.28962897\item Once the data from Process B is written, the page can be reallocated2898to Process A. To prevent Process A from reading Process B's data, the2899page should be cleared.29002901\item At this point the call to {\tt sbrk} can return, giving {\tt malloc}2902additional space in the heap. Then {\tt malloc} allocates the requested2903chunk and returns. Process A can resume.29042905\item When Process A completes, or is interrupted, the scheduler might2906allow Process B to resume. When Process B accesses a page that has been swapped out, the memory management unit notices that the page is ``invalid'' and causes an interrupt.29072908\item When the operating system handles the interrupt, it sees that2909the page is swapped out, so it transfers the page back from disk to2910memory.29112912\item Once the page is swapped in, Process B can resume.29132914\end{enumerate}29152916When paging works well, it can greatly improve the utilization of2917physical memory, allowing more processes to run in less space.2918Here's why:29192920\begin{itemize}29212922\item Most processes don't use all of their allocated memory. Many2923parts of the text segment are never executed, or execute once and2924never again. Those pages can be swapped out without causing any2925problems.29262927\item If a program leaks memory, it might leave allocated space behind2928and never access it again. By swapping those pages out, the2929operating system can effectively plug the leak.29302931\item On most systems, there are processes like daemons that sit idle2932most of the time and only occasionally ``wake up'' to respond to2933events. While they are idle, these processes can be swapped out.29342935\item A user might have many windows open, but only a few are active2936at a time. The inactive processes can be swapped out.29372938\item Also, there might be many processes running the same program.2939These processes can share the same text and static segments, avoiding the need to keep multiple copies in physical memory.29402941\end{itemize}29422943If you add up the total memory allocated to all processes, it can2944greatly exceed the size of physical memory, and yet the system can2945still behave well.29462947Up to a point.29482949When a process accesses a page that's swapped out, it has to get the2950data back from disk, which can take several milliseconds. The2951delay is often noticeable. If you leave a window idle for a long2952time and then switch back to it, it might start slowly,2953and you might hear the disk drive working while pages are2954swapped in.29552956Occasional delays like that might be acceptable, but if you have too2957many processes using too much space, they start to interfere with each2958other. When Process A runs, it evicts the pages Process B needs.2959Then when B runs, it evicts the pages A needs. When this happens,2960both processes slow to a crawl and the system can become unresponsive.2961This scenario is called ``thrashing''.29622963In theory, operating systems could avoid thrashing by detecting an2964increase in paging and blocking or killing processes until the system2965is responsive again. But as far as I can tell, most systems don't do2966this, or don't do it well; it is often left to users to limit their2967use of physical memory or try to recover when thrashing occurs.296829692970\chapter{Multitasking}29712972In many current systems, the CPU contains multiple cores, which means2973it can run several processes at the same time. In addition, each core2974is capable of ``multitasking'', which means it can switch from one2975process to another quickly, creating the illusion that many processes2976are running at the same time.29772978The part of the operating system that implements multitasking is2979the ``kernel''. In a nut or seed, the kernel is the innermost2980part, surrounded by a shell. In an operating system, the kernel2981is the lowest level of software, surrounded by several other2982layers, including an interface called a ``shell.'' Computer2983scientists love extended metaphors.29842985At its most basic, the kernel's job is to2986handle interrupts. An ``interrupt'' is an event that stops the2987normal instruction cycle and causes the flow of execution to jump to a2988special section of code called an ``interrupt handler''.29892990%TODO: put new vocab in bold and add glossaries29912992A {\bf hardware interrupt} is caused when a device sends a signal to the2993CPU. For example, a network interface might cause an interrupt when2994a packet of data arrives, or a disk drive might cause an interrupt2995when a data transfer is complete. Most systems also have timers that2996cause interrupts at regular intervals, or after an elapsed time.29972998A {\bf software interrupt} is caused by a running program. For example, if2999an instruction cannot complete for some reason, it might trigger an3000interrupt so the condition can be handled by the operating system.3001Some floating-point errors, like division by zero, are handled3002using interrupts.30033004When a program needs to access a hardware device,3005it makes a {\bf system call}, which is similar to a function call,3006except that instead of jumping to the beginning of the function,3007it executes a special instruction that triggers an interrupt, causing3008the flow of execution to jump to the kernel. The kernel reads the3009parameters of the system call, performs the requested operation,3010and then resumes the interrupted process.301130123013\section{Hardware state}30143015Handling interrupts requires cooperation between hardware and3016software. When an interrupt occurs, there might be several3017instructions running on the CPU, data stored in registers, and3018other {\bf hardware state}.30193020Usually the hardware is responsible for bringing the CPU3021to a consistent state; for example, every instruction should either3022complete or behave as if it never started. No instruction should3023be left half complete. Also, the hardware is responsible for3024saving the program counter (PC), so the kernel knows where to3025resume.30263027Then, usually, it is the responsibility of the interrupt handler3028to save the rest of the hardware state before it does anything that3029might modify it, and then restore the saved state before the interrupted3030process resumes.30313032Here is an outline of this sequence of events:30333034\begin{enumerate}30353036\item When the interrupt occurs, the hardware saves the program3037counter in a special register and jumps to the appropriate interrupt3038handler.30393040\item The interrupt handler stores the program counter and the3041status register in memory, along with the contents of any data3042registers it plans to use.30433044\item The interrupt handler runs whatever code is needed to handle3045the interrupt.30463047\item Then it restores the contents of the saved registers. Finally,3048it restores the program counter of the interrupted process, which3049has the effect of jumping back to the interrupted instruction.30503051\end{enumerate}30523053If this mechanism works correctly, there is generally no way for3054the interrupted process to know there was an interrupt, unless3055it detects the change in time between instructions.305630573058\section{Context switching}30593060Interrupt handlers can be fast because they don't have to save the3061entire hardware state; they only have to save registers they are3062planning to use.30633064But when an interrupt occurs, the kernel does not always resume the3065interrupted process. It has the option of switching to another3066process. This mechanism is called a ``context switch''.30673068In general, the kernel doesn't know which registers a process will3069use, so it has to save all of them. Also, when it switches to a new3070process, it might have to clear data stored in the memory management3071unit (see3072Section~\ref{address_translation}). And after the context switch, it3073might take some time for the new process to load data into the cache.3074For these reasons, context switches are relatively slow, on the order3075of thousands of cycles, or a few microseconds.30763077In a multi-tasking system, each process is allowed to run for a short3078period of time called a ``time slice'' or ``quantum''. During3079a context switch, the kernel sets a hardware timer that causes3080an interrupt at the end of the time slice. When the interrupt3081occurs, the kernel can switch to another process or allow the3082interrupted process to resume. The part of the operating system3083that makes this decision is the ``scheduler''.308430853086\section{The process life cycle}30873088When a process is created, the operating system allocates a3089data structure that contains information about the process, called3090a ``process control block'' or PCB. Among other things, the3091PCB keeps track of the process state, which is one of:30923093\begin{itemize}30943095\item Running, if the process is currently running on a core.30963097\item Ready, if the process could be running, but isn't, usually because3098there are more runnable processes than cores.30993100\item Blocked, if the process cannot run because it is waiting for3101a future event like network communication or a disk read.31023103\item Done, if the process has completed, but has exit status3104information that has not been read yet.31053106\end{itemize}31073108Here are the events that cause a process to transition from one state to another:31093110\begin{itemize}31113112\item A process is created when the running program executes a system3113call like {\tt fork}. At the end of the system call, the new3114process is usually ready. Then the scheduler might resume the3115original process (the ``parent'') or start the new process (the3116``child'').31173118\item When a process is started or resumed by the scheduler, its state3119changes from ready to running.31203121\item When a process is interrupted and the scheduler chooses not3122to let it resume, its state changes from running to ready.31233124\item If a process executes a system call that cannot complete3125immediately, like a disk request, it becomes blocked3126and the scheduler usually chooses another process.31273128\item When an operation like a disk request completes, it causes an3129interrupt. The interrupt handler figures out which process was3130waiting for the request and switches its state from3131blocked to ready. Then the scheduler may or may not choose to3132resume the unblocked process.31333134\item When a process calls {\tt exit}, the interrupt handler stores3135the exit code in the PCB and changes the process's state to done.31363137\end{itemize}313831393140\section{Scheduling}31413142As we saw in Section~\ref{unixps} there might be hundreds of3143processes on a computer, but usually most of them are blocked. Most3144of the time, there are only a few processes that are ready or running.3145When an interrupt occurs, the scheduler decides which process to start3146or resume.31473148On a workstation or laptop, the primary goal of the scheduler is to3149minimize response time; that is, the computer should respond quickly3150to user actions. Response time is also important on a server, but in3151addition the scheduler might try to maximize throughput, which is the3152number of requests that complete per unit of time.31533154Usually the scheduler doesn't have much information about what3155processes are doing, so its decisions are based on a few3156heuristics:31573158\begin{itemize}31593160\item Processes might be limited by different resources. A process3161that does a lot of computation is probably CPU-bound, which means that3162its run time depends on how much CPU time it gets. A process that3163reads data from a network or disk might be I/O-bound, which means that3164it would run faster if data input and output went faster, but would not3165run faster with more CPU time. Finally, a process that interacts with3166the user is probably blocked, most of the time, waiting for user actions.31673168The operating system can sometimes classify processes based on their3169past behavior, and schedule them accordingly. For example, when an3170interactive process is unblocked, it should probably run immediately,3171because a user is probably waiting for a reply. On the other hand,3172a CPU-bound process that has been running for a long time might be3173less time-sensitive.31743175\item If a process is likely to run for a short time and then make3176a blocking request, it should probably run immediately, for two reasons:3177(1) if the request takes some time to complete, we should start it as soon3178as possible, and (2) it is better for a long-running process to wait3179for a short one, rather than the other way around.31803181As an analogy, suppose you are making an apple pie. The crust takes31825 minutes to prepare, but then it has to chill for half an hour. It takes318320 minutes to prepare the filling. If you prepare the crust first,3184you can prepare the filling while the crust is chilling, and you can3185finish the pie in 35 minutes. If you prepare the filling first, the3186process takes 55 minutes.31873188\end{itemize}31893190Most schedulers use some form of priority-based scheduling,3191where each process has a priority that can be adjusted up or down3192over time. When the scheduler runs, it chooses the runnable process3193with the highest priority.31943195Here are some of the factors that determine a process's priority:31963197\begin{itemize}31983199\item A process usually starts with a relatively high priority so it3200starts running quickly.32013202\item If a process makes a request and blocks before its time slice is3203complete, it is more likely to be interactive or I/O-bound, so its3204priority should go up.32053206\item If a process runs for an entire time slice, it is more likely to3207be long-running and CPU-bound, so its priority should go down.32083209\item If a task blocks for a long time and then becomes ready, it3210should get a priority boost so it can respond to whatever it was3211waiting for.32123213\item If process A is blocked waiting for process B, for example if3214they are connected by a pipe, the priority of process B should go3215up.32163217\item The system call {\tt nice} allows a process to decrease (but not3218increase) its own priority, allowing programmers to pass explicit3219information to the scheduler.32203221\end{itemize}32223223For most systems running normal workloads, scheduling algorithms3224don't have a substantial effect on performance. Simple scheduling3225policies are usually good enough.322632273228\section{Real-time scheduling}32293230However, for programs that interact with the real world, scheduling3231can be very important. For example, a program that reads data from3232sensors and controls motors might have to complete recurring tasks at3233some minimum frequency and react to external events with some maximum3234response time. These requirements are often expressed in terms of3235``tasks'' that must be completed before ``deadlines''.32363237Scheduling tasks to meet deadlines is called ``real-time3238scheduling''. For some applications, a general-purpose operating3239system like Linux can be modified to handle real-time scheduling.3240These modifications might include:32413242\begin{itemize}32433244\item Providing richer APIs for controlling task priorities.32453246\item Modifying the scheduler to guarantee that the process with3247highest priority runs within a fixed amount of time.32483249\item Reorganizing interrupt handlers to guarantee3250a maximum completion time.32513252\item Modifying locks and other synchronization mechanisms (coming up3253in the next chapter) to allow a high-priority task to preempt a3254lower-priority task.32553256\item Choosing an implementation of dynamic memory allocation that3257guarantees a maximum completion time.32583259\end{itemize}32603261For more demanding applications, especially in domains where real-time3262response is a matter of life and death, ``real-time operating3263systems'' provide specialized capabilities, often with much simpler3264designs than general purpose operating systems.32653266%TODO: kernel mode? signals? user and system time326732683269\chapter{Threads}32703271When I mentioned threads in Section~\ref{unixps}, I said that a thread3272is a kind of process. Now I will provide a more careful explanation.32733274When you create a process, the operating system creates a new address3275space, which includes the text segment, static segment, and heap; it3276also creates a new ``thread of execution'', which includes the program3277counter and other hardware state, and the call stack.32783279The processes we have seen so far are ``single-threaded'', which means3280that only one thread of execution runs in each address space. In this3281chapter, you will learn about ``multi-threaded'' processes that have3282multiple threads running in the same address space.32833284Within a single process, all threads share the same text segment, so3285they run the same code. But different threads often run different parts3286of the code.32873288And they share the same static segment, so if one thread changes a3289global variable, other threads see the change. They also share the heap,3290so threads can share dynamically-allocated chunks.32913292But each thread has its own stack, so threads can call functions without3293interfering with each other. Usually threads don't access each3294other's local variables (and sometimes they can't).32953296The example code for this chapter is in the repository for this book,3297in a directory named {\tt counter}. For information on downloading3298this code, see Section~\ref{code}.329933003301\section{Creating threads}33023303The most popular threading standard used with C is POSIX Threads, or Pthreads for short. The POSIX standard defines a thread model and an interface for creating and controlling threads. Most versions of UNIX provide an implementation of Pthreads.33043305Using Pthreads is like using most C libraries:33063307\begin{itemize}33083309\item You include headers files at the beginning of your3310program.33113312\item You write code that calls functions defined by Pthreads.33133314\item When you compile the program, you link it with the Pthread library.33153316\end{itemize}33173318For my examples, I include the following headers:33193320\begin{verbatim}3321#include <stdio.h>3322#include <stdlib.h>3323#include <pthread.h>3324#include <semaphore.h>3325\end{verbatim}33263327The first two are standard; the third is for Pthreads and3328the fourth is for semaphores. To compile with the Pthread library in {\tt gcc}, you can use the {\tt -l} option on the command line:33293330\begin{verbatim}3331gcc -g -O2 -o array array.c -lpthread3332\end{verbatim}33333334This compiles a source file named {\tt array.c} with debugging info3335and optimization, links with the Pthread library, and generates an3336executable named {\tt array}.333733383339\section{Creating threads}33403341The Pthread function that creates threads is called \verb"pthread_create".3342The following function shows how to use it:33433344\begin{verbatim}3345pthread_t make_thread(void *(*entry)(void *), Shared *shared)3346{3347int n;3348pthread_t thread;33493350n = pthread_create(&thread, NULL, entry, (void *)shared);3351if (n != 0) {3352perror("pthread_create failed");3353exit(-1);3354}3355return thread;3356}3357\end{verbatim}33583359\verb"make_thread" is a wrapper I wrote to make3360\verb"pthread_create" easier to use, and to provide error-checking.33613362%TODO: Define a newcommand like \py to make verb easier to use, and3363% get rid of the \_ hockey sticks33643365The return type from \verb"pthread_create" is \verb"pthread_t",3366which you can think of as an id or ``handle'' for the new thread.33673368If {\tt pthread\_create} succeeds, it returns 0 and \verb"make_thread"3369returns the handle of the new thread.3370If an error occurs, {\tt pthread\_create}3371returns an error code and \verb"make_thread" prints an error message3372and exits.33733374The parameters of \verb"make_thread" take some3375explaining. Starting with the second, {\tt Shared}3376is a structure I defined to contain values shared between threads.3377The following {\tt typedef} statement creates the new type:33783379\begin{verbatim}3380typedef struct {3381int counter;3382} Shared;3383\end{verbatim}33843385In this case, the only shared variable is {\tt counter}.3386{\tt make\_shared} allocates3387space for a {\tt Shared} structure and initializes the contents:33883389\begin{verbatim}3390Shared *make_shared()3391{3392Shared *shared = check_malloc(sizeof (Shared));3393shared->counter = 0;3394return shared;3395}3396\end{verbatim}33973398Now that we have a shared data structure, let's get back to3399\verb"make_thread".3400The first parameter is a pointer to a function that takes3401a {\tt void} pointer and returns a {\tt void} pointer. If the syntax3402for declaring this type makes your eyes bleed, you are not alone.3403Anyway, the purpose of this parameter is to specify the function where3404the execution of the new thread will begin. By convention, this3405function is named {\tt entry}:34063407\begin{verbatim}3408void *entry(void *arg)3409{3410Shared *shared = (Shared *) arg;3411child_code(shared);3412pthread_exit(NULL);3413}3414\end{verbatim}34153416The parameter of {\tt entry} has to be declared as a {\tt void}3417pointer, but in this program we know that it is really a pointer to a3418{\tt Shared} structure, so we can typecast it accordingly and then3419pass it along to {\tt child\_code}, which does the real work.34203421As a simple example, \verb"child_code" prints the value of3422the shared counter and increments it.34233424\begin{verbatim}3425void child_code(Shared *shared)3426{3427printf("counter = %d\n", shared->counter);3428shared->counter++;3429}3430\end{verbatim}34313432When {\tt child\_code} returns, {\tt entry} invokes3433\verb"pthread_exit" which can be used to pass a value to the thread3434that joins with this thread. In this case, the child has nothing to3435say, so we pass {\tt NULL}.34363437Finally, here is the code that creates the child threads:34383439\begin{verbatim}3440int i;3441pthread_t child[NUM_CHILDREN];34423443Shared *shared = make_shared(1000000);34443445for (i=0; i<NUM_CHILDREN; i++) {3446child[i] = make_thread(entry, shared);3447}3448\end{verbatim}34493450\verb"NUM_CHILDREN" is a compile-time constant that determines3451the number of child threads. {\tt child} is an array of3452thread handles.345334543455\section{Joining threads}34563457When one thread wants to wait for another thread to complete,3458it invokes {\tt pthread\_join}.3459Here is my wrapper for {\tt pthread\_join}:34603461\begin{verbatim}3462void join_thread(pthread_t thread)3463{3464int ret = pthread_join(thread, NULL);3465if (ret == -1) {3466perror("pthread_join failed");3467exit(-1);3468}3469}3470\end{verbatim}34713472The parameter is the handle of the thread you want to wait for.3473All the wrapper does is call {\tt pthread\_join} and check the3474result.34753476Any thread can join any other thread, but in the most common pattern3477the parent thread creates and joins all child threads.3478Continuing the example from the previous section, here's the3479code that waits on the children:34803481\begin{verbatim}3482for (i=0; i<NUM_CHILDREN; i++) {3483join_thread(child[i]);3484}3485\end{verbatim}34863487This loops waits for the children one at a time in the order they3488were created. There is no guarantee that the child threads complete3489in that order, but this loop works correctly even if they don't. If one3490of the children is late, the loop might have to wait, and other children3491might complete in the meantime. But regardless, the loop exits3492only when all children are done.34933494If you have downloaded the repository for this book (see3495Section~\ref{code}), you'll find this example in {\tt3496counter/counter.c}. You can compile and run it like this:34973498\begin{verbatim}3499$ make counter3500gcc -Wall counter.c -o counter -lpthread3501$ ./counter3502\end{verbatim}35033504When I ran it with 5 children, I got the following output:35053506\begin{verbatim}3507counter = 03508counter = 03509counter = 13510counter = 03511counter = 33512\end{verbatim}35133514When you run it, you will probably get different results. And if3515you run it again, you might get different results each time. What's3516going on?351735183519\section{Synchronization errors}35203521The problem with the previous program is that the children3522access the shared variable, {\tt counter}, without synchronization,3523so several threads can read the same value of {\tt counter} before3524any threads increment it.35253526Here is a sequence of events that could explain the output in the3527previous section:35283529\begin{verbatim}3530Child A reads 03531Child B reads 03532Child C reads 03533Child A prints 03534Child B prints 03535Child A sets counter=13536Child D reads 13537Child D prints 13538Child C prints 03539Child A sets counter=13540Child B sets counter=23541Child C sets counter=33542Child E reads 33543Child E prints 33544Child D sets counter=43545Child E sets counter=53546\end{verbatim}35473548Each time you run the program, threads might be interrupted at different3549points, or the scheduler might choose different threads to run, so3550the sequence of events, and the results, will be different.35513552Suppose we want to impose some order. For example, we might want3553each thread to read a different value of {\tt counter} and increment3554it, so that the value of {\tt counter} reflects the number of3555threads that have executed \verb"child_code".35563557To enforce that requirement, we can use a ``mutex'', which is3558an object that guarantees ``mutual exclusion'' for a block of code;3559that is, only one thread can execute the block at a time.35603561I have written a small module called {\tt mutex.c} that provides3562mutex objects. I'll show you how to use it first; then I'll explain3563how it works.35643565Here's a version of \verb"child_code" that uses a mutex to synchronize3566threads:35673568\begin{verbatim}3569void child_code(Shared *shared)3570{3571mutex_lock(shared->mutex);3572printf("counter = %d\n", shared->counter);3573shared->counter++;3574mutex_unlock(shared->mutex);3575}3576\end{verbatim}35773578Before any thread can access {\tt counter}, it has to ``lock''3579the mutex, which has the effect of barring all other threads.3580Suppose Thread A has locked the mutex and is in the3581middle of \verb"child_code". If Thread B arrives and3582executes \verb"mutex_lock", it blocks.35833584When Thread A is done, it executes \verb"mutex_unlock",3585which allows Thread B to proceed. In effect, the threads3586line up to execute \verb"child_code" one at a time, so they3587can't interfere with each other. When I run this code with35885 children, I get:35893590\begin{verbatim}3591counter = 03592counter = 13593counter = 23594counter = 33595counter = 43596\end{verbatim}35973598And that satisfies the requirements. In order for this solution to3599work, I have to add the Mutex to the Shared struct:36003601\begin{verbatim}3602typedef struct {3603int counter;3604Mutex *mutex;3605} Shared;3606\end{verbatim}36073608And initialize it in \verb"make_shared"36093610\begin{verbatim}3611Shared *make_shared(int end)3612{3613Shared *shared = check_malloc(sizeof(Shared));3614shared->counter = 0;3615shared->mutex = make_mutex(); //-- this line is new3616return shared;3617}3618\end{verbatim}36193620The code in this section is in \verb"counter_mutex.c".3621The definition of {\tt Mutex} is in {\tt mutex.c}, which I3622explain in the next section.3623362436253626\section{Mutex}36273628My definition of {\tt Mutex} is a wrapper for a type called3629\verb"pthread_mutex_t", which is defined in the POSIX threads API.36303631To create a POSIX mutex, you have to allocate space for a3632\verb"pthread_mutex_t" type and then call \verb"pthread_mutex_init".36333634One of the problems with this API is that \verb"pthread_mutex_t"3635behaves like a structure, so if you pass it as an argument, it makes a3636copy, which makes the mutex behave incorrectly. To avoid that, you have to3637pass \verb"pthread_mutex_t" by address.36383639My code makes it easier to get that right. It defines a3640type, {\tt Mutex}, which is just a more readable name for3641\verb"pthread_mutex_t":36423643\begin{verbatim}3644#include <pthread.h>36453646typedef pthread_mutex_t Mutex;3647\end{verbatim}36483649Then it defines \verb"make_mutex", which allocates space and3650initializes the mutex:36513652\begin{verbatim}3653Mutex *make_mutex()3654{3655Mutex *mutex = check_malloc(sizeof(Mutex));3656int n = pthread_mutex_init(mutex, NULL);3657if (n != 0) perror_exit("make_lock failed");3658return mutex;3659}3660\end{verbatim}36613662The return value is a pointer, which you can pass around as an3663argument without causing unwanted copying.36643665The functions to lock and unlock the mutex are simple wrappers3666for POSIX functions:36673668\begin{verbatim}3669void mutex_lock(Mutex *mutex)3670{3671int n = pthread_mutex_lock(mutex);3672if (n != 0) perror_exit("lock failed");3673}36743675void mutex_unlock(Mutex *mutex)3676{3677int n = pthread_mutex_unlock(mutex);3678if (n != 0) perror_exit("unlock failed");3679}3680\end{verbatim}36813682This code is in {\tt mutex.c} and the header file {\tt mutex.h}.368336843685\chapter{Condition variables}3686\label{csem}36873688Many simple synchronization problems can be solved using mutexes3689as shown in the previous chapter. In this chapter I introduce a3690bigger challenge, the well-known ``Producer-Consumer problem'', and3691a new tool to solve it, the condition variable.36923693\section{The work queue}3694\label{queue}36953696In some multi-threaded programs, threads are organized to perform3697different tasks. Often they communicate with each other using a queue,3698where some threads, called ``producers'', put data into the queue3699and other threads, called ``consumers'', take data out.37003701For example, in applications with a graphical user interface, there3702might be one thread that runs the GUI, responding to user events,3703and another thread that processes user requests. In that case,3704the GUI thread might put requests into a queue and the ``back end''3705thread might take requests out and process them.37063707To support this organization, we need a queue implementation that is3708``thread safe'', which means that both threads (or more than two) can3709access the queue at the same time. And we need to handle the special3710cases when the queue is empty and, if the size of the queue is3711bounded, when the queue is full.37123713I'll start with a simple queue that is not thread safe, then we'll see3714what goes wrong and fix it. The code for this example is in the3715repository for this book, in a folder called {\tt queue}. The file3716{\tt queue.c} contains a basic implementation of a circular buffer,3717which you can read about at3718\url{https://en.wikipedia.org/wiki/Circular_buffer}.37193720Here's the structure definition:37213722\begin{verbatim}3723typedef struct {3724int *array;3725int length;3726int next_in;3727int next_out;3728} Queue;3729\end{verbatim}37303731{\tt array} is the array that contains the elements of the queue.3732For this example the elements are ints, but more generally3733they would be structures that contain user events, items of work, etc.37343735{\tt length} is the length of the array. \verb"next_in" is an3736index into the array that indices where the next element should be3737added; similarly, \verb"next_out" is the index of the next element3738that should be removed.37393740\verb"make_queue" allocates space for this structure and initializes3741the fields:37423743\begin{verbatim}3744Queue *make_queue(int length)3745{3746Queue *queue = (Queue *) malloc(sizeof(Queue));3747queue->length = length + 1;3748queue->array = (int *) malloc(length * sizeof(int));3749queue->next_in = 0;3750queue->next_out = 0;3751return queue;3752}3753\end{verbatim}37543755The initial value for \verb"next_out" needs some explaining.3756Since the queue is initially empty, there is no next element to3757remove, so \verb"next_out" is invalid. Setting3758\verb"next_out == next_in" is a special case that indicates3759that the queue is empty, so we can write:37603761\begin{verbatim}3762int queue_empty(Queue *queue)3763{3764return (queue->next_in == queue->next_out);3765}3766\end{verbatim}37673768Now we can add elements to the queue using \verb"queue_push":37693770\begin{verbatim}3771void queue_push(Queue *queue, int item) {3772if (queue_full(queue)) {3773perror_exit("queue is full");3774}37753776queue->array[queue->next_in] = item;3777queue->next_in = queue_incr(queue, queue->next_in);3778}3779\end{verbatim}37803781If the queue is full, \verb"queue_push" prints an error message3782and exits. I will explain \verb"queue_full" soon.37833784If the queue is not full, \verb"queue_push" inserts the new3785element and then increments \verb"next_in" using \verb"queue_incr":37863787\begin{verbatim}3788int queue_incr(Queue *queue, int i)3789{3790return (i+1) % queue->length;3791}3792\end{verbatim}37933794When the index, {\tt i}, gets to the end of the array, it wraps around3795to 0. And that's where we run into a tricky part. If we keep adding3796elements to the queue, eventually \verb"next_in" wraps around and catches3797up with \verb"next_out". But if \verb"next_in == next_out", we would3798incorrectly conclude that the queue was empty.37993800To avoid that, we define another special case to indicate that the3801queue is full:38023803\begin{verbatim}3804int queue_full(Queue *queue)3805{3806return (queue_incr(queue, queue->next_in) == queue->next_out);3807}3808\end{verbatim}38093810If incrementing \verb"next_in" lands on \verb"next_out", that means3811we can't add another element without making the queue seem empty. So3812we stop one element before the ``end'' (keeping in mind that the end of3813the queue can be anywhere, not necessarily the end of the array).38143815Now we can write \verb"queue_pop", which removes and returns the next3816element from the queue:38173818\begin{verbatim}3819int queue_pop(Queue *queue) {3820if (queue_empty(queue)) {3821perror_exit("queue is empty");3822}38233824int item = queue->array[queue->next_out];3825queue->next_out = queue_incr(queue, queue->next_out);3826return item;3827}3828\end{verbatim}38293830If you try to pop from an empty queue, \verb"queue_pop" prints3831an error message and exits.383238333834\section{Producers and consumers}3835\label{prodcon}38363837Now let's make some threads to access this queue. Here's the3838producer code:38393840\begin{verbatim}3841void *producer_entry(void *arg) {3842Shared *shared = (Shared *) arg;38433844for (int i=0; i<QUEUE_LENGTH-1; i++) {3845printf("adding item %d\n", i);3846queue_push(shared->queue, i);3847}3848pthread_exit(NULL);3849}3850\end{verbatim}38513852Here's the consumer code:38533854\begin{verbatim}3855void *consumer_entry(void *arg) {3856int item;3857Shared *shared = (Shared *) arg;38583859for (int i=0; i<QUEUE_LENGTH-1; i++) {3860item = queue_pop(shared->queue);3861printf("consuming item %d\n", item);3862}3863pthread_exit(NULL);3864}3865\end{verbatim}38663867Here's the parent code that starts the threads and waits for them38683869\begin{verbatim}3870pthread_t child[NUM_CHILDREN];38713872Shared *shared = make_shared();38733874child[0] = make_thread(producer_entry, shared);3875child[1] = make_thread(consumer_entry, shared);38763877for (int i=0; i<NUM_CHILDREN; i++) {3878join_thread(child[i]);3879}3880\end{verbatim}38813882And finally here's the shared structure that contains the queue:38833884\begin{verbatim}3885typedef struct {3886Queue *queue;3887} Shared;38883889Shared *make_shared()3890{3891Shared *shared = check_malloc(sizeof(Shared));3892shared->queue = make_queue(QUEUE_LENGTH);3893return shared;3894}3895\end{verbatim}38963897The code we have so far is a good starting place, but it has3898several problems:38993900\begin{itemize}39013902\item Access to the queue is not thread safe. Different threads3903could access {\tt array}, \verb"next_in", and \verb"next_out"3904at the same time and leave the queue in a broken, ``inconsistent''3905state.39063907\item If the consumer is scheduled first, it finds the queue empty,3908print an error message, and exits. We would rather have the consumer3909block until the queue is not empty. Similarly, we would like the3910producer to block if the queue is full.39113912\end{itemize}39133914In the next section, we solve the first problem with a {\tt Mutex}.3915In the following section, we solve the second problem with condition3916variables.391739183919\section{Mutual exclusion}39203921We can make the queue thread safe with a mutex. This version3922of the code is in \verb"queue_mutex.c".39233924First we add a {\tt Mutex} pointer to the queue structure:39253926\begin{verbatim}3927typedef struct {3928int *array;3929int length;3930int next_in;3931int next_out;3932Mutex *mutex; //-- this line is new3933} Queue;3934\end{verbatim}39353936And initialize the {\tt Mutex} in \verb"make_queue":39373938\begin{verbatim}3939Queue *make_queue(int length) {3940Queue *queue = (Queue *) malloc(sizeof(Queue));3941queue->length = length;3942queue->array = (int *) malloc(length * sizeof(int));3943queue->next_in = 0;3944queue->next_out = 0;3945queue->mutex = make_mutex(); //-- new3946return queue;3947}3948\end{verbatim}39493950Next we add synchronization code to \verb"queue_push":39513952\begin{verbatim}3953void queue_push(Queue *queue, int item) {3954mutex_lock(queue->mutex); //-- new3955if (queue_full(queue)) {3956mutex_unlock(queue->mutex); //-- new3957perror_exit("queue is full");3958}39593960queue->array[queue->next_in] = item;3961queue->next_in = queue_incr(queue, queue->next_in);3962mutex_unlock(queue->mutex); //-- new3963}3964\end{verbatim}39653966Before checking whether the queue is full, we have to lock3967the {\tt Mutex}. If the queue is full, we have to unlock3968the {\tt Mutex} before exiting; otherwise the thread would leave3969it locked and no other threads could proceed.39703971The synchronization code for \verb"queue_pop" is similar:39723973\begin{verbatim}3974int queue_pop(Queue *queue) {3975mutex_lock(queue->mutex);3976if (queue_empty(queue)) {3977mutex_unlock(queue->mutex);3978perror_exit("queue is empty");3979}39803981int item = queue->array[queue->next_out];3982queue->next_out = queue_incr(queue, queue->next_out);3983mutex_unlock(queue->mutex);3984return item;3985}3986\end{verbatim}39873988Note that the other {\tt Queue} functions, \verb"queue_full",3989\verb"queue_empty", and \verb"queue_incr" do not try to lock3990the mutex. Any thread that calls these functions is required to3991lock the mutex first; this requirement is part of the documented3992interface for these functions.39933994With this additional code, the queue is thread safe; if you run it, you3995should not see any synchronization errors. But it is likely3996that the consumer will exit at some point because the queue is3997empty, or the producer will exit because the queue is full,3998or both.39994000The next step is to add condition variables.400140024003\section{Condition variables}40044005A condition variable is a data structure associated with a condition;4006it allows threads to block until the condition becomes true. For4007example, \verb"thread_pop" might want check whether the queue is4008empty and, if so, wait for a condition like ``queue not empty''.40094010Similarly, \verb"thread_push" might want to check whether the queue is4011full and, if so, block until it is not full.40124013I'll handle the first condition here, and you will have a chance to4014handle the second condition as an exercise.40154016First we add a condition variable to the {\tt Queue} structure:40174018\begin{verbatim}4019typedef struct {4020int *array;4021int length;4022int next_in;4023int next_out;4024Mutex *mutex;4025Cond *nonempty; //-- new4026} Queue;4027\end{verbatim}40284029And initialize it in \verb"make_queue":40304031\begin{verbatim}4032Queue *make_queue(int length)4033{4034Queue *queue = (Queue *) malloc(sizeof(Queue));4035queue->length = length;4036queue->array = (int *) malloc(length * sizeof(int));4037queue->next_in = 0;4038queue->next_out = 0;4039queue->mutex = make_mutex();4040queue->nonempty = make_cond(); //-- new4041return queue;4042}4043\end{verbatim}40444045Now in \verb"queue_pop", if we find the queue empty, we don't4046exit; instead we use the condition variable to block:40474048\begin{verbatim}4049int queue_pop(Queue *queue) {4050mutex_lock(queue->mutex);4051while (queue_empty(queue)) {4052cond_wait(queue->nonempty, queue->mutex); //-- new4053}40544055int item = queue->array[queue->next_out];4056queue->next_out = queue_incr(queue, queue->next_out);4057mutex_unlock(queue->mutex);4058cond_signal(queue->nonfull); //-- new4059return item;4060}4061\end{verbatim}40624063\verb"cond_wait" is complicated, so let's take it slow.4064The first argument is the condition variable; in this case,4065the condition we are waiting for is ``queue not empty''. The second4066argument is the mutex that protects the queue.40674068When the thread that locked the mutex calls \verb"cond_wait", it4069unlocks the mutex and then blocks. This is important. If4070\verb"cond_wait" did not unlock the mutex before blocking, no4071other thread would be able to access the queue, no more items4072could be added, and the queue would always be empty.40734074So while the consumer is blocked on {\tt nonempty}, the producer can4075run. Let's see what happens when the producer runs \verb"queue_push":40764077\begin{verbatim}4078void queue_push(Queue *queue, int item) {4079mutex_lock(queue->mutex);4080if (queue_full(queue)) {4081mutex_unlock(queue->mutex);4082perror_exit("queue is full");4083}4084queue->array[queue->next_in] = item;4085queue->next_in = queue_incr(queue, queue->next_in);4086mutex_unlock(queue->mutex);4087cond_signal(queue->nonempty); //-- new4088}4089\end{verbatim}40904091Just as before, \verb"queue_push" locks the {\tt Mutex} and checks4092whether the queue is full. Assuming it is not, \verb"queue_push" adds4093a new element to the queue and then unlocks the {\tt Mutex}.40944095But before returning, it does one more thing: it ``signals'' the4096condition variable {\tt nonempty}.40974098Signalling a condition variable usually indicates that the condition is4099true. If there are no threads waiting4100on the condition variable, the signal has no effect.41014102If there are threads waiting on the condition variable, one of them4103gets unblocked and resumes execution of \verb"cond_wait". But before4104the awakened thread can return from \verb"cond_wait", it has4105to wait for and lock the {\tt Mutex}, again.41064107Now go back to \verb"queue_pop" and see what happens when the thread4108returns from \verb"cond_wait". It loops back to the top of the while4109loop and checks the condition again. I'll explain why in just a4110second, but for now let's assume that the condition is true; that is,4111the queue is not empty.41124113When the consumer thread exits the while loop, we know two things: (1)4114the condition is true, so there is at least one item in the queue, and4115(2) the {\tt Mutex} is locked, so it is safe to access the queue.41164117After removing an item, \verb"queue_pop" unlocks the mutex4118and returns.41194120In the next section I'll show you how my {\tt Cond} code works, but first I4121want to answer two frequently-asked questions:41224123\begin{itemize}41244125\item Why is \verb"cond_wait" inside a while loop rather than an if4126statement; that is, why do we have to check the condition again after4127returning from \verb"cond_wait"?41284129The primary reason you have to re-check the condition is the possibility4130of an intercepted signal. Suppose Thread A is waiting on {\tt nonempty}.4131Thread B adds an item to the queue and signals {\tt nonempty}. Thread4132A wakes up an tries to lock the mutex, but before it gets the chance,4133Evil Thread C swoops in, locks the mutex, pops the item from the4134queue, and unlocks the mutex. Now the queue is empty again, but4135Thread A is not blocked any more. Thread A could lock the mutex and4136returns from \verb"cond_wait". If Thread A does not check the condition4137again, it would try to pop an element from an empty queue, and probably4138cause an error.41394140\item The other question that comes up when people learn about condition4141variables is ``How does the condition variable know what condition it4142is associated with?''41434144This question is understandable because there is no explicit connection4145between a {\tt Cond} structure and the condition it relates to. The4146connection is implicit in the way it is used.41474148Here's one way to think of it: the condition associated with a Cond4149is the thing that is false when you call \verb"cond_wait" and true4150when you call \verb"cond_signal".41514152\end{itemize}41534154Because threads have to check the condition when they return from4155\verb"cond_wait", it is not strictly necessary to call \verb"cond_signal"4156only when the condition is true. If you have reason to think the4157condition {\em might} be true, you could call \verb"cond_signal" as4158a suggestion that now is a good time to check.415941604161\section{Condition variable implementation}41624163The Cond structure I used in the previous section is a wrapper4164for a type called \verb"pthread_cond_t", which is defined in the POSIX4165threads API. It is very similar to Mutex, which is a wrapper for4166\verb"pthread_mutex_t". Both wrappers are defined in {\tt utils.c} and4167{\tt utils.h}.41684169Here's the typedef:41704171\begin{verbatim}4172typedef pthread_cond_t Cond;4173\end{verbatim}41744175\verb"make_cond" allocates space, initializes the condition variable,4176and returns a pointer:41774178\begin{verbatim}4179Cond *make_cond() {4180Cond *cond = check_malloc(sizeof(Cond));4181int n = pthread_cond_init(cond, NULL);4182if (n != 0) perror_exit("make_cond failed");41834184return cond;4185}4186\end{verbatim}41874188And here are the wrappers for \verb"cond_wait" and \verb"cond_signal".41894190\begin{verbatim}4191void cond_wait(Cond *cond, Mutex *mutex) {4192int n = pthread_cond_wait(cond, mutex);4193if (n != 0) perror_exit("cond_wait failed");4194}41954196void cond_signal(Cond *cond) {4197int n = pthread_cond_signal(cond);4198if (n != 0) perror_exit("cond_signal failed");4199}4200\end{verbatim}42014202At this point there should be nothing too surprising there.4203420442054206\chapter{Semaphores in C}42074208Semaphores are a good way to learn about synchronization, but4209they are not as widely used, in practice, as mutexes and4210condition variables.42114212Nevertheless, there are some synchronization problems that can be4213solved simply with semaphores, yielding solutions that are more4214demonstrably correct.42154216This chapter presents a C API for working with semaphores and4217my code for making it easier to work with. And it presents4218a final challenge: can you write an implementation of a semaphore4219using mutexes and condition variables?42204221The code for this chapter is in directory {\tt semaphore} in the4222repository for this book (see Section~\ref{code}).422342244225\section{POSIX Semaphores}42264227A semaphore is a data structure used to help threads work together4228without interfering with each other.42294230The POSIX standard specifies an interface for semaphores;4231it is not part of Pthreads, but most UNIXes4232that implement Pthreads also provide semaphores.42334234POSIX semaphores have type {\tt sem\_t}.4235As usual, I put a wrapper around {\tt sem\_t}4236to make it safer and easier to use. My wrapper is in {\tt sem.h}:42374238\begin{verbatim}4239typedef sem_t Semaphore;42404241Semaphore *make_semaphore(int value);4242void semaphore_wait(Semaphore *sem);4243void semaphore_signal(Semaphore *sem);4244\end{verbatim}42454246{\tt Semaphore} is a synonym for \verb"sem_t", but I find it more4247readable, and the capital letter reminds me to treat it like an4248object and pass it by pointer.42494250The implementation of these functions is in {\tt sem.c}:42514252\begin{verbatim}4253Semaphore *make_semaphore(int value)4254{4255Semaphore *sem = check_malloc(sizeof(Semaphore));4256int n = sem_init(sem, 0, value);4257if (n != 0) perror_exit("sem_init failed");4258return sem;4259}4260\end{verbatim}42614262{\tt make\_semaphore} takes the initial value of the semaphore4263as a parameter. It allocates space for a Semaphore, initializes4264it, and returns a pointer to {\tt Semaphore}.42654266{\tt sem\_init} returns 0 if it succeeds and -1 if anything goes4267wrong. One nice thing about using wrapper functions is that you can4268encapsulate the error-checking code, which makes the code that uses4269these functions more readable.42704271Here is the implementation of \verb"semaphore_wait":42724273\begin{verbatim}4274void semaphore_wait(Semaphore *sem)4275{4276int n = sem_wait(sem);4277if (n != 0) perror_exit("sem_wait failed");4278}4279\end{verbatim}42804281And here is \verb"semaphore_signal":42824283\begin{verbatim}4284void semaphore_signal(Semaphore *sem)4285{4286int n = sem_post(sem);4287if (n != 0) perror_exit("sem_post failed");4288}4289\end{verbatim}42904291I prefer to call this operation ``signal'' rather than ``post'',4292although both terms are common.42934294Here's an example that shows how to use a semaphore as a mutex:42954296\begin{verbatim}4297Semaphore *mutex = make_semaphore(1);42984299semaphore_wait(mutex);4300// protected code goes here4301semaphore_signal(mutex);4302\end{verbatim}43034304When you use a semaphore as a mutex, you usually4305initialize it to 1 to indicate4306that the mutex is unlocked; that is, one thread can4307pass the semaphore without blocking.43084309Here I am using the variable name {\tt mutex} to indicate that4310the semaphore is being used as a mutex. But remember that the behavior4311of a semaphore is not the same as a Pthread mutex.431243134314\section{Producers and consumers with semaphores}43154316Using these semaphore wrapper functions, we can4317write a solution to the Producer-Consumer problem from4318Section~\ref{prodcon}.4319The code in this section is in \verb"queue_sem.c".43204321%TODO: Get this code into ExercisesInC/examples/queue43224323Here's the new definition of {\tt Queue}, replacing the mutex4324and condition variables with semaphores:43254326\begin{verbatim}4327typedef struct {4328int *array;4329int length;4330int next_in;4331int next_out;4332Semaphore *mutex; //-- new4333Semaphore *items; //-- new4334Semaphore *spaces; //-- new4335} Queue;4336\end{verbatim}43374338And here's the new version of \verb"make_queue":43394340\begin{verbatim}4341Queue *make_queue(int length)4342{4343Queue *queue = (Queue *) malloc(sizeof(Queue));4344queue->length = length;4345queue->array = (int *) malloc(length * sizeof(int));4346queue->next_in = 0;4347queue->next_out = 0;4348queue->mutex = make_semaphore(1);4349queue->items = make_semaphore(0);4350queue->spaces = make_semaphore(length-1);4351return queue;4352}4353\end{verbatim}43544355{\tt mutex} is used to guarantee exclusive access to the queue;4356the initial value is 1, so the mutex is4357initially unlocked.43584359{\tt items} is the number of items in the queue, which is also the number4360of consumer threads that can execute \verb"queue_pop" without blocking.4361Initially there are no items in the queue.43624363{\tt spaces} is the number of empty spaces in the queue, which is the4364number of producer threads that can execute \verb"queue_push" without4365blocking. Initially the number of spaces is the capacity of the queue,4366which is {\tt length-1}, as explained in Section~\ref{queue}.43674368Here is the new version of \verb"queue_push", which is run by4369producer threads:43704371\begin{verbatim}4372void queue_push(Queue *queue, int item) {4373semaphore_wait(queue->spaces);4374semaphore_wait(queue->mutex);43754376queue->array[queue->next_in] = item;4377queue->next_in = queue_incr(queue, queue->next_in);43784379semaphore_signal(queue->mutex);4380semaphore_signal(queue->items);4381}4382\end{verbatim}43834384Notice that \verb"queue_push" doesn't have to call4385\verb"queue_full" any more; instead, the semaphore keeps track of4386how many spaces are available and blocks producers if the queue4387is full.43884389Here is the new version of \verb"queue_pop":43904391\begin{verbatim}4392int queue_pop(Queue *queue) {4393semaphore_wait(queue->items);4394semaphore_wait(queue->mutex);43954396int item = queue->array[queue->next_out];4397queue->next_out = queue_incr(queue, queue->next_out);43984399semaphore_signal(queue->mutex);4400semaphore_signal(queue->spaces);44014402return item;4403}4404\end{verbatim}44054406This solution is explained, using pseudo-code, in Chapter 4 of4407{\it The Little Book of Semaphores}.44084409Using the code in the repo for this book, you should be able to compile4410and run this solution. You should be able to run:44114412\begin{verbatim}4413$ make queue_sem4414$ ./queue_sem4415\end{verbatim}4416441744184419\section{Make your own semaphores}4420\label{makeyourown}44214422Any problem that can be solved with semaphores can also be solved4423with condition variables and mutexes. One way to prove that's true4424is to implement a semaphore using condition variables and mutexes.44254426Before you go on, you might want to try this as an exercise: write4427functions that implement the semaphore API in {\tt sem.h}4428using using condition variables and mutexes. You can put your4429solution in {\tt mysem.c} and {\tt mysem.h} in the repo for this book,4430and you'll find my solution in \verb"mysem_soln.c" and4431\verb"mysem_soln.h".44324433If you have trouble getting started, you can use the following4434structure definition, from my solution, as a hint:44354436\begin{verbatim}4437typedef struct {4438int value, wakeups;4439Mutex *mutex;4440Cond *cond;4441} Semaphore;4442\end{verbatim}44434444% TODO: Include Property 3 (it's in LBoS).44454446{\tt value} is the value of the semaphore. {\tt wakeups} counts4447the number of pending signals; that is, the number of threads4448that have been woken but have not yet resumed execution. The reason4449for wakeups is to make sure that our semaphores have4450Property 3, described in {\tt The Little Book of Semaphores}.44514452{\tt mutex} provides exclusive access to {\tt value} and4453{\tt wakeups}; {\tt cond} is the condition variable threads4454wait on if they wait on the semaphore.44554456Here is the initialization code for this structure:44574458\begin{verbatim}4459Semaphore *make_semaphore(int value)4460{4461Semaphore *semaphore = check_malloc(sizeof(Semaphore));4462semaphore->value = value;4463semaphore->wakeups = 0;4464semaphore->mutex = make_mutex();4465semaphore->cond = make_cond();4466return semaphore;4467}4468\end{verbatim}446944704471\subsection{Semaphore implementation}44724473Here is my implementation of semaphores using POSIX mutexes4474and condition variables:44754476\begin{verbatim}4477void semaphore_wait(Semaphore *semaphore)4478{4479mutex_lock(semaphore->mutex);4480semaphore->value--;44814482if (semaphore->value < 0) {4483do {4484cond_wait(semaphore->cond, semaphore->mutex);4485} while (semaphore->wakeups < 1);4486semaphore->wakeups--;4487}4488mutex_unlock(semaphore->mutex);4489}4490\end{verbatim}44914492When a thread waits on the semaphore, it has to lock the mutex4493before it decrements {\tt value}. If the value of the semaphore4494becomes negative, the thread blocks until a ``wakeup'' is4495available. While it is blocked, the mutex is unlocked, so another4496thread can signal.44974498Here is the code for \verb"semaphore_signal":44994500\begin{verbatim}4501void semaphore_signal(Semaphore *semaphore)4502{4503mutex_lock(semaphore->mutex);4504semaphore->value++;45054506if (semaphore->value <= 0) {4507semaphore->wakeups++;4508cond_signal(semaphore->cond);4509}4510mutex_unlock(semaphore->mutex);4511}4512\end{verbatim}45134514Again, a thread has to lock the mutex before it increments4515{\tt value}. If the semaphore was negative, that means threads4516are waiting, so the signalling thread increments {\tt wakeups} and4517signals the condition variable.45184519At this point one of the waiting threads might wake up, but the4520mutex is still locked until the signalling thread unlocks it.45214522At that point, one of the waiting threads returns from \verb"cond_wait"4523and checks whether a wakeup is still available. If not, it4524loops and waits on the condition variable again. If so, it4525decrements {\tt wakeups}, unlocks the mutex, and exits.45264527One thing about this solution that might not be obvious is the use of4528a {\tt do...while} loop. Can you figure out why it is not a4529more conventional {\tt while} loop? What would go wrong?45304531The problem is that with a {\tt while} loop this implementation would4532not have Property 3. It would be possible for a thread to signal and4533then run around and catch its own signal.45344535With the {\tt do...while} loop, it is guaranteed\footnote{Well,4536almost. It turns out that a well-timed spurious wakeup (see4537\url{http://en.wikipedia.org/wiki/Spurious_wakeup}) can violate this4538guarantee.} that when a thread signals, one of the waiting threads4539will get the signal, even if the signalling thread runs around and4540gets the mutex before one of the waiting threads resumes.45414542\end{document}45434544454545464547