Preventing Directory Traversal Vulnerabilities in Python

Directory traversal 101

If you develop or maintain any software that allows a user to request files from a server (e.g. a web server), you need to watch out for directory traversal attacks. They work like this: Imagine your server is configured to serve files from the directory /var/srv/my-site (henceforth known as the “document root”). If someone requests http://HOSTNAME/index.html, the server knows to search the document root for a file named “index.html” and serve up /var/srv/my-site/index.html.

What if the attacker requests something more clever than index.html? For example, what if the attacker requests something like http://HOSTNAME/../../../home/user/passwords.txt? The .. is a relative path component that tells the server to move up one directory in the path. If your code isn’t checking for input like this, your server will try and resolve /var/srv/my-site/../../../home/user/passwords.txt to /home/user/passwords.txt. The attacker can use relative paths to “break out” of the document root and extract any files that the web server can read.

How not to do it

On more than one occasion, I’ve seen code that looks like this:

import os.path

DOCUMENT_ROOT = "/var/srv/my-site"


def validate_requested_file_path(requested_path: str):
    absolute_document_root = os.path.abspath(DOCUMENT_ROOT)
    absolute_requested_path = os.path.abspath(requested_path)

    if not absolute_requested_path.startswith(absolute_document_root):
        raise ForbiddenPathError(
            f"The requested path '{requested_path}' is forbidden"
        )

At a glance, this looks good. The requested path and document root are passed to os.path.abspath(), which resolves any relative components or symlinks and returns an absolute, canonical path. A canonical path is the simplest absolute path to a given file. For example, the canonical path of /var/srv/../file.txt is /var/file.txt. Canonicalizing two paths allows them to be accurately compared. Well done, right?

This approach handles most attacks, but there’s an edge case it misses. Imagine that your directory structure is set so that the document root is /var/srv/my-site and your X.509 certificate is stored in /var/srv/my-site-certificates.

$ ls -l /var/srv
drwxr-xr-x 2 webserver webserver    4096 May 17  2021 my-site/
drwxr-xr-x 2 webserver webserver    4096 May 17  2021 my-site-certificates/

Do you see the issue? If the attacker requests http://HOSTNAME/../my-site-certificates/my-site.key, the web server will resolve the requested path to /var/srv/my-site-certificates/my-site.key. The code uses string comparison to check if /var/srv/my-site-certificates starts with /var/srv/my-site. Guess what? It does.

A safer way

I’m a big fan of using Path objects from Python’s pathlib library instead of os.path functions to handle file paths. As luck would have it, Path objects offer a safer way to detect attempted directory traversals.

from pathlib import Path

DOCUMENT_ROOT = Path("/var/srv/my-site")


def validate_requested_file_path(requested_path: Path):
    if DOCUMENT_ROOT.resolve() not in requested_path.resolve().parents:
        raise ForbiddenPathError(
            f"The requested path '{requested_path}' is forbidden"
        )

First, we call Path.resolve() for both the DOCUMENT_ROOT and requested_path. By canonicalizing the paths, we can have confidence that we’re comparing apples to apples. Next, the Path.parents() method returns a sequence containing every ancestor directory of requested_path. Checking that DOCUMENT_ROOT is an ancestor of requested_path is more precise than using string comparison and leaves little room for error.

Caveats

Symlinks and exceptions

Path.resolve() resolves symlinks, which may or may not be desirable for your application. Furthermore, if any infinite loops are encountered while resolving the path, a RuntimeError will be raised. If either of these behaviors is undesirable, an alternate solution using os.path.abspath can be seen below. Note, however, that because os.path.abspath does not resolve symlinks, the paths will be absolute but may not be canonical. In these cases, an attacker may still be able to write files outside of the document root. I would recommend the implementation above for most applications; the implementation below has been added for completeness.

from pathlib import Path
from os.path import abspath

# WARNING: os.path.abspath does not resolve symlinks
DOCUMENT_ROOT = Path(abspath("/var/srv/my-site"))


def validate_requested_file_path(requested_path: Path):
    if DOCUMENT_ROOT not in Path(abspath(requested_path)).parents:
        raise ForbiddenPathError(
            f"The requested path '{requested_path}' is forbidden"
        )

Limitations

It should be noted that either method presented here is only secure if the attacker has no way of manipulating the directory structure on the server. If the attacker can create symlinks or move directories, then more rigorous checks are needed to prevent directory traversal, but that’s a topic for a future post.