Master Thesis: [PDF] Systematic Analysis of the PDF Landscape



The main goal of this thesis is the collection and analysis of PDF files. For this purpose, an existing Python-based tool should be refactored and extended.

The thesis consists of three parts:

  • PDF Collector: In this module a methodology for collecting PDF files in the wild will be established. The goal is to collect PDFs from different areas, e.g., PDFs published on websites, malicious PDFs from public databases, or governmental PDFs which are usually protected. The methodology will be implemented and a world wide scan will be run.
  • PDF Analyzer: In this module one main question will be answered: “Which features are used in the collected (malicious and benign) PDF files?”.
    • First, based on an analysis of the PDF specification a catalog with security-relevant features will be created.
    • Second, an analyzing methodology will be elaborated by considering different methods, e.g., a RegEx-based analysis without parsing a PDF and parsing-based analysis.
    • Third, all previously collected PDF files will be analyzed and details regarding relevant features will be extracted.
  • Systematization of Knowledge
    • All previously collected results will be systematized and key insights will be highlighted.
    • Potential security problems will be also summarized and the causes will be analyzed.


Your task is to write a crawler for one website and determine automatically whether a PDF document is signed or not. Stick to the best practices for crawling and scraping:
  • Write a PDF crawler scanning the website
  • Download all PDF files and all ZIP-files containing PDF documents.
  • Determine which of the downloaded PDFs are signed and which are not.
Send your source code and a list of all signed files to


  • Python
  • Lecture Message-Level Security


Supervision:  Christian Mainka, Vladislav Mladenov, Simon Rohlmann


Start date: immediately