Doc-as-code. Loading into memory a Markdown document in Python

I believe that in the field of computer science, there are two main types of professionals, without discrediting either of them. There are those who first tinker, and then there are those of us who eagerly dive into the documentation (and later tinker, …a lot). I love reading, so I’m one of those who eagerly turns to the docs. I value good documentation both to delve into a new concept and to refer back to it later. I don’t trust my memory beyond what’s absolutely necessary because I’m aware of its fallibility. I’ve had enough page fails when going to the kitchen and forgetting why I went there in the first place. That’s why Damian Conway’s quote resonates so well with me:

“Documentation is a love letter that you write to your future self”

Also, obviously, I love programming. I believe that I would treat everything as code if I could, so why not documentation? This idea is not mine; it’s called Documentation as Code (or doc-as-code) and has quite a few advocates on the web. As soon as I got the opportunity, I prepared a Proof of Concept for doc-as-code, and now that I work as a Tech Lead, I’m going to ramp up the level of evangelism in my team. For me doc-as-code is a comfortable and natural way, as well as much more auditable, to prepare documentation. It’s not just talk. This blog is written as doc-as-code, and the vast majority of the documentation I produce in my work (except for quick notes and meeting notes) is also generated using that doc-as-code PoC. I generate documents in Markdown, and then automation takes care of publishing them where needed.

Test your documentation

One of the great advantages of treating documentation as code is that it can be tested. All code can and should be tested. What tests should technical documentation pass? The question is open-ended and in my case, it’s in the process of iterative answering. In the initial version we have decided to audit its content against a series of compliance standards. Not just a linter (there are already good linters for markdown), but something to ensure that the document has the format we want and the sections we require and find beneficial. For example, after the first heading, there should be a note stating that it’s a self-generated document, discouraging direct edits (changes will disappear in the next run of the CD pipeline). Since I couldn’t find anything to test documentation as code, I simply started to build it. This project, which I named Dactester, is still in development and brought about a series of quite interesting needs. One of the most important ones was: if I want to automate testing a Markdown document using Python, how do I represent it in memory?

Representing the Document: The Naive Solution 1.0

Markdown is a plain text file with its own syntax that allows interpreters to render it. But at the data level it’s just a plain text file. A plain text file is a collection of lines. So, my first implementation is a list of plain text lines. The code follows an object-oriented programming (OOP) approach and includes the necessary methods to provide a minimal functional interface that dactester requires.

The code

""" Represent a markdown document file """
__author__      = "Abian G. Rodriguez"
__version__     = "0.1"
__status__      = "PoC"

# This is a class representing a .md document and providing methods to handle it: read it, parse it, etc

class MdDoc:
    """
    Class abstraction of a markdown document file.
    """

    def __init__(self):
        self.doclines = []

    def load_from_file(self, filepath):
        try:
            with open(filepath,'r') as sourcefile:
                for line in sourcefile:
                    self.doclines.append(line)
        except FileNotFoundError as error:
            self.doclines = []

    def get_doc_lines(self):
        if len(self.doclines) > 0:
            return self.doclines
        else:
            return -1

    def get_line_number(self, linestring):
        """
        Get the line number of a given line.
        If two or more lines are equal, will get only the first one

        Parameters:
            linestring (str): The exact string representing a line to match

        Returns:
            linenumber (int): The number of the line found, or -1 if not found or document is empty
        """
        if len(self.doclines) > 0:
            if linestring in self.doclines:
                return(self.doclines.index(linestring)+1)
            else:
                return -1
        else:
            return -1

It works, and that’s enough for a first approach but not for me. I’d like to know if it’s efficient. I’ve always admired those who squeezed the hardware to the maximum instead of just jumping to the next task regardless of the resources their code consumed. I grew up admiring the people who could squeeze polygons and effects on the early 32-bit consoles. People capable of running Quake on the Sega Saturn.

Performance

To determine if it’s efficient, in this initial approach I’m interested in seeing the execution times based on the size of the markdown document, which is the only variable input. In the execution of the entire dactester process, the quantity and type of tests on the documents also matter, but by now I’ll focus solely on the document and its implementation. I’m concerned about the three main operations performed by the MdDoc class, which will be the three tests I’ll conduct:

Read and load a document into memory.
Loop through the list, performing an operation (typically a string comparison).
Search for a specific line and return its line number.

To measure performance I’m going to use timeit from Python’s standard libraries. It provides the execution time in seconds for the snippet or code function passed as a parameter. By the way, I’m just getting started with performance testing, so if my methods aren’t entirely accurate or contain errors, feel free to provide any suggestions for improvement or comments. Remember you can contact me.

Results

I’ve conducted each of the basic tests with three different document sizes. A small markdown with 50 lines, a medium one with 500, and a large one with 5000. So, each document has ten times more lines than the previous. timeit returns the total time taken for the code execution number times, where number is an integer parameter. In my case, each result will be the total time for 10,000 executions since the code is simple and relatively small, and I want a relatively high number to make any minimal differences based on the document size visible. And now, here are the test results.

@@@@ Document load tests
Load test for small doc took: 1.9765488000120968 seconds
=======================================
Load test for medium doc took: 2.843191900057718 seconds
=======================================
Load test for big doc took: 10.372771300142631 seconds
=======================================
@@@@ END of Document load tests

@@@@ Load Document and loop through content tests
Load & Loop test for small doc took: 2.015927999978885 seconds
=======================================
Load & Loop test for medium doc took: 2.0034616000484675 seconds
=======================================
Load & Loop test for big doc took: 2.1118023002054542 seconds
=======================================
@@@@ END of Load Document and loop through content tests

@@@@ Load Document, find line and return line number tests
Load & Loop test for SMALL doc (line near start) took: 1.9678535000421107 seconds
=======================================
Load & Loop test for SMALL doc (line near the middle) took: 1.9765974001493305 seconds
=======================================
Load & Loop test for SMALL doc (line near EoF) took: 1.9668072001077235 seconds
=======================================
Load & Loop test for MEDIUM doc (line near start) took: 1.9576516000088304 seconds
=======================================
Load & Loop test for MEDIUM doc (line near the middle) took: 1.9541629999876022 seconds
=======================================
Load & Loop test for MEDIUM doc (line near EoF) took: 2.0762366999406368 seconds
=======================================
Load & Loop test for BIG doc (line near start) took: 1.9916893998160958 seconds
=======================================
Load & Loop test for BIG doc (line near the middle) took: 2.003709400072694 seconds
=======================================
Load & Loop test for BIG doc (line near EoF) took: 1.978114299941808 seconds
=======================================
@@@@ END of Load Document, find line and return line number tests

The operation where there is the most significant difference is loading the document into memory. The large document takes around ten times longer than the small one. This makes sense because I perform the loading using .append(), which, despite having a constant time complexity of O(1), is called n times based on the n lines in the document. Consequently, the time complexity for loading into memory with MdDoc is closer to O(n).

For the document loading and traversal (loop) operations the times are quite similar, although curiously, the medium document consistently takes less or the same time as the small one. This has occurred consistently in all test runs and raises a new question that I’ll have to delve into at another time. The MdDoc comparison is done with value in string, which for strings becomes a call to str.__contains__(). According to its implementation in CPython, it uses the Boyer-Moore-Horspool algorithm and is sublinear in the best cases O(n/m), O(n) on average, and O(nm) in the worst cases. Based on the test results, it seems to approach something sublinear.

The operation to return the line number of a line in the document has a duration that approaches constant time. In theory, based on the information I’ve found, list.index() has a linear time complexity of O(n). The fact that in the tests the behavior approaches O(1) may be because the document size is not large enough for the difference to be so visible.

So, the three main operations are on average O(n), not highly efficient but not terribly inefficient either. The fact that in my tests, where the medium document approaches the size of those we work with on my team, the operation times behave almost constant reassures me. For now, it will suffice.

However, I’m not settling. Among the project notes there is now a measurable reason to improve the efficiency of the MdDoc class.

Dactester action

As I mentioned, I couldn’t find anything to automate the testing of doc-as-code, so I had to create it myself. I decided to publish the same GitHub Action that I use in the CI/CD pipeline for documentation on our team. It has the original name of dactester-action. In case someone might find it interesting after reading this post. Sharing is caring.

Kudos

Doc-as-code
Doc-as-code in writethedocs.org
dactester-action en mi Github
Runtime of substring in string in Stack Overflow