Wednesday, October 15, 2014

Mathematical Expression Parsing and Evaluation Made Easy


Disclaimer

This post is about how to parse and evaluate mathematical expressions. I assume there are several libraries built to enable you to do this already. I'm not particularly well versed in regular expressions, but I assume many implementations probably provide functions for this as well. This post, however, aims to explain how mathematical expression parsing and evaluation works even if you probably don't want to implement it yourself so as to provide you with a new level of understanding.

I should also mention that the code presented below will likely not be the most capable of all mathematical expression evaluators, nor the most performant. It is however, a neat way to think of the problem and trivializes parsing and evaluation of mathematical expressions.

Finally, there might be flaws in my logic. I am human after all. The algorithm as presented below seems to work for all test cases I have given it, but supposing I was stuck in a rut when coming up with test cases I might have categorically missed cases where my algorithm pathologically fails. If so, I am sorry. Hopefully, the idea presented should come through clearly enough regardless.

Introduction

As a programmer, you most likely know what a mathematical expression is. Hell, as a middle school graduate you probably know what a mathematical expression is (although I admit that if you are not regularly performing math in your adult life the terminology might have slipped your mind by now).

Simply put, a mathematical expression is a set of terms combined by a set of operations. Something as simple as this is an expression:

a + b * c

And something as complex (or more) as this is an expression:

(ex * 2x+1) / x!

The naïve approach to expression parsing would be to simply step through a string and evaluate. The problem faced when doing that approach is the order of operations. The order of operations observed is generally PEMDAS, meaning parenthesis are always evaluated first (writer's note; parenthesis are not mathematical operations in and of themselves, merely a way to escape the order of operations otherwise imposed in evaluation); exponents are evaluated second; multiplication and division are evaluated third, but have equal standing in relation to each other; addition and subtraction are evaluated last, but also have equal standing in relation to each other. Operations having equal standing simply means that those operations are executed in the order they are encountered from left to right since there is no internal preference between them.

I should note that sometimes there are ways to stylize a mathematical expression so as to escape PEMDAS. Commonly, division is used in such a way;

a + b
c  -  b

unrolls into (a + b) / (c - d) rather than a + (b / c) – d despite the lack of parenthesis. We are not going to be handling these cases as code complexity would swell by quite a bit and obscure the relatively simple principles that lie underneath the approach taken in this post.

The idea

As was mentioned before, some algorithms might seek to just parse a string left to right in order to evaluate a mathematical expression. However, this might break the order of operations. Some other algorithms parse left to right but try to be more aware of the context and might jump forwards and backwards in the string in order to evaluate the expression correctly. This is a needlessly obtuse approach even if it might work. In fact, it is probably a similar approach to the one I am about to present but without the use of a very important data structure that aids the functioning of the algorithm immensely.

Consider the following expression:

(a + b) + c * d^x

If we de-construct the terms into a logical order of binary operations we should get the following:

A = a + b
B = d^x
C = c * A
D = A + C

Where the result will be stored in D. Thinking about this a little harder we can see that we only use at most two evaluated operands at a time. We are also recursively evaluating each term. With this in mind, that same sequence can be expressed as a diagram:



Hopefully by now, the use of data structure should be evident; It is the trusty old binary tree. However, within the this field of science it is most often referred to as a syntax tree. Evaluating the expression is no more than traversing the tree in order and performing the operations on the operands given. The astute might have noticed that the trees can become quite unbalanced depending on the expression. However, this does not matter in the slightest for our purposes. I will explain why shortly. In order to do that though, we need to address the question of how the tree is constructed in the first place, and then parsed.

In essence, for a given node and a string containing an expression we need to parse for the least significant operation (discounting the contents inside parenthesis) and then split the string containing the expression at that operation into a left side and a right side while the operation itself is stored at the current node, making it an operation node. We then recursively perform this logic on the left and right child nodes of the current node, giving the left and right string split as input to those recursive calls respectively. Note that, as mentioned before, parenthesis is not an operation per se, which is why we are not parsing for it at this stage. Because of this, we are only looking for the EMDAS in PEMDAS. I will come back to the question of why we are reversing the search for operations as it is related to how we traverse the tree and evaluate its contents.

If we can not locate any operation in the expression we are processing then the expression might be a parenthesis. In that case, remove the outer parenthesis and make another recursive call to our logic given our current node and newly formed string as input.

If an input expression neither has an operation in it, nor is a parenthesis, then it must be a value. This brings us to the third and final case to test for, and constitutes our stopping condition; the current node is marked as a value node. Simply put, try to convert the expression to a type your of choice (usually an integer or a real value). If the conversion succeeds, store the value in the current node. If the conversion fails then maybe it is a symbolic constant. In that case look the value up and store it in the current node. Otherwise, something is amiss and the algorithm generates an error.

Since we are constructing the tree without having to leave the current node and then insert a new one by traversing the tree down a branch this is an O(1) operation. In total however, we process n items, where n is the number of operations and operands combined, making the construction of the tree an O(n) operation in total.

Evaluating the tree is nothing more than traversing the tree by requesting the value stored at a node. If the node is a value node, it simply returns the value that is stored at that node without asking any further questions. If, on the other hand, the node is an operation node it will have to query its left and right child nodes for their values and storing the results inside a left-hand-side variable and a right-hand-side variable respectively. Finally it returns the combined LHS and RHS based on what operation is stored inside the node.

Knowing this, the answer to why we are parsing for the operations in reverse order, i.e. addition/subtraction, then multiplication/division, and finally exponents, should be fairly obvious as the  topmost (as opposed to the root of the tree) operations stored in the tree are the first operations performed during a traversal of the tree. That means that the most significant operations must be stored at the top in order for them to take precedence over lesser significant operations. This is done by adding them to the tree last. However, reverse search does not end there. We also have to parse the expression string from right to left so that operations of the same rank are evaluated left to right. PEMDAS sets up an exception to the left-to-right rule; exponents (many applications screw this exception up, such as Excel, GNU Octave, and my first draft of this article). Exponents are parsed right-to-left, meaning the expression 3^4^5 evaluates to 3^(4^5), not (3^4)^5 as would be the case with 3/4*5.

Since we have to move through all of the elements in our tree to evaluate the expression we get an O(n) time complexity. This brings me back to the initial concern about unbalanced trees and why it is not a problem for our purposes. Constructing the tree is an O(n) operation, while evaluating the expression (traversing the tree) is also an O(n) operation, meaning the entire process is O(n) + O(n) or O(2n), simplified to O(n).

The implementation

For this post I will be using C++, but feel free to implement this in any language you like. The logic still stays the same regardless. The general outline of the data structure used:

class Expression
{
private:
    std::string m_expression;
    float       m_result;
    Node        *m_root;
private:
    bool   IsTermBalanced( void ) const;
    bool   IsBraceBalanced( void ) const;
    void   SanitizeExpression( void );
    void   DestroyTermTree(Node *node);
    void   GenerateTermTree(Node *& node, const std::string &expression);
    size_t FindOperationLeftToRight(const std::string &operations, const std::string &expression) const;
    size_t FindOperationRightToLeft(const std::string &operations, const std::string &expression) const;
       
public:
          Expression( void );
          ~Expression( void );

    bool  SetExpression(const std::string &expression);
   
    float Evaluate( void ) const;
};

As a side node, I was arguing with myself about how to implement this algorithm, mainly regarding the somewhat dubious question of whether the most elegant implementation is a class with functions, or just a collection of functions. I came to the conclusions that just using functions would be incredibly elegant, yet somewhat at odds with the design philosophy as championed by C++, at which point I decided to implement the algorithm as a class.

The presented binary tree is somewhat unorthodox. Some nodes are values, some nodes are operations. This might pose some difficulty when implementing the tree given certain languages, but there should be ways to work around this. In C++ we can solve the problem using polymorphism. In C, we could simply make a generic node structure and attach a state flag indicating whether the node is a value or an operation.

struct Node
{
    Node *left;
    Node *right;
   
    virtual ~Node( void ) {}
    virtual float Evaluate( void ) const = 0;
};
struct Operation : public Node
{
    char operation;
   
    float Evaluate( void ) const;
};
struct Value : public Node
{
    float value;
   
    float Evaluate( void ) const;
};

Note, that the end user will never actually handle this data structure. Making it local to a compilation unit makes sense, meaning you can forward declare it in your header file while actually declaring it in you source file.

Since we described the topic by covering the creation of the tree first we will do so here as well. The function is simply called SetExpression, but the main workhorse is the recursive call to GenerateTermTree.

bool Expression::SetExpression(const std::string &expression)
{
    m_expression = expression;
    SanitizeExpression();
    bool retVal = IsBraceBalanced() && IsTermBalanced();
    if (!retVal) {
        m_expression.clear();
    }
    DestroyTermTree(m_root);
    GenerateTermTree(m_root, m_expression);
    return retVal;
}

void Expression::GenerateTermTree(Node *& node, const std::string &expression)
{
    const std::string Pemdas[] = { "+-", "*/", "^" }; // reverse the order
   
    size_t opIndex = std::string::npos;
    char op;
   
    for (int i = 0; i < 3; ++i) {
        if (Pemdas[i] != "^") {
             opIndex = FindOperationRightToLeft(Pemdas[i], expression);
        } else {
             opIndex = FindOperationLeftToRight(Pemdas[i], expression);
        }
        if (opIndex != std::string::npos) {
            op = expression[opIndex];
            break;
        }
    }
   
    if (opIndex != std::string::npos) {
           
        Operation *opNode = new Operation;
        opNode->operation = op;
        opNode->left = NULL;
        opNode->right = NULL;
       
        std::string lexpr = expression.substr(0, opIndex);
        std::string rexpr = expression.substr(opIndex + 1, expression.size() - opIndex - 1);
       
        GenerateTermTree(opNode->left, lexpr);
        GenerateTermTree(opNode->right, rexpr);
       
        node = opNode;
    } else if (expression[0] == '(' && expression[expression.size() - 1] == ')') {
               
        GenerateTermTree(node, expression.substr(1, expression.size() - 2));
               
    } else { // STOPPING CONDITION
       
        Value *valNode = new Value;
        valNode->left = NULL;
        valNode->right = NULL;
        valNode->value = atof(expression.c_str());
        node = valNode;
       
    }
}

You may have noticed that this implementation performs some functions that we never mentioned in the previous section since they are not inherently associated with the data structure or algorithm. As such I will not go into exact detail about how these functions do what they are supposedly doing, but I will give a brief overview.

  • SanitizeString simply removes all kinds of whitespaces from the expression string.
  • IsBraceBalanced makes sure that each opening parenthesis corresponds to a closing parenthesis.
  • IsTermBalanced makes sure that each operation has exactly two operands.
  • FindOperation linearly parses the string looking for one of the specified operators, stopping at the first one. The trick to this function is to increment a counter for every opening brace encountered and decrement that same counter for every closing brace encountered. While the counter is not 0, skip detecting operators. The variants *LeftToRight and *RightToLeft searches for the characters incrementing from the first to the last character in the expression or decrementing from the last to the first character respectively.
  • DestroyTermTree frees the memory used by the expression data structure.

And now the evaluation. Most of the code is done in the node's virtual Evaluate function, while the tree's Evaluate merely calls upon the root node's Evaluate function:

float Operation::Evaluate( void ) const
{
    float lval = left->Evaluate();
    float rval = right->Evaluate();
    float result;
   
    switch (operation) {
        case '+':
            result = lval + rval;
            break;
        case '-':
            result = lval - rval;
            break;
        case '*':
            result = lval * rval;
            break;
        case '/':
            result = lval / rval;
            break;
        case '^':
            result = pow(lval, rval);
            break;
    }
   
    return result;
}

float Value::Evaluate( void ) const
{
    return value;
}

float Expression::Evaluate( void ) const
{
    float result = 0.0f;
    if (m_root != NULL) {
        result = m_root->Evaluate();
    }
    return result;
}


With this we have now covered the bulk of the algorithm as well as the data structure, but your implementation could just as well look very different. There is still some room for improvement though.

Further improvements

Reading this post, and being somewhat well versed in mathematics, you might go 'Hold on! What about unary operators!?' Your concern is legitimate. The tree structure does not fit perfectly for unary operators, or functions for that matter. It could however quite trivially be extended to handle these cases. On the one hand, there is the unary operator -, as in -x. In reality though you can substitute this for (0-x). Other unary operators such as ! are harder to handle, but not impossible. Unary operators and functions could be implemented by adding a new node type that does not branch (or for simplicity's sake consistently only leads down the left or right pointer to another node).

The code as presented above does not have support for symbolic constants, making that an obvious point of improvement. That should not be more difficult than doing a string lookup if the conversion to a floating point number fails. However, that will require another string to floating point value conversion function as atof will always return a valid value (0.0 on failure).

No comments:

Post a Comment