One of the main methods our customers communicate with us is email. We get hundreds of travel booking emails daily; all of them contain time expressions that we need to turn into a computer readable format.
Humans naturally talk about “tomorrow afternoon” or “early next Monday morning”. These expressions are both contextual and ambiguous; the exact date of “next Monday” and “tomorrow” of course depends on what day it is today. Furthermore, “early morning” means different things to different people, although some common overlap can probably be agreed upon. Computers tend to prefer well defined and unambiguous times, for instance 1530961500
is the unix timestamp for July 7th 2018, 11:05 (UTC). Note that the time zone, something that most of us don’t usually actively think about, is also meaningful in this context.
Parsing time expression into structured, computer readable data is therefore challenging. Many solutions exist, but they are either too simplistic or problematic to use in a python
setup. We therefore wrote a pure python library for time parsing. ctparse
is a MIT-Licensed library built on straightforward concepts. It allows parsing complex expressions efficiently and can easily be adjusted for domain specific use cases https://github.com/comtravo/ctparse.
In many ways ctparse
is similar to duckling
, albeit admittedly having a significantly smaller scope for the time being. ctparse
implements a regular-expression and rule based system for parsing time and date expressions. There is also a statistical model to rank different parses and favour reasonable solutions over others. Whilst still in an early stage, we currently outperform duckling
in parsing date/time expressions from e-mail booking requests, both in terms of speed and accuracy.
For more details have a look at my talk about ctparse
at the previous PyData Berlin conference. In the talk I lay out the basic concepts and ideas behind building the PCFG (probabilistic context free grammar) inspired parser, discuss in detail some of the more challenging algorithmic building blocks and demonstrate how python
is actually a very good choice to implement such a system.
Have look at ctparse
on github and let us know what you think https://github.com/comtravo/ctparse.