Event image

Abstract:

We show how a combination of type theory, prefix codes, nondeterministic automata and determinization to streaming string transformers yields a regular expression parser that executes in worst-case linear time (for fixed regular expression) and is optimally streaming: it outputs bits as soon as those are semantically uniquely determined. Preliminary tests indicate that their straightforward rendition in C operates at a sustained 100+ Mbps rate on complex regular expressions (on a commodity PC) for large data sets (e.g. 300 GB/file); this seems to be significantly faster than widely used subgroup matching tools, including Google’s RE2, which yield less semantic information and are observed to operate at 2 to 20 Mbps.

We speculate that expressive regular expression processing corresponding to subgroup matching and Perl-style substitution may be pushed to execute at 1 Gbps per core without employing machine-specific or hardware-oriented tricks.

Joint work (partly published at ICTAC 2014, partly unpublished) with Bjørn Bugge Grathwohl and Ulrik Terp Rasmussen at DIKU, the Department of Computer Science at U Copenhagen.

Bio:

Fritz Henglein is professor of programming languages and systems at DIKU, the Department of Computer Science at the University of Copenhagen, where he heads the section on algorithms and programming languages and the HIPERFIT research center on functional high-performance computing for finance.

His research interests are in semantic, logical and algorithmic aspects of programming languages, specifically type inference, type-based program analysis, algorithmic functional programming and domain-specific languages, and the application of programming language technology in enterprise systems, health care processes and finance.