Module Pcre

Perl Compatibility Regular Expressions for OCaml

7.4.1 - homepage

Exceptions

type error =
| Partial

String only matched the pattern partially

| BadPartial

Pattern contains items that cannot be used together with partial matching.

| BadPattern of string * int

BadPattern (msg, pos) regular expression is malformed. The reason is in msg, the position of the error in the pattern in pos.

| BadUTF8

UTF8 string being matched is invalid

| BadUTF8Offset

Gets raised when a UTF8 string being matched with offset is invalid.

| MatchLimit

Maximum allowed number of match attempts with backtracking or recursion is reached during matching. ALL FUNCTIONS CALLING THE MATCHING ENGINE MAY RAISE IT!!!

| RecursionLimit
| WorkspaceSize

Raised by pcre_dfa_exec when the provided workspace array is too small. See documention on pcre_dfa_exec for details on workspace array sizing.

| InternalError of string

InternalError msg C-library exhibits unknown/undefined behaviour. The reason is in msg.

exception Error of error

Exception indicating PCRE errors.

exception Backtrack

Backtrack used in callout functions to force backtracking.

exception Regexp_or of string * error

Regexp_or (pat, error) gets raised for sub-pattern pat by regexp_or if it failed to compile.

Compilation and runtime flags and their conversion functions

type icflag

Internal representation of compilation flags

type irflag

Internal representation of runtime flags

type cflag = [
| `CASELESS
| `MULTILINE
| `DOTALL
| `EXTENDED
| `ANCHORED
| `DOLLAR_ENDONLY
| `EXTRA
| `UNGREEDY
| `UTF8
| `NO_UTF8_CHECK
| `NO_AUTO_CAPTURE
| `AUTO_CALLOUT
| `FIRSTLINE
]

Compilation flags

val cflags : cflag list -> icflag

cflags cflag_list converts a list of compilation flags to their internal representation.

val cflag_list : icflag -> cflag list

cflag_list cflags converts internal representation of compilation flags to a list.

type rflag = [
| `ANCHORED
| `NOTBOL
| `NOTEOL
| `NOTEMPTY
| `PARTIAL
| `DFA_RESTART
]

Runtime flags

val rflags : rflag list -> irflag

rflags rflag_list converts a list of runtime flags to their internal representation.

val rflag_list : irflag -> rflag list

rflag_list rflags converts internal representation of runtime flags to a list.

Information on the PCRE-configuration (build-time options)

val version : string

Version information

Version of the PCRE-C-library

val config_utf8 : bool

Indicates whether UTF8-support is enabled

val config_newline : char

Character used as newline

Number of bytes used for internal linkage of regular expressions

val config_match_limit : int

Default limit for calls to internal matching function

val config_match_limit_recursion : int

Default limit recursion for calls to internal matching function

val config_stackrecurse : bool

Indicates use of stack recursion in matching function

Information on patterns

type firstbyte_info = [
| `Char of char
| `Start_only
| `ANCHORED
]

Information on matching of "first chars" in patterns

type study_stat = [
| `Not_studied
| `Studied
| `Optimal
]

Information on the study status of patterns

type regexp

Compiled regular expressions

val options : regexp -> icflag
val size : regexp -> int
val studysize : regexp -> int
val capturecount : regexp -> int
val backrefmax : regexp -> int
val namecount : regexp -> int
val nameentrysize : regexp -> int
val names : regexp -> string array
val firstbyte : regexp -> firstbyte_info
val firsttable : regexp -> string option
val lastliteral : regexp -> char option
val study_stat : regexp -> study_stat
val get_stringnumber : regexp -> string -> int
val get_match_limit : regexp -> int option
val get_match_limit_recursion : regexp -> int option

Compilation of patterns

type chtables

Alternative set of char tables for pattern matching

val maketables : unit -> chtables

Generates new set of char tables for the current locale.

val regexp : ?⁠study:bool -> ?⁠limit:int -> ?⁠limit_recursion:int -> ?⁠iflags:icflag -> ?⁠flags:cflag list -> ?⁠chtables:chtables -> string -> regexp

regexp ?study ?limit ?limit_recursion ?iflags ?flags ?chtables pattern compiles pattern with flags when given, with iflags otherwise, and with char tables chtables. If study is true, then the resulting regular expression will be studied. If limit is specified, this sets a limit to the amount of recursion and backtracking (only lower than the builtin default!). If this limit is exceeded, MatchLimit will be raised during matching.

parameter study

default = true

parameter limit

default = no extra limit other than default

parameter limit_recursion

default = no extra limit_recursion other than default

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter chtables

default = builtin char tables

returns

the regular expression.

For detailed documentation on how you can specify PERL-style regular expressions (= patterns), please consult the PCRE-documentation ("man pcrepattern") or PERL-manuals.

see http://www.perl.com

www.perl.com

val regexp_or : ?⁠study:bool -> ?⁠limit:int -> ?⁠limit_recursion:int -> ?⁠iflags:icflag -> ?⁠flags:cflag list -> ?⁠chtables:chtables -> string list -> regexp

regexp_or ?study ?limit ?limit_recursion ?iflags ?flags ?chtables patterns like regexp, but combines patterns as alternatives (or-patterns) into one regular expression.

val quote : string -> string

Subpattern extraction

type substrings

Information on substrings after pattern matching

val get_subject : substrings -> string
val num_of_subs : substrings -> int
val get_substring : substrings -> int -> string
val get_substring_ofs : substrings -> int -> int * int
val get_substrings : ?⁠full_match:bool -> substrings -> string array
val get_opt_substrings : ?⁠full_match:bool -> substrings -> string option array
val get_named_substring : regexp -> string -> substrings -> string
val get_named_substring_ofs : regexp -> string -> substrings -> int * int

Callouts

type callout_data = {
callout_number : int;

Callout number

substrings : substrings;

Substrings matched so far

start_match : int;

Subject start offset of current match attempt

current_position : int;

Subject offset of current match pointer

capture_top : int;

Number of the highest captured substring so far

capture_last : int;

Number of the most recently captured substring

pattern_position : int;

Offset of next match item in pattern string

next_item_length : int;

Length of next match item in pattern string

}
type callout = callout_data -> unit

Type of callout functions

Callouts are referred to in patterns as "(?Cn)" where "n" is a callout_number ranging from 0 to 255. Substrings captured so far are accessible as usual via substrings. You will have to consider capture_top and capture_last to know about the current state of valid substrings.

By raising exception Backtrack within a callout function, the user can force the pattern matching engine to backtrack to other possible solutions. Other exceptions will terminate matching immediately and return control to OCaml.

Matching of patterns and subpattern extraction

val pcre_exec : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> string -> int array
val pcre_dfa_exec : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> ?⁠workspace:int array -> string -> int array

pcre_dfa_exec ?iflags ?flags ?rex ?pat ?pos ?callout ?workspace subj invokes the "alternative" DFA matching function.

returns

an array of offsets that describe the position of matched subpatterns in the string subj starting at position pos with pattern pat when given, regular expression rex otherwise. The array also contains additional workspace needed by the match engine. Uses flags when given, the precompiled iflags otherwise. Requires a sufficiently-large workspace array. Callouts are handled by callout.

Note that the returned array of offsets are quite different from those returned by pcre_exec et al. The motivating use case for the DFA match function is to be able to restart a partial match with N additional input segments. Because the match function/workspace does not store segments seen previously, the offsets returned when a match completes will refer only to the matching portion of the last subject string provided. Thus, returned offsets from this function should not be used to support extracting captured submatches. If you need to capture submatches from a series of inputs incrementally matched with this function, you'll need to concatenate those inputs that yield a successful match here and re-run the same pattern against that single subject string.

Aside from an absolute minimum of 20, PCRE does not provide any guidance regarding the size of workspace array needed by any given pattern. Therefore, it is wise to appropriately handle the possible WorkspaceSize error. If raised, you can allocate a new, larger workspace array and begin the DFA matching process again.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter callout

default = ignore callouts

parameter workspace

default = fresh array of length 20

raises Not_found

if the pattern match has failed

raises Error

Partial if the pattern has matched partially; a subsequent exec call with the same pattern and workspace (adding the DFA_RESTART flag) be made to either further advance or complete the partial match.

raises Error

WorkspaceSize if the workspace array is too small to accommodate the DFA state required by the supplied pattern

val exec : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> string -> substrings
val exec_all : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> string -> substrings array
val next_match : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> substrings -> substrings
val extract : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠full_match:bool -> ?⁠callout:callout -> string -> string array

extract ?iflags ?flags ?rex ?pat ?pos ?full_match ?callout subj

returns

the array of substrings that match subj starting at position pos, using pattern pat when given, regular expression rex otherwise. Uses flags when given, the precompiled iflags otherwise. It includes the full match at index 0 when full_match is true, the captured substrings only when it is false. Callouts are handled by callout. If a subpattern did not capture a substring, the empty string is returned in the corresponding position instead.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter full_match

default = true

parameter callout

default = ignore callouts

raises Not_found

if pattern does not match.

val extract_opt : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠full_match:bool -> ?⁠callout:callout -> string -> string option array

extract_opt ?iflags ?flags ?rex ?pat ?pos ?full_match ?callout subj

returns

the array of optional substrings that match subj starting at position pos, using pattern pat when given, regular expression rex otherwise. Uses flags when given, the precompiled iflags otherwise. It includes Some full_match_str at index 0 when full_match is true, Some captured-substrings only when it is false. Callouts are handled by callout. If a subpattern did not capture a substring, None is returned in the corresponding position instead.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter full_match

default = true

parameter callout

default = ignore callouts

raises Not_found

if pattern does not match.

val extract_all : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠full_match:bool -> ?⁠callout:callout -> string -> string array array

extract_all ?iflags ?flags ?rex ?pat ?pos ?full_match ?callout subj

returns

an array of arrays of all matching substrings that match subj starting at position pos, using pattern pat when given, regular expression rex otherwise. Uses flags when given, the precompiled iflags otherwise. It includes the full match at index 0 of the extracted string arrays when full_match is true, the captured substrings only when it is false. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter full_match

default = true

parameter callout

default = ignore callouts

raises Not_found

if pattern does not match.

val extract_all_opt : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠full_match:bool -> ?⁠callout:callout -> string -> string option array array

extract_all_opt ?iflags ?flags ?rex ?pat ?pos ?full_match ?callout subj

returns

an array of arrays of all optional matching substrings that match subj starting at position pos, using pattern pat when given, regular expression rex otherwise. Uses flags when given, the precompiled iflags otherwise. It includes Some full_match_str at index 0 of the extracted string arrays when full_match is true, Some captured_substrings only when it is false. Callouts are handled by callout. If a subpattern did not capture a substring, None is returned in the corresponding position instead.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter full_match

default = true

parameter callout

default = ignore callouts

raises Not_found

if pattern does not match.

val pmatch : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> string -> bool

String substitution

type substitution

Information on substitution patterns

val subst : string -> substitution

subst str converts the string str representing a substitution pattern to the internal representation

The contents of the substitution string str can be normal text mixed with any of the following (mostly as in PERL):

  • $[0-9]+ - a "$" immediately followed by an arbitrary number. "$0" stands for the name of the executable, any other number for the n-th backreference.
  • $& - the whole matched pattern
  • $` - the text before the match
  • $' - the text after the match
  • $+ - the last group that matched
  • $$ - a single "$"
  • $! - delimiter which does not appear in the substitution. Can be used to part "$0-9+" from an immediately following other number.
val replace : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠itempl:substitution -> ?⁠templ:string -> ?⁠callout:callout -> string -> string

replace ?iflags ?flags ?rex ?pat ?pos ?itempl ?templ ?callout subj replaces all substrings of subj matching pattern pat when given, regular expression rex otherwise, starting at position pos with the substitution string templ when given, itempl otherwise. Uses flags when given, the precompiled iflags otherwise. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter itempl

default = empty string

parameter templ

default = ignored

parameter callout

default = ignore callouts

raises Failure

if there are backreferences to nonexistent subpatterns.

val qreplace : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠templ:string -> ?⁠callout:callout -> string -> string

qreplace ?iflags ?flags ?rex ?pat ?pos ?templ ?callout subj replaces all substrings of subj matching pattern pat when given, regular expression rex otherwise, starting at position pos with the string templ. Uses flags when given, the precompiled iflags otherwise. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter templ

default = ignored

parameter callout

default = ignore callouts

val substitute_substrings : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> subst:(substrings -> string) -> string -> string

substitute_substrings ?iflags ?flags ?rex ?pat ?pos ?callout ~subst subj replaces all substrings of subj matching pattern pat when given, regular expression rex otherwise, starting at position pos with the result of function subst applied to the substrings of the match. Uses flags when given, the precompiled iflags otherwise. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter callout

default = ignore callouts

val substitute : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> subst:(string -> string) -> string -> string

substitute ?iflags ?flags ?rex ?pat ?pos ?callout ~subst subj replaces all substrings of subj matching pattern pat when given, regular expression rex otherwise, starting at position pos with the result of function subst applied to the match. Uses flags when given, the precompiled iflags otherwise. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter callout

default = ignore callouts

val replace_first : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠itempl:substitution -> ?⁠templ:string -> ?⁠callout:callout -> string -> string

replace_first ?iflags ?flags ?rex ?pat ?pos ?itempl ?templ ?callout subj replaces the first substring of subj matching pattern pat when given, regular expression rex otherwise, starting at position pos with the substitution string templ when given, itempl otherwise. Uses flags when given, the precompiled iflags otherwise. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter itempl

default = empty string

parameter templ

default = ignored

parameter callout

default = ignore callouts

raises Failure

if there are backreferences to nonexistent subpatterns.

val qreplace_first : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠templ:string -> ?⁠callout:callout -> string -> string

qreplace_first ?iflags ?flags ?rex ?pat ?pos ?templ ?callout subj replaces the first substring of subj matching pattern pat when given, regular expression rex otherwise, starting at position pos with the string templ. Uses flags when given, the precompiled iflags otherwise. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter templ

default = ignored

parameter callout

default = ignore callouts

val substitute_substrings_first : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> subst:(substrings -> string) -> string -> string

substitute_substrings_first ?iflags ?flags ?rex ?pat ?pos ?callout ~subst subj replaces the first substring of subj matching pattern pat when given, regular expression rex otherwise, starting at position pos with the result of function subst applied to the substrings of the match. Uses flags when given, the precompiled iflags otherwise. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter callout

default = ignore callouts

val substitute_first : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠callout:callout -> subst:(string -> string) -> string -> string

substitute_first ?iflags ?flags ?rex ?pat ?pos ?callout ~subst subj replaces the first substring of subj matching pattern pat when given, regular expression rex otherwise, starting at position pos with the result of function subst applied to the match. Uses flags when given, the precompiled iflags otherwise. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter callout

default = ignore callouts

Splitting

val split : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠max:int -> ?⁠callout:callout -> string -> string list

split ?iflags ?flags ?rex ?pat ?pos ?max ?callout subj splits subj into a list of at most max strings, using as delimiter pattern pat when given, regular expression rex otherwise, starting at position pos. Uses flags when given, the precompiled iflags otherwise. If max is zero, trailing empty fields are stripped. If it is negative, it is treated as arbitrarily large. If neither pat nor rex are specified, leading whitespace will be stripped! Should behave exactly as in PERL. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter max

default = 0

parameter callout

default = ignore callouts

val asplit : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠max:int -> ?⁠callout:callout -> string -> string array
type split_result =
| Text of string

Text part of split string

| Delim of string

Delimiter part of split string

| Group of int * string

Subgroup of matched delimiter (subgroup_nr, subgroup_str)

| NoGroup

Unmatched subgroup

Result of a Pcre.full_split

val full_split : ?⁠iflags:irflag -> ?⁠flags:rflag list -> ?⁠rex:regexp -> ?⁠pat:string -> ?⁠pos:int -> ?⁠max:int -> ?⁠callout:callout -> string -> split_result list

full_split ?iflags ?flags ?rex ?pat ?pos ?max ?callout subj splits subj into a list of at most max elements of type "split_result", using as delimiter pattern pat when given, regular expression rex otherwise, starting at position pos. Uses flags when given, the precompiled iflags otherwise. If max is zero, trailing empty fields are stripped. If it is negative, it is treated as arbitrarily large. Should behave exactly as in PERL. Callouts are handled by callout.

parameter iflags

default = no extra flags

parameter flags

default = ignored

parameter rex

default = matches whitespace

parameter pat

default = ignored

parameter pos

default = 0

parameter max

default = 0

parameter callout

default = ignore callouts

Additional convenience functions

val foreach_line : ?⁠ic:Stdlib.in_channel -> (string -> unit) -> unit

foreach_line ?ic f applies f to each line in inchannel ic until the end-of-file is reached.

parameter ic

default = stdin

val foreach_file : string list -> (string -> Stdlib.in_channel -> unit) -> unit

foreach_file filenames f opens each file in the list filenames for input and applies f to each filename and the corresponding channel. Channels are closed after each operation (even when exceptions occur - they get reraised afterwards!).

UNSAFE STUFF - USE WITH CAUTION!

val unsafe_pcre_exec : irflag -> regexp -> pos:int -> subj_start:int -> subj:string -> int array -> callout option -> unit

unsafe_pcre_exec flags rex ~pos ~subj_start ~subj offset_vector callout. You should read the C-source to know what happens. If you do not understand it - don't use this function!

val make_ovector : regexp -> int * int array

make_ovector regexp calculates the tuple (subgroups2, ovector) which is the number of subgroup offsets and the offset array.

val unsafe_pcre_dfa_exec : irflag -> regexp -> pos:int -> subj_start:int -> subj:string -> int array -> callout option -> workspace:int array -> unit

unsafe_pcre_dfa_exec flags rex ~pos ~subj_start ~subj offset_vector callout ~workpace. You should read the C-source to know what happens. If you do not understand it - don't use this function!