Regex for web extraction. Positive lookahead issues -
below example of data i'm using. i've read number of posts involving topic, tried while on regex101.
botinfo[-]: source ip:[10.1.1.100] target host:[centos70-1] target os:[centos 7.0] description:[http connection request] details:[10.1.1.101 - - [28/may /2013:12:24:08 +0000] "get /math/html.mli http/1.0" 404 3567 "-" "-" ] phase: [access] service:[web]
the goal have 2 capture groups. 1 for tag (e.g. source ip, target host, description, etc) , content contained in outermost square brackets.
it's "outermost" that's getting me, because content details tag has square brackets in it.
here current progress on said regex. using /g flag:
\s?([^:]+):\[(.*?(?=\]\s.*?:\[))\]
this handles except edge case (it's more complex needed because i've been fiddling trying edge case work).
my current lookahead (\]\s.*?:\[
), @ high level, match end left bracket , next tag. issue fails @ last match, because there no following tag.
edit: example of successful output requested. using data provided, goal have 2 capture groups resulting in these pairs:
match 1 1. `source ip` 2. `10.1.1.100` match 2 1. `target host` 2. `centos70-1` match 3 1. `target os` 2. `centos 7.0` match 4 1. `description` 2. `http connection request` match 5 1. `details` 2. `10.1.1.101 - - [28/may/2013:12:24:08 +0000] "get /math/html.mli http/1.0" 404 3567 "-" "-" ` match 6 1. `phase` 2. `access` match 7 1. `service` 2. `web`
heavily inspired this answer nested patterns end regex demo here:
\s*([\w ]+):\s*(\[((?>[^[\]]+|(?2))*)\])
the main idea repeat match of brackets as possible (if opening or closing bracket found, repeat (?2). data you're looking in fact in first , third capture group, second capturing brackets recursion happen properly.
details on regex:
\s*
match (and discard) spaces before field([\w ]+):
capture field name (all before :)\s*
again discard space before field(\[
start of second capture group , match litteral[
((?>[^[\]]+
start of third capture group atomic match (blocking backtracking avoid infinite loop) should match brackets|(?2))
if found bracket, try rematching whole second group*)
repeat 0 or infinite times atomic group alternation nested brackets , end third capture group\])
our last bracket match , end second capture group used in alternation atomic match.
Comments
Post a Comment