Regex for web extraction. Positive lookahead issues -
below example of data i'm using. i've read number of posts involving topic, tried while on regex101.
botinfo[-]: source ip:[10.1.1.100] target host:[centos70-1] target os:[centos 7.0] description:[http connection request] details:[10.1.1.101 - - [28/may /2013:12:24:08 +0000] "get /math/html.mli http/1.0" 404 3567 "-" "-" ] phase: [access] service:[web]   the goal have 2 capture groups. 1 for tag (e.g. source ip, target host, description, etc) , content contained in outermost square brackets.
it's "outermost" that's getting me, because content details tag has square brackets in it.
here current progress on said regex. using /g flag:
\s?([^:]+):\[(.*?(?=\]\s.*?:\[))\]   this handles except edge case (it's more complex needed because i've been fiddling trying edge case work).
my current lookahead (\]\s.*?:\[), @ high level, match end left bracket , next tag. issue fails @ last match, because there no following tag.
edit: example of successful output requested. using data provided, goal have 2 capture groups resulting in these pairs:
match 1 1.  `source ip` 2.  `10.1.1.100` match 2 1.  `target host` 2.  `centos70-1` match 3 1.  `target os` 2.  `centos 7.0` match 4 1.  `description` 2.  `http connection request` match 5 1.  `details` 2.  `10.1.1.101 - - [28/may/2013:12:24:08 +0000] "get /math/html.mli http/1.0" 404 3567 "-" "-" ` match 6 1.  `phase` 2.  `access`  match 7 1.  `service` 2.  `web`      
heavily inspired this answer nested patterns end regex demo here:
\s*([\w ]+):\s*(\[((?>[^[\]]+|(?2))*)\])   the main idea repeat match of brackets as possible (if opening or closing bracket found, repeat (?2). data you're looking in fact in first , third capture group, second capturing brackets recursion happen properly.
details on regex:
\s*match (and discard) spaces before field([\w ]+):capture field name (all before :)\s*again discard space before field(\[start of second capture group , match litteral[((?>[^[\]]+start of third capture group atomic match (blocking backtracking avoid infinite loop) should match brackets|(?2))if found bracket, try rematching whole second group*)repeat 0 or infinite times atomic group alternation nested brackets , end third capture group\])our last bracket match , end second capture group used in alternation atomic match.
Comments
Post a Comment