Regex for web extraction. Positive lookahead issues -


below example of data i'm using. i've read number of posts involving topic, tried while on regex101.

botinfo[-]: source ip:[10.1.1.100] target host:[centos70-1] target os:[centos 7.0] description:[http connection request] details:[10.1.1.101 - - [28/may /2013:12:24:08 +0000] "get /math/html.mli http/1.0" 404 3567 "-" "-" ] phase: [access] service:[web] 

the goal have 2 capture groups. 1 for tag (e.g. source ip, target host, description, etc) , content contained in outermost square brackets.

it's "outermost" that's getting me, because content details tag has square brackets in it.

here current progress on said regex. using /g flag:

\s?([^:]+):\[(.*?(?=\]\s.*?:\[))\] 

this handles except edge case (it's more complex needed because i've been fiddling trying edge case work).

my current lookahead (\]\s.*?:\[), @ high level, match end left bracket , next tag. issue fails @ last match, because there no following tag.


edit: example of successful output requested. using data provided, goal have 2 capture groups resulting in these pairs:

match 1 1.  `source ip` 2.  `10.1.1.100` match 2 1.  `target host` 2.  `centos70-1` match 3 1.  `target os` 2.  `centos 7.0` match 4 1.  `description` 2.  `http connection request` match 5 1.  `details` 2.  `10.1.1.101 - - [28/may/2013:12:24:08 +0000] "get /math/html.mli http/1.0" 404 3567 "-" "-" ` match 6 1.  `phase` 2.  `access`  match 7 1.  `service` 2.  `web` 

heavily inspired this answer nested patterns end regex demo here:

\s*([\w ]+):\s*(\[((?>[^[\]]+|(?2))*)\]) 

the main idea repeat match of brackets as possible (if opening or closing bracket found, repeat (?2). data you're looking in fact in first , third capture group, second capturing brackets recursion happen properly.

details on regex:

  • \s* match (and discard) spaces before field
  • ([\w ]+): capture field name (all before :)
  • \s* again discard space before field
  • (\[ start of second capture group , match litteral [
  • ((?>[^[\]]+ start of third capture group atomic match (blocking backtracking avoid infinite loop) should match brackets
  • |(?2)) if found bracket, try rematching whole second group
  • *) repeat 0 or infinite times atomic group alternation nested brackets , end third capture group
  • \]) our last bracket match , end second capture group used in alternation atomic match.

Comments

Popular posts from this blog

javascript - gulp-nodemon - nodejs restart after file change - Error: listen EADDRINUSE events.js:85 -

Fatal Python error: Py_Initialize: unable to load the file system codec. ImportError: No module named 'encodings' -

oracle - Changing start date for system jobs related to automatic statistics collections in 11g -