Solução implementada
Esta é, na verdade, uma implementação da resposta de glenn com a adição da função definida pelo usuário stringSplit()
para usar em vez da função Gawk 3.1.6 builtin split()
que não suporta o quarto argumento opcional ( [seps]
, uma matriz para armazenar os separadores) que precisamos. O Gawk 3.1.6 faz apoiar o terceiro argumento opcional de propósito similar a match()
necessário, mas [seps]
não está disponível até o Gawk 4.0.0.
# stringSplit(str,fld,rx,[sep])
# Split string on regex delimeter preserving regex-seperators. Gawk 3.1.6
# equivalent to builtin split() function of later versions which add
# support for an optional 4th argument, ([seps]), an array to hold the
# evaluated regular-expressions.
# Arguments:
# str
# string to split
# fld
# array of the resulting fields
# rx
# regular expression (regex) to split on
# [sep]
# optional array of seperator strings matching the regex
# Revised:
# 20140117 docsalvage
#
function stringSplit(str,fld,rx,sep, searchstr,searchndx,match1,matchn,matches) {
searchstr = str # copy of str to use in while() loop
searchndx = match(searchstr, rx) # index in searchstr where rx(regex) found
match1 = searchndx # preserve result of first match attempt
matchn = 1 # match number (index in array of matches)
matches = 0 # number of matches returned by split()
#
while (RLENGTH > 0) { # more reliable than while(searchndx > 0)
# save match
sep[matchn] = substr(searchstr, searchndx, RLENGTH)
#
# match() only searches from beginning so give it just remainder of str
searchstr = substr(searchstr, searchndx + RLENGTH)
#
# printf("sep[%2d]: %s, searchndx: %2d, RLENGTH: %2d, searchstr: %s\n", matchn, sep[matchn], searchndx, RLENGTH, searchstr)
#
# search for next rx
searchndx = match(searchstr, rx)
matchn = matchn + 1
}
#
if (match1) matches = split(str,fld,rx)
#
return matches
}
BEGIN {
print
print "Test of:"
print " stringSplit()"
print
#
str = "[[link|label label]][[link]] @tag more text some text with @anothertag and [[another|link]]"
rx = "[][][][]|@[[:alnum:]]+"
#
# fld - array of fields
# sep - array of seperators
#
tags = 0
matches = stringSplit(str,fld,rx,sep)
#
# arrayDebug("fld",fld)
# arrayDebug("sep",sep)
# print
#
print "Results:"
printf " "
# per glenn jackman answer at
# http://unix.stackexchange.com/questions/109491/the-ere-regex-to-split-string-between-a-delimiter-and-end-of-word
for (i=1; i<=matches; i++) {
printf "(%s)", fld[i]
if (match(sep[i], /^@(.+)/, m)) { printf "(%s)", m[1]; ++tags }
}
#
print
print
print "Summary:"
printf(" %d matches + %d tags = %d printed using regex(rx): %s\n on string(str): %s\n", matches, tags, matches + tags, rx, str)
print
}