Filtrando inválido utf8

49

Eu tenho um arquivo de texto em uma codificação desconhecida ou mista. Eu quero ver as linhas que contêm uma seqüência de bytes que não é válida UTF-8 (canalizando o arquivo de texto em algum programa). Equivalentemente, eu quero filtrar as linhas que são válidas UTF-8. Em outras palavras, estou procurando grep [notutf8] .

Uma solução ideal seria portátil, curta e generalizável para outras codificações, mas se você achar que a melhor maneira é assar na definição de UTF-8 , vá em frente.

    
por Gilles 27.01.2011 / 01:13

6 respostas

32

Se você quiser usar grep , você pode fazer:

grep -axv '.*' file

nas localidades UTF-8 para obter as linhas que têm pelo menos uma seqüência UTF-8 inválida (isso funciona com o GNU Grep, pelo menos).

    
por 19.12.2014 / 23:14
32

Eu acho que você provavelmente quer iconv . É para converter entre conjuntos de códigos e suporta um número absurdo de formatos. Por exemplo, para remover qualquer coisa que não seja válida em UTF-8, você pode usar:

iconv -c -t UTF-8 < input.txt > output.txt

Sem a opção -c, ele relatará problemas na conversão para stderr, portanto, com a direção do processo, você pode salvar uma lista deles. Outra maneira seria tirar as coisas que não são UTF8 e, em seguida,

diff input.txt output.txt

para uma lista de onde as alterações foram feitas.

    
por 27.01.2011 / 04:43
20

Editar: Eu consertei um erro de digitação no regex. Ele precisava de um '\ x80' não \ 80 .

O regex para filtrar os formulários UTF-8 inválidos, para a adesão estrita ao UTF-8, é o seguinte

perl -l -ne '/
 ^( ([\x00-\x7F])              # 1-byte pattern
   |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
   |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
   |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
  )*$ /x or print'

Saída (das linhas principais do Teste 1 ):

Codepoint
=========  
00001000  Test=1 mode=strict               valid,invalid,fail=(1000,0,0)          
0000E000  Test=1 mode=strict               valid,invalid,fail=(D800,800,0)          
0010FFFF  mode=strict  test-return=(0,0)   valid,invalid,fail=(10F800,800,0)          

Q. Como criar dados de teste para testar uma regex que filtra Unicode inválido?
A. Crie seu próprio algoritmo de teste UTF-8 e quebre suas regras ...
Catch-22 .. Mas então, como você testa seu algoritmo de teste?

O regex, acima, foi testado (usando iconv como referência) para todo valor inteiro de 0x00000 a 0x10FFFF . Esse valor superior é o o valor inteiro máximo de um Codepoint Unicode

De acordo com esta página da wikipedia UTF-8 .

  • O UTF-8 codifica cada um dos 1.112.064 pontos de código no conjunto de caracteres Unicode, usando de um a quatro bytes de 8 bits

Esse numeber (1,112,064) equivale a um intervalo 0x000000 a 0x10F7FF , que é 0x0800 tímido do valor inteiro máximo real para o ponto de código Unicode mais alto: 0x10FFFF

Este bloco de inteiros está faltando no espectro Unicode Codepoints, devido à necessidade de a codificação UTF-16 ir além de sua intenção de design original através de um sistema chamado pares substitutos . Um bloco de 0x0800 inteiros foi reservado para ser usado pelo UTF-16 .. Esse bloco abrange o intervalo 0x00D800 to 0x00DFFF . Nenhuma dessas inteters são valores Unicode legais e, portanto, são valores UTF-8 inválidos.

No Teste 1 , o regex foi testado em cada número no intervalo de Pontos de código Unicode e corresponde exatamente aos resultados de iconv . 0x010F7FF valores válidos e 0x000800 valores inválidos.

No entanto, o problema surge agora, * Como o regex manipula Out-Of-Range UTF-8 Value; acima de 0x010FFFF (o UTF-8 pode ser estendido para 6 bytes, com um valor inteiro máximo de 0x7FFFFFFF ?
Para gerar os valores necessários * de byte UTF-8 não-unicode , usei o seguinte comando:

  perl -C -e 'print chr 0x'$hexUTF32BE

Para testar sua validade (de alguma forma), usei o Gilles' UTF-8 regex ...

  perl -l -ne '/
   ^( [
[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }

# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
#  been  accurately observed and included in the test code and final regex. 
# =========================================================================
: 2 bytes; B2=0 #  run-test=1   do-not-test=0
: 3 bytes; B3=0 #  run-test=1   do-not-test=0
: 4 bytes; B4=0 #  run-test=1   do-not-test=0 

:   regex; Rx=1 #  run-test=1   do-not-test=0

           ((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE  then iconv -f UTF-32BE beyond 0xFFFF)
           ((   lax=32)); mode[$lax]=lax       # iconv -f UTF-32BE  only)

          # modebits=$strict
                  # UTF-8, in relation to UTF-16 has invalid values
                  # modebits=$strict automatically shifts to modebits=$lax
                  # when the tested integer exceeds 0xFFFF
          # modebits=$lax 
                  # UTF-8, in relation to UTF-32, has no restrictione


           # Test 1 Sequentially tests a range of Big-Endian integers
           #      * Unicode Codepoints are a subset ofBig-Endian integers            
           #        ( based on 'iconv' -f UTF-32BE -f UTF-8 )    
           # Note: strict UTF-8 has a few quirks because of UTF-16
                    #    Set modebits=16 to "strictly" test the low range

             Test=1; modebits=$strict
           # Test=2; modebits=$lax
           # Test=3
              mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
              mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )


#########################################################################  

# 1 byte  UTF-8 values: Nothing to do; no complexities.

#########################################################################

#  2 Byte  UTF-8 values:  Verifying that I've got the right range values.
if ((B2==1)) ; then  
  echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
  # =========================================================================
  time \
  for d1 in {194..223} ;do
      #     bin       oct  hex  dec
      # lo  11000010  302   C2  194
      # hi  11011111  337   DF  223
      B2b1=$(printf "%0.2X" $d1)
      #
      for d2 in {128..191} ;do
          #     bin       oct  hex  dec
          # lo  10000000  200   80  128
          # hi  10111111  277   BF  191
          B2b2=$(printf "%0.2X" $d2)
          #
          echo -n "${B2b1}${B2b2}" |
            xxd -p -u -r  |
              iconv -f UTF-8 >/dev/null || { 
                echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
          #
      done
  done
  echo

  # Now do a negated test.. This takes longer, because there are more values.
  echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
  # =========================================================================
  # Note: 'iconv' will treat a leading  \x00-\x7F as a valid leading single,
  #   so this negated test primes the first UTF-8 byte with values starting at \x80
  time \
  for d1 in {128..193} {224..255} ;do 
 #for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2) 
      B2b1=$(printf "%0.2X" $d1)
      #
      for d2 in {0..127} {192..255} ;do
     #for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
          B2b2=$(printf "%0.2X" $d2)
          #
          echo -n "${B2b1}${B2b2}" |
            xxd -p -u -r |
              iconv -f UTF-8 2>/dev/null && { 
                echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
          #
      done
  done
  echo
fi

#########################################################################

#  3 Byte  UTF-8 values:  Verifying that I've got the right range values.
if ((B3==1)) ; then  
  echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
  # ========================================================================
  time \
  for d1 in {224..239} ;do
      #     bin       oct  hex  dec
      # lo  11100000  340   E0  224
      # hi  11101111  357   EF  239
      B3b1=$(printf "%0.2X" $d1)
      #
      if   [[ $B3b1 == "E0" ]] ; then
          B3b2range="$(echo {160..191})"
          #     bin       oct  hex  dec  
          # lo  10100000  240   A0  160  
          # hi  10111111  277   BF  191
      elif [[ $B3b1 == "ED" ]] ; then
          B3b2range="$(echo {128..159})"
          #     bin       oct  hex  dec  
          # lo  10000000  200   80  128  
          # hi  10011111  237   9F  159
      else
          B3b2range="$(echo {128..191})"
          #     bin       oct  hex  dec
          # lo  10000000  200   80  128
          # hi  10111111  277   BF  191
      fi
      # 
      for d2 in $B3b2range ;do
          B3b2=$(printf "%0.2X" $d2)
          echo "${B3b1} ${B3b2} xx"
          #
          for d3 in {128..191} ;do
              #     bin       oct  hex  dec
              # lo  10000000  200   80  128
              # hi  10111111  277   BF  191
              B3b3=$(printf "%0.2X" $d3)
              #
              echo -n "${B3b1}${B3b2}${B3b3}" |
                xxd -p -u -r  |
                  iconv -f UTF-8 >/dev/null || { 
                    echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
              #
          done
      done
  done
  echo

  # Now do a negated test.. This takes longer, because there are more values.
  echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
  # =========================================================================
  # Note: 'iconv' will treat a leading  \x00-\x7F as a valid leading single,
  #   so this negated test primes the first UTF-8 byte with values starting at \x80
  #
  # real     26m28.462s \ 
  # user     27m12.526s  | stepping by 2
  # sys      13m11.193s /
  #
  # real    239m00.836s \
  # user    225m11.108s  | stepping by 1
  # sys     120m00.538s /
  #
  time \
  for d1 in {128..223..1} {240..255..1} ;do 
 #for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3) 
      B3b1=$(printf "%0.2X" $d1)
      #
      if   [[ $B3b1 == "E0" ]] ; then
          B3b2range="$(echo {0..159..1} {192..255..1})"
         #B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
      elif [[ $B3b1 == "ED" ]] ; then
          B3b2range="$(echo {0..127..1} {160..255..1})"
         #B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
      else
          B3b2range="$(echo {0..127..1} {192..255..1})"
         #B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
      fi
      for d2 in $B3b2range ;do
          B3b2=$(printf "%0.2X" $d2)
          echo "${B3b1} ${B3b2} xx"
          #
          for d3 in {0..127..1} {192..255..1} ;do
         #for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
              B3b3=$(printf "%0.2X" $d3)
              #
              echo -n "${B3b1}${B3b2}${B3b3}" |
                xxd -p -u -r |
                  iconv -f UTF-8 2>/dev/null && { 
                    echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
              #
          done
      done
  done
  echo

fi

#########################################################################

#  Brute force testing in the Astral Plane will take a VERY LONG time..
#  Perhaps selective testing is more appropriate, now that the previous tests 
#     have panned out okay... 
#  
#  4 Byte  UTF-8 values:
if ((B4==1)) ; then  
  echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
  # ==================================================================
  # real    58m18.531s \
  # user    56m44.317s  | 
  # sys     27m29.867s /
  time \
  for d1 in {240..244} ;do
      #     bin       oct  hex  dec
      # lo  11110000  360   F0  240
      # hi  11110100  364   F4  244  -- F4 encodes some values greater than 0x10FFFF;
      #                                    such a sequence is invalid.
      B4b1=$(printf "%0.2X" $d1)
      #
      if   [[ $B4b1 == "F0" ]] ; then
        B4b2range="$(echo {144..191})" ## f0 90 80 80  to  f0 bf bf bf
        #     bin       oct  hex  dec          010000  --  03FFFF 
        # lo  10010000  220   90  144  
        # hi  10111111  277   BF  191
        #                            
      elif [[ $B4b1 == "F4" ]] ; then
        B4b2range="$(echo {128..143})" ## f4 80 80 80  to  f4 8f bf bf
        #     bin       oct  hex  dec          100000  --  10FFFF 
        # lo  10000000  200   80  128  
        # hi  10001111  217   8F  143  -- F4 encodes some values greater than 0x10FFFF;
        #                                    such a sequence is invalid.
      else
        B4b2range="$(echo {128..191})" ## fx 80 80 80  to  f3 bf bf bf
        #     bin       oct  hex  dec          0C0000  --  0FFFFF
        # lo  10000000  200   80  128          0A0000
        # hi  10111111  277   BF  191
      fi
      #
      for d2 in $B4b2range ;do
          B4b2=$(printf "%0.2X" $d2)
          #
          for d3 in {128..191} ;do
              #     bin       oct  hex  dec
              # lo  10000000  200   80  128
              # hi  10111111  277   BF  191
              B4b3=$(printf "%0.2X" $d3)
              echo "${B4b1} ${B4b2} ${B4b3} xx"
              #
              for d4 in {128..191} ;do
                  #     bin       oct  hex  dec
                  # lo  10000000  200   80  128
                  # hi  10111111  277   BF  191
                  B4b4=$(printf "%0.2X" $d4)
                  #
                  echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
                    xxd -p -u -r  |
                      iconv -f UTF-8 >/dev/null || { 
                        echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
                  #
              done
          done
      done
  done
  echo "# Test 4 bytes for Valid UTF-8 values: END"
  echo
fi

########################################################################
# There is no test (yet) for negated range values in the astral plane. #  
#                           (all negated range values must be invalid) #
#  I won't bother; This was mainly for me to ge the general feel of    #     
#   the tests, and the final test below should flush anything out..    #
# Traversing the intire UTF-8 range takes quite a while...             #
#   so no need to do it twice (albeit in a slightly different manner)  #
########################################################################

################################
### The construction of:    ####
###  The Regular Expression ####
###      (de-construction?) ####
################################

#     BYTE 1                BYTE 2       BYTE 3      BYTE 4 
# 1: [\x00-\x7F]
#    ===========
#    ([\x00-\x7F])
#
# 2: [\xC2-\xDF]           [\x80-\xBF]
#    =================================
#    ([\xC2-\xDF][\x80-\xBF])
# 
# 3: [\xE0]                [\xA0-\xBF]  [\x80-\xBF]   
#    [\xED]                [\x80-\x9F]  [\x80-\xBF]
#    [\xE1-\xEC\xEE-\xEF]  [\x80-\xBF]  [\x80-\xBF]
#    ==============================================
#    ((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))
#
# 4  [\xF0]                [\x90-\xBF]  [\x80-\xBF]  [\x80-\xBF]    
#    [\xF1-\xF3]           [\x80-\xBF]  [\x80-\xBF]  [\x80-\xBF]
#    [\xF4]                [\x80-\x8F]  [\x80-\xBF]  [\x80-\xBF]
#    ===========================================================
#    ((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))
#
# The final regex
# ===============
# 1-4:  (([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))
# 4-1:  (((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|([\xC2-\xDF][\x80-\xBF])|([\x00-\x7F]))


#######################################################################
#  The final Test; for a single character (multi chars to follow)     #  
#   Compare the return code of 'iconv' against the 'regex'            #
#   for the full range of 0x000000 to 0x10FFFF                        #
#                                                                     #     
#  Note; this script has 3 modes:                                     #
#        Run this test TWICE, set each mode Manually!                 #     
#                                                                     #     
#     1. Sequentially test every value from 0x000000 to 0x10FFFF      #     
#     2. Throw a spanner into the works! Force random byte patterns   #     
#     2. Throw a spanner into the works! Force random longer strings  #     
#        ==============================                               #     
#                                                                     #     
#  Note: The purpose of this routine is to determine if there is any  #
#        difference how 'iconv' and 'regex' handle the same data      #  
#                                                                     #     
#######################################################################
if ((Rx==1)) ; then
  # real    191m34.826s
  # user    158m24.114s
  # sys      83m10.676s
  time { 
  invalCt=0
  validCt=0
   failCt=0
  decBeg=$((0x00110000)) # incement by decimal integer
  decMax=$((0x7FFFFFFF)) # incement by decimal integer
  # 
  for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
      ((D==1)) && echo "=========================================================="
      #
      # Convert decimal integer '$CPDec' to Hex-digits; 6-long  (dec2hex)
      hexUTF32BE=$(printf '%0.8X\n' $CPDec)  # hexUTF32BE

      # progress count  
      if (((CPDec%$((0x1000)))==0)) ;then
          ((Test>2)) && echo
          echo "$hexUTF32BE  Test=$Test mode=${mode[$modebits]}            "
      fi
      if   ((Test==1 || Test==2 ))
      then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
          #
          if   ((Test==2)) ; then
              bits=32
              UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
                perl -l -ne '/^(  [perl -l -ne '/
 ^( ([\x00-\x7F])              # 1-byte pattern
   |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
   |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
   |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
  )*$ /x or print'
0-7] | [0-7][0-7] | [0-7][0-7]{2} | [0-7][0-7]{3} | [0-3][0-7]{4} | [4-5][0-7]{5} )*$/x and print' |xxd -p )" UTF8="${UTF8%0a}" [[ -n "$UTF8" ]] \ && rcIco32=0 || rcIco32=1 rcIco16= elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then bits=16 UTF8="$( echo -n "${hexUTF32BE:4}" | xxd -p -u -r | iconv -f UTF-16BE -t UTF-8 2>/dev/null)" \ && rcIco16=0 || rcIco16=1 rcIco32= else bits=32 UTF8="$( echo -n "$hexUTF32BE" | xxd -p -u -r | iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \ && rcIco32=0 || rcIco32=1 rcIco16= fi # echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE " # # # if ((${rcIco16}${rcIco32}!=0)) ;then # 'iconv -f UTF-16BE' failed produce a reliable UTF-8 if ((bits==16)) ;then ((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'" # iconv failed to create a 'srict' UTF-8 so # try UTF-32BE to get a 'lax' UTF-8 pattern UTF8="$( echo -n "$hexUTF32BE" | xxd -p -u -r | iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \ && rcIco32=0 || rcIco32=1 #echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE " if ((rcIco32!=0)) ;then ((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE" rcIco32=1 fi fi fi # echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE " # # # if ((rcIco16==0 || rcIco32==0)) ;then # 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern ((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE" ((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then echo " .. 'lax' UTF-8 produced a pattern" else echo fi # regex test if ((modebits==strict)) ;then #rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))*$/ or print' )" rxOut="$(echo -n "$UTF8" | perl -l -ne '/^( ([\x00-\x7F]) # 1-byte pattern |([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern )*$ /x or print' )" else if ((Test==2)) ;then rx="$(echo -n "$UTF8" |perl -l -ne '/^([
Codepoint
=========  
00001000  Test=1 mode=strict               valid,invalid,fail=(1000,0,0)          
0000E000  Test=1 mode=strict               valid,invalid,fail=(D800,800,0)          
0010FFFF  mode=strict  test-return=(0,0)   valid,invalid,fail=(10F800,800,0)          
0-7]|[0-7][0-7]|[0-7][0-7]{2}|[0-7][0-7]{3}|[0-3][0-7]{4}|[4-5][0-7]{5})*$/ and print')" [[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut= rx="$(echo -n "$rx" |sed -e "s/\(..\)/ /g")" else rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([
  perl -C -e 'print chr 0x'$hexUTF32BE
0-7]|[0-7][0-7]|[0-7][0-7]{2}|[0-7][0-7]{3}|[0-3][0-7]{4}|[4-5][0-7]{5})*$/ or print' )" fi fi if [[ "$rxOut" == "" ]] ;then ((D==1)) && echo " rcRegex: ok" rcRegex=0 else ((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed," ((D==1)) && if [[ "12" == *$Test* ]] ;then echo # " (codepoint) Test $Test" else echo fi rcRegex=1 fi fi # elif [[ $Test == 2 ]] then # Test 2. Throw a randomizing spanner into the works! # Then test the arbitary bytes ASIS # hexLineRand="$(echo -n "$hexUTF32BE" | sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/\n\n\n\n\n\n\n/" | sort -R | tr -d '\n')" # elif [[ $Test == 3 ]] then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF # echo "Test 3 is not properly implemented yet.. Exiting" exit 99 else echo "ERROR: Invalid mode" exit fi # # if ((Test==1 || Test=2)) ;then if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then ((rcIconv=rcIco16)) else ((rcIconv=rcIco32)) fi if ((rcRegex!=rcIconv)) ;then [[ $Test != 1 ]] && echo if ((rcRegex==1)) ;then echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} " else echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} " fi ((failCt++)); elif ((rcRegex!=0)) ;then # ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegex\t: $(printf "%0.8X\n" $invalCt)\t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")\r" ((invalCt++)) else ((validCt++)) fi if ((Test==1)) ;then echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) \r" else echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))\r" fi fi done } # End time fi exit
0-7] # 1-byte pattern |[0-7][0-7] # 2-byte pattern |[0-7][0-7]{2} # 3-byte pattern |[0-7][0-7]{3} # 4-byte pattern |[0-3][0-7]{4} # 5-byte pattern |[4-5][0-7]{5} # 6-byte pattern )*$ /x or print'

A saída de 'perl's print chr' coincide com a filtragem do regex de Gilles. Um reforça a validade do outro. Eu não posso usar iconv porque ele só lida com o subconjunto Padrão Unicode válido do padrão UTF-8 (original) mais amplo ...

Os nunbers envolvidos são bastante grandes, então eu testei as varreduras de topo de gama, de baixo de gama e várias varreduras por incrementos como 11111, 13579, 33333, 53441 ... Os resultados combinam Portanto, tudo o que resta é testar o regex contra esses valores fora do intervalo de estilo UTF-8 (inválido para Unicode e, portanto, também inválido para o próprio UTF-8 estrito).

Aqui estão os módulos de teste:

  perl -l -ne '/
   ^( [
[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }

# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
#  been  accurately observed and included in the test code and final regex. 
# =========================================================================
: 2 bytes; B2=0 #  run-test=1   do-not-test=0
: 3 bytes; B3=0 #  run-test=1   do-not-test=0
: 4 bytes; B4=0 #  run-test=1   do-not-test=0 

:   regex; Rx=1 #  run-test=1   do-not-test=0

           ((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE  then iconv -f UTF-32BE beyond 0xFFFF)
           ((   lax=32)); mode[$lax]=lax       # iconv -f UTF-32BE  only)

          # modebits=$strict
                  # UTF-8, in relation to UTF-16 has invalid values
                  # modebits=$strict automatically shifts to modebits=$lax
                  # when the tested integer exceeds 0xFFFF
          # modebits=$lax 
                  # UTF-8, in relation to UTF-32, has no restrictione


           # Test 1 Sequentially tests a range of Big-Endian integers
           #      * Unicode Codepoints are a subset ofBig-Endian integers            
           #        ( based on 'iconv' -f UTF-32BE -f UTF-8 )    
           # Note: strict UTF-8 has a few quirks because of UTF-16
                    #    Set modebits=16 to "strictly" test the low range

             Test=1; modebits=$strict
           # Test=2; modebits=$lax
           # Test=3
              mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
              mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )


#########################################################################  

# 1 byte  UTF-8 values: Nothing to do; no complexities.

#########################################################################

#  2 Byte  UTF-8 values:  Verifying that I've got the right range values.
if ((B2==1)) ; then  
  echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
  # =========================================================================
  time \
  for d1 in {194..223} ;do
      #     bin       oct  hex  dec
      # lo  11000010  302   C2  194
      # hi  11011111  337   DF  223
      B2b1=$(printf "%0.2X" $d1)
      #
      for d2 in {128..191} ;do
          #     bin       oct  hex  dec
          # lo  10000000  200   80  128
          # hi  10111111  277   BF  191
          B2b2=$(printf "%0.2X" $d2)
          #
          echo -n "${B2b1}${B2b2}" |
            xxd -p -u -r  |
              iconv -f UTF-8 >/dev/null || { 
                echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
          #
      done
  done
  echo

  # Now do a negated test.. This takes longer, because there are more values.
  echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
  # =========================================================================
  # Note: 'iconv' will treat a leading  \x00-\x7F as a valid leading single,
  #   so this negated test primes the first UTF-8 byte with values starting at \x80
  time \
  for d1 in {128..193} {224..255} ;do 
 #for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2) 
      B2b1=$(printf "%0.2X" $d1)
      #
      for d2 in {0..127} {192..255} ;do
     #for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
          B2b2=$(printf "%0.2X" $d2)
          #
          echo -n "${B2b1}${B2b2}" |
            xxd -p -u -r |
              iconv -f UTF-8 2>/dev/null && { 
                echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
          #
      done
  done
  echo
fi

#########################################################################

#  3 Byte  UTF-8 values:  Verifying that I've got the right range values.
if ((B3==1)) ; then  
  echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
  # ========================================================================
  time \
  for d1 in {224..239} ;do
      #     bin       oct  hex  dec
      # lo  11100000  340   E0  224
      # hi  11101111  357   EF  239
      B3b1=$(printf "%0.2X" $d1)
      #
      if   [[ $B3b1 == "E0" ]] ; then
          B3b2range="$(echo {160..191})"
          #     bin       oct  hex  dec  
          # lo  10100000  240   A0  160  
          # hi  10111111  277   BF  191
      elif [[ $B3b1 == "ED" ]] ; then
          B3b2range="$(echo {128..159})"
          #     bin       oct  hex  dec  
          # lo  10000000  200   80  128  
          # hi  10011111  237   9F  159
      else
          B3b2range="$(echo {128..191})"
          #     bin       oct  hex  dec
          # lo  10000000  200   80  128
          # hi  10111111  277   BF  191
      fi
      # 
      for d2 in $B3b2range ;do
          B3b2=$(printf "%0.2X" $d2)
          echo "${B3b1} ${B3b2} xx"
          #
          for d3 in {128..191} ;do
              #     bin       oct  hex  dec
              # lo  10000000  200   80  128
              # hi  10111111  277   BF  191
              B3b3=$(printf "%0.2X" $d3)
              #
              echo -n "${B3b1}${B3b2}${B3b3}" |
                xxd -p -u -r  |
                  iconv -f UTF-8 >/dev/null || { 
                    echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
              #
          done
      done
  done
  echo

  # Now do a negated test.. This takes longer, because there are more values.
  echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
  # =========================================================================
  # Note: 'iconv' will treat a leading  \x00-\x7F as a valid leading single,
  #   so this negated test primes the first UTF-8 byte with values starting at \x80
  #
  # real     26m28.462s \ 
  # user     27m12.526s  | stepping by 2
  # sys      13m11.193s /
  #
  # real    239m00.836s \
  # user    225m11.108s  | stepping by 1
  # sys     120m00.538s /
  #
  time \
  for d1 in {128..223..1} {240..255..1} ;do 
 #for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3) 
      B3b1=$(printf "%0.2X" $d1)
      #
      if   [[ $B3b1 == "E0" ]] ; then
          B3b2range="$(echo {0..159..1} {192..255..1})"
         #B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
      elif [[ $B3b1 == "ED" ]] ; then
          B3b2range="$(echo {0..127..1} {160..255..1})"
         #B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
      else
          B3b2range="$(echo {0..127..1} {192..255..1})"
         #B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
      fi
      for d2 in $B3b2range ;do
          B3b2=$(printf "%0.2X" $d2)
          echo "${B3b1} ${B3b2} xx"
          #
          for d3 in {0..127..1} {192..255..1} ;do
         #for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
              B3b3=$(printf "%0.2X" $d3)
              #
              echo -n "${B3b1}${B3b2}${B3b3}" |
                xxd -p -u -r |
                  iconv -f UTF-8 2>/dev/null && { 
                    echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
              #
          done
      done
  done
  echo

fi

#########################################################################

#  Brute force testing in the Astral Plane will take a VERY LONG time..
#  Perhaps selective testing is more appropriate, now that the previous tests 
#     have panned out okay... 
#  
#  4 Byte  UTF-8 values:
if ((B4==1)) ; then  
  echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
  # ==================================================================
  # real    58m18.531s \
  # user    56m44.317s  | 
  # sys     27m29.867s /
  time \
  for d1 in {240..244} ;do
      #     bin       oct  hex  dec
      # lo  11110000  360   F0  240
      # hi  11110100  364   F4  244  -- F4 encodes some values greater than 0x10FFFF;
      #                                    such a sequence is invalid.
      B4b1=$(printf "%0.2X" $d1)
      #
      if   [[ $B4b1 == "F0" ]] ; then
        B4b2range="$(echo {144..191})" ## f0 90 80 80  to  f0 bf bf bf
        #     bin       oct  hex  dec          010000  --  03FFFF 
        # lo  10010000  220   90  144  
        # hi  10111111  277   BF  191
        #                            
      elif [[ $B4b1 == "F4" ]] ; then
        B4b2range="$(echo {128..143})" ## f4 80 80 80  to  f4 8f bf bf
        #     bin       oct  hex  dec          100000  --  10FFFF 
        # lo  10000000  200   80  128  
        # hi  10001111  217   8F  143  -- F4 encodes some values greater than 0x10FFFF;
        #                                    such a sequence is invalid.
      else
        B4b2range="$(echo {128..191})" ## fx 80 80 80  to  f3 bf bf bf
        #     bin       oct  hex  dec          0C0000  --  0FFFFF
        # lo  10000000  200   80  128          0A0000
        # hi  10111111  277   BF  191
      fi
      #
      for d2 in $B4b2range ;do
          B4b2=$(printf "%0.2X" $d2)
          #
          for d3 in {128..191} ;do
              #     bin       oct  hex  dec
              # lo  10000000  200   80  128
              # hi  10111111  277   BF  191
              B4b3=$(printf "%0.2X" $d3)
              echo "${B4b1} ${B4b2} ${B4b3} xx"
              #
              for d4 in {128..191} ;do
                  #     bin       oct  hex  dec
                  # lo  10000000  200   80  128
                  # hi  10111111  277   BF  191
                  B4b4=$(printf "%0.2X" $d4)
                  #
                  echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
                    xxd -p -u -r  |
                      iconv -f UTF-8 >/dev/null || { 
                        echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
                  #
              done
          done
      done
  done
  echo "# Test 4 bytes for Valid UTF-8 values: END"
  echo
fi

########################################################################
# There is no test (yet) for negated range values in the astral plane. #  
#                           (all negated range values must be invalid) #
#  I won't bother; This was mainly for me to ge the general feel of    #     
#   the tests, and the final test below should flush anything out..    #
# Traversing the intire UTF-8 range takes quite a while...             #
#   so no need to do it twice (albeit in a slightly different manner)  #
########################################################################

################################
### The construction of:    ####
###  The Regular Expression ####
###      (de-construction?) ####
################################

#     BYTE 1                BYTE 2       BYTE 3      BYTE 4 
# 1: [\x00-\x7F]
#    ===========
#    ([\x00-\x7F])
#
# 2: [\xC2-\xDF]           [\x80-\xBF]
#    =================================
#    ([\xC2-\xDF][\x80-\xBF])
# 
# 3: [\xE0]                [\xA0-\xBF]  [\x80-\xBF]   
#    [\xED]                [\x80-\x9F]  [\x80-\xBF]
#    [\xE1-\xEC\xEE-\xEF]  [\x80-\xBF]  [\x80-\xBF]
#    ==============================================
#    ((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))
#
# 4  [\xF0]                [\x90-\xBF]  [\x80-\xBF]  [\x80-\xBF]    
#    [\xF1-\xF3]           [\x80-\xBF]  [\x80-\xBF]  [\x80-\xBF]
#    [\xF4]                [\x80-\x8F]  [\x80-\xBF]  [\x80-\xBF]
#    ===========================================================
#    ((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))
#
# The final regex
# ===============
# 1-4:  (([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))
# 4-1:  (((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|([\xC2-\xDF][\x80-\xBF])|([\x00-\x7F]))


#######################################################################
#  The final Test; for a single character (multi chars to follow)     #  
#   Compare the return code of 'iconv' against the 'regex'            #
#   for the full range of 0x000000 to 0x10FFFF                        #
#                                                                     #     
#  Note; this script has 3 modes:                                     #
#        Run this test TWICE, set each mode Manually!                 #     
#                                                                     #     
#     1. Sequentially test every value from 0x000000 to 0x10FFFF      #     
#     2. Throw a spanner into the works! Force random byte patterns   #     
#     2. Throw a spanner into the works! Force random longer strings  #     
#        ==============================                               #     
#                                                                     #     
#  Note: The purpose of this routine is to determine if there is any  #
#        difference how 'iconv' and 'regex' handle the same data      #  
#                                                                     #     
#######################################################################
if ((Rx==1)) ; then
  # real    191m34.826s
  # user    158m24.114s
  # sys      83m10.676s
  time { 
  invalCt=0
  validCt=0
   failCt=0
  decBeg=$((0x00110000)) # incement by decimal integer
  decMax=$((0x7FFFFFFF)) # incement by decimal integer
  # 
  for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
      ((D==1)) && echo "=========================================================="
      #
      # Convert decimal integer '$CPDec' to Hex-digits; 6-long  (dec2hex)
      hexUTF32BE=$(printf '%0.8X\n' $CPDec)  # hexUTF32BE

      # progress count  
      if (((CPDec%$((0x1000)))==0)) ;then
          ((Test>2)) && echo
          echo "$hexUTF32BE  Test=$Test mode=${mode[$modebits]}            "
      fi
      if   ((Test==1 || Test==2 ))
      then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
          #
          if   ((Test==2)) ; then
              bits=32
              UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
                perl -l -ne '/^(  [%pre%0-7]
                                | [0-7][0-7]
                                | [0-7][0-7]{2}
                                | [0-7][0-7]{3}
                                | [0-3][0-7]{4}
                                | [4-5][0-7]{5}
                               )*$/x and print' |xxd -p )"
              UTF8="${UTF8%0a}"
              [[ -n "$UTF8" ]] \
                    && rcIco32=0 || rcIco32=1
                       rcIco16=

          elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
              bits=16
              UTF8="$( echo -n "${hexUTF32BE:4}" |
                xxd -p -u -r |
                  iconv -f UTF-16BE -t UTF-8 2>/dev/null)" \
                    && rcIco16=0 || rcIco16=1  
                       rcIco32=
          else
              bits=32
              UTF8="$( echo -n "$hexUTF32BE" |
                xxd -p -u -r |
                  iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \
                    && rcIco32=0 || rcIco32=1
                       rcIco16=
          fi
          # echo "1 mode=${mode[$modebits]}-$bits  rcIconv: (${rcIco16},${rcIco32})  $hexUTF32BE "
          #
          #
          #
          if ((${rcIco16}${rcIco32}!=0)) ;then
              # 'iconv -f UTF-16BE' failed produce a reliable UTF-8
              if ((bits==16)) ;then
                  ((D==1)) &&           echo "bits-$bits rcIconv: error    $hexUTF32BE .. 'strict' failed, now trying 'lax'"
                  #  iconv failed to create a  'srict' UTF-8 so   
                  #      try UTF-32BE to get a   'lax' UTF-8 pattern    
                  UTF8="$( echo -n "$hexUTF32BE" |
                    xxd -p -u -r |
                      iconv -f UTF-32BE -t UTF-8 2>/dev/null)" \
                        && rcIco32=0 || rcIco32=1
                  #echo "2 mode=${mode[$modebits]}-$bits  rcIconv: (${rcIco16},${rcIco32})  $hexUTF32BE "
                  if ((rcIco32!=0)) ;then
                      ((D==1)) &&               echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
                      rcIco32=1
                  fi
              fi
          fi
          # echo "3 mode=${mode[$modebits]}-$bits  rcIconv: (${rcIco16},${rcIco32})  $hexUTF32BE "
          #
          #
          #
          if ((rcIco16==0 || rcIco32==0)) ;then
              # 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern  
                  ((D==1)) &&       echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
                  ((D==1)) &&       if [[ $bits == "16" && $rcIco32 == "0" ]] ;then 
                  echo " .. 'lax' UTF-8 produced a pattern"
              else
                  echo
              fi
               # regex test
              if ((modebits==strict)) ;then
                 #rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([\x00-\x7F])|([\xC2-\xDF][\x80-\xBF])|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})))*$/ or print' )"
                                     rxOut="$(echo -n "$UTF8" |
                  perl -l -ne '/^( ([\x00-\x7F])             # 1-byte pattern
                                  |([\xC2-\xDF][\x80-\xBF])  # 2-byte pattern
                                  |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF]))  # 3-byte pattern
                                  |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))        # 4-byte pattern
                                 )*$ /x or print' )"
               else
                  if ((Test==2)) ;then
                      rx="$(echo -n "$UTF8" |perl -l -ne '/^([%pre%0-7]|[0-7][0-7]|[0-7][0-7]{2}|[0-7][0-7]{3}|[0-3][0-7]{4}|[4-5][0-7]{5})*$/ and print')"
                      [[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
                      rx="$(echo -n "$rx" |sed -e "s/\(..\)/ /g")"  
                  else 
                      rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([%pre%0-7]|[0-7][0-7]|[0-7][0-7]{2}|[0-7][0-7]{3}|[0-3][0-7]{4}|[4-5][0-7]{5})*$/ or print' )"
                  fi
              fi
              if [[ "$rxOut" == "" ]] ;then
                ((D==1)) &&           echo "        rcRegex: ok"
                  rcRegex=0
              else
                  ((D==1)) &&           echo -n "bits-$bits rcRegex: error    $hexUTF32BE .. 'strict' failed,"
                  ((D==1)) &&           if [[  "12" == *$Test* ]] ;then 
                                            echo # "  (codepoint) Test $Test" 
                                        else
                                            echo
                                        fi
                  rcRegex=1
              fi
          fi
          #
      elif [[ $Test == 2 ]]
      then # Test 2. Throw a randomizing spanner into the works! 
          #          Then test the  arbitary bytes ASIS
          #
          hexLineRand="$(echo -n "$hexUTF32BE" |
            sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/\n\n\n\n\n\n\n/" |
              sort -R |
                tr -d '\n')"
          # 
      elif [[ $Test == 3 ]]
      then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
          #
          echo "Test 3 is not properly implemented yet.. Exiting"
          exit 99 
      else
          echo "ERROR: Invalid mode"
          exit
      fi
      #
      #
      if ((Test==1 || Test=2)) ;then
          if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
              ((rcIconv=rcIco16))
          else
              ((rcIconv=rcIco32))
          fi
          if ((rcRegex!=rcIconv)) ;then
              [[ $Test != 1 ]] && echo
              if ((rcRegex==1)) ;then
                  echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
              else
                  echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
              fi
              ((failCt++));
          elif ((rcRegex!=0)) ;then
            # ((invalCt++)); echo -ne "$hexUTF32BE  exit-codes $${rcIco16}${rcIco32}=,$rcRegex\t: $(printf "%0.8X\n" $invalCt)\t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")\r"
              ((invalCt++)) 
          else
              ((validCt++)) 
          fi
          if   ((Test==1)) ;then
              echo -ne "$hexUTF32BE "    "mode=${mode[$modebits]}  test-return=($rcIconv,$rcRegex)   valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))          \r"
          else 
              echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex)  val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))\r"
          fi
      fi
  done
  } # End time
fi
exit
0-7] # 1-byte pattern |[0-7][0-7] # 2-byte pattern |[0-7][0-7]{2} # 3-byte pattern |[0-7][0-7]{3} # 4-byte pattern |[0-3][0-7]{4} # 5-byte pattern |[4-5][0-7]{5} # 6-byte pattern )*$ /x or print'
    
por 03.05.2011 / 20:50
6

Eu acho uconv (em icu-devtools package no Debian) útil para inspecionar dados UTF-8:

$ print '\xE9 \xe9 \u20ac \ud800\udc00 \U110000' |
    uconv --callback escape-c -t us
\xE9 \xE9 \u20ac \xED\xA0\x80\xED\xB0\x80 \xF4\x90\x80\x80

(O \x s ajuda a identificar os caracteres inválidos (exceto o falso positivo introduzido voluntariamente com um literal \xE9 acima)).

(muitos outros usos agradáveis).

    
por 20.12.2014 / 00:46
3

O Python tem uma função unicode integrada desde a versão 2.0.

#!/usr/bin/env python2
import sys
for line in sys.stdin:
    try:
        unicode(line, 'utf-8')
    except UnicodeDecodeError:
        sys.stdout.write(line)

No Python 3, unicode foi dobrado em str . Ele precisa ser passado por um objeto semelhante a bytes , aqui o buffer objetos para o descritores padrão .

#!/usr/bin/env python3
import sys
for line in sys.stdin.buffer:
    try:
        str(line, 'utf-8')
    except UnicodeDecodeError:
        sys.stdout.buffer.write(line)
    
por 20.12.2014 / 00:05
0

Eu me deparei com um problema semelhante (detalhe na seção "Contexto") e cheguei com a seguinte solução ftfy_line_by_line.py :

#!/usr/bin/env python3
import ftfy, sys
with open(sys.argv[1], mode='rt', encoding='utf8', errors='replace') as f:
  for line in f:
    sys.stdout.buffer.write(ftfy.fix_text(line).encode('utf8', 'replace'))
    #print(ftfy.fix_text(line).rstrip().decode(encoding="utf-8", errors="replace"))

Usando codificar + substituir + ftfy para corrigir automaticamente Mojibake e outras correções.

Contexto

Coletei > 10GiB CSV de metadados do sistema de arquivos básico usando o seguinte script gen_basic_files_metadata.csv.sh , executando essencialmente :

find "${path}" -type f -exec stat --format="%i,%Y,%s,${hostname},%m,%n" "{}" \;

O problema que tive foi com codificação inconsistente de nomes de arquivos em sistemas de arquivos, causando UnicodeDecodeError ao processar mais com aplicativos python (csvsql para ser mais específico).

Por isso, apliquei acima do script ftfy e foi necessário

Por favor, note que ftfy é muito lento, processando os > 10GiB:

real    147m35.182s
user    146m14.329s
sys     2m8.713s

enquanto sha256sum para comparação:

real    6m28.897s
user    1m9.273s
sys     0m6.210s

na CPU Intel Core i7-3520M a 2.90GHz + 16GiB RAM (e dados na unidade externa)

    
por 25.05.2017 / 16:11