De acordo com o Unicode® Standard Annex # 31 (identificador Unicode e sintaxe padrão) , ID_Start e ID_Continue de categorias de caracteres são derivadas do Unicode General_Category ( leia todo Formato de arquivo UnicodeData e Artigos de banco de dados de caracteres Unicode ).
Veja Tabela 2. Propriedades de Classes Lexicais para Identificadores (resumo):
ID_Startcharacters are derived from the Unicode General_Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plusOther_ID_Start, minusPattern_SyntaxandPattern_White_Spacecode points.
- In set notation:
[[:L:][:Nl:][:Other_ID_Start:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]
ID_Continuecharacters includeID_Startcharacters, plus characters having the Unicode General_Category of nonspacing marks, spacing combining marks, decimal number, connector punctuation, plusOther_ID_Continue, minusPattern_SyntaxandPattern_White_Spacecode points.
- In set notation:
[[:ID_Start:][:Mn:][:Mc:][:Nd:][:Pc:][:Other_ID_Continue:]--[:Pattern_Syntax:]--[:Pattern_White_Space:]]
Podemos ver referência a Other_ID_Start , Other_ID_Continue , Pattern_Syntax e Pattern_White_Space here; por exemplo:
The exact list of characters covered by the
Other_ID_StartandOther_ID_Continueproperties depends on the version of Unicode. For more information, see Unicode Standard Annex #44, “Unicode Character Database” [UAX44].
Analise UnicodeData.txt , aplique a (s) regex (es) válida (s) criada (s) acima definir notação . Inscreva-se na versão apropriada de UnicodeData.txt , navegando no Índice de / Public .
http://unicode.org/Public/5.0.0/ucd/UnicodeData.txt
↑ ↑ ↑