Por fim, pesquisando sobre o erro, encontrei alguns artigos sugerindo que no MySQL o conjunto de caracteres utf8
tem bugs / danos cerebrais e não deve ser usado.
Em sua substituição, deve-se usar o conjunto de caracteres utf8mb4
.
De No MySQL, nunca use “utf8”. Use "utf8mb4"
MySQL’s “utf8” isn’t UTF-8.
The “utf8” encoding only supports three bytes per character. The real UTF-8 encoding — which everybody uses, including you — needs up to four bytes per character.
MySQL developers never fixed this bug. They released a workaround in 2010: a new character set called “utf8mb4”.
In short:
- MySQL’s “utf8mb4” means “UTF-8”.
- MySQL’s “utf8” means “a proprietary character encoding”. This encoding can’t encode many Unicode characters.I’ll make a sweeping statement here: all MySQL and MariaDB users who are currently using “utf8” should actually use “utf8mb4”. Nobody should ever use “utf8”.
De Como suportar Unicode completo em bancos de dados MySQL
Turns out MySQL’s utf8 charset only partially implements proper UTF-8 encoding. It can only store UTF-8-encoded symbols that consist of one to three bytes; encoded symbols that take up four bytes aren’t supported.
As shown above, this behavior can lead to data loss, but it gets worse — it can result in security vulnerabilities. Here are some examples, all of which were discovered after publishing this write-up:
- PHP object injection vulnerability in WordPress < 3.6.1, leading to remote >code execution in combination with certain WordPress plugins
- Email authentication bypass in Phabricator Stored XSS in WordPress 4.1.2
- Remote command execution in the Joomla! CMS
TL;DR MySQL’s
utf8
encoding is awkwardly named, as it’s different from proper UTF-8 encoding. It doesn’t offer full Unicode support, which can lead to data loss or security vulnerabilities.