We are currently investigating the possibility to support Chinese characters.
for this, we've had already quite some investigation, but we would like to check with the guru's here.

----------------------------------------------------------------------------
Current environment
----------------------------------------------------------------------------
CREATE TABLE `shipment_ref` ( fields with int, varchar (up to 10000 length), blob, decimal, smallint, tinyint, timestamp )

PRIMARY KEY, KEY
) ENGINE=InnoDB AUTO_INCREMENT=15186641 DEFAULT CHARSET=latin1


SHOW GLOBAL VARIABLES LIKE 'character_set%';
character_set_client latin1
character_set_connection latin1
character_set_database latin1
character_set_filesystem binary
character_set_results latin1
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/


SHOW SESSION VARIABLES LIKE 'character_set%';
character_set_client utf8
character_set_connection utf8
character_set_database latin1
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/

Version: 5.1.73-1+deb6u1-log (Debian)
Compiled for: debian-linux-gnu (x86_64)

----------------------------------------------------------------------------
New environment
----------------------------------------------------------------------------
CREATE TABLE `shipment_ref` ( fields with int, varchar (up to 4000 length), blob, decimal, smallint, tinyint, timestamp) COLLATE utf8_unicode_ci NOT NULL
PRIMARY KEY, KEY
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

SHOW GLOBAL VARIABLES LIKE 'character_set%';
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server utf8
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/

SHOW SESSION VARIABLES LIKE 'character_set%';
character_set_client utf8
character_set_connection utf8
character_set_database utf8
character_set_filesystem binary
character_set_results utf8
character_set_server utf8
character_set_system utf8
character_sets_dir /usr/share/mysql/charsets/

Version: 5.6.20-68.0-56-log Percona XtraDB Cluster (GPL), Release 25.7, wsrep_25.7.r4126
Compiled for: debian-linux-gnu (x86_64)

Reason to move to Percona is to use the cluster for high availability and scalability.
We plan to setup an active-active-active cluster with a load balancer by implementing Percona cluster, basic proof of concept is successful.

Reason to move to a new database version is because of the possible character set utf8mb4 in 5.6 in comparison to 5.1 which is not applicable.
This proof of concept is still under investigation, hence this post.

As you can see, the current environment is setup to latin1 character set.
Reason to use character set utf8mb4 is that we need to support Chinese simplified, Chinese traditional and Cyrillic characters and maintain the characters in latin1.

Now, in my opinion, a hell of a job (and you are most probably guessing the question)
We need to convert our complete database to the new character set / collation, what is the best approach?


I've read the blog on charcoll (http://mysql.rjweb.org/doc.php/charcoll)
Nevertheless, this change is quite big for us.
The blog dates back to 2013, conceptual probably nothing changed, nevertheless, this area is changing vastly.
For that, I'd like to ask some questions, to put our heads together for the best appropriate approach and solution.

Any help from you and other fellow forum contributors would be greatly appreciated!

Character set
I think it's best to set to "UTF-8 Unicode, utf8mb4_general_ci"
This is the character set known in mySql

The global variables need to be set, I think this is OK, any remarks?
character_set_client utf8mb4_general_ci
character_set_connection utf8mb4_general_ci
character_set_database utf8mb4_general_ci
character_set_filesystem binary
character_set_results utf8mb4_general_ci
character_set_server utf8mb4_general_ci
character_set_system utf8mb4_general_ci
character_sets_dir /usr/share/mysql/charsets/


I don't know for sure which collation to use, any advice?
- utf8mb4_general_ci
- utf8mb4_bin
- utf8mb4_unicode_ci
- utf8mb4_unicode_520_ci


Do we need to change the connection strings in our applications?
In other words, do we need to use specific session parameters while building the connection form the client programs?

What is the best approach?
Update the 3 Percona cluster databases with correct character sets and
- Fix current data while dumping the data from current environment.
or Fix current data while loading data the data in the new environment.

or
Setup a new Percona cluster identical to character sets and collation as current environment.
Dump the current environment and load the databases in the Percona cluster and update variables and tables afterwards.

or
Dump current environment, fix current environment, Dump current environment, setup Percona cluster with that information.


As our current character set is latin1, do you foresee problems?
I know that in oracle there is a possibility with the program csscan to check if your current character set can be converted to the new character set.
Is there such a tool in mySql (I've googled, but could not find it).

Are there more things to think about? than just setting and / or fixing the encoding on the database objects?
Are there settings on the operating system or programming level (java, jmx web-services) we should take in consideration?

Please keep in mind that we are having quite a big database (no rocket science, but quite some rows)
To sum up, 9 databases, 172 tables, 230 million records, 100 GB.

Total downtime in the night with our customers is possible for 5 hours maximum.
Is that optimistic?

Do you have any questions or remarks, please share, any help would be appreciated.