Inserting emojis that uses 4-byte utf-8 does not work with MySQL

Description

Adding an emoji(common when using mobile devices) that uses 4 bytes in the editor does not work. Behaviour differs from tool to tool. For example in Lessons tool when pasting the "grinning face with smiling eyes" from http://apps.timwhitlock.info/emoji/tables/unicode in to the editor user will be presented with the error:

Unable to save or update item. Error was: java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x81</...' for column 'html' at row 1

The correct solution to this is to convert from utf8 to utf8mb4(https://mathiasbynens.be/notes/mysql-utf8mb4).

Please feel free to edit to change to the correct components, I am not sure how to label this issue.

Activity

Show:

Sam Ottenhoff September 13, 2016 at 12:53 PM

Confirmed on nightly instances: I can't post a message to Forums with a 4-byte emoji to Sakai 11. To Sakai master the emoji is stripped as expected and my post succeeds.

Sam Ottenhoff August 8, 2016 at 11:27 AM

Yes, let's backport this fix to 10.x. Can someone confirm this works on Sakai 11?

Matthew Buckett August 8, 2016 at 11:20 AM
Edited

We just had a failure locally on 10.x because an email for the email archive tool contained an emoji and it failed to get inserted into the DB.

Matthew Jones April 1, 2016 at 12:56 PM
Edited

It looks like as of MySQL 5.7.7 the large_prefix is default to be on, and the old prefix size (767) is deprecated. So I think this plan is

Sakai 11, have a property that will remove 4-byte utf-8 characters (or convert to something else) on by default. Shouldn't be needed for Oracle though.

In Sakai 12+ maybe have a complete conversion for people running with this property or MySQL 5.7.7+.

Matthew Jones March 24, 2016 at 8:38 AM
Edited

Yeah, that sounds like a good quick fix for 11. What would we replace them with, A space? Some other character? It looks pretty straightforward. But I'd take the patch if they had something in production too.

http://stackoverflow.com/questions/12013341/removing-characters-of-a-specific-unicode-range-from-a-string

Some patterns like this might be enough too, if we also don't care about a replacement.
byte[] converttoBytes = string.getBytes("UTF-8");
string = new String(converttoBytes, "UTF-8");

There is an emoji library (https://github.com/vdurmont/emoji-java) that can remove emojis but I know there's characters outside of the emoji range that would also cause problems.

This article is the more appropriate one for this case.
http://stackoverflow.com/questions/12867000/how-to-remove-surrogate-characters-in-java

Fixed

Details

Priority

Affects versions

Fix versions

Components

Assignee

Reporter

Property addition/change required

Yes

Environment

MySQL
Created January 9, 2015 at 4:35 AM
Updated April 23, 2019 at 3:29 PM
Resolved May 5, 2016 at 2:46 PM