Improving the word-breaker (SMF patch)

I ran into a nice suggestion from a long time ago (~12 years):

Quote from: mrb on January 23, 2013, 02:52:20 AM

SMF breaks up long words by inserting a space every 79 characters (it is a space in a with a negative margin). Example: here are 120 'a' characters:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

When copying/pasting the chars, the space is visible at the 80th position, which is very annoying...

Instead, SMF should insert the standardized tag (word break opportunity) already recognized by most browsers. In theory is identical to U+200B (ZERO-WIDTH SPACE) but this is false; for example the current Chrome version on Linux (Version 23.0.1271.97) replaces U+200B with '#' when copying & pasting to a non-UTF8 application, whereas is nicely invisible...

Maybe newer members haven't encountered this particular issue before, but I know exactly what the above poster is complaining about, and I'm sure that many other people do, too.

Basically, if you (with some exceptions that aren't worth getting into) post an unbroken sequence of 80 or more letters, digits, underscores and/or periods, then it'll automatically get divided into chunks which are each 79 characters in length (or less, in the case of the final chunk).

So, for example, if you were to post the SHA-512 of the text "How would you like to suck my balls, Mr. Garrison?", then instead of it appearing like this:

It would appear like this:

(See that tiny space at around the two-thirds point? Between dbcc and 9e06?)

That "breaker" space will, among other things, cause problems with copy-pasting (because there's now a space in the content that the author didn't intend for there to be), and will also affect double-click selecting, like this:

So, what I've done is add a new kind of "breaker" to SMF (to go along with the three existing variations that are chosen between based on what browser the BBCode parser thinks it's producing markup for). This new breaker avoids the problems described above and is used by default (but, I've left the older breakers accessible by a context variable in case theymos wishes to selectively flip between the new and the old behavior).

Here's the diff:

Code:

--- baseline/Sources/Subs.php 2011-09-17 21:59:55.000000000 +0000
+++ modified/Sources/Subs.php 2025-02-16 23:39:26.000000000 +0000
@@ -1860,24 +1860,27 @@
if (!empty($modSettings['fixLongWords']) && $modSettings['fixLongWords'] > 5)
{
// This is SADLY and INCREDIBLY browser dependent.
if ($context['browser']['is_gecko'] || $context['browser']['is_konqueror'])
$breaker = ' ';
// Opera...
elseif ($context['browser']['is_opera'])
$breaker = ' ';
// Internet Explorer...
else
$breaker = ' ';

+ if ($context['bbc_use_modern_breaker'] ?? true)
+ $breaker = '';
+
// PCRE will not be happy if we don't give it a short.
$modSettings['fixLongWords'] = (int) min(65535, $modSettings['fixLongWords']);

// The idea is, find words xx long, and then replace them with xx + space + more.
if (strlen($data) > $modSettings['fixLongWords'])
{
// This is done in a roundabout way because $breaker has "long words" :P.
$data = strtr($data, array($breaker => '< >', ' ' => $context['utf8'] ? "\xC2\xA0" : "\xA0"));
$data = preg_replace(
'~(?<=[>;:!? ' . $non_breaking_space . '\]()]|^)([\w\.]{' . $modSettings['fixLongWords'] . ',})~e' . ($context['utf8'] ? 'u' : ''),
"preg_replace('/(.{" . ($modSettings['fixLongWords'] - 1) . '})/' . ($context['utf8'] ? 'u' : '') . "', '\\\$1< >', '\$1')",
$data);

When doing these sorts of fixes, I always hem and haw on whether or not it makes sense to keep using the older behavior on old posts, and only use the newer behavior on new posts...

I guess, I don't much like the idea of posts changing in ways that the author didn't account for (for example, I hate that one of my own posts was mangled by the wordfilter at some point after I authored it). I (mostly) lean toward wanting to keep old posts displaying as they did at the time they were authored. For example, in the post I quoted from above, mrb is demonstrating the problem he's describing, as in, if you try to copy-paste his example-sequence, then you'll get the result he's talking about: 79 characters, followed by a space, followed by 41 characters. But, after this fix is applied, if you then tried to copy-paste mrb's example, you'll find that you get 120 contiguous characters, so it'll look (from the perspective of someone reading that post cold) like mrb must have been confused or mistaken when he constructed that example.

I dunno, maybe I'm just overthinking things, but, in case theymos feels that the old behavior is worth preserving, here are two additional diffs that will make sure (at least, in the two places that are important, I think) that old posts won't be affected by this fix:

Code:

--- baseline/Sources/Display.php 2011-02-07 16:45:09.000000000 +0000
+++ modified/Sources/Display.php 2025-02-17 01:11:58.000000000 +0000
@@ -878,24 +878,26 @@
else
{
$memberContext[$message['ID_MEMBER']]['can_view_profile'] = allowedTo('profile_view_any') || ($message['ID_MEMBER'] == $ID_MEMBER && allowedTo('profile_view_own'));
$memberContext[$message['ID_MEMBER']]['is_topic_starter'] = $message['ID_MEMBER'] == $context['topic_starter_id'];
}

$memberContext[$message['ID_MEMBER']]['ip'] = $message['posterIP'];

// Do the censor thang.
censorText($message['body']);
censorText($message['subject']);

+ $context['bbc_use_modern_breaker'] = (int)$message['ID_MSG'] >= 65073000;
+
// Run BBC interpreter on the message.
$message['body'] = parse_bbc($message['body'], $message['smileysEnabled'], $message['ID_MSG']);

// Compose the memory eat- I mean message array.
$output = array(
'attachment' => loadAttachmentContext($message['ID_MSG']),
'alternate' => $counter % 2,
'id' => $message['ID_MSG'],
'href' => $scripturl . '?topic=' . $topic . '.msg' . $message['ID_MSG'] . '#msg' . $message['ID_MSG'],
'link' => '<a href="' . $scripturl . '?topic=' . $topic . '.msg' . $message['ID_MSG'] . '#msg' . $message['ID_MSG'] . '">' . $message['subject'] . '</a>',
'member' => &$memberContext[$message['ID_MEMBER']],
'icon' => $message['icon'],

Code:

--- baseline/Sources/Profile.php 2013-10-21 19:01:11.000000000 +0000
+++ modified/Sources/Profile.php 2025-02-17 01:12:02.000000000 +0000
@@ -1479,24 +1479,26 @@
}

// Start counting at the number of the first message displayed.
$counter = $reverse ? $context['start'] + $maxIndex + 1 : $context['start'];
$context['posts'] = array();
$board_ids = array('own' => array(), 'any' => array());
while ($row = mysql_fetch_assoc($request))
{
// Censor....
censorText($row['body']);
censorText($row['subject']);

+ $context['bbc_use_modern_breaker'] = (int)$row['ID_MSG'] >= 65073000;
+
// Do the code.
$row['body'] = parse_bbc($row['body'], $row['smileysEnabled'], $row['ID_MSG']);

// And the array...
$context['posts'][$counter += $reverse ? -1 : 1] = array(
'body' => $row['body'],
'counter' => $counter,
'category' => array(
'name' => $row['cname'],
'id' => $row['ID_CAT']
),