Code point \u0005 is not a valid character in XML

XML can be tricky. XML plays a heavy role in FreshPorts. Since the early days, converting to XML then loading into the database has been a priority. This choice means the database-loading code doesn’t need to know much about the source of the data. It always means we can change the data source without modifying the database loading code.

You might ask: why aren’t you using JSON?

FreshPorts predates JSON by several years.

Today I found an issue caused by some characters in a commit. I am not blaming the committer. There is no fault there. This is about the code and how it can be improved to not fall over.

The commit email

Looking at the email file on disk, I found this:

Author: pkubaj
Date: Wed Jan  8 21:36:57 2020
New Revision: 522460
URL: https://svnweb.freebsd.org/changeset/ports/522460

Log:
  multimedia/obs-studio: fix build on powerpc64E
  

That E after 64, that’s it.

The commit in the repo

The commit was on multimedia/obs-studio at Wed Jan 8 21:36:57 2020 UTC. Viewing that commit within the repo, nothing seems amiss

Not seeing it there makes me question: is that website front end hiding the issue?

svn log

Let’s try viewing the svn log:

[dan@pkg01:~/ports/head] $ svn log -r 522460 
------------------------------------------------------------------------
r522460 | pkubaj | 2020-01-08 21:36:57 +0000 (Wed, 08 Jan 2020) | 10 lines

multimedia/obs-studio: fix build on powerpc64

Merge upstream commit to use GCC's SSE->AltiVec translation. Since it depends on compiling with GCC, it only works on ELFv1. Hopefully it will be possible to build it on ELFv2 in the future.

Also use luajit only where it's actually available. Since it's optional, the port builds anyway.

PR:		243199
Approved by:	yuri (maintainer)
MFH:		2020Q1 (fix build blanket)

------------------------------------------------------------------------
[dan@pkg01:~/ports/head] $ 

OK we don’t see it there, but if I pipe the output through less, I see it:

[dan@pkg01:~/ports/head] $ svn log -r 522460 | less
------------------------------------------------------------------------
r522460 | pkubaj | 2020-01-08 21:36:57 +0000 (Wed, 08 Jan 2020) | 10 lines

multimedia/obs-studio: fix build on powerpc64^E

Merge upstream commit to use GCC's SSE->AltiVec translation. Since it depends on compiling with GCC, it only works on ELFv1. Hopefully it will be possible to build it on ELFv2 in the future.

Also use luajit only where it's actually available. Since it's optional, the port builds anyway.

PR:             243199
Approved by:    yuri (maintainer)
MFH:            2020Q1 (fix build blanket)

------------------------------------------------------------------------
(END)

That’s interesting. Piping the output through more gave similar results.

The error

The error I see is:


Code point \u0005 is not a valid character in XML at /usr/local/lib/perl5/site_perl/FreshPorts/process_svn_mail.pm line 183.

The XML generated contains:




    
        
        
        
        

Yes, it terminates with the LOG message, which is the text which contains the non-printing character.

What’s next?

I’m going to consult with others and see how to fix this. The code in question is perl and I’m not the best at that.

Solved with this patch

--- ingress/modules/trunk/process_svn_mail.pm	2019/01/02 21:26:27	5200
+++ ingress/modules/trunk/process_svn_mail.pm	2020/01/09 17:16:10	5244
@@ -52,6 +52,11 @@
 
 	$Log = &GetLog($message);
 
+	# re https://news.freshports.org/2020/01/09/code-point-u0005-is-not-a-valid-character-in-xml/
+	#    https://twitter.com/FreshPorts/status/1215286202691211264
+	#	
+	$Log =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
+
 #print "log: '$Log'\n";
 
 	if ($Log eq '') {

The most relevant hint was: “Just look at the XML spec, it says exactly which control characters are allowed.” by Garrett Wollman. Based on that, I found the suggestion for the code I used above.

Very nice. Very simple.

I implemented on the dev server. It worked. Progressing through test, stage, and onto prod, it worked.

Thank you.

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top