Fatal UnidecodeError of bblfsh with a badly utf8 encoded file


Yes, I know, “garbage in , garbage out!”, but…

I use PGA as source for a training corpus. PGA contains archive of a java project ABPlayer : 33bf4f54ccfdaf527041d8b7cacc805594cd4bba.siva
This is a backup of github project https://github.com/winkstu/ABPlayer.

This project contained (and still contains) a java source file ABPlayer/libs/OneXListviewLibrary/src/me/maxwin/view/XListViewFooter.java which is badly encoded.

command file on this output: Java source, ISO-8859 text

Unfortunately, this file make bblfsh to crash and I don’t know how to avoid it because an exception occurs within handling of exception.

Here is an extract with the concerned lines:

public class UnicodeDecodeError {
	public void setText(String text){

And here is a simple python script to reproduce the behavior:

import bblfsh

client = bblfsh.BblfshClient("")
    print('client.parse UnidecodeEncodeError.java')
    ctx = client.parse('UnidecodeEncodeError.java')
except SystemError:
    print('!!!parse exit SystemError!')
except UnicodeDecodeError:
    print('!!!parse exit UnicodeDecodeError!')
    return None

And the trace:

client.parse UnidecodeEncodeError.java
Traceback (most recent call last):
  File "/home/mesnard/.local/lib/python3.6/site-packages/bblfsh/client.py", line 41, in _ensure_utf8
    return text.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 80: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "eval_badEncodedComment.py", line 6, in <module>
    ctx = client.parse('UnidecodeEncodeError.java')
  File "/home/mesnard/.local/lib/python3.6/site-packages/bblfsh/client.py", line 80, in parse
    contents = self._get_contents(contents, filename)
  File "/home/mesnard/.local/lib/python3.6/site-packages/bblfsh/client.py", line 52, in _get_contents
    contents = BblfshClient._ensure_utf8(contents)
  File "/home/mesnard/.local/lib/python3.6/site-packages/bblfsh/client.py", line 43, in _ensure_utf8
    raise NonUTF8ContentException("Content must be UTF-8, ASCII or Base64 encoded")

I used bblfshd 2.16.1

The solution for me is to exclude this archive from my learning corpus.

Thanks for your advice