Difference between UAST modes: native, annotated, semantic

Bblfsh project provides a self-hosted server for source code parsing that for a given file can outputs AST representation using of the different modes that control the amount of normalization of the tree structure.

What is the difference between those modes, and when I should choose one mode over another?

It’s easier to observe modes as the results of the specific transformation sequences.
Let’s describe the transformations list first:

  1. Preprocess stage normalizes native AST in such ways:

    • changes type key to uast.KeyType
    • restructures positional information
    • fixes any issues with native AST structure
  2. PreprocessCode stage runs code-assisted transformations after the Preprocess stage:

    • fixes node tokens
    • fixes node positional information based on the source
  3. Normalize stage:

    • converts a known native AST structures to a canonicalized high-level AST representation (UAST).
    • It is executed after PreprocessCode and before the Annotations stage.
  4. Annotations stage:

    • applies UAST role annotations and is executed after Normalize transformation, or after PreprocessCode if Normalize is disabled.
    • It also changes token key to uast.KeyToken.
    • It should not be done in the Preprocess stage, because Semantic annotations are easier on clean native AST.
  5. Namespace for native AST nodes of this language. Only enabled in Semantic mode.

    • Namespace will be set at the end of the pipeline, thus all transforms can use type names without the driver namespace.

Now we can “construct” modes from transformations sequences:

  1. ModeNative does not require any transformations. In this case the tree is equal to the one that is returned from a language AST generator:
    So, AST == AST in ModeNative

  2. ModePreprocessed
    AST → PreprocessPreprocessCode = AST in ModePreprocessed

  3. ModeAnnotated
    AST → PreprocessPreprocessCodeAnnotations = AST in ModeAnnotated

  4. ModeSemantic
    AST → PreprocessPreprocessCodeNormalizeAnnotationsNamespace = AST in ModeSemantic

Reference: code documentation

3 Likes

Thanks for explaining it, @lwsanty!

But I think the answer is a bit too low-level and may not answer questions that end users might have in mind.

For example, the Preprocess and PreprocessCode stages are split only in the code to properly isolate internal interfaces. For the user, it’s a single stage.

Same is true for Namespace. Although it’s technically a separate stage, it can be considered a part of the Semantic, regardless of the fact that it runs at the end. It’s an implementation detail that may change in the future.

Also, the answer explains what are the stages, but not why they exist and what is the difference.

In other words, why should a user choose one parsing mode over the other?

1 Like

I believe first it is worthwhile to give an intuition for the difference between:

  • Syntax: in an analogy with natural language, whether the words of a phrase may be correctly sorted.
  • Semantics: in our analogy, which meaning does a group of phrases carry when put together in a paragraph (order matters, for example).

Now let us jump to my understanding of the differences between each mode of AST output:

  • Native AST: it is close to the raw tree generated by one compiler (the one Babelfish uses internally in each driver) for a given language. It is highly language-dependent: nodes representing import dependencies for Java can have a different structure than nodes representing dependencies for Python, relative position of nodes (topology of the tree) may vary from language to language, e.g. argument nodes for functions can appear in different places with respect to the function call node (as children, hanging from another common node, hanging from the call node itself, …). Linking to @lwsanty’s explanation, the nodes of this tree could have incomplete positional information, but of course they are useful for research. Moreover, native ASTs have (almost) the guarantee that they are reversible to code and there is a bijection between a syntactically correct file (with some normalizations such as comment deletion, cleaning of indentation, etc) and its AST.

  • Annotated AST: original native AST topology is preserved and this is still a highly language-dependent tree (for example @type key of most nodes is still something dependent on the compiler’s naming style). However, nodes are packed with some valuable information:

    • roles: keywords corresponding to the function of each node in the code inside the @role key of a node: Argument, Call, Callee, Function, If, Import, …
    • positions (what does this node correspond to in the code?) for each node, given in a language independent notation (@pos key inside a node, with @type marked as uast:Positions).
      Example of positions annotation
       '@pos': { '@type': "uast:Positions",
                      start: { '@type': "uast:Position",
                         offset: 15,
                         line: 1,
                         col: 16,
                      },
                      end: { '@type': "uast:Position",
                         offset: 19,
                         line: 1,
                         col: 20,
                      },
               }
      

    These trees can be used to perform queries where we rely on the specific topology of the tree for a given language and annotations can be useful in our query: e.g. give me all the integer literals for this file written in Python.

  • Semantic AST: topology of the tree is rearranged so relative order of nodes carries the same meaning from one language to another (for certain constructs). Also @type value of some nodes is homogenized across languages with values such as uast:Function, uast:Identifier, uast:Argument, etc. It intends to capture as much information as possible as the original tree, but giving assurances with respect to the standardization of certain language constructs. Aim of this type of tree is performing queries (over the normalized constructs) in a language agnostic way: e.g. give me the imports used in these projects that could be written in Go, Python, C++, Java, etc.

The transformations mentioned by @lwsanty exist, to the best of my knowledge, because as I said structure of native ASTs is very heterogeneous across languages, and we need several steps to get to a point where we can abstract some information in an homogeneous manner.

2 Likes