精华内容
下载资源
问答
  • 在IDEA中使用ANTLR4教程

    万次阅读 2016-12-16 20:01:23
    Antlr 是一个基于 Java 开发的功能强大的语言识别工具,Antlr 以其简介的语法和高速的运行效率在这类工具中出类拔萃。当你需要开发一种领域语言时,语言可能像 Excel 中的公式一样复杂,也可能像本文中的例子一样...

    Antlr 是一个基于 Java 开发的功能强大的语言识别工具,Antlr 以其简介的语法和高速的运行效率在这类工具中出类拔萃。当你需要开发一种领域语言时,语言可能像 Excel 中的公式一样复杂,也可能像本文中的例子一样简单(只有算术运算),这时你可以考虑使用 Antlr 来处理你的语言。


    Antlr 简介

    1. ANTLR 语言识别的一个工具 (ANother Tool for Language Recognition ) 是一种语言工具,它提供了一个框架,可以通过包含 Java, C++, 或 C# 动作(action)的语法描述来构造语言识别器,编译器和解释器。 计算机语言的解析已经变成了一种非常普遍的工作,在这方面的理论和工具经过近 40 年的发展已经相当成熟,使用 Antlr 等识别工具来识别,解析,构造编译器比手工编程更加容易,同时开发的程序也更易于维护。
    2. 语言识别的工具有很多种,比如大名鼎鼎的 Lex 和 YACC,Linux 中有他们的开源版本,分别是 Flex 和 Bison。在 Java 社区里,除了 Antlr 外,语言识别工具还有 JavaCC 和 SableCC 等。
    3. 和大多数语言识别工具一样,Antlr 使用上下文无关文法描述语言。最新的 Antlr 是一个基于 LL(*) 的语言识别器。在 Antlr 中通过解析用户自定义的上下文无关文法,自动生成词法分析器 (Lexer)、语法分析器 (Parser) 和树分析器 (Tree Parser)。

    Antlr 能做什么

    编程语言处理

    识别和处理编程语言是 Antlr 的首要任务,编程语言的处理是一项繁重复杂的任务,为了简化处理,一般的编译技术都将语言处理工作分为前端和后端两个部分。其中前端包括词法分析、语法分析、语义分析、中间代码生成等若干步骤,后端包括目标代码生成和代码优化等步骤。

    Antlr 致力于解决编译前端的所有工作。使用 Anltr 的语法可以定义目标语言的词法记号和语法规则,Antlr 自动生成目标语言的词法分析器和语法分析器;此外,如果在语法规则中指定抽象语法树的规则,在生成语法分析器的同时,Antlr 还能够生成抽象语法树;最终使用树分析器遍历抽象语法树,完成语义分析和中间代码生成。整个工作在 Anltr 强大的支持下,将变得非常轻松和愉快。
    文本处理

    文本处理

    当需要文本处理时,首先想到的是正则表达式,使用 Anltr 的词法分析器生成器,可以很容易的完成正则表达式能够完成的所有工作;除此之外使用 Anltr 还可以完成一些正则表达式难以完成的工作,比如识别左括号和右括号的成对匹配等。


    在IDEA中安装使用Antlr

    1. 在Settings-Plugins中安装ANTLR v4 grammar plugin
    2. 新建一个Maven项目,在pom.xml文件中添加ANTLR4插件和运行库的依赖。注意一定要用最新版的,依赖,不知道最新版本号的可以自己google一下maven antlr4。
    <dependencies>
    
            <dependency>
                <groupId>org.antlr</groupId>
                <artifactId>antlr4-runtime</artifactId>
                <version>4.5.3</version>
            </dependency>
        </dependencies>
        <build>
            <plugins>
                <plugin>
                    <groupId>org.antlr</groupId>
                    <artifactId>antlr4-maven-plugin</artifactId>
                    <version>4.3</version>
                    <executions>
                        <execution>
                            <id>antlr</id>
                            <goals>
                                <goal>antlr4</goal>
                            </goals>
                            <phase>none</phase>
                        </execution>
                    </executions>
                    <configuration>
                        <outputDirectory>src/test/java</outputDirectory>
                        <listener>true</listener>
                        <treatWarningsAsErrors>true</treatWarningsAsErrors>
                    </configuration>
                </plugin>
            </plugins>
        </build>

    antlr4-maven-plugin用于生产Java代码,antlr4-runtime则是运行时所需的依赖库。把antlr4-maven-pluginphase设置成none,这样在Maven 的lifecycle种就不会调用ANTLR4。如果你希望每次构建生成文法可以将这个配置去掉。

    1. 我们定义一个最简单的领域语言,从一个简单的完成算术运算的例子出发,详细说明 Antlr 的使用。首先我们需要在src\main\java中新建一个 Antlr 的文法文件, 一般以 .g4 为文件名后缀,命名为 Demo.g4 。

    表达式定义

    文法定义

    在这个文法文件 Demo.g4 中根据 Antlr 的语法规则来定义算术表达式的文法,文件的头部是 grammar 关键字,定义文法的名字,必须与文法文件文件的名字相同:

    grammar Demo;

    为了简单起见,假设我们的自定义语言只能输入一个算术表达式。从而整个程序有一个语句构成,语句有表达式或者换行符构成。如清单 1 所示:

    清单1.程序和语句

    prog: stat 
    ; 
    stat: expr 
      |NEWLINE 
    ;

    在 Anltr 中,算法的优先级需要通过文法规则的嵌套定义来体现,加减法的优先级低于乘除法,表达式 expr 的定义由乘除法表达式 multExpr 和加减法算符 (‘+’|’-‘) 构成;同理,括号的优先级高于乘除法,乘除法表达式 multExpr 通过原子操作数 atom 和乘除法算符 (‘*’|’/’) 构成。整个表达的定义如清单 2 所示:

    清单2.表达式

    expr : multExpr (('+'|'-') multExpr)* 
    ; 
    multExpr : atom (('*'|'/') atom)* 
    ; 
    atom:  '(' expr ')' 
          | INT  
       | ID  
    ;

    最后需要考虑的词法的定义,在 Antlr 中语法定义和词法定义通过规则的第一个字符来区别, 规定语法定义符号的第一个字母小写,而词法定义符号的第一个字母大写。算术表达式中用到了 4 类记号 ( 在 Antlr 中被称为 Token),分别是标识符 ID,表示一个变量;常量 INT,表示一个常数;换行符 NEWLINE 和空格 WS,空格字符在语言处理时将被跳过,skip() 是词法分析器类的一个方法。如清单 3 所示:

    清单 3. 记号定义

    ID:('a'..'z'|'A'..'Z')+;
    INT:'0'..'9'+;
    NEWLINE:'\r'?'\n';
    WS:(' '|'\t'|'\n'|'\r')+{skip();};

    Antlr 支持多种目标语言,可以把生成的分析器生成为 Java,C#,C,Python,JavaScript 等多种语言,默认目标语言为 Java,通过 options {language=?;} 来改变目标语言。我们的例子中目标语言为 Java。

    整个Demo.g4文件内容如下:

    grammar Demo;
    
    //parser
    prog:stat
    ;
    stat:expr|NEWLINE
    ;
    
    expr:multExpr(('+'|'-')multExpr)*
    ;
    multExpr:atom(('*'|'/')atom)*
    ;
    atom:'('expr')'
        |INT
        |ID
    ;
    
    //lexer
    ID:('a'..'z'|'A'..'Z')+;
    INT:'0'..'9'+;
    NEWLINE:'\r'?'\n';
    WS:(' '|'\t'|'\n'|'\r')+{skip();};
    

    运行ANTLR

    1. 右键Demo.g4,选择Configure ANTLR,配置output路径。

    1. 右键Demo.g4,选择Generate ANTLR Recognizer。可以看到生成结果结果。
      其中Demo.tokens为文法中用到的各种符号做了数字化编号,我们可以不关注这个文件。DemoLexer是Antlr生成的词法分析器,DemoParser是Antlr 生成的语法分析器。

    1. 调用分析器。新建一个Main.java。
    public static void run(String expr) throws Exception{
    
            //对每一个输入的字符串,构造一个 ANTLRStringStream 流 in
            ANTLRInputStream in = new ANTLRInputStream(expr);
    
            //用 in 构造词法分析器 lexer,词法分析的作用是产生记号
            DemoLexer lexer = new DemoLexer(in);
    
            //用词法分析器 lexer 构造一个记号流 tokens
            CommonTokenStream tokens = new CommonTokenStream(lexer);
    
            //再使用 tokens 构造语法分析器 parser,至此已经完成词法分析和语法分析的准备工作
            DemoParser parser = new DemoParser(tokens);
    
            //最终调用语法分析器的规则 prog,完成对表达式的验证
            parser.prog();
        }

    完整Main.java代码:

    import org.antlr.v4.runtime.CommonTokenStream;
    import org.antlr.v4.runtime.ANTLRInputStream;
    
    public class Main {
    
        public static void run(String expr) throws Exception{
    
            //对每一个输入的字符串,构造一个 ANTLRStringStream 流 in
            ANTLRInputStream in = new ANTLRInputStream(expr);
    
            //用 in 构造词法分析器 lexer,词法分析的作用是产生记号
            DemoLexer lexer = new DemoLexer(in);
    
            //用词法分析器 lexer 构造一个记号流 tokens
            CommonTokenStream tokens = new CommonTokenStream(lexer);
    
            //再使用 tokens 构造语法分析器 parser,至此已经完成词法分析和语法分析的准备工作
            DemoParser parser = new DemoParser(tokens);
    
            //最终调用语法分析器的规则 prog,完成对表达式的验证
            parser.prog();
        }
    
        public static void main(String[] args) throws Exception{
    
            String[] testStr={
                    "2",
                    "a+b+3",
                    "(a-b)+3",
                    "a+(b*3"
            };
    
            for (String s:testStr){
                System.out.println("Input expr:"+s);
                run(s);
            }
        }
    }
    1. 运行Main.java
      当输入合法的的表达式时,分析器没有任何输出,表示语言被分析器接受;当输入的表达式违反文法规则时,比如“a + (b * 3”,分析器输出 line 0:-1 mismatched input ‘’ expecting ‘)’;提示期待一个右括号却遇到了结束符号。

    文法可视化

    1. 打开Antlr Preview。
    2. 在Demo.g4中选中一个语法定义符号,如expr。右键选中的符合,选择Text Rule expr。
    3. 在ANTLR Preview中选择input,输入表达式,如a+b*c+4/2。则能显示出可视化的文法。
    展开全文
  • ANTLR 4简明教程.epub

    2021-05-11 14:53:02
    ANTLR 4简明教程.epub
  • ANTLR巨型教程

    2020-06-06 07:09:24
    在本完整的教程中,我们将要: 解释基础 :解析器是什么,解析器可以用于什么 了解如何设置要从Javascript,Python,Java和C#中使用的ANTLR 讨论如何测试解析器 展示ANTLR中最先进,最有用的功能 :您将...

    解析器是功能强大的工具,使用ANTLR,您可以编写可用于多种不同语言的各种解析器。

    在本完整的教程中,我们将要:

    • 解释基础 :解析器是什么,解析器可以用于什么
    • 了解如何设置要从Javascript,Python,Java和C#中使用的ANTLR
    • 讨论如何测试解析器
    • 展示ANTLR中最先进,有用的功能 :您将学到解析所有可能的语言所需的一切
    • 显示大量示例

    也许您已经阅读了一些过于复杂或过于局部的教程,似乎以为您已经知道如何使用解析器。 这不是那种教程。 我们只希望您知道如何编码以及如何使用文本编辑器或IDE。 而已。

    在本教程的最后:

    • 您将能够编写一个解析器以识别不同的格式和语言
    • 您将能够创建构建词法分析器和解析器所需的所有规则
    • 您将知道如何处理遇到的常见问题
    • 您将了解错误,并且将知道如何通过测试语法来避免错误。

    换句话说,我们将从头开始,到结束时,您将学到所有可能需要了解ANTLR的知识。

    ANTLR Mega Tutorial Giant目录列表

    什么是ANTLR?

    ANTLR是解析器生成器,可帮助您创建解析器的工具。 解析器获取一段文本并将其转换为一个组织化的结构 ,例如抽象语法树(AST)。 您可以将AST看作是描述代码内容的故事,也可以看作是通过将各个部分放在一起而创建的逻辑表示。

    欧氏算法的AST的图形表示

    获得AST所需要做的事情:

    1. 定义词法分析器和语法分析器
    2. 调用ANTLR:它将以您的目标语言(例如Java,Python,C#,Javascript)生成一个词法分析器和解析器
    3. 使用生成的词法分析器和解析器:调用它们并传递代码以进行识别,然后它们会返回给您AST

    因此,您需要首先为要分析的事物定义一个词法分析器和解析器语法。 通常,“事物”是一种语言,但它也可以是数据格式,图表或任何以文本表示的结构。

    正则表达式不够吗?

    如果您是典型的程序员,您可能会问自己: 为什么我不能使用正则表达式 正则表达式非常有用,例如当您想在文本字符串中查找数字时,它也有很多限制。

    最明显的是缺乏递归:除非您为每个级别手动编码,否则您无法在另一个表达式中找到一个(正则)表达式。 很快就无法维持的事情。 但是更大的问题是它并不是真正可扩展的:如果您只将几个正则表达式放在一起,就将创建一个脆弱的混乱,将很难维护。

    使用正则表达式不是那么容易

    您是否尝试过使用正则表达式解析HTML? 这是一个可怕的想法,因为您冒着召唤克苏鲁的危险,但更重要的是, 它实际上并没有奏效 你不相信我吗 让我们看一下,您想要查找表的元素,因此尝试像这样的常规扩展: <table>(.*?)</table> 辉煌! 你做到了! 除非有人向其表中添加诸如styleid类的属性。 没关系,您执行<table.*?>(.*?)</table> ,但实际上您关心表中的数据,因此您需要解析trtd ,但是它们已满标签。

    因此,您也需要消除这种情况。 而且甚至有人甚至敢使用<!—我的评论&gtl->之类的评论。 注释可以在任何地方使用,并且使用正则表达式不容易处理。 是吗?

    因此,您禁止Internet使用HTML中的注释:已解决问题。

    或者,您也可以使用ANTLR,对您而言似乎更简单。

    ANTLR与手动编写自己的解析器

    好的,您确信需要一个解析器,但是为什么要使用像ANTLR这样的解析器生成器而不是构建自己的解析器呢?

    ANTLR的主要优势是生产率

    如果您实际上一直在使用解析器,因为您的语言或格式在不断发展,则您需要能够保持步伐,而如果您必须处理实现a的细节,则无法做到这一点。解析器。 由于您不是为了解析而解析,因此您必须有机会专注于实现目标。 而ANTLR使得快速,整洁地执行此操作变得更加容易。

    其次,定义语法后,您可以要求ANTLR生成不同语言的多个解析器。 例如,您可以使用C#获得一个解析器,而使用Javascript获得一个解析器,以在桌面应用程序和Web应用程序中解析相同的语言。

    有人认为,手动编写解析器可以使其更快,并且可以产生更好的错误消息。 这有些道理,但以我的经验,ANTLR生成的解析器总是足够快。 如果确实需要,您可以调整语法并通过处理语法来提高性能和错误处理。 只要对语法感到满意,就可以这样做。

    目录还是可以的

    两个小注意事项:

    • 本教程配套存储库中,您将找到所有带有测试的代码,即使我们在本文中没有看到它们
    • 示例将使用不同的语言,但是知识通常适用于任何语言

    设定

    1. 设定ANTLR
    2. Javascript设置
    3. Python设置
    4. Java设置
    5. C#设定

    初学者

    1. 词法分析器
    2. 创建语法
    3. 设计数据格式
    4. Lexer规则
    5. 解析器规则
    6. 错误与调整

    中级

    1. 用Javascript设置聊天项目
    2. Antlr.js
    3. HtmlChatListener.js
    4. 与听众合作
    5. 用语义谓词解决歧义
    6. 用Python继续聊天
    7. 与侦听器配合使用的Python方法
    8. 用Python测试
    9. 解析标记
    10. 词汇模式
    11. 解析器文法

    高级

    1. Java中的标记项目
    2. 主App.java
    3. 使用ANTLR转换代码
    4. 转换代码的喜悦与痛苦
    5. 高级测试
    6. 处理表情
    7. 解析电子表格
    8. C#中的电子表格项目
    9. Excel注定了
    10. 测试一切

    结束语

    1. 技巧和窍门
    2. 结论

    设定

    在本节中,我们准备使用ANTLR的开发环境:解析器生成器工具,每种语言的支持工具和运行时。

    1.设置ANTLR

    ANTLR实际上由两个主要部分组成:用于生成词法分析器和解析器的工具,以及运行它们所需的运行时。

    语言工程师将只需要您使用该工具,而运行时将包含在使用您的语言的最终软件中。

    无论您使用哪种语言,该工具始终是相同的:这是开发计算机上所需的Java程序。 尽管每种语言的运行时都不同,但是开发人员和用户都必须可以使用它。

    该工具的唯一要求是您已经安装了至少Java 1.7 要安装Java程序,您需要从官方站点下载最新版本,当前版本为:

    http://www.antlr.org/download/antlr-4.6-complete.jar

    使用说明

    1. 将下载的工具复制到通常放置第三方Java库的位置(例如/usr/local/libC:\Program Files\Java\lib
    2. 将工具添加到您的CLASSPATH 将其添加到您的启动脚本中(例如.bash_profile
    3. (可选)还在您的启动脚本中添加别名,以简化ANTLR的使用

    在Linux / Mac OS上执行说明

    // 1.
    sudo cp antlr-4.6-complete.jar /usr/local/lib/
    // 2. and 3.
    // add this to your .bash_profile
    export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar:$CLASSPATH"
    // simplify the use of the tool to generate lexer and parser
    alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.6-complete.jar:$CLASSPATH" org.antlr.v4.Tool'
    // simplify the use of the tool to test the generated code
    alias grun='java org.antlr.v4.gui.TestRig'

    在Windows上执行说明

    // 1.
    Go to System Properties dialog > Environment variables
    -> Create or append to the CLASSPATH variable
    // 2. and 3. Option A: use doskey
    doskey antlr4=java org.antlr.v4.Tool $*
    doskey grun =java org.antlr.v4.gui.TestRig $*
    // 2. and 3. Option B: use batch files
    // create antlr4.bat
    java org.antlr.v4.Tool %*
    // create grun.bat
    java org.antlr.v4.gui.TestRig %*
    // put them in the system path or any of the directories included in %path%

    典型工作流程

    使用ANTLR时,首先要编写语法 ,即扩展名为.g4的文件,其中包含要分析的语言规则。 然后,您可以使用antlr4程序来生成程序将实际使用的文件,例如词法分析器和解析器。

    antlr4 <options> <grammar-file-g4>

    运行antlr4时可以指定几个重要选项。

    首先,您可以指定目标语言,以Python或JavaScript或任何其他不同于Java的目标(默认语言)生成解析器。 其他的用于生成访问者和侦听器(不要担心,如果您不知道这些是什么,我们将在后面进行解释)。

    缺省情况下,仅生成侦听器,因此要创建访问者,请使用-visitor命令行选项,如果不想生成-no-listener则使用-no-listener listener。 也有相反的选项-no-visitor-listener ,但它们是默认值。

    antlr4 -visitor <Grammar-file>

    您可以使用一个名为TestRig (的小实用工具来优化语法测试TestRig (尽管,如我们所见,它通常是grun的别名)。

    grun <grammar-name> <rule-to-test> <input-filename(s)>

    文件名是可选的,您可以代替分析在控制台上键入的输入。

    如果要使用测试工具,则即使您的程序是用另一种语言编写的,也需要生成Java解析器。 这可以通过选择与antlr4不同的选项来antlr4

    手动测试语法初稿时,Grun非常有用。 随着它变得更加稳定,您可能希望继续进行自动化测试(我们将看到如何编写它们)。

    Grun还有一些有用的选项: -tokens ,显示检测到的令牌, -gui生成AST的图像。

    2. Javascript设置

    您可以将语法与Javascript文件放在同一文件夹中。 包含语法的文件必须具有与语法相同的名称,该名称必须在文件顶部声明。

    在下面的示例中,名称为Chat ,文件为Chat.g4

    通过使用ANTLR4 Java程序指定正确的选项,我们可以创建相应的Javascript解析器。

    antlr4 -Dlanguage=JavaScript Chat.g4

    请注意,该选项区分大小写,因此请注意大写的“ S”。 如果您输入有误,则会收到类似以下的消息。

    error(31):  ANTLR cannot generate Javascript code as of version 4.6

    ANTLR可以与node.js一起使用,也可以在浏览器中使用。 对于浏览器,您需要使用webpackrequire.js 如果您不知道如何使用两者之一,可以查阅官方文档寻求帮助或阅读网络上的antlr教程。 我们将使用node.js ,只需使用以下标准命令即可为之安装ANTLR运行时。

    npm install antlr4

    3. Python设置

    有了语法后,请将其放在与Python文件相同的文件夹中。 该文件必须具有与语法相同的名称,该名称必须在文件顶部声明。 在下面的示例中,名称为Chat ,文件为Chat.g4

    通过使用ANTLR4 Java程序指定正确的选项,我们可以简单地创建相应的Python解析器。 对于Python,您还需要注意Python的版本2或3。

    antlr4 -Dlanguage=Python3 Chat.g4

    PyPi提供了运行时,因此您可以使用pio进行安装。

    pip install antlr4-python3-runtime

    同样,您只需要记住指定正确的python版本。

    4. Java设定

    要使用ANTLR设置我们的Java项目,您可以手动执行操作。 或者您可以成为文明的人并使用Gradle或Maven。

    另外,您可以在IDE中查看ANTLR插件。

    4.1使用Gradle进行Java设置

    这就是我通常设置Gradle项目的方式。

    我使用Gradle插件调用ANTLR,也使用IDEA插件生成IntelliJ IDEA的配置。

    dependencies {
      antlr "org.antlr:antlr4:4.5.1"
      compile "org.antlr:antlr4-runtime:4.5.1"
      testCompile 'junit:junit:4.12'
    }
     
    generateGrammarSource {
        maxHeapSize = "64m"
        arguments += ['-package', 'me.tomassetti.mylanguage']
        outputDirectory = new File("generated-src/antlr/main/me/tomassetti/mylanguage".toString())
    }
    compileJava.dependsOn generateGrammarSource
    sourceSets {
        generated {
            java.srcDir 'generated-src/antlr/main/'
        }
    }
    compileJava.source sourceSets.generated.java, sourceSets.main.java
     
    clean{
        delete "generated-src"
    }
     
    idea {
        module {
            sourceDirs += file("generated-src/antlr/main")
        }
    }

    我将语法放在src / main / antlr /下 ,并且gradle配置确保它们在与程序包相对应的目录中生成。 例如,如果我希望解析器位于包me.tomassetti.mylanguage中,则必须将其生成到generate-src / antlr / main / me / tomassetti / mylanguage中

    此时,我可以简单地运行:

    # Linux/Mac
    ./gradlew generateGrammarSource
     
    # Windows
    gradlew generateGrammarSource

    然后我从语法中生成了词法分析器和解析器。

    然后我也可以运行:

    # Linux/Mac
    ./gradlew idea
     
    # Windows
    gradlew idea

    我已经准备好要打开一个IDEA项目。

    4.2使用Maven进行Java设置

    首先,我们将在POM中指定需要antlr4-runtime作为依赖项。 我们还将使用Maven插件通过Maven运行ANTLR。

    我们还可以指定是否使用ANTLR来生成访问者或侦听器。 为此,我们定义了几个相应的属性。

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
     
      [..]
     
      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <antlr4.visitor>true</antlr4.visitor>
        <antlr4.listener>true</antlr4.listener>
      </properties>  
     
      <dependencies>
        <dependency>
          <groupId>org.antlr</groupId>
          <artifactId>antlr4-runtime</artifactId>
          <version>4.6</version>
        </dependency>
       [..]
      </dependencies>
     
      <build>
        <plugins>
          [..]
          <!-- Plugin to compile the g4 files ahead of the java files
               See https://github.com/antlr/antlr4/blob/master/antlr4-maven-plugin/src/site/apt/examples/simple.apt.vm
               Except that the grammar does not need to contain the package declaration as stated in the documentation (I do not know why)
               To use this plugin, type:
                 mvn antlr4:antlr4
               In any case, Maven will invoke this plugin before the Java source is compiled
            -->
          <plugin>
            <groupId>org.antlr</groupId>
            <artifactId>antlr4-maven-plugin</artifactId>
            <version>4.6</version>                
            <executions>
              <execution>
                <goals>
                  <goal>antlr4</goal>
                </goals>            
              </execution>
            </executions>
          </plugin>
          [..]
        </plugins>
      </build>
    </project>

    现在,您必须将语法的* .g4文件放在src/main/antlr4/me/tomassetti/examples/MarkupParser.

    编写完语法后,您只需运行mvn package ,所有奇妙的事情就会发生:ANTLR被调用,它会生成词法分析器和解析器,并将它们与其余代码一起编译。

    // use mwn to generate the package
    mvn package

    如果您从未使用过Maven,则可以查看Java目标的官方ANTLR文档Maven网站来入门。

    使用Java开发ANTLR语法有一个明显的优势:有多个IDE的插件,这是该工具的主要开发人员实际使用的语言。 因此,它们是org.antlr.v4.gui.TestRig类的工具,可以轻松地集成到您的工作流中,如果您想轻松地可视化输入的AST,这些工具将非常有用。

    5. C#设置

    支持.NET Framework和Mono 3.5,但不支持.NET Core。 我们将使用Visual Studio创建我们的ANTLR项目,因为由C#目标的同一作者为Visual Studio创建了一个不错的扩展,称为ANTLR语言支持 您可以通过进入工具->扩展和更新来安装它。 当您构建项目时,此扩展将自动生成解析器,词法分析器和访问者/侦听器。

    此外,该扩展名将允许您使用众所周知的菜单添加新项目来创建新的语法文件。 最后但并非最不重要的一点是,您可以在每个语法文件的属性中设置用于生成侦听器/访问者的选项。

    另外,如果您更喜欢使用编辑器,则需要使用常规的Java工具生成所有内容。 您可以通过指示正确的语言来做到这一点。 在此示例中,语法称为“电子表格”。

    antlr4 -Dlanguage=CSharp Spreadsheet.g4

    请注意,CSharp中的“ S”为大写。

    您仍然需要项目的ANTLR4运行时,并且可以使用良好的nu'nuget安装它。

    初学者

    在本节中,我们为使用ANTLR奠定了基础:什么是词法分析器和解析器,在语法中定义它们的语法以及可用于创建它们的策略。 我们还将看到第一个示例,以展示如何使用所学知识。 如果您不记得ANTLR的工作原理,可以回到本节。

    6.词法分析器

    在研究解析器之前,我们需要首先研究词法分析器,也称为令牌化器。 它们基本上是解析器的第一个垫脚石,当然ANTLR也允许您构建它们。 词法分析器将各个字符转换为令牌 (解析器用来创建逻辑结构的原子)。

    想象一下,此过程适用于自然语言,例如英语。 您正在阅读单个字符,将它们放在一起直到形成一个单词,然后将不同的单词组合成一个句子。

    让我们看下面的示例,并想象我们正在尝试解析数学运算。

    437 + 734

    词法分析器扫描文本,然后找到“ 4”,“ 3”,“ 7”,然后找到空格“”。 因此,它知道第一个字符实际上代表一个数字。 然后,它找到一个“ +”符号,因此知道它代表一个运算符,最后找到另一个数字。

    它怎么知道的? 因为我们告诉它。

    /*
     * Parser Rules
     */
     
    operation  : NUMBER '+' NUMBER ;
     
    /*
     * Lexer Rules
     */
     
    NUMBER     : [0-9]+ ;
     
    WHITESPACE : ' ' -> skip ;

    这不是一个完整的语法,但是我们已经可以看到词法分析器规则全部为大写,而解析器规则全部为小写。 从技术上讲,关于大小写的规则仅适用于其名称的第一个字符,但通常为了清楚起见,它们全都为大写或小写。

    规则通常按以下顺序编写:首先是解析器规则,然后是词法分析器规则,尽管在逻辑上它们是按相反的顺序应用的。 同样重要的是要记住, 词法分析器规则是按照它们出现的顺序进行分析的 ,它们可能是不明确的。

    典型的例子是标识符:在许多编程语言中,它可以是任何字母字符串,但是某些组合(例如“ class”或“ function”)被禁止,因为它们表示一个classfunction 因此,规则的顺序通过使用第一个匹配项来解决歧义,这就是为什么首先定义标识关键字(例如函数)的令牌,而最后一个用于标识符的令牌的原因。

    规则的基本语法很容易: 有一个名称,一个冒号,该规则的定义和一个终止分号

    NUMBER的定义包含一个典型的数字范围和一个“ +”符号,表示允许一个或多个匹配项。 这些都是我认为您熟悉的非常典型的指示,否则,您可以阅读有关正则表达式的语法的更多信息。

    最后,最有趣的部分是定义WHITESPACE令牌的词法分析器规则。 这很有趣,因为它显示了如何指示ANTLR忽略某些内容。 考虑一下忽略空白如何简化解析器规则:如果我们不能说忽略WHITESPACE,则必须将其包括在解析器的每个子规则之间,以便用户在所需的地方放置空格。 像这样:

    operation  : WHITESPACE* NUMBER WHITESPACE* '+' WHITESPACE* NUMBER;

    注释通常也是如此:它们可以出现在任何地方,并且我们不想在语法的每个部分中都专门处理它们,因此我们只是忽略它们(至少在解析时)。

    7.创建语法

    现在,我们已经了解了规则的基本语法,下面我们来看看定义语法的两种不同方法:自顶向下和自底向上。

    自上而下的方法

    这种方法包括从以您的语言编写的文件的一般组织开始。

    文件的主要部分是什么? 他们的顺序是什么? 每个部分中包含什么?

    例如,Java文件可以分为三个部分:

    • 包装声明
    • 进口
    • 类型定义

    当您已经知道要为其设计语法的语言或格式时,此方法最有效。 具有良好理论背景的人或喜欢从“大计划”入手的人可能会首选该策略。

    使用这种方法时,首先要定义代表整个文件的规则。 它可能会包括其他规则,以代表主要部分。 然后,您定义这些规则,然后从最一般的抽象规则过渡到底层的实用规则。

    自下而上的方法

    自下而上的方法包括首先关注小元素:定义如何捕获令牌,如何定义基本表达式等等。 然后,我们移至更高级别的构造,直到定义代表整个文件的规则。

    我个人更喜欢从底层开始,这些基本项目是使用词法分析器进行分析的。 然后您自然地从那里成长到结构,该结构由解析器处理。 这种方法允许只关注语法的一小部分,为此建立语法,确保其按预期工作,然后继续进行下一个工作。

    这种方法模仿了我们的学习方式。 此外,从实际代码开始的好处是,在许多语言中,实际代码实际上是相当普遍的。 实际上,大多数语言都具有标识符,注释,空格等内容。显然,您可能需要进行一些调整,例如HTML中的注释在功能上与C#中的注释相同,但是具有不同的定界符。

    自底向上方法的缺点在于解析器是您真正关心的东西。 不要求您构建一个词法分析器,而是要求您构建一个可以提供特定功能的解析器。 因此,如果您不了解程序的其余部分如何工作,那么从最后一部分词法分析器开始,您可能最终会进行一些重构。

    8.设计数据格式

    为新语言设计语法是困难的。 您必须创建一种对用户来说简单而直观的语言,同时又要明确地使语法易于管理。 它必须简洁,清晰,自然,并且不会妨碍用户。

    因此,我们从有限的内容开始:一个简单的聊天程序的语法。

    让我们从对目标的更好描述开始:

    • 不会有段落,因此我们可以使用换行符作为消息之间的分隔符
    • 我们要允许表情符号,提及和链接。 我们将不支持HTML标签
    • 由于我们的聊天将针对​​讨厌的青少年,因此我们希望为用户提供一种简单的方法来喊叫和设置文本颜色的格式。

    最终,少年们可能会大喊大叫,全是粉红色。 多么活着的时间。

    9. Lexer规则

    我们首先为聊天语言定义词法分析器规则。 请记住,词法分析器规则实际上位于文件的末尾。

    /*
     * Lexer Rules
     */
     
    fragment A          : ('A'|'a') ;
    fragment S          : ('S'|'s') ;
    fragment Y          : ('Y'|'y') ;
    fragment H          : ('H'|'h') ;
    fragment O          : ('O'|'o') ;
    fragment U          : ('U'|'u') ;
    fragment T          : ('T'|'t') ;
     
    fragment LOWERCASE  : [a-z] ;
    fragment UPPERCASE  : [A-Z] ;
     
    SAYS                : S A Y S ;
     
    SHOUTS              : S H O U T S;
     
    WORD                : (LOWERCASE | UPPERCASE | '_')+ ;
     
    WHITESPACE          : (' ' | '\t') ;
     
    NEWLINE             : ('\r'? '\n' | '\r')+ ;
     
    TEXT                : ~[\])]+ ;

    在此示例中,我们使用规则片段 :它们是词法分析器规则的可重用构建块。 您定义它们,然后在词法分析器规则中引用它们。 如果定义它们但不将它们包括在词法分析器规则中,则它们根本无效。

    我们为要在关键字中使用的字母定义一个片段。 这是为什么? 因为我们要支持不区分大小写的关键字。 除了避免重复字符的情况以外,在处理浮点数时也使用它们。 为了避免重复数字,请在点/逗号之前和之后。 如下面的例子。

    fragment DIGIT : [0-9] ;
    NUMBER         : DIGIT+ ([.,] DIGIT+)? ;

    TEXT令牌显示如何捕获所有内容,除了波浪号('〜')之后的字符以外。 我们不包括右方括号']',但是由于它是用于标识一组字符结尾的字符,因此必须在其前面加上反斜杠'\'来对其进行转义。

    换行规则是用这种方式制定的,因为操作系统实际上指示换行的方式不同,有些包括carriage return ('\r') ,有些包括newline ('\n') ,或者二者结合。

    10.解析器规则

    我们继续解析器规则,这些规则是我们的程序将与之最直接交互的规则。

    /*
     * Parser Rules
     */
     
    chat                : line+ EOF ;
     
    line                : name command message NEWLINE;
     
    message             : (emoticon | link | color | mention | WORD | WHITESPACE)+ ;
     
    name                : WORD ;
     
    command             : (SAYS | SHOUTS) ':' WHITESPACE ;
                                            
    emoticon            : ':' '-'? ')'
                        | ':' '-'? '('
                        ;
     
    link                : '[' TEXT ']' '(' TEXT ')' ;
     
    color               : '/' WORD '/' message '/';
     
    mention             : '@' WORD ;

    第一个有趣的部分是message ,与其包含的内容有关,不如说是它所代表的结构。 我们说的是message可以是任何列出的规则中的任何顺序。 这是解决空白时无需每次重复的简单方法。 由于作为用户,我们发现空白不相关,因此我们看到类似WORD WORD mention ,但解析器实际上看到WORD WHITESPACE WORD WHITESPACE mention WHITESPACE

    当您无法摆脱空白时,处理空白的另一种方法是更高级的:词法模式。 基本上,它允许您指定两个词法分析器部分:一个用于结构化部分,另一个用于简单文本。 这对于解析XML或HTML之类的内容很有用。 我们将在稍后展示。

    很明显, 命令规则很明显,您只需要注意命令和冒号这两个选项之间不能有空格,但是之后需要一个WHITESPACE 表情符号规则显示了另一种表示多种选择的符号,您可以使用竖线字符“ |” 没有括号。 我们仅支持带有或不带有中间线的两个表情符号,快乐和悲伤。

    就像我们已经说过的那样, 链接规则可能被认为是错误或执行不佳,实际上, TEXT捕获了除某些特殊字符之外的所有内容。 您可能只想在括号内使用WORDWHITESPACE,或者在方括号内强制使用正确的链接格式。 另一方面,这允许用户在编写链接时犯错误,而不会使解析器抱怨。

    您必须记住,解析器无法检查语义

    例如,它不知道指示颜色的WORD是否实际代表有效颜色。 也就是说,它不知道使用“ dog”是错误的,但是使用“ red”是正确的。 必须通过程序的逻辑进行检查,该逻辑可以访问可用的颜色。 您必须找到在语法和您自己的代码之间划分执行力的正确平衡。

    解析器应仅检查语法。 因此,经验法则是,如果有疑问,则让解析器将内容传递给程序。 然后,在程序中,检查语义并确保规则实际上具有正确的含义。

    让我们看一下规则颜色:它可以包含一条消息 ,它本身也可以是消息的一部分 这种歧义将通过使用的上下文来解决。

    11.错误与调整

    在尝试新语法之前,我们必须在文件开头添加一个名称。 名称必须与文件名相同,文件扩展名应为.g4

    grammar Chat;

    您可以在官方文档中找到如何为您的平台安装所有内容 安装完所有内容后,我们创建语法,编译生成的Java代码,然后运行测试工具。

    // lines preceded by $ are commands
    // > are input to the tool
    // - are output from the tool
    $ antlr4 Chat.g4
    $ javac Chat*.java
    // grun is the testing tool, Chat is the name of the grammar, chat the rule that we want to parse
    $ grun Chat chat
    > john SAYS: hello @michael this will not work
    // CTRL+D on Linux, CTRL+Z on Windows
    > CTRL+D/CTRL+Z
    - line 1:0 mismatched input 'john SAYS: hello @michael this will not work\n' expecting WORD

    好的,它不起作用。 为什么要期待WORD 就在那! 让我们尝试使用选项-tokens它可以识别的令牌,以找出-tokens

    $ grun Chat chat -tokens
    > john SAYS: hello @michael this will not work
    - [@0,0:44='john SAYS: hello @michael this will not work\n',<TEXT>,1:0]
    - [@1,45:44='<EOF>',<EOF>,2:0]

    因此,它只能看到TEXT令牌。 但是我们把它放在语法的末尾,会发生什么? 问题在于它总是尝试匹配最大可能的令牌。 所有这些文本都是有效的TEXT令牌。 我们如何解决这个问题? 有很多方法,第一种当然是摆脱该令牌。 但是目前,我们将看到第二个最简单的方法。

    [..]
     
    link                : TEXT TEXT ;
     
    [..]
     
    TEXT                : ('['|'(') ~[\])]+ (']'|')');

    我们更改了有问题的令牌,使其包含前面的括号或方括号。 请注意,这并不完全相同,因为它允许两个系列的括号或方括号。 但这是第一步,毕竟我们正在这里学习。

    让我们检查一下是否可行:

    $ grun Chat chat -tokens
    > john SAYS: hello @michael this will not work
    - [@0,0:3='john',<WORD>,1:0]
    - [@1,4:4=' ',<WHITESPACE>,1:4]
    - [@2,5:8='SAYS',<SAYS>,1:5]
    - [@3,9:9=':',<':'>,1:9]
    - [@4,10:10=' ',<WHITESPACE>,1:10]
    - [@5,11:15='hello',<WORD>,1:11]
    - [@6,16:16=' ',<WHITESPACE>,1:16]
    - [@7,17:17='@',<'@'>,1:17]
    - [@8,18:24='michael',<WORD>,1:18]
    - [@9,25:25=' ',<WHITESPACE>,1:25]
    - [@10,26:29='this',<WORD>,1:26]
    - [@11,30:30=' ',<WHITESPACE>,1:30]
    - [@12,31:34='will',<WORD>,1:31]
    - [@13,35:35=' ',<WHITESPACE>,1:35]
    - [@14,36:38='not',<WORD>,1:36]
    - [@15,39:39=' ',<WHITESPACE>,1:39]
    - [@16,40:43='work',<WORD>,1:40]
    - [@17,44:44='\n',<NEWLINE>,1:44]
    - [@18,45:44='<EOF>',<EOF>,2:0]

    使用-gui选项,我们还可以拥有一个很好的,更易于理解的图形表示。

    空中的点表示空白。

    这行得通,但不是很聪明,不错或没有组织。 但是不用担心,稍后我们将看到更好的方法。 该解决方案的一个积极方面是,它可以显示另一个技巧。

    TEXT                : ('['|'(') .*? (']'|')');

    这是令TEXT的等效表示形式:“。” 匹配任何字符,“ *”表示可以随时重复前面的匹配,“?” 表示先前的比赛是非贪婪的。 也就是说,前一个子规则匹配除其后的所有内容,从而允许匹配右括号或方括号。

    中级

    在本节中,我们将了解如何在程序中使用ANTLR,需要使用的库和函数,如何测试解析器等。 我们了解什么是监听器以及如何使用监听器。 通过查看更高级的概念(例如语义谓词),我们还基于对基础知识的了解。 尽管我们的项目主要使用Javascript和Python,但该概念通常适用于每种语言。 当您需要记住如何组织项目时,可以回到本节。

    12.使用Java脚本设置聊天项目

    在前面的部分中,我们逐段地介绍了如何为聊天程序构建语法。 现在,让我们复制刚在Javascript文件的同一文件夹中创建的语法。

    grammar Chat;
     
    /*
     * Parser Rules
     */
     
    chat                : line+ EOF ;
     
    line                : name command message NEWLINE ;
     
    message             : (emoticon | link | color | mention | WORD | WHITESPACE)+ ;
     
    name                : WORD WHITESPACE;
     
    command             : (SAYS | SHOUTS) ':' WHITESPACE ;
                                            
    emoticon            : ':' '-'? ')'
                        | ':' '-'? '('
                        ;
     
    link                : TEXT TEXT ;
     
    color               : '/' WORD '/' message '/';
     
    mention             : '@' WORD ;
     
     
    /*
     * Lexer Rules
     */
     
    fragment A          : ('A'|'a') ;
    fragment S          : ('S'|'s') ;
    fragment Y          : ('Y'|'y') ;
    fragment H          : ('H'|'h') ;
    fragment O          : ('O'|'o') ;
    fragment U          : ('U'|'u') ;
    fragment T          : ('T'|'t') ;
     
    fragment LOWERCASE  : [a-z] ;
    fragment UPPERCASE  : [A-Z] ;
     
    SAYS                : S A Y S ;
     
    SHOUTS              : S H O U T S ;
     
    WORD                : (LOWERCASE | UPPERCASE | '_')+ ;
     
    WHITESPACE          : (' ' | '\t')+ ;
     
    NEWLINE             : ('\r'? '\n' | '\r')+ ;
     
    TEXT                : ('['|'(') ~[\])]+ (']'|')');

    通过使用ANTLR4 Java程序指定正确的选项,我们可以创建相应的Javascript解析器。

    antlr4 -Dlanguage=JavaScript Chat.g4

    现在,您将在文件夹中找到一些新文件,它们的名称如ChatLexer.js, ChatParser.js以及* .tokens文件,这些文件都不包含我们感兴趣的任何东西,除非您想了解ANTLR的内部工作原理。

    您要查看的文件是ChatListener.js ,您不会对其进行任何修改,但是它包含我们将使用自己的侦听器覆盖的方法和函数。 我们不会对其进行修改,因为每次重新生成语法时,更改都会被覆盖。

    查看它,您可以看到几个输入/退出函数,每个解析器规则都有一对。 当遇到与规则匹配的一段代码时,将调用这些函数。 这是侦听器的默认实现,它使您可以在派生的侦听器上仅覆盖所需的功能,而其余部分保持不变。

    var antlr4 = require('antlr4/index');
     
    // This class defines a complete listener for a parse tree produced by ChatParser.
    function ChatListener() {
        antlr4.tree.ParseTreeListener.call(this);
        return this;
    }
     
    ChatListener.prototype = Object.create(antlr4.tree.ParseTreeListener.prototype);
    ChatListener.prototype.constructor = ChatListener;
     
    // Enter a parse tree produced by ChatParser#chat.
    ChatListener.prototype.enterChat = function(ctx) {
    };
     
    // Exit a parse tree produced by ChatParser#chat.
    ChatListener.prototype.exitChat = function(ctx) {
    };
     
    [..]

    创建Listener的替代方法是创建一个Visitor 主要区别在于,您既无法控制侦听器的流程,也无法从其功能返回任何内容,而您既可以使用访问者来完成这两个操作。 因此,如果您需要控制AST节点的输入方式或从其中几个节点收集信息,则可能需要使用访客。 例如,这对于代码生成很有用,在代码生成中,创建新源代码所需的一些信息散布在许多部分。 听者和访客都使用深度优先搜索。

    深度优先搜索意味着当一个节点将被访问时,其子节点将被访问,如果一个子节点具有自己的子节点,则在继续第一个节点的其他子节点之前,将对其进行访问。 下图将使您更容易理解该概念。

    因此,对于侦听器,在与该节点的第一次相遇时将触发enter事件,并且在退出所有子节点之后将触发出口。 在下图中,您可以看到在侦听器遇到线路节点时将触发哪些功能的示例(为简单起见,仅显示与线路相关的功能)。

    对于标准的访问者,其行为将是相似的,当然,对于每个单个节点都只会触发单个访问事件。 在下图中,您可以看到访问者遇到线路节点时将触发哪些功能的示例(为简单起见,仅显示与线路相关的功能)。

    请记住, 这对于访问者的默认实现是正确的这是通过返回每个函数中每个节点的子代来完成的 如果您忽略了访问者的方法,则有责任使访问者继续旅行或在此停留。

    13. Antlr.js

    终于到了看典型的ANTLR程序外观的时候了。

    const http = require('http');
    const antlr4 = require('antlr4/index');
    const ChatLexer = require('./ChatLexer');
    const ChatParser = require('./ChatParser');
    const HtmlChatListener = require('./HtmlChatListener').HtmlChatListener;
     
    http.createServer((req, res) => {
       
       res.writeHead(200, {
           'Content-Type': 'text/html',        
       });
     
       res.write('<html><head><meta charset="UTF-8"/></head><body>');
       
       var input = "john SHOUTS: hello @michael /pink/this will work/ :-) \n";
       var chars = new antlr4.InputStream(input);
       var lexer = new ChatLexer.ChatLexer(chars);
       var tokens  = new antlr4.CommonTokenStream(lexer);
       var parser = new ChatParser.ChatParser(tokens);
       parser.buildParseTrees = true;   
       var tree = parser.chat();   
       var htmlChat = new HtmlChatListener(res);
       antlr4.tree.ParseTreeWalker.DEFAULT.walk(htmlChat, tree);
       
       res.write('</body></html>');
       res.end();
     
    }).listen(1337);

    在主文件的开头,我们导入(使用require )必要的库和文件, antlr4 (运行时)和生成的解析器,以及稍后将要看到的侦听器。

    为简单起见,我们从字符串中获取输入,而在实际情况下,它将来自编辑器。

    第16-19行显示了每个ANTLR程序的基础:您从输入创建字符流,将其提供给词法分析器,然后将其转换为令牌,然后由解析器对其进行解释。

    花一点时间思考一下是很有用的:词法分析器处理输入的字符,准确地说是输入的副本,而解析器处理解析器生成的标记。 词法分析器无法直接处理输入,解析器甚至看不到字符

    记住这一点很重要,以防您需要执行一些高级操作(如操纵输入)。 在这种情况下,输入是字符串,但当然可以是任何内容流。

    第20行是多余的,因为该选项已经默认为true,但是在以后的运行时版本中可能会更改,因此最好指定它。

    然后,在第21行,将树的根节点设置为聊天规则。 您要调用解析器,指定一个通常是第一条规则的规则。 但是,实际上您可以直接调用任何规则,例如color

    通常,一旦从解析器中获取AST,我们就想使用侦听器或访问者来处理它。 在这种情况下,我们指定一个侦听器。 我们特定的侦听器采用一个参数:响应对象。 我们希望使用它在响应中放入一些文本以发送给用户。 设置好听众之后,我们最终与听众一起走到树上。

    14. HtmlChatListener.js

    我们继续看聊天项目的听众。

    const antlr4 = require('antlr4/index');
    const ChatLexer = require('./ChatLexer');
    const ChatParser = require('./ChatParser');
    var ChatListener = require('./ChatListener').ChatListener;
     
    HtmlChatListener = function(res) {
        this.Res = res;    
        ChatListener.call(this); // inherit default listener
        return this;
    };
     
    // inherit default listener
    HtmlChatListener.prototype = Object.create(ChatListener.prototype);
    HtmlChatListener.prototype.constructor = HtmlChatListener;
     
    // override default listener behavior
    HtmlChatListener.prototype.enterName = function(ctx) {          
        this.Res.write("<strong>");    
    };
     
    HtmlChatListener.prototype.exitName = function(ctx) {      
        this.Res.write(ctx.WORD().getText());
        this.Res.write("</strong> ");
    }; 
     
    HtmlChatListener.prototype.exitEmoticon = function(ctx) {      
        var emoticon = ctx.getText();        
        
        if(emoticon == ':-)' || emoticon == ':)')
        {
            this.Res.write("??");        
        }
        
        if(emoticon == ':-(' || emoticon == ':(')
        {
            this.Res.write("??");            
        }
    }; 
     
    HtmlChatListener.prototype.enterCommand = function(ctx) {          
        if(ctx.SAYS() != null)
            this.Res.write(ctx.SAYS().getText() + ':' + '<p>');
     
        if(ctx.SHOUTS() != null)
            this.Res.write(ctx.SHOUTS().getText() + ':' + '<p style="text-transform: uppercase">');
    };
     
    HtmlChatListener.prototype.exitLine = function(ctx) {              
        this.Res.write("</p>");
    };
     
    exports.HtmlChatListener = HtmlChatListener;

    After the requires function calls we make our HtmlChatListener to extend ChatListener. The interesting stuff starts at line 17.

    The ctx argument is an instance of a specific class context for the node that we are entering/exiting. So for enterName is NameContext , for exitEmoticon is EmoticonContext , etc. This specific context will have the proper elements for the rule, that would make possible to easily access the respective tokens and subrules. For example, NameContext will contain fields like WORD() and WHITESPACE(); CommandContext will contain fields like WHITESPACE() , SAYS() and SHOUTS().

    These functions, enter* and exit*, are called by the walker everytime the corresponding nodes are entered or exited while it's traversing the AST that represents the program newline. A listener allows you to execute some code, but it's important to remember that you can't stop the execution of the walker and the execution of the functions .

    On line 18, we start by printing a strong tag because we want the name to be bold, then on exitName we take the text from the token WORD and close the tag. Note that we ignore the WHITESPACE token, nothing says that we have to show everything. In this case we could have done everything either on the enter or exit function.

    On the function exitEmoticon we simply transform the emoticon text in an emoji character. We get the text of the whole rule because there are no tokens defined for this parser rule. On enterCommand , instead there could be any of two tokens SAYS or SHOUTS , so we check which one is defined. And then we alter the following text, by transforming in uppercase, if it's a SHOUT. Note that we close the p tag at the exit of the line rule, because the command, semantically speaking, alter all the text of the message.

    All we have to do now is launching node, with nodejs antlr.js , and point our browser at its address, usually at http://localhost:1337/ and we will be greeted with the following image.

    So all is good, we just have to add all the different listeners to handle the rest of the language. Let's start with color and message .

    15. Working with a Listener

    We have seen how to start defining a listener. Now let's get serious on see how to evolve in a complete, robust listener. Let's start by adding support for color and checking the results of our hard work.

    HtmlChatListener.prototype.enterColor = function(ctx) {     
        var color = ctx.WORD().getText();         
        this.Res.write('<span style="color: ' + color + '">');        
    };
     
    HtmlChatListener.prototype.exitColor = function(ctx) {          
        this.Res.write("</span>");    
    }; 
     
    HtmlChatListener.prototype.exitMessage = function(ctx) {             
        this.Res.write(ctx.getText());
    };
     
    exports.HtmlChatListener = HtmlChatListener;
     

    Except that it doesn't work. Or maybe it works too much: we are writing some part of message twice (“this will work”): first when we check the specific nodes, children of message , and then at the end.

    Luckily with Javascript we can dynamically alter objects, so we can take advantage of this fact to change the *Context object themselves.

    HtmlChatListener.prototype.exitColor = function(ctx) {         
        ctx.text += ctx.message().text;    
        ctx.text += '</span>';
    };
     
    HtmlChatListener.prototype.exitEmoticon = function(ctx) {      
        var emoticon = ctx.getText();        
        
        if(emoticon == ':-)' || emoticon == ':)')
        {        
            ctx.text = "??";
        }
        
        if(emoticon == ':-(' || emoticon == ':(')
        {          
            ctx.text = "??";
        }
    }; 
     
    HtmlChatListener.prototype.exitMessage = function(ctx) {                
        var text = '';
     
        for (var index = 0; index <  ctx.children.length; index++ ) {
            if(ctx.children[index].text != null)
                text += ctx.children[index].text;
            else
                text += ctx.children[index].getText();
        }
     
        if(ctx.parentCtx instanceof ChatParser.ChatParser.LineContext == false)
        {
            ctx.text = text;        
        }
        else
        {
            this.Res.write(text);
            this.Res.write("</p>");
        }
    };

    Only the modified parts are shown in the snippet above. We add a text field to every node that transforms its text, and then at the exit of every message we print the text if it's the primary message, the one that is directly child of the line rule. If it's a message, that is also a child of color, we add the text field to the node we are exiting and let color print it. We check this on line 30, where we look at the parent node to see if it's an instance of the object LineContext . This is also further evidence of how each ctx argument corresponds to the proper type.

    Between lines 23 and 27 we can see another field of every node of the generated tree: children , which obviously it contains the children node. You can observe that if a field text exists we add it to the proper variable, otherwise we use the usual function to get the text of the node.

    16. Solving Ambiguities with Semantic Predicates

    So far we have seen how to build a parser for a chat language in Javascript. Let's continue working on this grammar but switch to python. Remember that all code is available in the repository . Before that, we have to solve an annoying problem: the TEXT token. The solution we have is terrible, and furthermore, if we tried to get the text of the token we would have to trim the edges, parentheses or square brackets. 所以,我们能做些什么?

    We can use a particular feature of ANTLR called semantic predicates. As the name implies they are expressions that produce a boolean value. They selectively enable or disable the following rule and thus permit to solve ambiguities. Another reason that they could be used is to support different version of the same language, for instance a version with a new construct or an old without it.

    Technically they are part of the larger group of actions , that allows to embed arbitrary code into the grammar. The downside is that the grammar is no more language independent , since the code in the action must be valid for the target language. For this reason, usually it's considered a good idea to only use semantic predicates, when they can't be avoided, and leave most of the code to the visitor/listener.

    link                : '[' TEXT ']' '(' TEXT ')';
     
    TEXT                : {self._input.LA(-1) == ord('[') or self._input.LA(-1) == ord('(')}? ~[\])]+ ;

    We restored link to its original formulation, but we added a semantic predicate to the TEXT token, written inside curly brackets and followed by a question mark. We use self._input.LA(-1) to check the character before the current one, if this character is a square bracket or the open parenthesis, we activate the TEXT token. It's important to repeat that this must be valid code in our target language, it's going to end up in the generated Lexer or Parser, in our case in ChatLexer.py.

    This matters not just for the syntax itself, but also because different targets might have different fields or methods, for instance LA returns an int in python, so we have to convert the char to a int .

    Let's look at the equivalent form in other languages.

    // C#. Notice that is .La and not .LA
    TEXT : {_input.La(-1) == '[' || _input.La(-1) == '('}? ~[\])]+ ;
    // Java
    TEXT : {_input.LA(-1) == '[' || _input.LA(-1) == '('}? ~[\])]+ ;
    // Javascript
    TEXT : {this._input.LA(-1) == '[' || this._input.LA(-1) == '('}? ~[\])]+ ;

    If you want to test for the preceding token, you can use the _input.LT(-1,) but you can only do that for parser rules. For example, if you want to enable the mention rule only if preceded by a WHITESPACE token.

    // C#
    mention: {_input.Lt(-1).Type == WHITESPACE}? '@' WORD ;
    // Java
    mention: {_input.LT(1).getType() == WHITESPACE}? '@' WORD ;
    // Python
    mention: {self._input.LT(-1).text == ' '}? '@' WORD ;
    // Javascript
    mention: {this._input.LT(1).text == ' '}? '@' WORD ;

    17. Continuing the Chat in Python

    Before seeing the Python example, we must modify our grammar and put the TEXT token before the WORD one. Otherwise ANTLR might assign the incorrect token, in cases where the characters between parentheses or brackets are all valid for WORD , for instance if it where [this](link) .

    Using ANTLR in python is not more difficult than with any other platform, you just need to pay attention to the version of Python, 2 or 3.

    antlr4 -Dlanguage=Python3 Chat.g4

    就是这样。 So when you have run the command, inside the directory of your python project, there will be a newly generated parser and a lexer. You may find interesting to look at ChatLexer.py and in particular the function TEXT_sempred (sempred stands for sem antic pred icate).

    def TEXT_sempred(self, localctx:RuleContext, predIndex:int):
        if predIndex == 0:
            return self._input.LA(-1) == ord('[') or self._input.LA(-1) == ord('(')

    You can see our predicate right in the code. This also means that you have to check that the correct libraries, for the functions used in the predicate, are available to the lexer.

    18. The Python Way of Working with a Listener

    The main file of a Python project is very similar to a Javascript one, mutatis mutandis of course. That is to say we have to adapt libraries and functions to the proper version for a different language.

    import sys
    from antlr4 import *
    from ChatLexer import ChatLexer
    from ChatParser import ChatParser
    from HtmlChatListener import HtmlChatListener
     
    def main(argv):
        input = FileStream(argv[1])
        lexer = ChatLexer(input)
        stream = CommonTokenStream(lexer)
        parser = ChatParser(stream)
        tree = parser.chat()
     
        output = open("output.html","w")
        
        htmlChat = HtmlChatListener(output)
        walker = ParseTreeWalker()
        walker.walk(htmlChat, tree)
            
        output.close()      
     
    if __name__ == '__main__':
        main(sys.argv)

    We have also changed the input and output to become files, this avoid the need to launch a server in Python or the problem of using characters that are not supported in the terminal.

    import sys
    from antlr4 import *
    from ChatParser import ChatParser
    from ChatListener import ChatListener
     
    class HtmlChatListener(ChatListener) :
        def __init__(self, output):
            self.output = output
            self.output.write('<html><head><meta charset="UTF-8"/></head><body>')
     
        def enterName(self, ctx:ChatParser.NameContext) :
            self.output.write("<strong>") 
     
        def exitName(self, ctx:ChatParser.NameContext) :
            self.output.write(ctx.WORD().getText()) 
            self.output.write("</strong> ") 
     
        def enterColor(self, ctx:ChatParser.ColorContext) :
            color = ctx.WORD().getText()
            ctx.text = '<span style="color: ' + color + '">'        
     
        def exitColor(self, ctx:ChatParser.ColorContext):         
            ctx.text += ctx.message().text
            ctx.text += '</span>'
     
        def exitEmoticon(self, ctx:ChatParser.EmoticonContext) : 
            emoticon = ctx.getText()
     
            if emoticon == ':-)' or emoticon == ':)' :
                ctx.text = "??"
        
            if emoticon == ':-(' or emoticon == ':(' :
                ctx.text = "??"
     
        def enterLink(self, ctx:ChatParser.LinkContext):
            ctx.text = '<a href="%s">%s</a>' % (ctx.TEXT()[1], (ctx.TEXT()[0]))
     
        def exitMessage(self, ctx:ChatParser.MessageContext):
            text = ''
     
            for child in ctx.children:
                if hasattr(child, 'text'):
                    text += child.text
                else:
                    text += child.getText()
            
            if isinstance(ctx.parentCtx, ChatParser.LineContext) is False:
                ctx.text = text
            else:    
                self.output.write(text)
                self.output.write("</p>") 
     
        def enterCommand(self, ctx:ChatParser.CommandContext):
            if ctx.SAYS() is not None :
                self.output.write(ctx.SAYS().getText() + ':' + '<p>')
     
            if ctx.SHOUTS() is not None :
                self.output.write(ctx.SHOUTS().getText() + ':' + '<p style="text-transform: uppercase">')    
     
        def exitChat(self, ctx:ChatParser.ChatContext):
            self.output.write("</body></html>")

    Apart from lines 35-36, where we introduce support for links, there is nothing new. Though you might notice that Python syntax is cleaner and, while having dynamic typing, it is not loosely typed as Javascript. The different types of *Context objects are explicitly written out. If only Python tools were as easy to use as the language itself. But of course we cannot just fly over python like this, so we also introduce testing.

    19. Testing with Python

    While Visual Studio Code have a very nice extension for Python, that also supports unit testing, we are going to use the command line for the sake of compatibility.

    python3 -m unittest discover -s . -p ChatTests.py

    That's how you run the tests, but before that we have to write them. Actually, even before that, we have to write an ErrorListener to manage errors that we could find. While we could simply read the text outputted by the default error listener, there is an advantage in using our own implementation, namely that we can control more easily what happens.

    import sys
    from antlr4 import *
    from ChatParser import ChatParser
    from ChatListener import ChatListener
    from antlr4.error.ErrorListener import *
    import io
     
    class ChatErrorListener(ErrorListener):
     
        def __init__(self, output):
            self.output = output        
            self._symbol = ''
        
        def syntaxError(self, recognizer, offendingSymbol, line, column, msg, e):        
            self.output.write(msg)
            self._symbol = offendingSymbol.text
     
        @property        
        def symbol(self):
            return self._symbol

    Our class derives from ErrorListener and we simply have to implement syntaxError . Although we also add a property symbol to easily check which symbol might have caused an error.

    from antlr4 import *
    from ChatLexer import ChatLexer
    from ChatParser import ChatParser
    from HtmlChatListener import HtmlChatListener
    from ChatErrorListener import ChatErrorListener
    import unittest
    import io
     
    class TestChatParser(unittest.TestCase):
     
        def setup(self, text):        
            lexer = ChatLexer(InputStream(text))        
            stream = CommonTokenStream(lexer)
            parser = ChatParser(stream)
            
            self.output = io.StringIO()
            self.error = io.StringIO()
     
            parser.removeErrorListeners()        
            errorListener = ChatErrorListener(self.error)
            parser.addErrorListener(errorListener)  
     
            self.errorListener = errorListener              
            
            return parser
            
        def test_valid_name(self):
            parser = self.setup("John ")
            tree = parser.name()               
        
            htmlChat = HtmlChatListener(self.output)
            walker = ParseTreeWalker()
            walker.walk(htmlChat, tree)              
     
            # let's check that there aren't any symbols in errorListener         
            self.assertEqual(len(self.errorListener.symbol), 0)
     
        def test_invalid_name(self):
            parser = self.setup("Joh-")
            tree = parser.name()               
        
            htmlChat = HtmlChatListener(self.output)
            walker = ParseTreeWalker()
            walker.walk(htmlChat, tree)              
     
            # let's check the symbol in errorListener
            self.assertEqual(self.errorListener.symbol, '-')
     
    if __name__ == '__main__':
        unittest.main()

    The setup method is used to ensure that everything is properly set; on lines 19-21 we setup also our ChatErrorListener , but first we remove the default one, otherwise it would still output errors on the standard output. We are listening to errors in the parser, but we could also catch errors generated by the lexer. It depends on what you want to test. You may want to check both.

    The two proper test methods checks for a valid and an invalid name. The checks are linked to the property symbol , that we have previously defined, if it's empty everything is fine, otherwise it contains the symbol that created the error. Notice that on line 28, there is a space at the end of the string, because we have defined the rule name to end with a WHITESPACE token.

    20. Parsing Markup

    ANTLR can parse many things, including binary data, in that case tokens are made up of non printable characters. But a more common problem is parsing markup languages such as XML or HTML. Markup is also a useful format to adopt for your own creations, because it allows to mix unstructured text content with structured annotations. They fundamentally represent a form of smart document, containing both text and structured data. The technical term that describe them is island languages . This type is not restricted to include only markup, and sometimes it's a matter of perspective.

    For example, you may have to build a parser that ignores preprocessor directives. In that case, you have to find a way to distinguish proper code from directives, which obeys different rules.

    In any case, the problem for parsing such languages is that there is a lot of text that we don't actually have to parse, but we cannot ignore or discard, because the text contain useful information for the user and it is a structural part of the document. The solution is lexical modes , a way to parse structured content inside a larger sea of free text.

    21. Lexical Modes

    We are going to see how to use lexical modes, by starting with a new grammar.

    lexer grammar MarkupLexer;
     
    OPEN                : '[' -> pushMode(BBCODE) ;
    TEXT                : ~('[')+ ;
     
    // Parsing content inside tags
    mode BBCODE;
     
    CLOSE               : ']' -> popMode ;
    SLASH               : '/' ;
    EQUALS              : '=' ;
    STRING              : '"' .*? '"' ;
    ID                  : LETTERS+ ;
    WS                  : [ \t\r\n] -> skip ;
     
    fragment LETTERS    : [a-zA-Z] ;

    Looking at the first line you could notice a difference: we are defining a lexer grammar , instead of the usual (combined) grammar . You simply can't define a lexical mode together with a parser grammar . You can use lexical modes only in a lexer grammar, not in a combined grammar. The rest is not suprising, as you can see, we are defining a sort of BBCode markup, with tags delimited by square brackets.

    On lines 3, 7 and 9 you will find basically all that you need to know about lexical modes. You define one or more tokens that can delimit the different modes and activate them.

    The default mode is already implicitly defined, if you need to define yours you simply use mode followed by a name. Other than for markup languages, lexical modes are typically used to deal with string interpolation. When a string literal can contain more than simple text, but things like arbitrary expressions.

    When we used a combined grammar we could define tokens implicitly: when in a parser rule we used a string like '=' that is what we did. Now that we are using separate lexer and parser grammars we cannot do that. That means that every single token has to be defined explicitly. So we have definitions like SLASH or EQUALS which typically could be just be directly used in a parser rule. The concept is simple: in the lexer grammar we need to define all tokens, because they cannot be defined later in the parser grammar.

    22. Parser Grammars

    We look at the other side of a lexer grammar, so to speak.

    parser grammar MarkupParser;
     
    options { tokenVocab=MarkupLexer; }
     
    file        : element* ;
     
    attribute   : ID '=' STRING ;
     
    content     : TEXT ;
     
    element     : (content | tag) ;
     
    tag         : '[' ID attribute? ']' element* '[' '/' ID ']' ;

    On the first line we define a parser grammar . Since the tokens we need are defined in the lexer grammar, we need to use an option to say to ANTLR where it can find them. This is not necessary in combined grammars, since the tokens are defined in the same file.

    There are many other options available, in the documentation .

    There is almost nothing else to add, except that we define a content rule so that we can manage more easily the text that we find later in the program.

    I just want to say that, as you can see, we don't need to explicitly use the tokens everytime (es. SLASH), but instead we can use the corresponding text (es. '/').

    ANTLR will automatically transform the text in the corresponding token, but this can happen only if they are already defined. In short, it is as if we had written:

    tag : OPEN ID attribute? CLOSE element* OPEN SLASH ID CLOSE ;

    But we could not have used the implicit way, if we hadn't already explicitly defined them in the lexer grammar. Another way to look at this is: when we define a combined grammar ANTLR defines for use all the tokens, that we have not explicitly defined ourselves. When we need to use a separate lexer and a parser grammar, we have to define explicitly every token ourselves. Once we have done that, we can use them in every way we want.

    Before moving to actual Java code, let's see the AST for a sample input.

    You can easily notice that the element rule is sort of transparent: where you would expect to find it there is always going to be a tag or content . So why did we define it? There are two advantages: avoid repetition in our grammar and simplify managing the results of the parsing. We avoid repetition because if we did not have the element rule we should repeat (content|tag) everywhere it is used. What if one day we add a new type of element? In addition to that it simplify the processing of the AST because it makes both the node represent tag and content extend a comment ancestor.

    高级

    In this section we deepen our understanding of ANTLR. We will look at more complex examples and situations we may have to handle in our parsing adventures. We will learn how to perform more adavanced testing, to catch more bugs and ensure a better quality for our code. We will see what a visitor is and how to use it. Finally, we will see how to deal with expressions and the complexity they bring.

    You can come back to this section when you need to deal with complex parsing problems.

    23. The Markup Project in Java

    You can follow the instructions in Java Setup or just copy the antlr-java folder of the companion repository. Once the file pom.xml is properly configured, this is how you build and execute the application.

    // use mwn to generate the package
    mvn package
    // every time you need to execute the application
    java -cp target/markup-example-1.0-jar-with-dependencies.jar me.tomassetti.examples.MarkupParser.App

    As you can see, it isn't any different from any typical Maven project, although it's indeed more complicated that a typical Javascript or Python project. Of course, if you use an IDE you don't need to do anything different from your typical workflow.

    24. The Main App.java

    We are going to see how to write a typical ANTLR application in Java.

    package me.tomassetti.examples.MarkupParser;
    import org.antlr.v4.runtime.*;
    import org.antlr.v4.runtime.tree.*;
     
    public class App 
    {
        public static void main( String[] args )
        {
            ANTLRInputStream inputStream = new ANTLRInputStream(
                "I would like to [b][i]emphasize[/i][/b] this and [u]underline [b]that[/b][/u] ." +
                "Let's not forget to quote: [quote author=\"John\"]You're wrong![/quote]");
            MarkupLexer markupLexer = new MarkupLexer(inputStream);
            CommonTokenStream commonTokenStream = new CommonTokenStream(markupLexer);
            MarkupParser markupParser = new MarkupParser(commonTokenStream);
     
            MarkupParser.FileContext fileContext = markupParser.file();                
            MarkupVisitor visitor = new MarkupVisitor();                
            visitor.visit(fileContext);        
        }
    }

    At this point the main java file should not come as a surprise, the only new development is the visitor. Of course, there are the obvious little differences in the names of the ANTLR classes and such. This time we are building a visitor, whose main advantage is the chance to control the flow of the program. While we are still dealing with text, we don't want to display it, we want to transform it from pseudo-BBCode to pseudo-Markdown.

    25. Transforming Code with ANTLR

    The first issue to deal with our translation from pseudo-BBCode to pseudo-Markdown is a design decision. Our two languages are different and frankly neither of the two original one is that well designed.

    BBCode was created as a safety precaution, to make possible to disallow the use of HTML but giove some of its power to users. Markdown was created to be an easy to read and write format, that could be translated into HTML. So they both mimic HTML, and you can actually use HTML in a Markdown document. Let's start to look into how messy would be a real conversion.

    package me.tomassetti.examples.MarkupParser;
     
    import org.antlr.v4.runtime.*;
    import org.antlr.v4.runtime.misc.*;
    import org.antlr.v4.runtime.tree.*;
     
    public class MarkupVisitor extends MarkupParserBaseVisitor
    {
        @Override
        public String visitFile(MarkupParser.FileContext context)
        {
             visitChildren(context);
             
             System.out.println("");
             
             return null;
        }
        
        @Override
        public String visitContent(MarkupParser.ContentContext context)
        {
            System.out.print(context.TEXT().getText());
            
            return visitChildren(context);
        }
    }

    The first version of our visitor prints all the text and ignore all the tags.

    You can see how to control the flow, either by calling visitChildren , or any other visit* function, and deciding what to return. We just need to override the methods that we want to change. Otherwise, the default implementation would just do like visitContent , on line 23, it will visit the children nodes and allows the visitor to continue. Just like for a listener, the argument is the proper context type. If you want to stop the visitor just return null as on line 15.

    26. Joy and Pain of Transforming Code

    Transforming code, even at a very simple level, comes with some complications. Let's start easy with some basic visitor methods.

    @Override
    public String visitContent(MarkupParser.ContentContext context)    
    {          
        return context.getText();        
    }    
     
    @Override
    public String visitElement(MarkupParser.ElementContext context)
    {
        if(context.parent instanceof MarkupParser.FileContext)
        {
            if(context.content() != null)            
                System.out.print(visitContent(context.content()));            
            if(context.tag() != null)
                System.out.print(visitTag(context.tag()));
        }    
     
        return null;
    }

    Before looking at the main method, let's look at the supporting ones. Foremost we have changed visitContent by making it return its text instead of printing it. Second, we have overridden the visitElement so that it prints the text of its child, but only if it's a top element, and not inside a tag . In both cases, it achieve this by calling the proper visit* method. It knows which one to call because it checks if it actually has a tag or content node.

    @Override
    public String visitTag(MarkupParser.TagContext context)    
    {
        String text = "";
        String startDelimiter = "", endDelimiter = "";
     
        String id = context.ID(0).getText();
        
        switch(id)
        {
            case "b":
                startDelimiter = endDelimiter = "**";                
            break;
            case "u":
                startDelimiter = endDelimiter = "*";                
            break;
            case "quote":
                String attribute = context.attribute().STRING().getText();
                attribute = attribute.substring(1,attribute.length()-1);
                startDelimiter = System.lineSeparator() + "> ";
                endDelimiter = System.lineSeparator() + "> " + System.lineSeparator() + "> – "
                             + attribute + System.lineSeparator();
            break;
        } 
     
        text += startDelimiter;
     
        for (MarkupParser.ElementContext node: context.element())
        {                
            if(node.tag() != null)
                text += visitTag(node.tag());
            if(node.content() != null)
                text += visitContent(node.content());                
        }        
        
        text += endDelimiter;
        
        return text;        
    }

    VisitTag contains more code than every other method, because it can also contain other elements, including other tags that have to be managed themselves, and thus they cannot be simply printed. We save the content of the ID on line 5, of course we don't need to check that the corresponding end tag matches, because the parser will ensure that, as long as the input is well formed.

    The first complication starts with at lines 14-15: as it often happens when transforming a language in a different one, there isn't a perfect correspondence between the two. While BBCode tries to be a smarter and safer replacement for HTML, Markdown want to accomplish the same objective of HTML, to create a structured document. So BBCode has an underline tag, while Markdown does not.

    So we have to make a decision

    Do we want to discard the information, or directly print HTML, or something else? We choose something else and instead convert the underline to an italic. That might seem completely arbitrary, and indeed there is an element of choice in this decision. But the conversion forces us to lose some information, and both are used for emphasis, so we choose the closer thing in the new language.

    The following case, on lines 18-22, force us to make another choice. We can't maintain the information about the author of the quote in a structured way, so we choose to print the information in a way that will make sense to a human reader.

    On lines 28-34 we do our “magic”: we visit the children and gather their text, then we close with the endDelimiter . Finally we return the text that we have created.

    That's how the visitor works

    1. every top element visit each child
      • if it's a content node, it directly returns the text
      • if it's a tag , it setups the correct delimiters and then it checks its children. It repeats step 2 for each children and then it returns the gathered text
    2. it prints the returned text

    It's obviously a simple example, but it show how you can have great freedom in managing the visitor once you have launched it. Together with the patterns that we have seen at the beginning of this section you can see all of the options: to return null to stop the visit, to return children to continue, to return something to perform an action ordered at an higher level of the tree.

    27. Advanced Testing

    The use of lexical modes permit to handle the parsing of island languages, but it complicates testing.

    We are not going to show MarkupErrorListener.java because w edid not changed it; if you need you can see it on the repository.

    You can run the tests by using the following command.

    mvn test

    Now we are going to look at the tests code. We are skipping the setup part, because that also is obvious, we just copy the process seen on the main file, but we simply add our error listener to intercept the errors.

    // private variables inside the class AppTest
    private MarkupErrorListener errorListener;
    private MarkupLexer markupLexer;
     
    public void testText()
    {
        MarkupParser parser = setup("anything in here");
     
        MarkupParser.ContentContext context = parser.content();        
        
        assertEquals("",this.errorListener.getSymbol());
    }
     
    public void testInvalidText()
    {
        MarkupParser parser = setup("[anything in here");
     
        MarkupParser.ContentContext context = parser.content();        
        
        assertEquals("[",this.errorListener.getSymbol());
    }
     
    public void testWrongMode()
    {
        MarkupParser parser = setup("author=\"john\"");                
     
        MarkupParser.AttributeContext context = parser.attribute(); 
        TokenStream ts = parser.getTokenStream();        
        
        assertEquals(MarkupLexer.DEFAULT_MODE, markupLexer._mode);
        assertEquals(MarkupLexer.TEXT,ts.get(0).getType());        
        assertEquals("author=\"john\"",this.errorListener.getSymbol());
    }
     
    public void testAttribute()
    {
        MarkupParser parser = setup("author=\"john\"");
        // we have to manually push the correct mode
        this.markupLexer.pushMode(MarkupLexer.BBCODE);
     
        MarkupParser.AttributeContext context = parser.attribute(); 
        TokenStream ts = parser.getTokenStream();        
        
        assertEquals(MarkupLexer.ID,ts.get(0).getType());
        assertEquals(MarkupLexer.EQUALS,ts.get(1).getType());
        assertEquals(MarkupLexer.STRING,ts.get(2).getType()); 
        
        assertEquals("",this.errorListener.getSymbol());
    }
     
    public void testInvalidAttribute()
    {
        MarkupParser parser = setup("author=/\"john\"");
        // we have to manually push the correct mode
        this.markupLexer.pushMode(MarkupLexer.BBCODE);
        
        MarkupParser.AttributeContext context = parser.attribute();        
        
        assertEquals("/",this.errorListener.getSymbol());
    }

    The first two methods are exactly as before, we simply check that there are no errors, or that there is the correct one because the input itself is erroneous. On lines 30-32 things start to get interesting: the issue is that by testing the rules one by one we don't give the chance to the parser to switch automatically to the correct mode. So it remains always on the DEFAULT_MODE, which in our case makes everything looks like TEXT . This obviously makes the correct parsing of an attribute impossible.

    The same lines shows also how you can check the current mode that you are in, and the exact type of the tokens that are found by the parser, which we use to confirm that indeed all is wrong in this case.

    While we could use a string of text to trigger the correct mode, each time, that would make testing intertwined with several pieces of code, which is a no-no. So the solution is seen on line 39: we trigger the correct mode manually. Once you have done that, you can see that our attribute is recognized correctly.

    28. Dealing with Expressions

    So far we have written simple parser rules, now we are going to see one of the most challenging parts in analyzing a real (programming) language: expressions. While rules for statements are usually larger they are quite simple to deal with: you just need to write a rule that encapsulate the structure with the all the different optional parts. For instance a for statement can include all other kind of statements, but we can simply include them with something like statement*. An expression, instead, can be combined in many different ways.

    An expression usually contains other expressions. For example the typical binary expression is composed by an expression on the left, an operator in the middle and another expression on the right. This can lead to ambiguities. Think, for example, at the expression 5 + 3 * 2 , for ANTLR this expression is ambiguous because there are two ways to parse it. It could either parse it as 5 + (3 * 2) or (5 +3) * 2.

    Until this moment we have avoided the problem simply because markup constructs surround the object on which they are applied. So there is not ambiguity in choosing which one to apply first: it's the most external. Imagine if this expression was written as:

    <add>
        <int>5</int>
        <mul>
            <int>3</int>
            <int>2</int>
        </mul>
    </add>

    That would make obvious to ANTLR how to parse it.

    These types of rules are called left-recursive rules. You might say: just parse whatever comes first. The problem with that is semantic: the addition comes first, but we know that multiplications have a precedence over additions. Traditionally the way to solve this problem was to create a complex cascade of specific expressions like this:

    expression     : addition;
    addition       : multiplication ('+' multiplication)* ;
    multiplication : atom ('*' atom)* ;
    atom           : NUMBER ;

    This way ANTLR would have known to search first for a number, then for multiplications and finally for additions. This is cumbersome and also counterintuitive, because the last expression is the first to be actually recognized. Luckily ANTLR4 can create a similar structure automatically, so we can use a much more natural syntax .

    expression : expression '*' expression
               | expression '+' expression                      
               | NUMBER
               ;

    In practice ANTLR consider the order in which we defined the alternatives to decide the precedence. By writing the rule in this way we are telling to ANTLR that the multiplication has precedence on the addition.

    29. Parsing Spreadsheets

    Now we are prepared to create our last application, in C#. We are going to build the parser of an Excel-like application. In practice, we want to manage the expressions you write in the cells of a spreadsheet.

    grammar Spreadsheet;
     
    expression          : '(' expression ')'                        #parenthesisExp
                        | expression (ASTERISK|SLASH) expression    #mulDivExp
                        | expression (PLUS|MINUS) expression        #addSubExp
                        | <assoc=right>  expression '^' expression  #powerExp
                        | NAME '(' expression ')'                   #functionExp
                        | NUMBER                                    #numericAtomExp
                        | ID                                        #idAtomExp
                        ;
     
    fragment LETTER     : [a-zA-Z] ;
    fragment DIGIT      : [0-9] ;
     
    ASTERISK            : '*' ;
    SLASH               : '/' ;
    PLUS                : '+' ;
    MINUS               : '-' ;
     
    ID                  : LETTER DIGIT ;
     
    NAME                : LETTER+ ;
     
    NUMBER              : DIGIT+ ('.' DIGIT+)? ;
     
    WHITESPACE          : ' ' -> skip;

    With all the knowledge you have acquired so far everything should be clear, except for possibly three things:

    1. why the parentheses are there,
    2. what's the stuff on the right,
    3. that thing on line 6.

    The parentheses comes first because its only role is to give the user a way to override the precedence of operator, if it needs to do so. This graphical representation of the AST should make it clear.

    The things on the right are labels , they are used to make ANTLR generate specific functions for the visitor or listener. So there will be a VisitFunctionExp , a VisitPowerExp , etc. This makes possible to avoid the use of a giant visitor for the expression rule.

    The expression relative to exponentiation is different because there are two possible ways to act, to group them, when you meet two sequential expressions of the same type. The first one is to execute the one on the left first and then the one on the right, the second one is the inverse: this is called associativity . Usually the one that you want to use is left-associativity, which is the default option. Nonetheless exponentiation is right-associative , so we have to signal this to ANTLR.

    Another way to look at this is: if there are two expressions of the same type, which one has the precedence: the left one or the right one? Again, an image is worth a thousand words.

    We have also support for functions, alphanumeric variables that represents cells and real numbers.

    30. The Spreadsheet Project in C#

    You just need to follow the C# Setup : to install a nuget package for the runtime and an ANTLR4 extension for Visual Studio. The extension will automatically generate everything whenever you build your project: parser, listener and/or visitor.

    After you have done that, you can also add grammar files just by using the usual menu Add -> New Item. Do exactly that to create a grammar called Spreadsheet.g4 and put in it the grammar we have just created. Now let's see the main Program.cs .

    using System;
    using Antlr4.Runtime;
     
    namespace AntlrTutorial
    {
        class Program
        {
            static void Main(string[] args)
            {
                string input = "log(10 + A1 * 35 + (5.4 - 7.4))";
     
                AntlrInputStream inputStream = new AntlrInputStream(input);
                SpreadsheetLexer spreadsheetLexer = new SpreadsheetLexer(inputStream);
                CommonTokenStream commonTokenStream = new CommonTokenStream(spreadsheetLexer);
                SpreadsheetParser spreadsheetParser = new SpreadsheetParser(commonTokenStream);
     
                SpreadsheetParser.ExpressionContext expressionContext = spreadsheetParser.expression();
                SpreadsheetVisitor visitor = new SpreadsheetVisitor();
                
                Console.WriteLine(visitor.Visit(expressionContext));
            }
        }
    }

    There is nothing to say, apart from that, of course, you have to pay attention to yet another slight variation in the naming of things: pay attention to the casing. For instance, AntlrInputStream , in the C# program, was ANTLRInputStream in the Java program.

    Also you can notice that, this time, we output on the screen the result of our visitor, instead of writing the result on a file.

    31. Excel is Doomed

    We are going to take a look at our visitor for the Spreadsheet project.

    public class SpreadsheetVisitor : SpreadsheetBaseVisitor<double>
    {
        private static DataRepository data = new DataRepository();
     
        public override double VisitNumericAtomExp(SpreadsheetParser.NumericAtomExpContext context)
        {            
            return double.Parse(context.NUMBER().GetText(), System.Globalization.CultureInfo.InvariantCulture);
        }
     
        public override double VisitIdAtomExp(SpreadsheetParser.IdAtomExpContext context)
        {
            String id = context.ID().GetText();
     
            return data[id];
        }
     
        public override double VisitParenthesisExp(SpreadsheetParser.ParenthesisExpContext context)
        {
            return Visit(context.expression());
        }
     
        public override double VisitMulDivExp(SpreadsheetParser.MulDivExpContext context)
        {
            double left = Visit(context.expression(0));
            double right = Visit(context.expression(1));
            double result = 0;
     
            if (context.ASTERISK() != null)
                result = left * right;
            if (context.SLASH() != null)
                result = left / right;
     
            return result;
        }
     
        [..]
     
        public override double VisitFunctionExp(SpreadsheetParser.FunctionExpContext context)
        {
            String name = context.NAME().GetText();
            double result = 0;
     
            switch(name)
            {
                case "sqrt":
                    result = Math.Sqrt(Visit(context.expression()));
                    break;
     
                case "log":
                    result = Math.Log10(Visit(context.expression()));
                    break;
            }
     
            return result;
        }
    }

    VisitNumeric and VisitIdAtom return the actual numbers that are represented either by the literal number or the variable. In a real scenario DataRepository would contain methods to access the data in the proper cell, but in our example is just a Dictionary with some keys and numbers. The other methods actually work in the same way: they visit/call the containing expression(s). The only difference is what they do with the results.

    Some perform an operation on the result, the binary operations combine two results in the proper way and finally VisitParenthesisExp just reports the result higher on the chain. Math is simple, when it's done by a computer.

    32. Testing Everything

    Up until now we have only tested the parser rules, that is to say we have tested only if we have created the correct rule to parse our input. Now we are also going to test the visitor functions. This is the ideal chance because our visitor return values that we can check individually. In other occasions, for instance if your visitor prints something to the screen, you may want to rewrite the visitor to write on a stream. Then, at testing time, you can easily capture the output.

    We are not going to show SpreadsheetErrorListener.cs because it's the same as the previous one we have already seen; if you need it you can see it on the repository.

    To perform unit testing on Visual Studio you need to create a specific project inside the solution. You can choose different formats, we opt for the xUnit version. To run them there is an aptly named section “TEST” on the menu bar.

    [Fact]
    public void testExpressionPow()
    {
        setup("5^3^2");
     
        PowerExpContext context = parser.expression() as PowerExpContext;
     
        CommonTokenStream ts = (CommonTokenStream)parser.InputStream;   
     
        Assert.Equal(SpreadsheetLexer.NUMBER, ts.Get(0).Type);
        Assert.Equal(SpreadsheetLexer.T__2, ts.Get(1).Type);
        Assert.Equal(SpreadsheetLexer.NUMBER, ts.Get(2).Type);
        Assert.Equal(SpreadsheetLexer.T__2, ts.Get(3).Type);
        Assert.Equal(SpreadsheetLexer.NUMBER, ts.Get(4).Type); 
    }
     
    [Fact]
    public void testVisitPowerExp()
    {
        setup("4^3^2");
     
        PowerExpContext context = parser.expression() as PowerExpContext;
     
        SpreadsheetVisitor visitor = new SpreadsheetVisitor();
        double result = visitor.VisitPowerExp(context);
     
        Assert.Equal(double.Parse("262144"), result);
    }
     
    [..]
     
    [Fact]
    public void testWrongVisitFunctionExp()
    {
        setup("logga(100)");
     
        FunctionExpContext context = parser.expression() as FunctionExpContext;
        
        SpreadsheetVisitor visitor = new SpreadsheetVisitor();
        double result = visitor.VisitFunctionExp(context);
     
        CommonTokenStream ts = (CommonTokenStream)parser.InputStream;
     
        Assert.Equal(SpreadsheetLexer.NAME, ts.Get(0).Type);
        Assert.Equal(null, errorListener.Symbol);
        Assert.Equal(0, result);
    }
     
    [Fact]
    public void testCompleteExp()
    {
        setup("log(5+6*7/8)");
     
        ExpressionContext context = parser.expression();
     
        SpreadsheetVisitor visitor = new SpreadsheetVisitor();
        double result = visitor.Visit(context);
     
        Assert.Equal("1.01072386539177", result.ToString(System.Globalization.CultureInfo.GetCultureInfo("en-US").NumberFormat));            
    }

    The first test function is similar to the ones we have already seen; it checks that the corrects tokens are selected. On line 11 and 13 you may be surprised to see that weird token type, this happens because we didn't explicitly created one for the '^' symbol so one got automatically created for us. If you need you can see all the tokens by looking at the *.tokens file generated by ANTLR.

    On line 25 we visit our test node and get the results, that we check on line 27. It's all very simple because our visitor is simple, while unit testing should always be easy and made up of small parts it really can't be easier than this.

    The only thing to pay attention to it's related to the format of the number, it's not a problem here, but look at line 59, where we test the result of a whole expression. There we need to make sure that the correct format is selected, because different countries use different symbols as the decimal mark.

    There are some things that depends on the cultural context

    If your computer was already set to the American English Culture this wouldn't be necessary, but to guarantee the correct testing results for everybody we have to specify it. Keep that in mind if you are testing things that are culture-dependent: such as grouping of digits, temperatures, etc.

    On line 44-46 you see than when we check for the wrong function the parser actually works. That's because indeed “logga” is syntactically valid as a function name, but it's not semantically correct. The function “logga” doesn't exists, so our program doesn't know what to do with it. So when we visit it we get 0 as a result. As you recall this was our choice: since we initialize the result to 0 and we don't have a default case in VisitFunctionExp. So if there no function the result remains 0. A possib alternative could be to throw an exception.

    Final Remarks

    In this section we see tips and tricks that never came up in our example, but can be useful in your programs. We suggest more resources you may find useful if you want to know more about ANTLR, both the practice and the theory, or you need to deal with the most complex problems.

    33. Tips and Tricks

    Let's see a few tricks that could be useful from time to time. These were never needed in our examples, but they have been quite useful in other scenarios.

    Catchall Rule

    The first one is the ANY lexer rule. This is simply a rule in the following format.

    ANY : . ;

    This is a catchall rule that should be put at the end of your grammar. It matches any character that didn't find its place during the parsing. So creating this rule can help you during development, when your grammar has still many holes that could cause distracting error messages. It's even useful during production, when it acts as a canary in the mines. If it shows up in your program you know that something is wrong.

    频道

    There is also something that we haven't talked about: channels . Their use case is usually handling comments. You don't really want to check for comments inside every of your statements or expressions, so you usually throw them way with -> skip . But there are some cases where you may want to preserve them, for instance if you are translating a program in another language. When this happens you use channels . There is already one called HIDDEN that you can use, but you can declare more of them at the top of your lexer grammar.

    channels { UNIQUENAME }
    // and you use them this way
    COMMENTS : '//' ~[\r\n]+ -> channel(UNIQUENAME) ;

    Rule Element Labels

    There is another use of labels other than to distinguish among different cases of the same rule. They can be used to give a specific name, usually but not always of semantic value, to a common rule or parts of a rule. The format is label=rule , to be used inside another rule.

    expression : left=expression (ASTERISK|SLASH) right=expression ;

    This way left and right would become fields in the ExpressionContext nodes. And instead of using context.expression(0) , you could refer to the same entity using context.left .

    Problematic Tokens

    In many real languages some symbols are reused in different ways, some of which may lead to ambiguities. A common problematic example are the angle brackets, used both for bitshift expression and to delimit parameterized types.

    // bitshift expression, it assigns to x the value of y shifted by three bits
    x = y >> 3;
    // parameterized types, it define x as a list of dictionaries
    List<Dictionary<string, int>> x;

    The natural way of defining the bitshift operator token is as a single double angle brackets, '>>'. But this might lead to confusing a nested parameterized definition with the bitshift operator, for instance in the second example shown up here. While a simple way of solving the problem would be using semantic predicates, an excessive number of them would slow down the parsing phase. The solution is to avoid defining the bitshift operator token and instead using the angle brackets twice in the parser rule, so that the parser itself can choose the best candidate for every occasion.

    // from this
    RIGHT_SHIFT : '>>';
    expression : ID RIGHT_SHIFT NUMBER;
    // to this
    expression : ID SHIFT SHIFT NUMBER;

    34. Conclusions

    We have learned a lot today:

    • what are a lexer and a parser
    • how to create lexer and parser rules
    • how to use ANTLR to generate parsers in Java, C#, Python and JavaScript
    • the fundamental kinds of problems you will encounter parsing and how to solve them
    • how to understand errors
    • how to test your parsers

    That's all you need to know to use ANTLR on your own. And I mean literally, you may want to know more, but now you have solid basis to explore on your own.

    Where to look if you need more information about ANTLR:

    Also the book it's only place where you can find and answer to question like these:

    ANTLR v4 is the result of a minor detour (twenty-five years) I took in graduate
    学校。 I guess I'm going to have to change my motto slightly.

    Why program by hand in five days what you can spend twenty-five years of your
    life automating?

    We worked quite hard to build the largest tutorial on ANTLR: the mega-tutorial! A post over 13.000 words long, or more than 30 pages, to try answering all your questions about ANTLR. Missing something? Contact us and let us now, we are here to help

    翻译自: https://www.javacodegeeks.com/2017/03/antlr-mega-tutorial.html

    展开全文
  • Antlr v4入门教程和实例

    万次阅读 多人点赞 2015-04-30 21:30:05
    1 重逢ANTLR最早知道ANTLR是当年学习Apache Derby数据库源码时,在看到SQL解析那一层时,第一次看到编译原理在实际...以下入门介绍主要参考ANTLR作者写的《The Definitive ANTLR 4 Reference》。1.1 ANTLR全景当我们

    重逢ANTLR

    最早知道ANTLR是当年学习Apache Derby数据库源码时,在看到SQL解析那一层时,第一次看到编译原理在实际项目中的应用,惊叹之余也只能望而却步。之前也根据网上一些资料尝试了一下,看介绍说ANTLR v4更加易用了,于是又好奇地试用一下。以下入门介绍主要参考ANTLR作者写的The Definitive ANTLR 4 Reference

    1.1 ANTLR全景

    当我们实现一种语言时,我们需要构建读取句子(sentence)的应用,并对输入中的元素做出反应。如果应用计算或执行句子,我们就叫它解释器(interpreter,包括计算器、配置文件读取器、Python解释器都属于解释器。如果我们将句子转换成另一种语言,我们就叫它翻译器(translator,像JavaC#的翻译器和编译器都属于翻译器。不管是解释器还是翻译器,应用首先都要识别出所有有效的句子、词组、字词组等,识别语言的程序就叫解析器(parser语法分析器(syntax analyzer。我们学习的重点就是如何实现自己的解析器,去解析我们的目标语言,像DSL语言、配置文件、自定义SQL等等。

    1.2 元编程

    手动编写解析器是非常繁琐的,所以我们有了ANTLR。只需编写ANTLR的语法文件,描述我们要解析的语言的语法,之后ANTLR就会自动生成能解析这种语言的解析器。也就是说,ANTLR是一种能写出程序的程序。在学习LISPRuby的宏时,我们经常能接触到元编程的概念。而用来声明我们语言的ANTLR语言的语法,就是元语言(meta-language

    1.3 解析过程

    为了简单起见,我们将解析分为两个阶段,对应我们的大脑读取文字时的过程。当我们读到一个句子时,在第一阶段,大脑会下意识地将字符组成单词,然后像查词典一样识别出它们的意思。在第二阶段,大脑会根据已识别的单词去识别句子的结构。第一阶段的过程叫词法分析(lexical analysis,对应的分析程序叫做lexer,负责将符号(token)分组成符号类(token class or token type。而第二阶段就是真正的parser,默认ANTLR会构建出一棵分析树(parse tree或叫语法树(syntax tree。如下图,就是简单的赋值表达式的解析过程:



    语法树的叶子是输入token,而上级结点时包含其孩子结点的词组名(phase),线性的句子其实是语法树的序列化。最终生成语法树的好处是:

    1) 树形结构易于遍历和处理,并且易被程序员理解,方便了应用代码做进一步处理。

    2) 多种解释或翻译的应用代码都可以重用一个解析器。但ANTLR也支持像传统解析器生成器那样,将应用处理代码嵌入到语法中。

    3) 对于因为计算依赖而需要多趟处理的翻译器来说,语法树非常有用!我们不用多次调用解析器去解析,只需高效地遍历语法树多次。

    1.4 深入ANTLR

    ANTLR生成的解析器叫做递归下降解析器(recursive-descent parser,属于自顶向下解析器(top-down parser)的一种。顾名思义,递归下降指的就是解析过程是从语法树的根开始向叶子(token)递归,比较酷的是代码的调用图能与树结点对应上。还是以前面的赋值表达式解析为例,其递归下降解析器的代码大概是下面这个样子:



    Assign很简单,直接顺序读取输入字符,不用做任何选择。相比之下,根结点Stat要复杂一些,因为它有多种选择。解析时,要向前看(lookahead一些字符才能确认走哪个分支代码,有时甚至要读取完所有输入才能预测出,而ANTLR默默为我们处理了一切!

    1.5 解析树上的应用

    在内部,ANTLR的数据结构会尽可能地共享数据来节约内存,这种考量在NginxString中也能看到。如下图所示,解析树的叶子节点指向Token流中的Token,而Token中的起止字符索引指向字符流,而非拷贝子字符串。而像空格这种不与任何Token相关的字符会直接被Lexer丢弃掉。



    ANTLR为每个Rule都会生成一个Context对象,它会记录识别时的所有信息。ANTLR提供了ListenerVisitor两种遍历机制。Listener是全自动化的,ANTLR会主导深度优先遍历过程,我们只需处理各种事件就可以了。而Visitor则提供了可控的遍历方式,我们可以自行决定是否显示地调用子结点的visit方法。



    1.6 ANTLR v4新特性

    目前还未深入使用,对v4的新特性了解的不多,摘录一段“antlr v4新特性总结及与antlr v3的不同

    1) 学习曲线低。antlr v4相对于v3v4更注重于用更接近于自然语言的方式去解析语言。比如运算符优先级,排在最前面的规则优先级最高;

    2) 层次更清晰、更易维护。引入访问者、监听器模式,使解析与应用代码分离;新増import功能,lexerparser可以成为公共组件,増加可复用性;

    3) 新算法。改进LL()算法,使用新的Adative LL()算法,在运行时动态分析语法,而LL(*)需要静态分析语法,考虑各种语法的可能性。

    4) 新用法。引入了一些新用法,如rewrite the input streamsending token in different channelsisland grammarsassociativity,可以更方便、灵活地在应用中处理解析对象。

    5) 性能。相对于v3,解析代码跟应用代码都是自动生成的,而v4分离了解析与应用代码的实现,应用代码的实现及性能则可以由开发人员自主地控制,但新算法据官方指引说会消耗一定的速度上的性能,因此提供了SLL()LL()的开关,可通过api控制。

    准备工作

    2.1 安装IDE插件

    我这里使用的是Intellij IDEA,所以就去Plugins中搜“ANTLR v4 grammar plugin”插件,重启IDEA即可使用。如果想在IDE外使用,需要下载ANTLR包,是JAVA写成的,后面在IDEA中的各种操作都可以手动执行命令来完成。



    2.2 实用的小例子

    没有搜到很好的例子,偶然看到The Definitive ANTLR 4 Reference中第三章的入门实例不错,就参照着动手实现一下。简单介绍一下这个实用的小程序是要做什么?我们将short[] data = {1, 2, 3} 翻译成 string data = "\u0001\u0002\u0003"

    动手实现解析器

    3.1 编写.g4文件

    创建一个文件,后缀名是g4,只有这样在文件上点右键才能看到ANTLR插件的菜单。



    3.2 自动生成代码

    .g4文件上右键就能看到ANTLR插件的两个菜单,分别用来配置ANTLR生成工具的参数(在命令行中都有对应)和触发生成文件。首先选配置菜单,将目录选择到main/javatest/java。注意:ANTLR会自动根据Package/namespace的配置,生成出包的文件夹,不用预先创建出来。



    之后就点生成菜单,于是就在我们配置的目录下,自动生成出的如下代码:


    构建应用代码

    有了生成好的解析器,我们就可以在它上面构建出好玩的应用了。

    4.1 ANLTR运行时

    在开始编写应用代码之前,我们要引入ANTLR运行时。因为我们的解析器其实只是一堆回调hook,真正的通用解析流程实现是在ANTLR runtime包中。所以,以Maven为例ANTLR v4的依赖是:

    <dependency>
        <groupId>org.antlr</groupId>
        <artifactId>antlr4-runtime</artifactId>
        <version>4.5</version>
    </dependency>

    4.2 应用代码

    运用前面学习过的知识,我们实现一个Listener完成翻译工作。然后在main()中构建起词法分析器和解析器,以及连接它们的数据流和语法树。



     

    展开全文
  • Antlr4指南

    2019-02-22 23:11:01
    Antlr4指南 权威介绍 学习教程
  • antlr 教程 Antlr代表另一种语言识别工具。 该工具能够为任何计算机语言生成编译器或解释器。 除了明显的用途(例如需要解析一种真正的“大型”编程语言,例如Java,PHP或SQL)外,它还可以帮助执行更小,更常见的...

    antlr 教程

    Antlr代表另一种语言识别工具。 该工具能够为任何计算机语言生成编译器或解释器。 除了明显的用途(例如需要解析一种真正的“大型”编程语言,例如Java,PHP或SQL)外,它还可以帮助执行更小,更常见的任务。

    每当您需要评估编译时未知的表达式,或者解析奇怪的用户输入或文件时,这都是有用的。 当然,可以为任何这些任务创建定制的手工分析器。 但是,这通常需要更多的时间和精力。 对良好的解析器生成器的一点了解可能会将这些耗时的任务变成简单而又快速的练习。

    这篇文章从ANTLR有用性的一个小例子开始。 然后,我们解释什么是ANTLR以及它如何工作。 最后,我们展示如何编译一个简单的“ Hello word!” 语言转换成抽象语法树。 该帖子还说明了如何添加错误处理以及如何测试语言。

    下一篇文章展示了如何创建一种真实的表达语言。

    实词示例

    ANTLR在开源单词中似乎很流行。 其中, Apache CamelApache LuceneApache HadoopGroovyHibernate都使用它 他们都需要用于自定义语言的解析器。 例如,Hibernate使用ANTLR解析其查询语言HQL。

    所有这些都是大型框架,因此与小型应用程序相比,它们更可能需要特定领域的语言。 使用ANTLR的较小项目的列表可在其展示列表中找到 我们还找到了关于该主题的一个stackoverflow讨论。

    若要查看ANTLR在哪里有用以及如何节省时间,请尝试估算以下要求:

    • 将公式计算器添加到会计系统中。 它将计算公式的值,例如(10 + 80)*sales_tax
    • 将扩展的搜索字段添加到配方搜索引擎中。 它将搜索匹配表达式的收据,例如(chicken and orange) or (no meat and carrot)

    我们的安全评估需要一天半的时间,其中包括文档,测试以及与项目的集成。 如果您面临类似的要求并且做出了更高的估计,那么ANTLR值得一看。

    总览

    ANTLR是代码生成器。 它以所谓的语法文件作为输入,并生成两个类:lexer和parser。

    Lexer首先运行,然后将输入分成称为令牌的片段。 每个令牌代表或多或少有意义的输入。 标记流被传递到解析器,解析器完成所有必要的工作。 解析器负责构建抽象语法树,解释代码或将其转换为其他形式。

    语法文件包含ANTLR生成正确的词法分析器和解析器所需的所有内容。 它是否应该生成Java或python类,解析器是否生成抽象语法树,汇编代码或直接解释代码等。 正如本教程显示如何构建抽象语法树一样,在以下说明中我们将忽略其他选项。

    最重要的是,语法文件描述了如何将输入分为令牌以及如何从令牌构建树。 换句话说,语法文件包含词法分析器规则和解析器规则。

    每个词法分析器规则描述一个令牌:

    TokenName: regular expression;

    解析器规则更加复杂。 最基本的版本类似于lexer规则中的版本:

    ParserRuleName: regular expression;

    它们可能包含修饰符,这些修饰符在结果抽象语法树中指定输入,根和子元素上的特殊转换,或在使用规则时执行的操作。 几乎所有工作通常都在解析器规则内完成。

    基础设施

    首先,我们展示使ANTLR开发更容易的工具。 当然,本章中所描述的内容都不是必需的。 所有示例仅适用于maven,文本编辑器和Internet连接。

    ANTLR项目制作了独立的IDEEclipse插件Idea插件 我们没有找到NetBeans插件。

    ANTLRWorks

    独立的想法称为ANTLRWorks 从项目下载页面下载它。 ANTLRWorks是单个jar文件,请使用java -jar antlrworks-1.4.3.jar命令运行它。

    IDE具有更多功能,并且比Eclipse插件更稳定。

    Eclipse插件

    从ANTLR 下载页面下载并解压缩ANTLR v3。 然后,从Eclipse Marketplace安装ANTLR插件:

    转到“首选项”并配置ANTLR v3安装目录:

    要测试配置,请下载示例语法文件并在eclipse中打开它。 它将在ANTLR编辑器中打开。 编辑器具有三个选项卡:

    • 语法–具有语法突出显示,代码完成等功能的文本编辑器。
    • 解释器–将测试表达式编译成语法树,可能会产生与生成的解析器不同的结果。 它倾向于在正确的表达式上抛出失败的谓词异常。
    • 铁路视图–绘制您的词法分析器和解析器规则的漂亮图形。

    空项目– Maven配置

    本章说明如何将ANTLR添加到Maven项目中。 如果您使用Eclipse且尚未安装m2eclipse插件,请从http://download.eclipse.org/technology/m2e/releases更新站点进行安装。 这将使您的生活更加轻松。

    建立专案

    创建新的Maven项目,并在“选择原型”屏幕上指定maven-archetype-quickstart。 如果不使用Eclipse,则命令mvn archetype:generate可以达到相同的目的。

    相依性

    将ANTLR依赖项添加到pom.xml中

    org.antlr
        
        
        
         antlr
        
        
        
         3.3
        
       
        
         jar
        
        
        
         compile

    注意:由于ANTLR没有向后兼容的历史记录,因此最好指定所需的版本。

    外挂程式

    Antlr maven插件在generate-sources阶段运行,并从语法(.g)文件生成lexer和parser java类。 将其添加到pom.xml中

    org.antlr
       
      
       
        antlr3-maven-plugin
       
      
       
        3.3
       
      
       
        
        
          
         
          run antlr
         
          
         
          generate-sources
         
          
         
            
          
           antlr

    创建src/main/antlr3文件夹。 该插件希望其中包含所有语法文件。

    生成的文件放在target/generated-sources/antlr3目录中。 由于此目录不在默认的maven构建路径中,因此我们使用build-helper-maven-plugin将其添加到该目录中:

    org.codehaus.mojo
       
      
       
        build-helper-maven-plugin
       
      
       
        
        
          
         
          add-source
         
          
         
          generate-sources
         
          
         
            
          
           add-source
          
          
         
          
         
            
          
              
           ${basedir}/target/generated-sources/antlr3

    如果使用eclipse,则必须更新项目配置:右键单击项目->'maven'->'更新项目配置'。

    测试一下

    调用maven以测试项目配置:右键单击项目->'Run As'->'Maven generate-sources'。 或者,使用mvn generate-sources命令。

    构建应该成功。 控制台输出应包含antlr3-maven-plugin插件输出:

    [INFO] --- antlr3-maven-plugin:3.3:antlr (run antlr) @ antlr-step-by-step ---
    [INFO] ANTLR: Processing source directory C:\meri\ANTLR\workspace\antlr-step-by-step\src\main\antlr3
    [INFO] No grammars to process
    ANTLR Parser Generator  Version 3.3 Nov 30, 2010 12:46:29

    它之后应该是build-helper-maven-plugin插件输出:

    [INFO] --- build-helper-maven-plugin:1.7:add-source (add-source) @ antlr-step-by-step ---
    [INFO] Source directory: C:\meri\ANTLR\workspace\antlr-step-by-step\target\generated-sources\antlr3 added.

    此阶段的结果位于github上,标记为001-configured_antlr

    你好字

    我们将创建最简单的语言解析器– hello word解析器。 它通过一个表达式构建一个小的抽象语法树:“ Hello word!”。

    我们将使用它来显示如何创建语法文件并从中生成ANTLR类。 然后,我们将展示如何使用生成的文件并创建单元测试。

    第一个语法文件

    Antlr3-maven-plugin在src/main/antlr3目录中搜索语法文件。 它使用语法为每个子目录创建新程序包,并在其中生成解析器和词法分析器类。 由于我们希望将类生成到org.meri.antlr_step_by_step.parsers包中,因此我们必须创建src/main/antlr3/org/meri/antlr_step_by_step/parsers目录。

    语法名称和文件名必须相同。 文件必须带有.g后缀。 此外,每个语法文件都以语法名称声明开头。 我们的S001HelloWord语法从以下几行开始:

    grammar S001HelloWord;

    声明之后始终是生成器选项。 我们正在研究Java项目,希望将表达式编译成抽象语法树:

    options {
        // antlr will generate java lexer and parser
        language = Java;
        // generated parser should create abstract syntax tree
        output = AST;
    }

    Antlr不会在生成的类之上生成包声明。 我们必须使用@parser::header@lexer::header块来实施它。 标头必须遵循选项块:

    @lexer::header {
      package org.meri.antlr_step_by_step.parsers;
    }
    
    @parser::header {
      package org.meri.antlr_step_by_step.parsers;
    }

    每个语法文件必须至少具有一个词法分析器规则。 每个词法分析器规则必须以大写字母开头。 我们有两个规则,第一个定义一个称呼令牌,第二个定义一个endsymbol令牌。 称呼必须为“ Hello word”,且结尾符号必须为“!”。

    SALUTATION:'Hello word';   
    ENDSYMBOL:'!';

    同样,每个语法文件必须至少具有一个解析器规则。 每个解析器规则必须以小写字母开头。 我们只有一个解析器规则:我们语言中的任何表达式都必须由称呼后跟一个结尾符号组成。

    expression : SALUTATION ENDSYMBOL;

    注意:语法文件元素的顺序是固定的。 如果更改它,则antlr插件将失败。

    生成词法分析器和解析器

    使用mvn generate-sources命令或从Eclipse从命令行生成词法分析器和解析器:

    • 右键单击该项目。
    • 点击“运行方式”。
    • 单击“ Maven生成源”。

    Antlr插件将创建target / generated-sources / antlr / org / meri / antlr_step_by_step / parsers文件夹,并将S001HelloWordLexer.java和S001HelloWordParser.java文件放入其中。

    使用Lexer和Parser

    最后,我们创建编译器类。 它只有一种公共方法,该方法:

    • 调用生成的词法分析器将输入拆分为令牌,
    • 调用生成的解析器以根据令牌构建AST,
    • 将结果AST树打印到控制台中,
    • 返回抽象语法树。

    编译器位于S001HelloWordCompiler类中:

    public CommonTree compile(String expression) {
        try {
          //lexer splits input into tokens
          ANTLRStringStream input = new ANTLRStringStream(expression);
          TokenStream tokens = new CommonTokenStream( new S001HelloWordLexer( input ) );
      
          //parser generates abstract syntax tree
          S001HelloWordParser parser = new S001HelloWordParser(tokens);
          S001HelloWordParser.expression_return ret = parser.expression();
      
          //acquire parse result
          CommonTree ast = (CommonTree) ret.tree;
          printTree(ast);
          return ast;
        } catch (RecognitionException e) {
          throw new IllegalStateException("Recognition exception is never thrown, only declared.");
      }

    注意:不必担心在S001HelloWordParser.expression()方法上声明的RecognitionException异常。 它永远不会被抛出。

    测试它

    在本章结束时,我们将使用一个针对新编译器的小测试用例。 创建S001HelloWordTest类:

    public class S001HelloWordTest {
     /**
      * Abstract syntax tree generated from "Hello word!" should have an 
      * unnamed root node with two children. First child corresponds to 
      * salutation token and second child corresponds to end symbol token.
      * 
      * Token type constants are defined in generated S001HelloWordParser 
      * class.
      */
     @Test
     public void testCorrectExpression() {
      //compile the expression
      S001HelloWordCompiler compiler = new S001HelloWordCompiler();
      CommonTree ast = compiler.compile("Hello word!");
      CommonTree leftChild = ast.getChild(0);
      CommonTree rightChild = ast.getChild(1);
    
      //check ast structure
      assertEquals(S001HelloWordParser.SALUTATION, leftChild.getType());
      assertEquals(S001HelloWordParser.ENDSYMBOL, rightChild.getType());
     }
    
    }

    测试将成功通过。 它将抽象语法树打印到控制台:

    0 null
      -- 4 Hello word
      -- 5 !

    IDE中的语法

    在编辑器中打开S001HelloWord.g并进入解释器选项卡。

    • 在左上方视图中突出显示表达式规则。
    • 写下“你好字!” 进入右上方的视图。
    • 按左上角的绿色箭头。

    解释器将生成解析树:

    复制语法

    本教程中的每个新语法都基于先前的语法。 我们汇总了将旧语法复制到新语法所需的步骤列表。 使用它们将OldGrammar复制到NewGrammar:

    错误处理

    没有适当的错误处理,没有任何任务真正完成。 生成的ANTLR类尽可能尝试从错误中恢复。 它们的确向控制台报告错误,但是没有现成的API可以以编程方式查找语法错误。

    如果我们只构建命令行编译器,那可能很好。 但是,假设我们正在为我们的语言构建GUI,或将结果用作其他工具的输入。 在这种情况下,我们需要对所有生成的错误进行API访问。

    在本章的开头,我们将尝试使用默认错误处理并为其创建测试用例。 然后,我们将添加一个简单的错误处理,只要发生第一个错误,该处理就会抛出异常。 最后,我们将转向“真实”解决方案。 它将在内部列表中收集所有错误并提供访问它们的方法。

    作为副产品,本章介绍了如何:

    默认错误处理

    首先,我们将尝试解析各种不正确的表达式。 目的是了解默认的ANTLR错误处理行为。 我们将根据每个实验创建测试用例。 所有测试用例都位于S001HelloWordExperimentsTest类中。

    表达式1Hello word?

    结果树与正确的树非常相似:

    0 null
      -- 4 Hello word
      -- 5 ?<missing ENDSYMBOL>

    控制台输出包含错误:

    line 1:10 no viable alternative at character '?'
    line 1:11 missing ENDSYMBOL at '<eof>'

    测试用例 :以下测试用例通过均没有问题。 不会引发异常,并且抽象语法树节点类型与正确表达式中的相同。

    @Test
     public void testSmallError() {
      //compile the expression
      S001HelloWordCompiler compiler = new S001HelloWordCompiler();
      CommonTree ast = compiler.compile("Hello word?");
    
      //check AST structure
      assertEquals(S001HelloWordParser.SALUTATION, ast.getChild(0).getType());
      assertEquals(S001HelloWordParser.ENDSYMBOL, ast.getChild(1).getType());
     }

    表情2Bye!

    结果树与正确的树非常相似:

    0 null
      -- 4 
      <missing>
       
      -- 5 !
    
      </missing>

    控制台输出包含错误:

    line 1:0 no viable alternative at character 'B'
    line 1:1 no viable alternative at character 'y'
    line 1:2 no viable alternative at character 'e'
    line 1:3 missing SALUTATION at '!'

    测试用例 :以下测试用例通过均没有问题。 不会引发异常,并且抽象语法树节点类型与正确表达式中的相同。

    @Test
     public void testBiggerError() {
      //compile the expression
      S001HelloWordCompiler compiler = new S001HelloWordCompiler();
      CommonTree ast = compiler.compile("Bye!");
    
      //check AST structure
      assertEquals(S001HelloWordParser.SALUTATION, ast.getChild(0).getType());
      assertEquals(S001HelloWordParser.ENDSYMBOL, ast.getChild(1).getType());
     }

    表达式3Incorrect Expression

    结果树只有根节点,没有子节点:

    0

    控制台输出包含很多错误:

    line 1:0 no viable alternative at character 'I'
    line 1:1 no viable alternative at character 'n'
    line 1:2 no viable alternative at character 'c'
    line 1:3 no viable alternative at character 'o'
    line 1:4 no viable alternative at character 'r'
    line 1:5 no viable alternative at character 'r'
    line 1:6 no viable alternative at character 'e'
    line 1:7 no viable alternative at character 'c'
    line 1:8 no viable alternative at character 't'
    line 1:9 no viable alternative at character ' '
    line 1:10 no viable alternative at character 'E'
    line 1:11 no viable alternative at character 'x'
    line 1:12 no viable alternative at character 'p'
    line 1:13 no viable alternative at character 'r'
    line 1:14 no viable alternative at character 'e'
    line 1:15 no viable alternative at character 's'
    line 1:16 no viable alternative at character 's'
    line 1:17 no viable alternative at character 'i'
    line 1:18 no viable alternative at character 'o'
    line 1:19 no viable alternative at character 'n'
    line 1:20 mismatched input '&ltEOF>' expecting SALUTATION

    测试用例 :我们终于找到了一个导致树结构不同的表达式。

    @Test
     public void testCompletelyWrong() {
      //compile the expression
      S001HelloWordCompiler compiler = new S001HelloWordCompiler();
      CommonTree ast = compiler.compile("Incorrect Expression");
    
      //check AST structure
      assertEquals(0, ast.getChildCount());
     }

    Lexer中的错误处理

    每个词法分析器规则“ RULE”对应于生成的词法分析器中的“ mRULE”方法。 例如,我们的语法有两个规则:

    SALUTATION:'Hello word';   
    ENDSYMBOL:'!';

    并且生成的词法分析器有两种相应的方法

    public final void mSALUTATION() throws RecognitionException {
        // ...
    }
    
    public final void mENDSYMBOL() throws RecognitionException {
        // ...
    }

    根据抛出的异常,lexer可能会也可能不会尝试从中恢复。 但是,每个错误都以reportError(RecognitionException e)方法结尾。 生成的词法分析器继承它:

    public void reportError(RecognitionException e) {
      displayRecognitionError(this.getTokenNames(), e);
     }

    结果:我们必须在lexer中更改reportError或displayRecognitionError方法。

    解析器中的错误处理

    我们的语法只有一个解析器规则“表达式”:

    expression SALUTATION ENDSYMBOL;

    该表达式对应于生成的解析器中的expression()方法:

    public final expression_return expression() throws RecognitionException {
      //initialization
      try {
        //parsing
      }
      catch (RecognitionException re) {
        reportError(re);
        recover(input,re);
        retval.tree = (Object) adaptor.errorNode(input, retval.start, input.LT(-1), re);
      } finally {
      }
      //return result;
    }

    如果发生错误,解析器将:

    • 向控制台报告错误,
    • 从错误中恢复
    • 将错误节点(而不是普通节点)添加到抽象语法树。

    解析器中的错误报告比lexer中的错误报告稍微复杂一些:

    /** Report a recognition problem.
      *
      *  This method sets errorRecovery to indicate the parser is recovering
      *  not parsing.  Once in recovery mode, no errors are generated.
      *  To get out of recovery mode, the parser must successfully match
      *  a token (after a resync).  So it will go:
      *
      *   1. error occurs
      *   2. enter recovery mode, report error
      *   3. consume until token found in resynch set
      *   4. try to resume parsing
      *   5. next match() will reset errorRecovery mode
      *
      *  If you override, make sure to update syntaxErrors if you care about that.
      */
     public void reportError(RecognitionException e) {
      // if we've already reported an error and have not matched a token
      // yet successfully, don't report any errors.
      if ( state.errorRecovery ) {
       return;
      }
      state.syntaxErrors++; // don't count spurious
      state.errorRecovery = true;
    
      displayRecognitionError(this.getTokenNames(), e);
     }

    这次我们有两个可能的选择:

    • 通过自己的处理替换解析器规则方法中的catch子句,
    • 覆盖解析器方法。

    在解析器中更改捕获

    Antlr提供了两种方法来更改解析器中生成的catch子句。 我们将创建两个新的语法,每个都演示一种方法。 在这两种情况下,我们都会使解析器在第一个错误时退出。

    首先,我们可以将rulecatch添加到新S002HelloWordWithErrorHandling语法的解析器规则中:

    expression : SALUTATION ENDSYMBOL;
    catch [RecognitionException e] {
      //Custom handling of an exception. Any java code is allowed.
      throw new S002HelloWordError(":(", e);
    }

    当然,我们必须将S002HelloWordError异常的导入添加到headers块中

    @parser::header {
      package org.meri.antlr_step_by_step.parsers;
    
      //add imports (see full line on Github)
      import ... .S002HelloWordWithErrorHandlingCompiler.S002HelloWordError;
    }

    编译器类与以前几乎相同。 它声明了新的异常:

    public class S002HelloWordWithErrorHandlingCompiler extends AbstractCompiler {
    
      public CommonTree compile(String expression) {
        // no change here
      }
    
      @SuppressWarnings("serial")
      public static class S002HelloWordError extends RuntimeException {
        public S002HelloWordError(String arg0, Throwable arg1) {
          super(arg0, arg1);
        }
      }
    }

    然后,ANTLR将用我们自己的处理方式替换表达式规则方法中的默认catch子句:

    public final expression_return expression() throws RecognitionException {
      //initialization
      try {
        //parsing
      }
      catch (RecognitionException re) {
        //Custom handling of an exception. Any java code is allowed.
        throw new S002HelloWordError(":(", e); 
      } finally {
      }
      //return result;
    }

    通常, 语法编译器类测试类在Github上可用。

    或者,我们可以将rulecatch规则放在标题块和第一个lexer规则之间。 S003HelloWordWithErrorHandling语法演示了此方法:

    //change error handling in all parser rules
    @rulecatch {
      catch (RecognitionException e) {
        //Custom handling of an exception. Any java code is allowed.
        throw new S003HelloWordError(":(", e);
      }
    }

    我们必须将S003HelloWordError异常的导入添加到标头块中:

    @parser::header {
      package org.meri.antlr_step_by_step.parsers;
    
      //add imports (see full line on Github)
      import ... .S003HelloWordWithErrorHandlingCompiler.S003HelloWordError;
    }

    编译器类与前面的情况完全相同。 ANTLR将替换所有解析器规则中的默认catch子句:

    public final expression_return expression() throws RecognitionException {
      //initialization
      try {
        //parsing
      }
      catch (RecognitionException re) {
        //Custom handling of an exception. Any java code is allowed.
        throw new S003HelloWordError(":(", e); 
      } finally {
      }
      //return result;
    }

    同样,Github上提供了语法编译器类测试类

    不幸的是,这种方法有两个缺点。 首先,它仅在解析器中不适用于lexer。 其次,默认报告和恢复功能以合理的方式工作。 它尝试从错误中恢复。 一旦开始恢复,就不会产生新的错误。 仅当解析器未处于错误恢复模式时,才会生成错误消息。

    我们喜欢此功能,因此我们决定仅更改错误报告的默认实现。


    将方法和字段添加到生成的类

    我们会将所有词法分析器/解析器错误存储在私有列表中。 此外,我们将在生成的类中添加两个方法:

    • hasErrors –如果发生至少一个错误,则返回true,
    • getErrors –返回所有生成的错误。

    在@members块内添加了新的字段和方法:

    @lexer::members {
      //everything you need to add to the lexer
    }
    
    @parser::members {
      //everything you need to add to the parser
    }

    成员块必须放置在标题块和第一个词法分析器规则之间。 该示例的语法为S004HelloWordWithErrorHandling

    //add new members to generated lexer
    @lexer::members {
      //add new field
      private List<RecognitionException> errors = new ArrayList <RecognitionException> ();
      
      //add new method
      public List<RecognitionException> getAllErrors() {
        return new ArrayList<RecognitionException>(errors);
      }
    
      //add new method
      public boolean hasErrors() {
        return !errors.isEmpty();
      }
    }
    
    //add new members to generated parser
    @parser::members {
      //add new field
      private List<RecognitionException> errors = new ArrayList <RecognitionException> ();
      
      //add new method
      public List<RecognitionException> getAllErrors() {
        return new ArrayList<RecognitionException>(errors);
      }
    
      //add new method
      public boolean hasErrors() {
        return !errors.isEmpty();
      }
    }

    生成的词法分析器生成的解析器都包含用members块编写的所有字段和方法。

    覆盖生成的方法

    要覆盖生成的方法,请执行与要添加新方法相同的操作,例如,将其添加到@members块中:

    //override generated method in lexer
    @lexer::members {
      //override method
      public void reportError(RecognitionException e) {
        errors.add(e);
        displayRecognitionError(this.getTokenNames(), e);
      }
    }
    
    //override generated method in parser
    @parser::members {
      //override method
      public void reportError(RecognitionException e) {
        errors.add(e);
        displayRecognitionError(this.getTokenNames(), e);
      }
    }

    现在,reportError方法将覆盖lexerparser中的默认行为。

    收集编译器中的错误

    最后,我们必须更改编译器类。 新版本将在输入解析阶段之后收集所有错误:

    private List<RecognitionException> errors = new ArrayList<RecognitionException>();
    
    public CommonTree compile(String expression) {
      try {
    
        ... init lexer ...
      
        ... init parser ...
        ret = parser.expression();
    
        //collect all errors
        if (lexer.hasErrors())
          errors.addAll(lexer.getAllErrors());
      
        if (parser.hasErrors())
          errors.addAll(parser.getAllErrors());
      
        //acquire parse result
        ... as usually ...
      } catch (RecognitionException e) {
        ...
      }
    }
      
    /**
    * @return all errors found during last run
    */
    public List<RecognitionException> getAllErrors() {
      return errors;
    }

    解析器完成工作后,我们必须收集词法分析器错误。 从它调用词法分析器,之前没有任何错误。 像往常一样,我们将语法编译器类测试类放在Github上。

    下载antlr分步项目的标记003-S002-to-S004HelloWordWithErrorHandling ,以查找同一java项目中的所有三种错误处理方法。

    参考: ANTLR教程–我们的JCG合作伙伴 Maria Jurcovicova在This is Stuff博客上的问候语。


    翻译自: https://www.javacodegeeks.com/2012/04/antlr-tutorial-hello-word.html

    antlr 教程

    展开全文
  • ANTLR快餐教程(2) - ANTLR其实很简单

    千次阅读 2017-03-14 15:24:12
    ANTLR其实很简单
  • antlr使用教程

    2009-12-17 18:35:25
    我从以下几个问题入手介绍ANTLR的相关知识。 1 ANTLR是什么? ANTLR, ANother Tool for Language Recognition, 是一个可以接受含有语法描述的语言描述符并且生成程序能够识别这些语言所产生的句子。作为一个翻译...
  • ANTLR教程(一)简介

    千次阅读 2017-10-05 14:36:50
    3. antlr4的优点 4. antlr语法分析流程antlr参考资料 The Definitive ANTLR 4 Reference.pdf 基本概念 · ANTLR 4简明教程 antlr环境搭建intellij java编程环境下的antlr编程环境 安装intellij java编程环境 官网...
  • antlr教程

    千次阅读 2008-05-19 23:04:00
    Antlr入门详细教程 一、 Antlr 的主要类: Antlr 中有主要类有两种(其实还有一种 TreeLexer ) Lexer: 文法分析器类。主要用于把读入的字节流根据规则分段。既把长面条根据你要的尺寸切成一段一段:)并不对其...
  • ANTLR快餐教程(1) - 有好多现成例子啊

    万次阅读 2017-03-08 12:05:26
    我们选择ANTLRANTLR是用Java写的词法和语法分析工具。它比lex/flex/yacc/bison是更现代的工具。 最方便的一点是,ANTLR已经替我们写好了常用语言的语法规则,我们已经拥有了常见语言的分析器,可以在其基础上直接...
  • solidity-antlr4

    2019-06-18 11:23:45
    ANTLR4 笔记 ANTLR4 是一个非常厉害的程序/库,可以用来生成 Lexer 和 Parser,而且生成的接口非常易用。 安装 $ cd /usr/local/lib $ curl -O http://www.antlr.org/download/antlr-4.5-complete.jar $ vim ~/...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 1,051
精华内容 420
关键字:

antlr4教程