[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Lexical analysis




Till and Kolyang wrote:

> we are just implementing a lexical analysis for 
> our Isabelle CASL parser (up to now, we have used
> Isabelle's standard lexical analysis, but this
> is not exactly what is needed for CASL).
> 
> Now there are the following questions and problems:
> 
> - Are the following recognized as complete TOKENs or not?
>   ->  ->?  =e=  {}  ×
>   If they are, they should be listed together with
>   < * ? ! and / on p. C-10, if not, they should be listed
>   together with : :? ::= etc. on p. C-10.

=e= is not allowed in any single TOKEN, since it mixes SIGNS and
WORDS. 

{} is allowed, and could be added to the (non-exhaustive!) list of
examples of allowed tokens using special characters.

× (a multiplication sign in ISO Latin-1) was indeed intended to be
allowed as a complete token, to be consistent with the treatment of *.
-> and ->? were intended to be allowed too, for the same reason.  If
this turns out to be too problematic with some parsers (e.g, for
look-ahead reasons), these symbols may have to rejected as complete
TOKENs after all, but then allowing * may seem somewhat ad hoc.

> - A NUMBER can simultaneously be a WORDS.
>   We currently resolve this by scanning it as a NUMBER,
>   and adding
>      TOKEN ::= WORDS | SIGNS | DOT-WORDS | NUMBER
>   and perhaps also
>      SIMPLE-ID ::= WORDS | NUMBER
>   but the latter seems not to be very useful.
>   By the way, "5a5" and "'5a" are recognized as WORDS as well,
>   was this intended, or should a WORDS start with a letter?

A NUMBER is only used in a VERSION in a LIB-NAME, I think, which is
only used at the LIBRARY level.  So perhaps one might restrict its
recognition to such contexts? - see my previous message concerning URL
and PATH, also with regard to the following questions:

> - There is no syntax for PATH and URL. Should we follow some
>   international standard here? If so, which one and how to obtain
>   a precise description of the syntax?
> 
> - A WORDS can (probably - see above) simultaneously be a PATH.
>   We currently resolve this by scanning it as a WORDS,
>   and adding
>      LIB-IB ::= URL | PATH | WORDS 
> 
> - More seriously, x/zero is currently recognized as a PATH,
>   but within a TERM, it should be recognized as three lexical 
>   tokens, namely WORDS SINGS WORDS. There is no way to distinguish
>   these cases at the lexical level, and by the longest match rule,
>   we always get a PATH.
>   Moreover, probably other SIGNS (like ".") will be allowed in a PATH,
>   leading to similar problems.
>   One way out would be to disallow SIGNS in a PATH, and reinroduce
>   the necessary SIGNS, such as "/" and ".", via the grammar. But this would
>   allow to write PATHs interspersed with spaces, such as
>     CASLdir / examples / file1 . casl
>   while we probably would like to enforce the user to write
>     CASLdir/examples/file1.casl
>   The other possibility would be to require to quote a PATH, e.g.
>     "CASLdir/examples/file1.casl"
> 
>   The same problem also occurs for URLs.

That is exactly why I think that you may have to regard parsing
libraries as a separate process from parsing individual specifications
(although some parsing technologies can cope with context-sensitive
lexical analysis).

> We also have two problems with the grammar:
> 
> - There is no syntax for TOKEN-PLACES on page C-5 bottom.

I.e., Appendix C.2.2 (for the benefit of those reading the web pages,
please always include the relevant section number when citing the CASL
Summary).

It seems this bug also occurs in Appendix B.2: TOKEN-PLACES is used,
but not defined!  (Appendix C was derived from Appendix B, so it's not
too surprising that the same bug is both places - but one might have
expected it to have been noticed sooner: it was there also in v0.99...)

>   We assume that the syntax is
>      TOKEN-PLACES ::= PLACE ... PLACE TOKEN PLACE ... PLACE
>                     | TOKEN PLACE ... PLACE
>                     | PLACE ... PLACE TOKEN 
>                     | TOKEN

No: that would restrict mixfix notation to infix, prefix, and postfix,
prohibiting "outfix" notation such as {|__|} (used, e.g., in the
examples in Appendix E.2.1-2), which you were surely not intending to
do?

>   (if there is not exactly one TOKEN in a TOKEN-PLACES,
>    it becomes unclear where to attach the components of
>    the compound id).

It might well be desirable to restrict the syntax of MIXFIX-ID in
Appendix C.2.2 so that there cannot be more than one list of component
IDs.

I see that the ASF+SDF grammar (for v0.99 - I haven't yet updated to
Mark's v1.0 ASF+SDF grammar) at
ftp://ftp.brics.dk/Projects/CoFI/Documents/CASL/SyntaxExamples/ASF+SDF/
(see the files ZCasl-BasicItems.syn, ZCasl-Gen.syn) insists that any
components come right at the end of a MIXFIX-ID.  I'm afraid that this
was a simplifying assumption that I suggested to Bjarke while he was
developing the ASF+SDF grammar, and it should have been removed later
(assuming that one does indeed want to be able to declare symbols such
as __<[Elem]__).  Mea culpa!

Suggestions for a corrected grammar for MIXFIX-ID are welcome...

> - The production
>      SIMPLE-TERM ::= ID | ....
>   has to be replaced by 
>      SIMPLE-TERM ::= TOKEN-ID | ....
>   because a MIXFIX-ID should not be a legal SIMPLE-TERM
>   (and we would run into ambiguity problems).

I'm inclined to agree with you - this would also match the abstract
syntax.  There's no point in allowing something that can never be used
in a well-formed CASL spec, it's merely a superfluous generality in
the concrete grammar, and I suggest that it should be removed in the
next release of the CASL Summary.

> Greetings and happy new year,
> Till and Kolyang

Thanks for your work on these parsing issues!  I hope that the above
initial response will allow you to proceed - especially as I'm just
about to leave for a trip, and won't be back here until mid-January...

Best regards,

-- Peter
_________________________________________________________
Dr. Peter D. Mosses             International Fellow  (*)

Computer Science Laboratory     mailto:mosses@csl.sri.com
SRI International               phone: +1 (650)  859-2200
333 Ravenswood Avenue           fax:   +1 (650)  859-2844
Menlo Park, CA 94025, USA       http://www.brics.dk/~pdm/

(*) on leave from DAIMI & BRICS, University of Aarhus, DK
    also affiliated to CS Department, Stanford University
_________________________________________________________