WG15 Defect Report Ref: 9945-2-42
Topic: Regular Expression Issues


This is an approved interpretation of 9945-2:1993.

.

Last update: 1997-05-20


								9945-2-42

 _____________________________________________________________________________


	Topic:			Regular Expression Issues
	Relevant Sections:	2.8.3


Defect Report:
-----------------------

(from Andrew Hume Doug McIlroy)

RE Issues

Issue C

     The scheme	used for defining REs and their	semantics is
based  on  the	historical notion that primitive REs match a
single character. This was broken with the  introduction  of
CEs,  as  they	may be multiple	characters in length.  These
were added through bracket expressions so  as  to  cause  as
little impact to the rest of the specification.

     This had two side-effects.	 The first  is	interactions
with the rest of the semantics.	 The second is that the	his-
torical	and careful difference in semantics between BREs and
EREs,  which  generally	 allowed BREs to be implemented	with
polynomial-time	algorithms, was	erased.	 Both BREs and	EREs
now  in	 principle require algorithms with runtimes that are
exponential in the size	of the	pattern.   For	example,  on
line  2860  (2.8.3.2),	a  bracket  expression is defined to
match a	character or  CE.  Unfortunately,  if  a  duplicated
bracket	expression contains multibyte CEs, then	(exponential
runtime) backtracking could be necessary to find the longest
match.

     The main problem with 2.8.3.1 is that it is unclear how
to  interpret  matching	 a character or	CE against a string.
How is the string interpreted? As a sequence of	 characters,
a sequence of CEs, or as a mixture?
	  ________________________________________

 [9] On	line 2860 (2.8.3.2), a bracket expression is defined
     to	 match	a  single CE. Yet, line	2890 allows matching
     any ``character or	CE''.

Proposed Solution:

     Change the	phrase ``character or CE'' to ``CE''.

Rationale:

     The wording suggests a nonexistent	distinction.
	  ________________________________________

 [10] On  line	2860  (2.8.3.2),  a  bracket  expression  is
     defined  to  match	a single CE. Yet, on lines 2949-2952
     and lines 2957-2969,  frequent  reference	is  made  to
     ``characters''.

Proposed Solution:

     Change lines 2949-2952 and	 lines	2957-2969  to  refer
only to	CEs, and not to	characters.

Rationale:

     Same as previous rationale.
	  ________________________________________

 [11] Within 2.8.3.2, the term ``expression'' is used  often
     without  a	qualifier, and in these	cases its meaning is
     unclear.  Examples	include	lines 2886 and 2891.

Proposed Solution:

     The apparent intended meaning, namely  CEs,  should  be
used.


WG15 response for 9945-2:1993 
-----------------------------------


Part 9
Since the standard is unclear as to how strings are broken up into 
a series of collating elements (see interpretation #40) it is also
unclear how to interpret matching a character or collating element 
against a string, and as such no conformance distinction can be made
between alternative implementations based on this.  This is being 
referred to the sponsor. 

Part10

The standard is unclear on this issue, and no conformance
distinction can be made between alternative implementations
based on this.  This is being referred to the sponsor.

Concerns about the wording of this part of the standard have
been forwarded to the sponsor.

Part 11
The reference to "expressions" on page 80, line 2886 refers to the definition
on page 79, lines 2864-2866, and conforming implementations must conform to 
this.  

 
Rationale for Interpretation:
-----------------------------
None.

Proposed resolution forwarded: Aug 11 1995
Finalized: Sept 12 1995