Network Working Group                                             K. Yee
Request for Comments: 9999                           Foresight Institute
Category: Experimental                                        March 1998

[DRAFT 4, Tue 24 Mar 02:56:41 EST 1998]

                   Text-Search Fragment Identifiers      

Status of this Memo

   This memo defines an Experimental Protocol for the Internet
   community.  This memo does not specify an Internet standard of any
   kind.  Discussion and suggestions for improvement are requested.
   Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (1998).  All Rights Reserved.

1.  Introduction

   The Uniform Resource Locator (or URL; see RFC 1738 [1]) is a string
   format for referencing a resource on the Internet.  The character
   "#" is used on the World-Wide Web to delimit a URL from a "fragment
   identifier" which might follow it.  A fragment identifier specifies
   a particular view on, or portion of, the resource being referenced.

   In the case where the resource has a "text/html" representation,
   current practice on the World-Wide Web is to permit the use of an
   HTML anchor name as a fragment identifier (see RFC 1866 [2] section
   7.4).  Specifically, RFC 1866 allows such a fragment identifier to
   refer to an <A> element with a matching NAME attribute, while more
   recent W3C Recommendations for HTML also allow such a fragment
   identifier to refer to any element with a matching ID attribute.

   While this permits reference within an HTML document to a fragment
   that the document author has explicitly marked beforehand with a
   named anchor, it would also be useful to refer to text where no
   anchor was provided by the author.  In particular, that ability would
   help support third-party annotations and backward link traversal.
   This RFC proposes a new kind of fragment identifier based on
   matching text in the referenced document.  A text-search fragment
   identifier may refer to short phrases in the text of documents in
   plain text or SGML format (where the latter includes HTML, XML, and
   other applications of SGML).

   As many fragment specification schemes (FSSes) might be possible,
   this RFC also proposes a general framework for new FSSes that avoids
   conflict with fragment identifiers that refer to HTML anchor names.


Yee                          Experimental                       [Page 1]
?
RFC 9999           Text-Search Fragment Identifiers           March 1998


2.  Textual Content of a Document

   For the purposes of searching, any document is treated as a sequence
   of words.  A word is an unbroken sequence of letters or digits, where
   accented letters are translated to their unaccented counterparts and
   apostrophes (ASCII 39) are ignored.  Any sequence of one or more
   other non-word characters forms a break between words.

   If the target document is in an SGML format, all SGML tags are
   ignored and entities expanded when considering its textual content.
   Note that SGML tags immediately surrounded by word characters do not
   constitute word breaks.

   For example, the plain text excerpt:

       Hey!  That 'quick' brown fox didn't jump over the lazy dog?

   is treated as the sequence of words:

       Hey, That, quick, brown, fox, didnt, jump, over, the, lazy, dog

   The HTML excerpt:

       <p>Es <em>brillig</em> war.  Die schlichten Toven
       <br>Wirrten und wimmelten in <a href="waben.html">Waben</a>;
       <br>Und aller-m&uuml;msige Burggoven
       <br>Die mohmem R&auml;th' ausgraben<sup>1</sup>.

   is treated as the sequence of words:
     
       Es, brillig, war, Die, schlichten, Toven, Wirrten, und,
       wimmelten, in, Waben, Und, aller, mumsige, Burggoven, Die,
       mohmem, Rath, ausgraben1


3.  Extended Fragment Identifiers

   Fragment identifiers for HTML documents, as given in RFC 1866, refer
   to an <A> element by matching its NAME attribute.  Since the value of
   the NAME attribute must be an SGML NAME token, such "anchor fragment
   identifiers" must always begin with a letter.

   This RFC proposes one new fragment specification scheme (FSS), and
   various other new FSSes may also be devised in the future.  Fragment
   identifiers in such new schemes could be termed "extended fragment
   identifiers" (EFIs).  To avoid conflict with the existing namespace
   for fragment identifiers, it is proposed that all EFIs begin with a
   short keyword indicating their fragment specification scheme (an
   FSSID) enclosed in colons (ASCII 58).


Yee                          Experimental                       [Page 2]
?
RFC 9999           Text-Search Fragment Identifiers           March 1998


   Thus, the proposed syntax for a fragment identifier is as follows:

      fragid             =  anchor-fragid | extended-fragid

      anchor-fragid      =  sgml-name
      sgml-name          =  alpha *namechar
      namechar           =  alpha | digit | "-" | "."

      extended-fragid    =  ":" fss-id ":" fss-specific-part
      fss-id             =  1*alphadigit
      alphadigit         =  alpha | digit
      fss-specific-part  =  *uchar

   The following definitions are taken from RFC 1738 [1]:

      alpha          =  "A" | "B" | ... | "Z" | "a" | "b" | ... | "z"
      digit          =  "0" | "1" | ... | "9"
      safe           =  "$" | "-" | "_" | "." | "+"
      extra          =  "!" | "*" | "'" | "(" | ")" | ","

      hex            =  digit | "A" | ... | "F" | "a" | ... | "f"
      escape         =  "%" hex hex

      unreserved     =  alpha | digit | safe | extra
      uchar          =  unreserved | escape

   Note that although no limit on the length of the fss-specific-part is
   imposed here, a fragment identifier will often be part of the value
   of an HREF attribute in an HTML document, and the SGML declaration
   for HTML limits the length of an attribute value (in RFC 1866 [2]
   section 9.5, this limit is 1024 characters).


4.  Text-Search Fragment Specification Scheme

   The FSSID for the text-search FSS described here is "words".  The
   FSS-specific part consists of a sequence of words separated by
   hyphens (ASCII 45).  Embedded in this string there must be exactly
   one pair of parentheses that encloses one or more whole words.  More
   precisely, the syntax is:

      words-fragid   =  ":words:" *precontext targetphrase *postcontext

      precontext     =  word "-"
      targetphrase   =  "(" word *[ "-" word ] ")"
      postcontext    =  "-" word

      word           =  1*alphadigit


Yee                          Experimental                       [Page 3]
?
RFC 9999           Text-Search Fragment Identifiers           March 1998


   The semantics of such a fragment identifier are as follows.  A
   case-sensitive match for the entire sequence of words, ignoring the
   parentheses, is to be sought within the target document's word
   sequence.  Should the entire sequence of words occur multiple times,
   only the first occurrence is considered.  The "target phrase"
   actually referred to by the fragment identifier is the portion
   enclosed in parentheses.  The words before and after the parentheses
   serve as context to distinguish multiple occurrences of the target
   phrase in the target document.

   For example, any of the following fragment identifiers could refer
   to the word "distinguish" in the previous paragraph.

      :words:(distinguish)

      :words:serve-as-context-to-(distinguish)

      :words:context-to-(distinguish)-multiple-occurrences


5.  Recommended Application Behaviour

   The key words "MUST", "SHOULD", and "MAY" used in this section are
   to be interpreted as described in RFC 2119 [3].

   The intended application areas for text-search fragment identifiers
   are critical discussion, third-party review of documents, and so on.
   The text-search fragment identifier semantics were chosen to provide
   some robustness despite minor changes to the target document, in
   order that it might better support document annotation.

   A hyperlink URL which includes a text-search fragment identifier
   SHOULD be interpreted by document processors and user agents to mean
   a fine-grained hypertext link to the indicated target phrase.  The
   REL and REV attributes of such a hyperlink (see RFC 1866 [2] section
   5.7.3) can be used to give a disposition with respect to the target
   phrase.  For argumentation, for example, the disposition keywords
   "support", "issue", "query", and "comment" are suggested.
   
   When such a link is traversed, an interactive user agent conforming
   to this specification MUST enable the user to discover the precise
   location of the words of the target phrase in the document in some
   fashion.  (Merely scrolling the display to the approximate location
   of the target phrase is not sufficient.)  For example, the words of
   the target phrase could be highlighted, or symbols could be inserted
   into the displayed text of the document to point out the target
   phrase; such indication MAY be shown by default or shown at the
   request of the user.  It is not necessary to indicate the context
   words surrounding the target phrase in any particular fashion.


Yee                          Experimental                       [Page 4]
?
RFC 9999           Text-Search Fragment Identifiers           March 1998


   User agents MAY indicate the target phrase of a link pointing into a
   displayed document even when the displayed document was reached by
   means other than traversing that link.  If a user agent allows
   backward traversal of such hyperlinks, the source anchor for the
   backward link SHOULD be provided at or near the target phrase.  When
   the target phrase indicated by a text-search fragment identifier is
   not found in the target document, an interactive user agent SHOULD
   display a warning condition to notify the user that it was not found.


6.  Security Considerations

   The proposed text-search fragment specification scheme embeds a small
   excerpt from the document into the fragment identifier.  Therefore it
   is important to keep in mind that a URL into the text of a sensitive
   document might contain sensitive text, and storage of personal URL
   collections or histories would need to be protected.

   A text-search fragment identifier only specifies the way a user agent
   should treat data it has already received, and reveals no new access
   to a document other than the words in the fragment itself.  However,
   note that a text-search fragment identifier could be used to call
   attention to a mistake or a phrase otherwise made inconspicuous
   (e. g. by its colouring or small typeface), in a manner which will
   undoubtedly make some document authors uncomfortable.


7.  References

   [1]  T. Berners-Lee, L. Masinter, M. McCahill, Editors, "Uniform
        Resource Locators (URL)", RFC 1738, December 1994.

   [2]  T. Berners-Lee, D. Connolly, "Hypertext Markup Language - 2.0",
        RFC 1866, November 1995.

   [3]  S. Bradner, "Key words for use in RFCs to Indicate Requirement
        Levels", BCP 14, RFC 2119, March 1997.


8.  Author's Address

   Ka-Ping Yee
   St. Paul's United College
   Westmount Road North
   Waterloo, Ontario, Canada, R2M 5H9

   Phone:  +1 519-725-8008
   Fax:    +1 519-885-6364
   E-Mail: ping@lfw.org


Yee                          Experimental                       [Page 5]
?
RFC 9999           Text-Search Fragment Identifiers           March 1998


9.  Full Copyright Statement

   Copyright (C) The Internet Society (1998).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implmentation may be prepared, copied, published and
   distributed, in whole or in part, without restriction of any kind,
   provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."


Yee                          Experimental                       [Page 6]
?