Network Working Group K. Yee Request for Comments: 9999 Foresight Institute Category: Experimental March 1998 [DRAFT 4, Tue 24 Mar 02:56:41 EST 1998] Text-Search Fragment Identifiers Status of this Memo This memo defines an Experimental Protocol for the Internet community. This memo does not specify an Internet standard of any kind. Discussion and suggestions for improvement are requested. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (1998). All Rights Reserved. 1. Introduction The Uniform Resource Locator (or URL; see RFC 1738 [1]) is a string format for referencing a resource on the Internet. The character "#" is used on the World-Wide Web to delimit a URL from a "fragment identifier" which might follow it. A fragment identifier specifies a particular view on, or portion of, the resource being referenced. In the case where the resource has a "text/html" representation, current practice on the World-Wide Web is to permit the use of an HTML anchor name as a fragment identifier (see RFC 1866 [2] section 7.4). Specifically, RFC 1866 allows such a fragment identifier to refer to an element with a matching NAME attribute, while more recent W3C Recommendations for HTML also allow such a fragment identifier to refer to any element with a matching ID attribute. While this permits reference within an HTML document to a fragment that the document author has explicitly marked beforehand with a named anchor, it would also be useful to refer to text where no anchor was provided by the author. In particular, that ability would help support third-party annotations and backward link traversal. This RFC proposes a new kind of fragment identifier based on matching text in the referenced document. A text-search fragment identifier may refer to short phrases in the text of documents in plain text or SGML format (where the latter includes HTML, XML, and other applications of SGML). As many fragment specification schemes (FSSes) might be possible, this RFC also proposes a general framework for new FSSes that avoids conflict with fragment identifiers that refer to HTML anchor names. Yee Experimental [Page 1] ? RFC 9999 Text-Search Fragment Identifiers March 1998 2. Textual Content of a Document For the purposes of searching, any document is treated as a sequence of words. A word is an unbroken sequence of letters or digits, where accented letters are translated to their unaccented counterparts and apostrophes (ASCII 39) are ignored. Any sequence of one or more other non-word characters forms a break between words. If the target document is in an SGML format, all SGML tags are ignored and entities expanded when considering its textual content. Note that SGML tags immediately surrounded by word characters do not constitute word breaks. For example, the plain text excerpt: Hey! That 'quick' brown fox didn't jump over the lazy dog? is treated as the sequence of words: Hey, That, quick, brown, fox, didnt, jump, over, the, lazy, dog The HTML excerpt:

Es brillig war. Die schlichten Toven
Wirrten und wimmelten in
Waben;
Und aller-mümsige Burggoven
Die mohmem Räth' ausgraben1. is treated as the sequence of words: Es, brillig, war, Die, schlichten, Toven, Wirrten, und, wimmelten, in, Waben, Und, aller, mumsige, Burggoven, Die, mohmem, Rath, ausgraben1 3. Extended Fragment Identifiers Fragment identifiers for HTML documents, as given in RFC 1866, refer to an element by matching its NAME attribute. Since the value of the NAME attribute must be an SGML NAME token, such "anchor fragment identifiers" must always begin with a letter. This RFC proposes one new fragment specification scheme (FSS), and various other new FSSes may also be devised in the future. Fragment identifiers in such new schemes could be termed "extended fragment identifiers" (EFIs). To avoid conflict with the existing namespace for fragment identifiers, it is proposed that all EFIs begin with a short keyword indicating their fragment specification scheme (an FSSID) enclosed in colons (ASCII 58). Yee Experimental [Page 2] ? RFC 9999 Text-Search Fragment Identifiers March 1998 Thus, the proposed syntax for a fragment identifier is as follows: fragid = anchor-fragid | extended-fragid anchor-fragid = sgml-name sgml-name = alpha *namechar namechar = alpha | digit | "-" | "." extended-fragid = ":" fss-id ":" fss-specific-part fss-id = 1*alphadigit alphadigit = alpha | digit fss-specific-part = *uchar The following definitions are taken from RFC 1738 [1]: alpha = "A" | "B" | ... | "Z" | "a" | "b" | ... | "z" digit = "0" | "1" | ... | "9" safe = "$" | "-" | "_" | "." | "+" extra = "!" | "*" | "'" | "(" | ")" | "," hex = digit | "A" | ... | "F" | "a" | ... | "f" escape = "%" hex hex unreserved = alpha | digit | safe | extra uchar = unreserved | escape Note that although no limit on the length of the fss-specific-part is imposed here, a fragment identifier will often be part of the value of an HREF attribute in an HTML document, and the SGML declaration for HTML limits the length of an attribute value (in RFC 1866 [2] section 9.5, this limit is 1024 characters). 4. Text-Search Fragment Specification Scheme The FSSID for the text-search FSS described here is "words". The FSS-specific part consists of a sequence of words separated by hyphens (ASCII 45). Embedded in this string there must be exactly one pair of parentheses that encloses one or more whole words. More precisely, the syntax is: words-fragid = ":words:" *precontext targetphrase *postcontext precontext = word "-" targetphrase = "(" word *[ "-" word ] ")" postcontext = "-" word word = 1*alphadigit Yee Experimental [Page 3] ? RFC 9999 Text-Search Fragment Identifiers March 1998 The semantics of such a fragment identifier are as follows. A case-sensitive match for the entire sequence of words, ignoring the parentheses, is to be sought within the target document's word sequence. Should the entire sequence of words occur multiple times, only the first occurrence is considered. The "target phrase" actually referred to by the fragment identifier is the portion enclosed in parentheses. The words before and after the parentheses serve as context to distinguish multiple occurrences of the target phrase in the target document. For example, any of the following fragment identifiers could refer to the word "distinguish" in the previous paragraph. :words:(distinguish) :words:serve-as-context-to-(distinguish) :words:context-to-(distinguish)-multiple-occurrences 5. Recommended Application Behaviour The key words "MUST", "SHOULD", and "MAY" used in this section are to be interpreted as described in RFC 2119 [3]. The intended application areas for text-search fragment identifiers are critical discussion, third-party review of documents, and so on. The text-search fragment identifier semantics were chosen to provide some robustness despite minor changes to the target document, in order that it might better support document annotation. A hyperlink URL which includes a text-search fragment identifier SHOULD be interpreted by document processors and user agents to mean a fine-grained hypertext link to the indicated target phrase. The REL and REV attributes of such a hyperlink (see RFC 1866 [2] section 5.7.3) can be used to give a disposition with respect to the target phrase. For argumentation, for example, the disposition keywords "support", "issue", "query", and "comment" are suggested. When such a link is traversed, an interactive user agent conforming to this specification MUST enable the user to discover the precise location of the words of the target phrase in the document in some fashion. (Merely scrolling the display to the approximate location of the target phrase is not sufficient.) For example, the words of the target phrase could be highlighted, or symbols could be inserted into the displayed text of the document to point out the target phrase; such indication MAY be shown by default or shown at the request of the user. It is not necessary to indicate the context words surrounding the target phrase in any particular fashion. Yee Experimental [Page 4] ? RFC 9999 Text-Search Fragment Identifiers March 1998 User agents MAY indicate the target phrase of a link pointing into a displayed document even when the displayed document was reached by means other than traversing that link. If a user agent allows backward traversal of such hyperlinks, the source anchor for the backward link SHOULD be provided at or near the target phrase. When the target phrase indicated by a text-search fragment identifier is not found in the target document, an interactive user agent SHOULD display a warning condition to notify the user that it was not found. 6. Security Considerations The proposed text-search fragment specification scheme embeds a small excerpt from the document into the fragment identifier. Therefore it is important to keep in mind that a URL into the text of a sensitive document might contain sensitive text, and storage of personal URL collections or histories would need to be protected. A text-search fragment identifier only specifies the way a user agent should treat data it has already received, and reveals no new access to a document other than the words in the fragment itself. However, note that a text-search fragment identifier could be used to call attention to a mistake or a phrase otherwise made inconspicuous (e. g. by its colouring or small typeface), in a manner which will undoubtedly make some document authors uncomfortable. 7. References [1] T. Berners-Lee, L. Masinter, M. McCahill, Editors, "Uniform Resource Locators (URL)", RFC 1738, December 1994. [2] T. Berners-Lee, D. Connolly, "Hypertext Markup Language - 2.0", RFC 1866, November 1995. [3] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 8. Author's Address Ka-Ping Yee St. Paul's United College Westmount Road North Waterloo, Ontario, Canada, R2M 5H9 Phone: +1 519-725-8008 Fax: +1 519-885-6364 E-Mail: ping@lfw.org Yee Experimental [Page 5] ? RFC 9999 Text-Search Fragment Identifiers March 1998 9. Full Copyright Statement Copyright (C) The Internet Society (1998). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implmentation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." Yee Experimental [Page 6] ?