< draft-ietf-morg-fuzzy-search-02.txt   draft-ietf-morg-fuzzy-search-03.txt >
Message Organization Working Group T. Sirainen Message Organization Working Group T. Sirainen
Internet-Draft January 23, 2010 Internet-Draft November 16, 2010
Intended status: Standards Track Intended status: Standards Track
Expires: July 27, 2010 Expires: May 20, 2011
IMAP4 Extension for Fuzzy Search IMAP4 Extension for Fuzzy Search
draft-ietf-morg-fuzzy-search-02 draft-ietf-morg-fuzzy-search-03
Abstract Abstract
This document describes an IMAP protocol extension enabling server to This document describes an IMAP protocol extension enabling server to
perform searches with inexact matching and assigning relevancy scores perform searches with inexact matching and assigning relevancy scores
for matched messages. for matched messages.
Note Note
A revised version of this draft document will be submitted to the RFC A revised version of this draft document will be submitted to the RFC
editor as a Proposed Standard for the Internet Community. Discussion editor as a Proposed Standard for the Internet Community. Discussion
and suggestions for improvement are requested, and should be sent to and suggestions for improvement are requested, and should be sent to
morg@ietf.org. morg@ietf.org.
Status of this Memo Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF). Note that other groups may also distribute
other groups may also distribute working documents as Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at This Internet-Draft will expire on May 20, 2011.
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on July 27, 2010.
Copyright Notice Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the BSD License. described in the Simplified BSD License.
1. Conventions used in this document 1. Conventions used in this document
In examples, "C:" indicates lines sent by a client that is connected In examples, "C:" indicates lines sent by a client that is connected
to a server. "S:" indicates lines sent by the server to the client. to a server. "S:" indicates lines sent by the server to the client.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [Kwds]. document are to be interpreted as described in RFC 2119 [Kwds].
skipping to change at page 2, line 37 skipping to change at page 2, line 31
When humans perform searches in IMAP clients, they typically want to When humans perform searches in IMAP clients, they typically want to
see the most relevant search results first. IMAP servers are able to see the most relevant search results first. IMAP servers are able to
do this in the most efficient way when they're free to internally do this in the most efficient way when they're free to internally
decide how searches should match messages. This document describes a decide how searches should match messages. This document describes a
new SEARCH=FUZZY extension that provides such functionality. new SEARCH=FUZZY extension that provides such functionality.
3. The FUZZY Search Key 3. The FUZZY Search Key
FUZZY search key takes another search key as its argument. Server is FUZZY search key takes another search key as its argument. Server is
allowed to perform all matching in an implementation-defined manner allowed to perform all matching in an implementation-defined manner
for this search key. Typically this would be used to search for for this search key, including ignoring the active comparator as
defined by [RFC5255]. Typically this would be used to search for
strings, for example: strings, for example:
C: A01 SEARCH FUZZY (SUBJECT "IMAP break") C: A1 SEARCH FUZZY (SUBJECT "IMAP break")
S: * SEARCH 1 5 10 S: * SEARCH 1 5 10
S: A01 OK Search completed. S: A1 OK Search completed.
Besides matching messages with subject "IMAP break", the above search Besides matching messages with subject "IMAP break", the above search
may also match messages with subjects "broken IMAP", "IMAP is may also match messages with subjects "broken IMAP", "IMAP is
broken", or anything else the server decides that might be a good broken", or anything else the server decides that might be a good
match. match.
This example does a fuzzy SUBJECT search, but a non-fuzzy FROM
search:
C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com
S: * SEARCH 1 4
S: A2 OK Search completed.
It is implementation-defined how server handles multiple separate
FUZZY search keys.
4. Relevancy Scores for Search Results 4. Relevancy Scores for Search Results
Servers SHOULD assign a search relevancy score for each matched Servers SHOULD assign a search relevancy score for each matched
message when the FUZZY search key is given. Relevancy scores are message when the FUZZY search key is given. Relevancy scores are
given in range 1-100, where 100 is the highest relevancy. The given in range 1-100, where 100 is the highest relevancy. The
relevancy scores SHOULD use the full 1-100 range, so that clients can relevancy scores SHOULD use the full 1-100 range, so that clients can
show them to users in a meaningful way, such as a percentage value. show them to users in a meaningful way, such as a percentage value.
As the name already tells, relevancy scores specify how relevant to As the name already tells, relevancy scores specify how relevant to
the search the matched message is. It's not necessarily the same as the search the matched message is. It's not necessarily the same as
how precisely the message matched. For example a message whose how precisely the message matched. For example a message whose
subject matches fuzzily the search string might get a higher subject matches fuzzily the search string might get a higher
relevancy score than a message whose body had the exact string in the relevancy score than a message whose body had the exact string in the
middle of a sentence. middle of a sentence. When multiple search keys are matched fuzzily,
it's server-dependent on how the relevancy score is calculated.
If server advertises the ESEARCH capability as defined by [ESEARCH], If server also advertises the ESEARCH capability as defined by
the relevancy scores can be retrieved using the new RELEVANCY return [ESEARCH], the relevancy scores can be retrieved using the new
option for SEARCH: RELEVANCY return option for SEARCH:
C: A02 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo" C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo"
S: * ESEARCH (TAG "A02") ALL 1,5,10 RELEVANCY (4 99 42) S: * ESEARCH (TAG "B2") ALL 1,5,10 RELEVANCY (4 99 42)
S: A02 OK Search completed. S: B1 OK Search completed.
The RELEVANCY return option MUST NOT be used unless FUZZY search key The RELEVANCY return option MUST NOT be used unless FUZZY search key
is also given. is also given. Note that SEARCH results aren't sorted by relevancy,
SORT is needed for that.
5. Fuzzy matching with non-string search keys 5. Fuzzy matching with non-string search keys
Fuzzy matching is not limited to just string matching. All search Fuzzy matching is not limited to just string matching. All search
keys SHOULD be matched fuzzily, although what exactly that means for keys SHOULD be matched fuzzily, although what exactly that means for
different search keys is left up to server implementations to decide different search keys is left up to server implementations to decide
-- including deciding that fuzzy matching is meaningless for a -- including deciding that fuzzy matching is meaningless for a
particular key, and falling back to exact matching. Some suggestions particular key, and falling back to exact matching. Some suggestions
are given below. are given below.
skipping to change at page 4, line 15 skipping to change at page 4, line 23
Flags: Server could return messages that don't have the specified Flags: Server could return messages that don't have the specified
flags, but with a lower relevancy score. flags, but with a lower relevancy score.
UIDs, sequences, modification sequences: These are examples of keys UIDs, sequences, modification sequences: These are examples of keys
for which exact matching is probably what makes sense. for which exact matching is probably what makes sense.
Alternatively, a server might choose, for instance, to expand a UID Alternatively, a server might choose, for instance, to expand a UID
range by 5% on each side. range by 5% on each side.
6. Extensions to SORT 6. Extensions to SORT
If server advertises the SORT capability as defined by [SORT], the If server also advertises the SORT capability as defined by [SORT],
results can be sorted by the new RELEVANCY sort criteria: the results can be sorted by the new RELEVANCY sort criteria:
C: A03 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo" C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo"
S: * SORT 5 10 1 S: * SORT 5 10 1
S: A03 OK Sort completed. S: C1 OK Sort completed.
The message with the highest score is returned first. As with The message with the highest score is returned first. As with
RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used
unless FUZZY search key is also given. unless FUZZY search key is also given.
If server advertises the ESORT capability as defined by [CONTEXT], If server also advertises the ESORT capability as defined by
the relevancy scores can be retrieved using the new RELEVANCY return [CONTEXT], the relevancy scores can be retrieved using the new
option for SORT: RELEVANCY return option for SORT:
C: A04 SORT RETURN (RELEVANCY ALL) (RELEVANCY) FUZZY TEXT "Helo" C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT "Helo"
S: * ESEARCH (TAG "A04") ALL 5,10,1 RELEVANCY (99 42 4) S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4)
S: A04 OK Sort completed. S: C2 OK Sort completed.
To limit the number of returned messages, use the PARTIAL return
option. For example this returns the 10 most relevant messages:
C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT "World"
S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82)
S: C3 OK Sort completed.
7. Formal Syntax 7. Formal Syntax
The following syntax specification uses the augmented Backus-Naur The following syntax specification uses the augmented Backus-Naur
Form (BNF) as described in [ABNF]. It includes definitions from Form (BNF) as described in [ABNF]. It includes definitions from
[RFC3501], [IMAP-ABNF] and [SORT]. [RFC3501], [IMAP-ABNF] and [SORT].
capability =/ "SEARCH=FUZZY" capability =/ "SEARCH=FUZZY"
score = 1*3DIGIT score = 1*3DIGIT
skipping to change at page 5, line 24 skipping to change at page 5, line 30
search-return-data =/ "RELEVANCY" SP score-list search-return-data =/ "RELEVANCY" SP score-list
;; Conforms to <search-return-data>, from [IMAP-ABNF] ;; Conforms to <search-return-data>, from [IMAP-ABNF]
search-return-opt =/ "RELEVANCY" search-return-opt =/ "RELEVANCY"
;; Conforms to <search-return-opt>, from [IMAP-ABNF] ;; Conforms to <search-return-opt>, from [IMAP-ABNF]
sort-key =/ "RELEVANCY" sort-key =/ "RELEVANCY"
8. Security Considerations 8. Security Considerations
This document is believed not to have any security implications. Implementation of this extension might enable a denial-of-service
attack if the implementation isn't careful to prevent them. Fuzzy
search engines are often complex with non-obvious disk space, memory
and/or CPU usage patterns. Implementors should test at least the
behavior of large messages that contain very long words and/or unique
random strings. Also very long search keys might cause excessive
memory or CPU usage.
Invalid input may also be problematic. For example if the search
engine takes UTF-8 stream as input, it might fail more or less badly
when illegal UTF-8 sequences are fed to it from a message whose
character set was claimed to be UTF-8. This could be avoided by
validating all the input and, for example, replacing illegal UTF-8
sequences with the Unicode replacement character (U+FFFD).
Search relevancy rankings might be susceptible to "poisoning" by
smart attackers using certain keywords or hidden markup (e.g. HTML)
in their messages to boost the rankings. This can't be fully
prevented by servers, so clients should prepare for it by at least
allowing user to see all the search results, rather than hide results
below a certain score.
9. IANA Considerations 9. IANA Considerations
IMAP4 capabilities are registered by publishing a standards track or IMAP4 capabilities are registered by publishing a standards track or
IESG approved experimental RFC. The registry is currently located IESG approved experimental RFC. The registry is currently located
at: at:
http://www.iana.org/assignments/imap4-capabilities http://www.iana.org/assignments/imap4-capabilities
This document defines the X-DRAFT-I02-SEARCH=FUZZY [[anchor7: Note to This document defines the X-DRAFT-I03-SEARCH=FUZZY [[anchor7: Note to
RFC Editor: fix before publication]] IMAP capability. IANA is RFC Editor: fix before publication]] IMAP capability. IANA is
requested to add it to the registry. requested to add it to the registry.
10. Acknowledgements 10. Acknowledgements
Alexey Melnikov, Zoltan Ordogh and Barry Leiba have helped with this Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo and Dave
document. Cridland have helped with this document.
11. Normative References 11. Normative References
[ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax [ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 5234, January 2008. Specifications: ABNF", RFC 5234, January 2008.
[CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4", RFC 5267, [CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4", RFC 5267,
July 2008. July 2008.
[ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH [ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH
skipping to change at page 6, line 22 skipping to change at page 6, line 44
[IMAP-ABNF] [IMAP-ABNF]
Melnikov, A. and C. Daboo, "Collected Extensions to IMAP4 Melnikov, A. and C. Daboo, "Collected Extensions to IMAP4
ABNF", RFC 4466, April 2006. ABNF", RFC 4466, April 2006.
[Kwds] Bradner, S., "Key words for use in RFCs to Indicate [Kwds] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, March 1997. Requirement Levels", RFC 2119, March 1997.
[RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION [RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION
4rev1", RFC 3501, March 2003. 4rev1", RFC 3501, March 2003.
[RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet
Message Access Protocol Internationalization", RFC 5255,
June 2008.
[SORT] Crispin, M. and K. Murchison, "Internet Message Access [SORT] Crispin, M. and K. Murchison, "Internet Message Access
Protocol - SORT and THREAD Extensions", RFC 5256, Protocol - SORT and THREAD Extensions", RFC 5256,
June 2008. June 2008.
Author's Address Author's Address
Timo Sirainen Timo Sirainen
Email: tss@iki.fi Email: tss@iki.fi
 End of changes. 24 change blocks. 
40 lines changed or deleted 78 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/