CDNI W. Jin Internet-Draft M. Li Intended status: Informational B. Khasnabish Expires: January 31, 2013 ZTE Corporation July 30, 2012 Content De-duplication for CDNi Optimization draft-jin-cdni-content-deduplication-optimization-02 Abstract Recent explosive growth of content delivery/distribution networks (CDNs) and their interconnection are causing unintended repetition of content storage in the same dCDN. This can be avoided by using a suitable de-duplication mechanism. This document explores the scenarios which create the problems, and then discusses the approaches to eliminate the duplicated transmission of the same content from uCDN(s) to dCDN in CDNi networks. To implement the optimization, some enhancements to the CDNi metadata model and interface are required. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 31, 2013. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Jin, et al. Expires January 31, 2013 [Page 1] Internet-Draft CDNi Content De-duplication Optimization July 2012 carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 2. Deployment Scenarios . . . . . . . . . . . . . . . . . . . . . 4 2.1. Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2. Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3. Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . 6 3. Content Naming for CDNi . . . . . . . . . . . . . . . . . . . 7 3.1. Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2. Ownership . . . . . . . . . . . . . . . . . . . . . . . . 8 4. CDNi Content De-duplication Optimization Implementation . . . 8 4.1. Constant URL . . . . . . . . . . . . . . . . . . . . . . . 8 4.2. Content Naming Mechanism . . . . . . . . . . . . . . . . . 9 4.3. Description of Content De-duplication . . . . . . . . . . 10 4.3.1. Pre-Positioned Content Acquisition . . . . . . . . . . 10 4.3.2. Dynamic Content Acquisition . . . . . . . . . . . . . 12 4.3.3. Content Removal . . . . . . . . . . . . . . . . . . . 14 5. Security Considerations . . . . . . . . . . . . . . . . . . . 15 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 16 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 8.1. Normative References . . . . . . . . . . . . . . . . . . . 16 8.2. Informative References . . . . . . . . . . . . . . . . . . 16 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17 Jin, et al. Expires January 31, 2013 [Page 2] Internet-Draft CDNi Content De-duplication Optimization July 2012 1. Introduction In some CDNi deployment, the dCDN occasionally caches the same content copy multiple times from the same Content Service Provider (CSP). For example, the CSP may have the agreement with two authoritative CDNs, and both could be the upstream CDNs of the same dCDN. Cascading of CDNs may result in a similar scenario as well. The top-layer uCDN establishes connections with two intermediate- layer uCDNs respectively, and both connect to the same bottom-layer dCDN. In such scenarios, the dCDN may receive the content via 'push' from one of the uCDN via pre-position procedure, and then may also request for downloading the same content from another uCDN via another pre-position procedure or upon user's content request. Storing the same content multiple times not only wastes the dCDN's the memory or storage resources, it also causes wastage of transmission bandwidth that is used to deliver the same content repeatedly. Therefore, it is necessary to avoid delivering the same content from different uCDNs to dCDN repeatedly. In this draft, we list a set of scenarios which may cause repeated delivery of the same content. A feasible solution for content de-duplication is then discussed. In order to address the content repetition problem, several issues need to be considered. * How to detect content repetition by dCDN. * How to avoid content repetition, when one or more uCDNs selects one dCDN to deliver the same content to multiple User Agents. This document provides detailed analysis on the issues of content repetition. We realize that there is a need to develop an optimized mechanism to de-duplicate the content in CDNi network. In order to implement such optimization, enhancement to CDNi metadata model and interface may be required. 1.1. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. This document reuses the terminology defined in: [draft-ietf-cdni-problem-statement-08], [draft-ietf-cdni-requirements-03], Jin, et al. Expires January 31, 2013 [Page 3] Internet-Draft CDNi Content De-duplication Optimization July 2012 [draft-ietf-cdni-framework-01], and [draft-ietf-cdni-use-cases-09]. Resource Id: a metadata object (e.g., partial or whole URL, or other format) which is generated by uCDN and identifies the storage of the content in the uCDN. Content Id: a metadata object (e.g. a URN) which uniquely identifies the content in the scope of CDNi. 2. Deployment Scenarios This section illustrates several CDNi deployment scenarios that typically lead to duplicated content in the same server. 2.1. Scenario 1 As depicted in Figure 1, both CDN-A and CDN-B establish interconnections with CDN-C that acts as a dCDN. Thus, CDN-C will cache the content for CDN-A and CDN-B. When both CDN provider A and CDN provider B have agreements with the same CSP for content delivery, CDN-C may be required by CDN-A and CDN-B separately to retrieve and cache the same content from the CSP. CDN-C is likely to suffer from content repetition troubles. As the location of the content in a CDN is normally assigned by CDN itself, the URLs of the same content are likely different between CDNs. So it is not enough to determine whether the content to be retrieved and cached is the same only by the URL of the content. Jin, et al. Expires January 31, 2013 [Page 4] Internet-Draft CDNi Content De-duplication Optimization July 2012 +-------+ | CSP | +-------+ / \ ,--,--,--./ \,--,--,--. ,-' `-. ,-' `-. ( CDN Provider A ) ( CDN Provider B ) `-. (CDN-A) ,-' `-. (CDN-B) ,-' `--'--'--' `--'--'--' \\ // \\,--,--,--.// ,-' `-. ( CDN Provider C ) `-. (CDN-C) ,-' `--'--'--' | +------------+ | User Agent | +------------+ === CDN Interconnect Figure 1 Interconnected CDNs with the same CSP and with one dCDN 2.2. Scenario 2 Now we consider the case of cascaded CDNs, as depicted in Figure 2. Note that the top-layer Upstream CDN-A has direct contract with CSP and interconnects with two middle-layer CDNs (CDN-B and CDN-C) that have the same bottom-layer Downstream CDN (CDN-D) interconnected to them. Consequently, there are two possible delivery paths for CDN-D to cache the contents of CSP, one is CDN-A -> CDN-B -> CDN-D, and the other is CDN-A -> CDN-C -> CDN-D. CDN-D may need to cache the same content by upstream CDNs(CDN-B and CDN-C) on different paths. If the URL of the content is changed by CDN-B or CDN-C, CDN-D cannot be aware of the contents to be cached and therefore this may lead to storing of duplicated contents. Jin, et al. Expires January 31, 2013 [Page 5] Internet-Draft CDNi Content De-duplication Optimization July 2012 ,--,--,--. ,-' `-. +-------+ ( CDN Provider A )----| CSP | `-. (CDN-A) ,-' +-------+ //-'--'-\\ ,--,--,--.// \\,--,--,--. ,-' `-. ,-' `-. ( CDN Provider B ) ( CDN Provider C ) `-. (CDN-B) ,-' `-. (CDN-C) ,-' `--'--'--' `--'--'--' \\ // \\,--,--,--.// ,-' `-. ( CDN Provider D ) `-. (CDN-D) ,-' `--'--'--' | +------------+ | User Agent | +------------+ === CDN Interconnect Figure 2 Cascaded CDNs 2.3. Scenario 3 As depicted in Figure 3, two interconnected CDNs - CDN-A(uCDN) and CDN-B(dCDN) - both have contracts with CSP. CDN-B plays two roles at the same time: downstream CDN of CDN-A and Authoritative CDN of CSP. When an end-user of CDN-A initiates content request from the CSP, CDN-A decides that CDN-B should be the serving CDN. Then CDN-A redirects the request to CDN-B. If CDN B does not have a local copy of the requested content yet (cache miss), CDN B ingests the content from CDN A. Normally CDN-B, as Authoritative CDN, is very likely to have already cached this content from original server. If CDN-B cannot identify the requested contents as the same content, this same content will be repeatedly retrieved and cached. Jin, et al. Expires January 31, 2013 [Page 6] Internet-Draft CDNi Content De-duplication Optimization July 2012 +-------+ | CSP | +-------+ / \ ,--,--,--./ \,--,--,--. ,-' `-. ,-' `-. ( CDN Provider A )=====( CDN Provider B ) `-. (CDN-A) ,-' `-. (CDN-B) ,-' `--'--'--' `--'--'--' | +------------+ | User Agent | +------------+ === CDN Interconnect Figure 3 Interconnected CDNs with the same CSP 3. Content Naming for CDNi It is well known that CDNs have their own content naming mechanisms, most of which are independent and separated from one another due to the use of different algorithms such as Hash algorithms. It implies that for the same content distributed by two CDNs, the corresponding content identifiers are likely to be quite different. [I-D.ietf- cdni-requirements-03] treats the information regarding CDN content naming as intra-CDN information and the CDNI solution MUST not require intra-CDN information to be exposed to other CDNs for effective and efficient delivery of the content. Therefore, establishing a uniform content naming mechanism is urgently needed for CDNi network. This mechanism which can be implemented by CDNI Metadata Distribution Protocol may have the following properties. 3.1. Uniqueness CDNi content naming mechanism must guarantee the uniqueness of the content identification. Although URL is widely used for identifying network resource, it is not quite suitable for content identification in CDNi network where content de-duplication needs to be taken into consideration. Although the method of URL match is commonly used by many cache systems to detect the repetitive files with same name for avoiding content repetition, it will probably fail for CDNi case. This is due to the fact that different forwarding mechanisms are used in different CDNs involved. The user-originated requests are always snooped by the DPI (deep packet inspection) devices before transmitted to the original server, whereas the requests received by dCDN are always redirected by one or more uCDNs. Since there is no guarantee that the URLs will not be changed through the redirection Jin, et al. Expires January 31, 2013 [Page 7] Internet-Draft CDNi Content De-duplication Optimization July 2012 process, a new type of object needs to be defined to represent the uniqueness of content identifier. For the CSP's the contents that are distributed into different interconnected CDNs, the related metadata objects may be somewhat different in many cases. For example, in Figure 2, when both CDN-A and CDN-B delegate the delivery of the same CSP's content, the content metadata such as (a) content description, (b) access policy, and (c) security policy may be not be exactly the same in both cases. Such metadata information is not suitable for content identification. Consequently, we need to define a new type of metadata object that helps uniquely identify the same content. 3.2. Ownership CDNi content naming mechanism should embody the ownership of content identification. Typically, a CDN provider may have contracts with many CSPs for delivering their contents, as well as operate its own Content delivery network. However, it may happen that a lot of contents published by these CSPs may be very similar, and many of them may even be exactly the same. Therefore, the problem is whether these identical copies are from the same CSP or how the interconnected CDNs can verify that these contents are identical. (Note: Such copies are pointed by the same content identification only if they are from the same content source.) For a traditional (non-interconnected) CDN, there is no problem to distinguish them via its intra content naming mechanism. When a CDN interconnects with other CDNs, the condition becomes more complicated due to the lack of awareness of CSP's content when the CDN acts as a dCDN. 4. CDNi Content De-duplication Optimization Implementation 4.1. Constant URL In general, URL-based mechanism can be used to implement content de- duplication in CDNi network. As referred in section 2, the URL description for a content may be different from uCDN to uCDN and is likely to be changed in the redirection process. An agreement to configure a specific URL between a pair of interconnected CDN is used in the draft [I-D.ietf-cdni-framework-01], however this method is not flexible enough for supporting de-duplication in complex CDNi network. So a feasible proposal is to specify a mechanism for CDNi network to guarantee that CSP's same contents cached in different uCDNs are identified by the same URL and that the URL is unchanged or at least the path part remains the same in the redirection process. The main problem of this mechanism is lack of resilience and we prefer an alternative mechanism as introduced in section 4.2 below. Jin, et al. Expires January 31, 2013 [Page 8] Internet-Draft CDNi Content De-duplication Optimization July 2012 4.2. Content Naming Mechanism This section provides a detailed description of CDNi content naming mechanism using CDNi Metadata Protocol. In general, CSP's content as well as its copies cached in interconnected CDNs are delivered to numerous End-Users. The draft [I-D.ietf-cdni-framework-01] assigns each copy an identifier in the form of an URL that is embedded with "CDN-Domain" which is used to distinguish whether a download request is from an end-user or from an uCDN. We use the term Resource Identifier to represent the storage pointing to the content copies in interconnected CDNs. However, unlike the usage in the draft [I-D.ietf-cdni-framework-01], in this document CDNi content naming mechanism specifies Resource Identifiers as the one which is only related to contents in uCDNs. Taking the example in section 2.3, we use Resource Identifier A to point to content originated through uCDN A. Although the Resource Identifier is able to identify a content, the uniqueness of content identification can't be guaranteed, as Resource Identifier is used to identify the storage of the content in the uCDN and therefore will be changed during redirection processes between different uCDNs. In order to resolve it, we introduce the term Content Identifier that is assigned to associate with Resource Identifier to uniquely identify the content and is similar to the URN usage. Note that the Content Identifier MUST be globally unique. Figure 4 shows a metadata model that can be used for maintaining the relationship between the two types of Identifiers. Using this model, the dCDN is able to uniquely identify and route requests towards the same targeted content. +----------------------+ | Content Identifier | +----------------------+ / | \ / | \ +--------------+ +--------------+ +--------------+ | Resouce | | Resouce | | Resouce | | Indentifier 1| | Indentifier 2| ... | Indentifier n| +--------------+ +--------------+ +--------------+ Figure 4 Metadata Model for Maintaining Relationship among Multi-IDs Note: We need to develop an authoritative 'Entity' for creating and maintaining the Content Identifier. Jin, et al. Expires January 31, 2013 [Page 9] Internet-Draft CDNi Content De-duplication Optimization July 2012 4.3. Description of Content De-duplication In this section the details of the solutions that use CDNi Content Naming mechanism for content de-duplication are discussed. Content Identifier can be defined as a metadata object. The metadata object can be distributed with/without the actual content item from uCDN to dCDN for storage in the dCDN, if the uCDN wants to pre- position the content item to the dCDN before the actual user request. The content Identifier metadata object binds with the Resource Identifer to identify the storage of the content in the uCDN and forms a content identification model for the content item. By using this content identification model, an interconnected CDN is able to detect content repetition. The content status must be synchronously updated by the interconnected CDN. According to content status, the interconnected CDN can determine whether the resource copy is cached or not. We present several procedures for optimized implementation of avoidance of CDNi content repetition. 4.3.1. Pre-Positioned Content Acquisition The following flow illustrates how the two uCDNs successively pre- position the same content in the dCDN. In this flow, the content to be pre-positioned in the dCDN is identified by different Resource Identifiers corresponding to uCDN A and uCDN B. Jin, et al. Expires January 31, 2013 [Page 10] Internet-Draft CDNi Content De-duplication Optimization July 2012 +--------+ +--------+ +--------+ | dCDN | | uCDN A | | uCDN B | +--------+ +--------+ +--------+ | Pre-position Request | | |<-------------------------| | +--------------+ | | | Detection of | | | | content rep- | | |(1) | etition | | | +--------------+ | | | OK | | |------------------------->| | | | | | Acquisition Request | | |------------------------->| | | | | | Content Data | |(2) |<-------------------------| | +--------------+ | | | Update cont- | | | | ent status | | | +--------------+ | | | Pre-position Request | |<-----------------------------------------------------| +--------------+ | | | Detection of | | | | content rep- | | |(3) | etition | | | +--------------+ | | | OK | |----------------------------------------------------->| | | | Figure 5 Acquisition of Pre-Positioned Content The steps that are illustrated in the figure are as follows: 1. The uCDN A requests that the dCDN pre-positions a particular content item identified by its Resource Identifier and Content Identifier. Receiving the message, the dCDN uses the Content Identifier to look up its stored content identification model to see whether the same content item is cached. With the result that the metadata does not exist, the dCDN replies to uCDN A with an OK message to notify that no such copy has been cached and content pre- position is required. 2. The dCDN acquires the content from uCDN A. Once the content is Jin, et al. Expires January 31, 2013 [Page 11] Internet-Draft CDNi Content De-duplication Optimization July 2012 pre-positioned, dCDN updates the content status and maintains the binding relation between Resource Identifier and Content Identifier metadata. 3. The uCDN B requests that dCDN pre-positions the same content item identified by its Resource Identifier and Content Identifier. Receiving the message, the dCDN uses the Content Identifier to look up its stored content identification model. Because such metadata exists, dCDN determines that the content is already cached. Then, dCDN replies to uCDN A with an OK message to notify that the same copy has been cached and the pre-position request should be cancelled. The dCDN locally binds the new Resource Identifier provided by uCDN B with the Content Identifier. 4.3.2. Dynamic Content Acquisition The following flows illustrates how the dCDN performs content de- duplication in cases of a cache miss and a cache hit without content pre-positioning. Jin, et al. Expires January 31, 2013 [Page 12] Internet-Draft CDNi Content De-duplication Optimization July 2012 +----------+ +------+ +------+ | end-user | | dCDN | | uCDN | +----------+ +------+ +------+ | Content Request | | |------------------------------------------------------->| | Content Redirection | | |<-------------------------------------------------------|(1) | Content Request | | |-------------------------->| | | | Content id Acquisition | | |<-------------------------->| | +--------------+ | | | Detection of | | | | content rep- | |(2) | | etition | | | +--------------+ | | | Acquisition Request | | |--------------------------->| | | Content Data | | |<---------------------------| | +--------------+ |(3) | | Update cont- | | | | ent status | | | +--------------+ | | Content Data | | |<--------------------------| | | | | Figure 6 Dynamic Content Acquisition (cache miss case) The steps that are illustrated in the figure are as follows: 1. A content request originated from an end-user is received at uCDN. The uCDN processes the request and recognizes that the end- user is best served by a dCDN. So uCDN redirects the request to the dCDN by sending redirection response to the end-user who then requests the content from the dCDN. The uCDN encapsulates Resource Identifier of the requested content item in the redirection response. 2. Receiving the request, the dCDN uses the Resource Identifier pointing to the requested resource to fetch the corresponding Content Identifier from the uCDN. The dCDN then uses the Content Identifier to look up its stored content identification model to check whether the content item is cached. With the result that such metadata does not exist, the case of a cache miss is determined by the dCDN, and therefore content needs to be downloaded from the uCDN before delivered to the end-user. Jin, et al. Expires January 31, 2013 [Page 13] Internet-Draft CDNi Content De-duplication Optimization July 2012 3. The dCDN acquires the requested content from uCDN A. Once the content is cached, dCDN updates the content status and maintains the binding relation between Resource Identifier and the Content Identifier metadata. The dCDN then delivers content data to the end- user. +----------+ +------+ +------+ | end-user | | dCDN | | uCDN | +----------+ +------+ +------+ | Content Request | | |------------------------------------------------------->| | Content Redirection | | |<-------------------------------------------------------|(1) | Content Request | | |-------------------------->| | | | Content id Acquisition | | |<-------------------------->| | +--------------+ | | | Detection of | | | | content rep- | |(2) | | etition | | | +--------------+ | | Content Data | | |<--------------------------| | | | | Figure 7 Dynamic Content Acquisition (cache hit case) The steps that are illustrated in the figure are as follows: Steps 1 and 2 are exactly the same as steps 1 and 2 of Figure 6, except that this time dCDN determines the case of a cache hit according to the existence of such record in the corresponding Content Identifier metadata. This flow differs from the one in Figure 6 only in terms of not triggering dynamic content acquisition (step 3), since the content has already been cached by dCDN. 4.3.3. Content Removal The following flow illustrates how the dCDN removes the content resource under the control of the uCDN. Jin, et al. Expires January 31, 2013 [Page 14] Internet-Draft CDNi Content De-duplication Optimization July 2012 +----------+ +------+ +------+ | end-user | | dCDN | | uCDN | +----------+ +------+ +------+ | | Content Removal | | |<---------------------------| | +--------------+ | | | Removal of | | | | content | |(1) | +--------------+ | | | | | +--------------+ | | | Removal of | | | | conjunction | | | | metadata | |(2) | +--------------+ | | | OK | | |--------------------------->| | | | Figure 8 Removal of Content A premise is that the content copy to be removed has already been cached in the dCDN from the uCDN. The steps illustrated in the figure are as follows: 1. The uCDN requests that the dCDN removes some content resource identified by the Content Identifier due to the deployment policy or expiration of content's life-time. The dCDN then removes the content resource accordingly. Note that if one dCDN connects to two uCDNs, only the uCDN that deliveres the content item to the dCDN is proper and authorized to send the removal command to delete the content. Considering that CDN A and CDN B are uCDNs of CDN C, CDN A delivers a content item to CDN C via pre-positioning procedure or CDN C acquires a content item from CDN A via content acquisition procedure. CDN A is authorised to delete the content, while as for CDN B, re-sending the content to CDN C will not occur if the steps in above sections are applied. 2. Once the content resource is removed, the dCDN updates the content status and removes the Content Identifier metadata. It then replies a OK response to the uCDN. 5. Security Considerations To be added later Jin, et al. Expires January 31, 2013 [Page 15] Internet-Draft CDNi Content De-duplication Optimization July 2012 6. IANA Considerations This document has no IANA Considerations. 7. Acknowledgments The authors would like to thank Francois Le Faucheur and Ramki Krishnam for valuable input and discussions. 8. References 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 8.2. Informative References [I-D.ietf-cdni-framework] Peterson, L. and B. Davie, "Framework for CDN Interconnection", April 2012. [I-D.ietf-cdni-problem-statement] Niven-Jenkins, B., Faucheur, F., and N. Bitar, "Content Distribution Network Interconnection (CDNI) Problem Statement", May 2012. [I-D.ietf-cdni-requirements] Leung, K. and Y. Lee, "Content Distribution Network Interconnection (CDNI) Requirements", December 2011. [I-D.narten-iana-considerations-rfc2434bis] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", draft-narten-iana-considerations-rfc2434bis-09 (work in progress), March 2008. [RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, June 1999. [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC Text on Security Considerations", BCP 72, RFC 3552, July 2003. Jin, et al. Expires January 31, 2013 [Page 16] Internet-Draft CDNi Content De-duplication Optimization July 2012 Authors' Addresses WeiYi Jin ZTE Corporation Nanjing, 210012 China Phone: +86 025-52871364 Email: jin.weiyi@zte.com.cn Mian Li ZTE Corporation Nanjing, 210012 China Phone: +86 025-88014641 Email: li.mian@zte.com.cn Bhumip Khasnabish ZTE Corporation New Jersey, 07960 USA Phone: +001-781-752-8003 Email: bhumip.khasnabish@zteusa.com Jin, et al. Expires January 31, 2013 [Page 17]