Website Parse Templates Working Group Av. Manukyan Internet-Draft Ar. Manukyan Intended status: Informational Ar. Mailyan Expires: October 17, 2008 Al. Sayadyan WebsiteParser.com April 15, 2008 Internet Content Description Language (ICDL) Website Parse Templates draft-manukyan-icdl-website-parse-templates-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on October 17, 2008. Copyright Notice Copyright (C) The IETF Trust (2008). Abstract This document defines general concepts and terminology for Internet Content Description Language (ICDL) that is used to provide web crawlers with proper information on web site structure and content. Manukyan, et al. Expires October 17, 2008 [Page 1] Internet-Draft ICDL Website Parse Templates April 2008 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. ICDL Structure . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1. Ontology . . . . . . . .. . . . . . . . . . . . . . . . . 4 2.2. Template . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3. URLs . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3. Security Considerations . . . . . . . . . . . . . . . . . . . 6 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 5. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Normative References . . . . . . . . . . . . . . . . . . 7 5.2. Informative References . . . . . . . . . . . . . . . . . 7 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 7 Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 8 Disclaimer of Validity . . . . . . . . . . . . . . . . . . . . . . 8 Intellectual Property . . . . . . . . . . . . . . . . . . . . . . 8 Manukyan, et al. Expires October 17, 2008 [Page 2] Internet-Draft ICDL Website Parse Templates April 2008 1. Introduction ICDL is a specification for web site structure and content description for web crawlers. It is an effective way to provide web crawlers with proper web page templates to parse web site content more accurately and co-ordinate the same object attributes used in different pages of the same web site.Web page structure and content description based on ICDL standard is to be referred to as ICDL file (with ".icdl" extension). ICDL is compatible with existing Semantic Web[3] concepts defined by World Wide Web Consortium[4] (Resource Description Framework - RDF[1] and Web Ontology Language - OWL[5]) and Universal Networking Language - UNL[6] specifications. 2. ICDL File Structure ICDL file consists of following sections: Ontology - enumeration and definition of all concepts used in certain web site; Templates- web site's structural elements associations with defined concepts; URLs - URLs or URL patterns that are comply with described templates. Single ICDL file is referred to the same host, while single host may have several ICDL files describing its structure. The host should be specified at the beginning of corresponding ICDL file: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ . . . . . . . . . . . . . . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Manukyan, et al. Expires October 17, 2008 [Page 3] Internet-Draft ICDL Website Parse Templates April 2008 2.1. Ontology Ontology section defines all concepts used in web site. Defined concepts must be enclosed within tags. It is required to specify the ontology name and indicate language that is used to describe the concepts. E.g. the concept "artist", which is inherited from "person" and may have attributes "name", "image", "bio", "track", "video", etc., has the following ICDL representation: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The ontology name can be any comprehensive string, but for the language it is necessary to indicate supported language type, e.g. "icdl:ontology", "owl", "unl:uws". "inherit" tag shows inheritance relation between two concepts, "has" tag shows attributable relations. Either of defined concepts has default attribute - object identifier (id) to be used by web crawlers to co-ordinate the same object's attributes used in different pages of the same web site. ICDL standard foresees following predefined concepts that are general for all kind of web sites: "Menu" - navigation bar/menu "Logo" - design element/logo "Content" - element that contains main textual content of the page "Advertisement" - advertisement/banner "External Link" - element that contains external links Manukyan, et al. Expires October 17, 2008 [Page 4] Internet-Draft ICDL Website Parse Templates April 2008 2.2. Template ICDL template describes web site's HTML structure and content using concepts defined in ontology section. The template should be enclosed within tags. E.g. the artist web page may have certain structure with following simple ICDL template: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The template name can be any comprehensive string, but for the language it is necessary to indicate supported language type, e.g. "icdl:template", "rdf", "unl:expression". HTML structure elements represented in ICDL templates are indicated by "xpaths" [2] or "tag IDs". The web page may contain structured repeatable content ("repeatable block") included in one main structural element ("container"). If specified strucutural element is already described by another template the "reference" tag can be used to point to that "template block" as follows: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ It makes possible to create hierarchic relations between ICDL templates so that web crawlers can use specified reference(s) to identify the same object in different pages of a given web site. Manukyan, et al. Expires October 17, 2008 [Page 5] Internet-Draft ICDL Website Parse Templates April 2008 2.3. URLs This section defines the URLs or URL patterns that are corresponding to described ICDL templates. Listed URLs/URL patterns should be enclosed within tags as follows: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ "template" shows which one of the described ICDL templates corresponds to listed URLs/URL patterns. Regular Expressions[7] are used for URL patterns descriptions. The concepts necessary for URL pattern definition (such as "name" and "id") are to be defined previously in ontology section. 3. Security Considerations The syntax rules for ICDL file creation that are documented here pose no direct risk to computers and networks. However, people can use these rules to build web site parse templates that are inaccurate or even deliberately misleading, which reflects incorrect parsing of web site structured content. Systems that are based on ICDL website parse templates need to consider issues related to its accuracy and validity as part of their design and implementation, and users of such systems need to consider the design and implementation assumptions. 4. IANA Considerations This document has no actions for IANA. Manukyan, et al. Expires October 17, 2008 [Page 6] Internet-Draft ICDL Website Parse Templates April 2008 5. References 5.1 Normative References [1] A. Swartz, "application/rdf+xml Media Type Registration", RFC 3870, September 2004 [2] J. Boyer, M. Hughes and J. Reagle, "XML-Signature XPath Filter 2.0", RFC 3653, December 2003 5.2 Informative References [3] Ivan Herman, "W3C Semantic Web Activity", http://www.w3.org/2001/sw/, April 2008 [4] I. Jacobs, "About the World Wide Web Consortium (W3C)", http://www.w3.org/Consortium/, February 2008 [5] Guus Schreiber, Mike Dean, Frank van Harmelen, Jim Hendler, Ian Horrocks, Deborah L. McGuinness, Peter F. Patel-Schneider and Lynn Andrea Stein, "OWL Web Ontology Language Reference/W3C Recommendation", http://www.w3.org/TR/owl-ref/, February 2004 [6] "Introduction of the UNL", http://www.unl.ru/introduction.html [7] J. Goyvaerts,"Regular Expression Tutorial", http://www.regular-expressions.info/index.html, November 2007 Authors' Addresses Avet Manukyan WebsiteParser.com Email: contact@websiteparser.com URI: http://www.websiteparser.com Armen Manukyan WebsiteParser.com Email: contact@websiteparser.com Arthur Mayilyan WebsiteParser.com Email: contact@websiteparser.com Alexander Sayadyan WebsiteParser.com Email: contact@websiteparser.com Manukyan, et al. Expires October 17, 2008 [Page 7] Internet-Draft ICDL Website Parse Templates April 2008 Full Copyright Statement Copyright (C) The IETF Trust (2007). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Manukyan, et al. Expires October 17, 2008 [Page 8]