Holistic Schema Mappings for XML-on-RDBMS

Please download to get full document.

View again

of 16
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
Holistic Schema Mappings for XML-on-RDBMS Priti Patil and Jayant R. Haritsa Database Systems Lab, SERC, Indian Institute of Science, Bangalore , INDIA Abstract. When hosting XML information on relational
Document Share
Document Transcript
Holistic Schema Mappings for XML-on-RDBMS Priti Patil and Jayant R. Haritsa Database Systems Lab, SERC, Indian Institute of Science, Bangalore , INDIA Abstract. When hosting XML information on relational backends, a mapping has to be established between the schemas of the information source and the target storage repositories. A rich body of recent literature exists for mapping isolated components of XML Schema to their relational counterparts, especially with regard to table configurations. In this paper, we present the Elixir system for designing industrialstrength mappings for real-world applications. Specifically, it produces an information-preserving holistic mapping that transforms the complete XML world-view (XML schema with constraints, XML documents XQuery queries including triggers and views) into a full-scale relational mapping (table definitions, integrity constraints, indices, triggers and views) that is tuned to the application workload. A key design feature of Elixir is that it performs all its mapping-related optimizations in the XML source space, rather than in the relational target space. Further, unlike the XML mapping tools of commercial database systems, which rely heavily on user inputs, Elixir takes a principled cost-based approach to automatically find an efficient relational mapping. A prototype of Elixir is operational and we quantitatively demonstrate its functionality and efficacy on a variety of real-life XML schemas. 1 Introduction For persistently storing information from XML sources, there are primarily two technological choices available: A specialized native XML store (e.g. Tamino [25], Natix [11], Timber [10]), or a standard relational engine (e.g. IBM DB2 [20], Oracle [24], MS-SQL Server [22]). From a pragmatic viewpoint, the latter approach brings with it the benefits of highly-functional, efficient and mature technology. Therefore, a rich body of literature has emerged in the last five years on the mechanics of hosting XML documents on relational backends. Specifically, there have been several proposals for generating efficient mappings between XML schema (e.g. DTDs [17] or XML Schema [29]) and relational schema. A common feature of much of this work is that it has focused on isolated components of the relational schema, typically the table configurations. However, viable XMLto-relational systems that intend to support real-world applications will need to provide an information-preserving holistic mapping that transforms the complete XML world-view (XML schema with constraints, XML documents, XQuery Currently with IBM India Software Lab. M.L. Lee, K.L. Tan, and V. Wuwongse (Eds.): DASFAA 2006, LNCS 3882, pp , c Springer-Verlag Berlin Heidelberg 2006 742 P. Patil and J.R. Haritsa queries including triggers and views) into a full-scale relational schema (table definitions, integrity constraints, indices, triggers and views). In this paper, we address this issue by presenting a system called ELIXIR (Establishing holistic schemas for XML In Rdbms) which produces such industrial-strength XMLto-RDBMS mappings. By taking a principled cost-based approach to mapping design, Elixir automatically delivers efficient mappings that are tuned to the XML application. This is in marked contrast to the XML mapping tools currently provided by commercial database systems, wherein the user is expected to play a significant role in the design and the tuning is largely manual. For example, in DB2 s XML Extender, the user needs to have intimate knowledge of the application to specify mapping of each XML node to either a table or a column using the Document Access Definition (DAD) medium [20]. A novel feature of Elixir is that it performs all its mapping-related optimizations in the XML source space, rather than in the relational target space. The evaluation of the quality of these optimizations is done at the target database engine, and the feedback is used to guide the optimization process in the XML space, in an iterative manner, resulting in a dynamically-derived mapping that is tuned to the application. This approach is based on our observation that an organic understanding of the XML source can result in more informed choices from a performance perspective as a case in point, making index choices at the XML source and then mapping them to relational equivalents proves to be substantially better than directly using the relational engine s index advisor, which is the current industrial practice [6]. An additional benefit of source-based index choices is that the knowledge can be used to guide the XQuery-to-SQL translation during query processing, consistent with the observation in [12] that schema decomposition and query translation are interdependent and should therefore be handled in an integrated manner. A related feature of Elixir is its integrated approach to producing efficient holistic schemas for example, the choice of indices is affected by the XML constraints. This integration ensures that all the interactions between the XML inputs and the effects of these inputs on the relational outputs are automatically taken into account during the optimization process. Currently, a prototype of Elixir is operational on the DB2 relational engine [20], and can be easily ported to any standard RDBMS. The prototype is implemented in Ocamlc (Objective Caml) [23], a strongly-typed functional programming language, and has been successfully evaluated on a variety of realworld and synthetic XML schemas [29] for representative XQuery [2] queries. To make our objectives concrete, a sample fragment of inputs from an XML banking application and a relational mapping derived from Elixir for these inputs is shown in Figure 1. To the best of our knowledge, Elixir is the first system to aim towards delivering industrial-strength mappings for XML-to-RDBMS. In the remainder of this paper, we describe its highlights the complete technical details are available in [14]. Holistic Schema Mappings for XML-on-RDBMS 743 XMLSchema xs:element name= country type= countrytype minoccurs= 0 maxoccurs= unbounded xs:key name= acc-num-key xs:selector xpath= .//account / xs:field xpath= ./sav-acc-num./check-acc-num / /xs:key xs:keyref name= cust-acc refer= acc-num-key xs:selector xpath= .//customer / xs:field xpath= ./acc-num / xs:keyref /xs:element XMLDocuments bank country name india /name customer cust-id 1 /cust-id acc-num 101 /acc-num /customer city account sav-acc-num 101 /sav-acc-num balance /balance /account /city /country /bank XML Query workload FOR $cust IN //customer FOR $acc IN //account WHERE ($cust/acc-num = $acc/sav-acc-num OR $cust/acc-num = $acc/check-acc-num) AND $cust/cust-id = 1000 return balance $acc/balance /balance # Frequency XQuery Triggers CREATE TRIGGER Increment-Counter AFTER INSERT OF //Customer CREATE TRIGGER NewCityTrigger AFTER INSERT OF /bank/country/city XMLViews CREATE VIEW imp cust AS FOR $cust IN //customer FOR $acc IN //account WHERE ($cust/acc-num = $acc/sav-acc-num OR $cust/acc-num = $acc/check-acc-num) AND $acc/balance return acc-num $cust/acc-num /acc-num balance $acc/balance /balance (a) Input Tables CREATE TABLE Customer (Cust-id-key INTEGER PRIMARY KEY, id INTEGER NOT NULL, name VARCHAR(25),); CREATE TABLE Account (Acc-id-key INTEGER PRIMARY KEY, ); Relational keys equivalent to XML keys ALTER TABLE Account ADD CONSTRAINT Acc-key UNIQUE (sav-or-check-acc-num, parent-country); ALTER TABLE Customer ADD CONSTRAINT Acc-fkey FOREIGN KEY (acc-num, parent-country) REFERENCES Account(sav-or-check-acc-num, parent-country); Recommended Indices CREATE INDEX name-index ON Customer(name); CREATE INDEX acc-num-index ON Account (sav-or-check-acc-num, parent-country); SQL Triggers CREATE TRIGGER Increment-Counter AFTER INSERT ON Customer REFERENCING NEW AS new_row FOR EACH ROW BEGIN ATOMIC UPDATE Branch-office SET Acc-counter = Acc-counter + 1 WHERE Branch-office.Id = new_row.branch END Stored Procedure CREATE PROCEDURE NewCityTrigger() BEGIN Send-mail(cust-name, city-name, ) END Relational views CREATE VIEW imp cust AS (SELECT C.acc-num, A.balance FROM Customer C, Account A WHERE C.acc-num = A.sav-or-check-acc-num AND A.balance 10000) Fig. 1. Example Elixir Mapping (b) Output 744 P. Patil and J.R. Haritsa 2 Architecture of Elixir System The overall architecture of the Elixir system is depicted in Figure 2. Given an XML schema, a set of documents valid under this schema, and the user query workload, the system first creates an equivalent canonical fully-normalized initial XML schema [9], corresponding to an extremely fine-grained relational mapping, and in the rest of the procedure attempts to design more efficient schemas by merging relations of this initial schema. Summary statistical information of the documents for the canonical schema is collected using the StatsCollector module. The estimated runtime cost of the XML workload, after translation to SQL, on this schema is determined by accessing the relational engine s query optimizer. Subsequently, the original XML schema is transformed in a variety of ways using various schema transformations, the relational runtime costs for each of these new schemas is evaluated, and the transformed schema with the lowest cost is identified. This whole process is repeated with the new XML schema, and the iteration continues until the cost cannot be improved with any of the transformed schemas. The choice of transformations is conditional on their adhering to the constraints specified in the XML schema, and this is ensured by the Translation Module. In each iteration, the Index Processor component selects the set of XML path-indices that fit within the disk space budget (measured with respect to the equivalent relational indices), and deliver the greatest reduction in the query runtime cost. These path indices are then converted to an equivalent set of XML Schema with keys XML Documents Disk Budget XQuery Workload XQuery Triggers XQuery Views Initial Schema Index Processor Additional XQuery Workload XML Trigger Processor XML View Processor Stats Collector Path Indices SQL Triggers Relational Views XML Data Statistics Schema Transformation Module transformed schema Translation Module Relational tables, keys, indexes, statistics and SQL Workload XQuery Rewriting Cost Relational Optimizer Efficient Relational configuration consisting of table, keys, indices, SQL triggers, Relational views Stored Procedures Fig. 2. Architecture of the Elixir system Holistic Schema Mappings for XML-on-RDBMS 745 relational indices. The XQuery queries are also rewritten to benefit from the path indices, with the query rewriting based on the concept of path equivalence classes [16] of XML Schema. The XML Trigger Processor is responsible for handling all XML triggers it maps each trigger to either an equivalent SQL trigger, or if it is not mappable (as discussed in Section 5), represents it with a stored procedure that can be called by the middleware at runtime. To account for the cost of the non-mappable triggers, queries equivalent to these triggers are added to the input query workload. Finally, the XML View Processor maps XML views and materialized XML views specified by the user to relational views and materialized query tables, respectively. To implement the above architecture, we have consciously attempted, wherever possible, to incorporate the ideas and systems previously presented in the literature. Specifically, for schema transformations, we leverage the LegoDB framework [3], with its associated FleXMap search tool [15] and StatiX [9] statistics tool; the Index Processor component is based on the XIST path-index selection technique [16]; and, the DB2 relational engine [20] is used as the backend. In the following sections, we discuss in detail the generation of the various components of the holistic relational schema, including Table Configurations, Key Constraints, Indices, Triggers and Views. 3 Generating Constraint-Preserving Relations XML Schema supports a rich set of integrity and cardinality constraints. The Translation Module takes an XML schema with such constraints as input and produces a constraint-preserving equivalent relational schema. For example, XML Schema supports three integrity constraints: unique, key and keyref, with similar semantics to their relational counterparts unique ensures no duplication among non-null values; key ensures all values are unique and non-null; and keyref ensures reference to XML nodes. Due to hierarchical data model of XML, context is also specified for integrity constraints to define the different sets of nodes to be distinguished. Using the syntax of [5], example constraints for the sample bank.xml document shown in Figure 3 are given below: acc-num-key: (//country,(.//account, {sav-acc-num check-acc-num})) Within a country (here country is a context ), each account is uniquely identified by a savings or checking account number. cust-acc: (//country,(./customer,{acc-num})) KEYREF acc-num-key Within a country, each customer refers to a savings or checking account number by acc-num. An obvious way of supporting XML constraints in an RDBMS is to use triggered procedures, but this is highly inefficient [8], and should therefore only be used for those constraints (such as cardinality constraints) that do not have a relational equivalent. Specifically, the XML key and keyref constraints should 746 P. Patil and J.R. Haritsa bank country name india /name customer cust-id 1 /cust-id acc-num 101 /acc-num /customer city name bangalore /name state karnataka /state head-office /head-office branch-office /branch-office atm /atm account sav-acc-num 101 /sav-acc-num balance /balance /account account check-acc-num 102 /check-acc-num balance 645634 /balance /account /city /country /bank Fig. 3. Sample XML Document (bank.xml) TABLE Account( Acc-id-key INT, sav-acc-num INT, check-acc-num INT, balance INT, parent-city INT) (a) Using LegoDB mapping TABLE Account( Acc-id-key INT, sav-or-check-acc-num INT, parent-country INT, acc-num-flag INT, balance INT, parent-city INT) (b) Inclusion of relational key Fig. 4. Generating relational keys for XML key acc-num-key be mapped to relational key and foreign-key constructs. We have developed a three-step algorithm for implementing this mapping this technique is superficially similar to the X2R storage mapping algorithm [7], but a crucial difference is that they tailor the schema to fit the key constraints, thereby risking efficiency, whereas we take the opposite approach of integrating the key constraints with an efficient schema. Specifically, Elixir starts by converting the XML schema into the schema tree representation proposed in FleXMap [15]. Then, in the first step, subtrees corresponding to different paths that need to be mapped to a single column are associated, with the need for association determined from the XML keys. For example, for acc-num-key, the subtrees corresponding to sav-acc-num and checkacc-num have to be associated. In the next step, the XML-to-relational mapping procedure proposed in [3] is extended to create table configurations in the presence of the associated trees. After mapping the XML schema to tables, the final step is to incorporate the relational keys that are equivalent to the original XML keys. An example output for the initial generic mapping of Figure 4(a)) is shown in Figure 4(b). Here, the elements sav-acc-num and check-acc-num are mapped to a single column sav-or-check-acc-num, and an additional column, acc-num-flag, is created for identifying the account number type. Further, since the context element for acc-num-key is country, which is not an immediate parent of Account, a parent-country column, which refers to country-id-key, is added to distinguish between different contexts. Similarly we can define an equivalent relational foreign key for the cust-acc XML keyref. Specifically, create the following relation: TABLE Customer (Cust-id-key INT, Cust-id INT, Name STRING, Address STRING, Acc-num STRING, parent-country INT) Holistic Schema Mappings for XML-on-RDBMS 747 * (Account) account, balance savacc-num checkacc-num (SAccount) account, * account (CAccount) savacc-num balance checkacc-num, balance Fig. 5. Invalid union distribution due to acc-num-key constraint where the foreign key is {Acc-num, parent-country}, referring to the key attribute pair of the Account relation. Cost-based strategies, such as those proposed in [3], explore the optimization space by applying various transformations to the XML schema (which exploit the standard rules of regular expressions in XML Schema for unions and repetitions), and evaluating the costs of the corresponding relational configurations. Elixir restricts the mapping search space to only constraint-valid schema trees by filtering out the invalid schema transformations. For example, consider the union account = sav-acc-num check-acc-num shown before and after distribution in Figure 5. The corresponding relational configuration will have account numbersstoredintworelationsasfollows: TABLE SAccount(SAcc-id-key INT, sav-acc-num INT, balance INT, parent-city INT) TABLE CAccount(CAcc-id-key INT, check-acc-num INT, balance INT, parent-city INT) Here our goal is to map the XML key and keyref in the form of primary key and foreign key, respectively. However, according to the acc-num-key constraint, sav-acc-num and check-acc-num should be mapped to a single column, in order to define the relational key, thereby rendering the union distribution invalid. This example shows that not all relational configurations obtained by schema transformations are valid. Thus, while exploring the search space of relational configurations, we should explore only the space of valid configurations. The simple solution for this is to carry out the transformation on the schema tree and then check if relational keys equivalent to the given XML constraints can be defined on the resulting relational configuration. If it is not possible then that relational configuration can be ignored, otherwise it should be evaluated for the given query workload. However, this solution results in considerable unnecessary work, which can be avoided if we can detect the invalidity schema transformations before carrying out the schema transformation. For example, assume that union t 1 t 2 is being distributed, where t 1 and t 2 are subtrees of the schema tree. Now we will try to analyze the cause for invalidation. Note that both the subtrees, corresponding to sav-acc-num (t 1 )andcheck-acc-num (t 2 ), are on the same field path of the acc-num-key constraint. Thus, if the union distribution of this tree i.e. t 1 t 2 is distributed, then in the resulting configuration, t 1 and t 2 will be mapped to different relations. In general, if subtrees t 1 and t 2 are both on the same field path, then union distribution of t 1 t 2 748 P. Patil and J.R. Haritsa is invalid. The complete set of rules to detect when schema transformations are invalid w.r.t. XML schema constraints is given in [14]. A useful side-effect of incorporating the constraints during the schema design process is that the mapping process completes faster due to the reduction of the optimization search space. 4 Index Selection in Elixir We move on in this section to a different component of the holistic mapping, namely deciding on the best choice of relational indices, given a disk space budget. As mentioned earlier, Elixir takes the approach of finding a good set of indices in the XML space and then mapping them to equivalent indices in the relational space. This is in marked contrast to current industrial practice [6], where the index advisor of the relational engine is used to propose a good set of indices after the schema mapping has been carried out. For finding good XML indices, we leverage the recentl
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks