Ancora-Verb: A lexical resource for the semantic annotation of corpora

Please download to get full document.

View again

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
Abstract In this paper we present two large-scale verbal lexicons, AnCora-Verb-Ca for Catalan and AnCora-Verb-Es for Spanish, which are the basis for the semantic annotation with arguments and thematic roles of AnCora corpora. In AnCora-Verb
Document Share
Document Tags
Document Transcript
  AnCora-Verb: A Lexical Resource for the Semantic Annotation of Corpora Juan Aparicio, Mariona Taulé, Ma. Antònia Martí  CLiC, Centre de Llenguatge i Computació - University of BarcelonaGran Via de les Corts Catalanes 585, 08007 Barcelona{juanaparicio, mtaule, amarti} Abstract In this paper we present two large-scale verbal lexicons, AnCora-Verb-Ca for Catalan and AnCora-Verb-Es for Spanish, which are thebasis for the semantic annotation with arguments and thematic roles of AnCora corpora. In AnCora-Verb lexicons, the mappingbetween syntactic functions, arguments and thematic roles of each verbal predicate it is established taking into account the verbalsemantic class and the diatheses alternations in which the predicate can participate. Each verbal predicate is related to one or moresemantic classes basically differentiated according to the four event classes    accomplishments, achievements, states and activities    ,and on the diatheses alternations in which a verb can occur. AnCora-Verb-Es contains a total of 1965 different verbs corresponding to3671 senses and AnCora-Verb-Ca contains 2151 verbs and 4513 senses. These figures correspond to the total of 500,000 wordscontained in each corpus, AnCora-Ca and AnCora-Es. The lexicons and the annotated corpora constitute the richest linguisticresources of this kind freely available for Spanish and Catalan. The big amount of linguistic information contained in both resourcesshould be of great interest for computational applications and linguistic studies. Currently, a consulting interface for these lexicons isavailable at (    ).   1. Introduction In this paper we present two large-scale verbal lexicons,AnCora-Verb-Ca for Catalan and AnCora-Verb-Es forSpanish, which are the basis for the semantic annotationwith arguments and thematic roles of AnCora corpora. Atpresent, AnCora (Martí et al., 2008; Taulé et al., 2008) isthe largest multilevel annotated corpus of Spanish andCatalan consisting of 500,000 words each mostly fromnewspaper articles. AnCora is annotated withmorphological (PoS), syntactic (constituents andfunctions) and semantic (argument structure and thematicroles, semantic class, named entities and WordNet senses)information.In AnCora-Verb lexicons, the mapping between syntacticfunctions, arguments and thematic roles of each verbalpredicate it is established taking into account the verbalsemantic class and the diatheses alternations in which thepredicate can participate. Each verbal predicate is relatedto one or more semantic classes, depending on its senses.The main goal of this paper is to present the content of these lexicons and their resulting projection in the AnCoracorpora (section 2). A quantitative analysis of the data it isalso presented (section 3). Finally, main conclusions aredrawn in section 4. 2. AnCora-Verb Lexicons AnCora-Verb lexicons were obtained by deriving, for eachsense of each verb, all the syntactic schemata in which averbal predicate appears in AnCora corpora (Taulé et al.,2008). From this information, the mapping from syntacticfunctions to thematic roles, and the correspondingargument position, was fully manually encoded in thelexicons. The semantic properties used in thecharacterization of predicates are based on the proposal of lexical decomposition of Rappaport-Hovav & Levin(1998) from which the concept of Lexical SemanticStructure (LSS) has been taken. For the characterization of the argument structure, we follow PropBank annotationsystem (Palmer et al., 2005) 1 . In this direction, we followthe lines laid down by Kingsbury et al., (2002) in theconstruction of VerbNet.In AnCora-Verb lexicons, each predicate is related to oneor more semantic classes (LSS), depending on its senses,basically differentiated according to the four event classes    accomplishments (A), achievements (B), states (C) andactivities (D)    , and on the diatheses alternations in whicha verb can occur.Figure 1 shows the full information associated with theentry reforzar  ‘to reinforce’: the lemma ( reforzar  ), thedifferent senses associated to their corresponding semanticclasses (in this case LSS1.1 and LSS2.2), the mappingbetween syntactic function and thematic role (for instance,SUJ Arg0##CAU), and the diatheses alternations in whichthe verb occurs (in this case, ANTICAUSATIVA‘inchoative’). As we can observe, the expression of thecausative-inchoative alternation entails an argumentcrossing: the affected object, appears as direct object inthe causative structure (CD Arg1##TEM) and as subject inthe inchoative structure (SUJ Arg1##TEM). Furthermore,the expression of this alternation also involves anaspectual change, since the causative readingcorresponds with an accomplishment (LSS1.1) and theinchoative reading with an achievement (LSS2.2). Finally,examples are also included. 1 The arguments selected by the verb are incrementallynumbered –Arg0, Arg1, Arg2, Arg3, Arg4– expressing theirdegree of proximity in relation to its predicate. The adjuncts arelabelled as ArgM. The list of thematic roles consists of 20different thematic labels: AGT (Agent), AGI (Induced Agent),CAU (Cause), EXP (Experiencer), SCR (Source), PAT (Patient),TEM (Theme), ATR (Attribute), BEN (Beneficiary), EXT(Extension), INS (Instrument), LOC (Locative), TMP (Time),MNR (Manner), ORI (Origin), DES (Goal), FIN (Purpose), EIN(Initial State), EFI (Final State) and ADV (Adverbial).   797  reforzar  - 01LSS1.1 (A1)SUJ Arg0##CAUCD Arg1##TEMEX: “La subida en dos décimas de la tasa de paroreforzó la tendencia al alza” 2  +ANTICAUSATIVALSS2.2 (B2)SUJ Arg1##TEMEX: “Si dos neuronas se activan, sus conexiones serefuerzan” 3  Figure 1: Lexical entry of  reforzar  ‘to reinforce’ inAnCora-Verb-EsIn order to guarantee the coherence and quality and toensure the correct mapping between   arguments,   thematicroles, syntactic functions and LSS, inter-annotatoragreement tests were carried out in the building process of the verbal lexicons. After a first proposal of verb classesand their corresponding arguments and theta-roles, agroup of seven trained linguists elaborated a subset of 30verbal entries. The resulting entries were compared, thedisagreements discussed and the verb classes modifiedwhen necessary. This process was applied over severalsubsets of 30 verbs until no relevant disagreements arose.Disagreements were mainly due to differences in classassignment (LSS), and therefore also in the thematic roleassignment. For example, in Spanish, a verb in a passive(‘pasiva refleja’) or inchoative (‘anticausativa)construction can appear with the pronoun se , and it is notalways easy to decide which of them the correctinterpretation is and, obviously, the consequences are alsovery different. If we opt for the passive reading, the Arg0is an Agent, whereas if we choose the inchoative readingthe Arg0 is a Causer. The identification of multiwords, forinstance the treatment of light verbs, is also especiallyproblematic, basically when it is necessary to decide if agiven structure corresponds to a verb and its complementsor to an idiom ( tener  + ganas vs. tener_ganas , ‘to need’ or‘to want’).Next we present the 13 semantic classes that have beenused for the characterization of verbal predicates:Accomplishments (A)A1: Transitive-Causative class:LSS1.1 [x CAUSE [BECOME [y <STATE >]]]Arg0##CAUArg1##TEMDiatheses: [+Inchoative] [+Resultative]Spanish verbs: abrir  ‘to open’, causar  ‘to cause’, cerrar   ‘to close’, romper  ‘to break’…Catalan verbs: afectar ‘to affect’, convertir ‘to turn into’,omplir ‘to fill’… 2 ‘The rise in two tenths of the unemployment rate reinforced thebullish tendency’. 3 ‘ If two neurons are activated, their connections are reinforced’. A2: Transitive-agentive class:LSS1.2 [[x DO-SOMETHING] CAUSE [BECOME [y<STATE>]]]Arg0##AGTArg1##PATDiatheses: [+Passive]Spanish verbs: comer  ‘to eat’, escribir  ‘to write’…Catalan verbs : afirmar ‘to affirm’, llegir  ‘to read’…A3.1: Ditransitive-agentive locative class:LSS1.3.1 [[x DO-SOMETHING] CAUSE [BECOME [y<PLACE> z]]]Arg0##AGTArg1##PATArg2##LOCDiatheses: [+Passive]Spanish verbs: colocar  ‘to place’, dejar  ‘to leave’,Catalan verbs: moure ‘to move’,  posar  ‘to put’...A3.2: Ditransitive-agentive beneficiary class:LSS1.3.2 [[x DO-SOMETHING] CAUSE [BECOME [y<PLACE> z]]]Arg0##AGTArg1##PATArg2##BENDiatheses: [+Passive]Spanish verbs: dar  ‘to give’, decir  ‘to tell’,Catalan verbs: enviar  ‘to send’, vendre ‘to sell’…   Achievements (B)B1: Unaccusative-motion classLSS2.1 [BECOME [y <PLACE>]]Arg1##TEM/PATSpanish verbs: llegar  ‘to arrive’, salir  ‘to go_out’…Catalan verbs: entrar    ‘to go_in’,   venir  ‘to come’…B2: Unaccusative-state classLSS2.2 [BECOME [y <STATE>]]Arg1##TEM/PATArg2##EFISpanish verbs: crecer  ‘to grow’,  florecer  ‘to bloom’…Catalan verbs: enfonsar-se ‘to collapse’…States (C)C1: Existence-state classLSS3.1 [x <STATE>y]Arg1##TEMArg2##LOCSpanish verbs: estar  ‘to be’, existir  ‘to exist’…Catalan verbs: haver-hi ‘there_is/are’…C2: Attributive-state classLSS3.2 [x <STATE>y]Arg1##TEMArg2##ATRSpanish verbs: ser  ‘to be’, tener  ‘to have’…Catalan verbs: estar  ‘to be’, tenir  ‘to have’… 798  C3: Scalar-state classLSS3.3 [x <STATE>y]Arg1##TEMArg2##EXTSpanish verbs: medir  ‘to measure’,  pesar  ‘to weigh’…Catalan verbs: costar ‘to cost’, durar ‘to last’…C4: Beneficiary-state classLSS3.4 [x <STATE>y]Arg1##TEMArg2##BEN/EXPSpanish verbs: gustar  ‘to like’,  parecer  ‘to seem’…Catalan verbs: agradar  ‘to like’,  preocupar  ‘to worry’…Activities (D)D1: Agentive-inergative classLSS4.1 [x ACT <MANNER/INSTRUMENT >]Arg0##AGTSpanish verbs: caminar  ‘to walk’, nadar  ‘to swim’…Catalan verbs:  jugar  ‘to play’, navegar  ‘to sail’…D2: Experiencer-inergative classLSS4.2 [x ACT <MANNER/INSTRUMENT >]Arg0##EXPSpanish verbs: dormir  ‘to sleep’, soñar  ‘to dream’...Catalan verbs: respirar  ‘to breath’…D3: Source-inergative classLSS4.3 [x ACT <MANNER/INSTRUMENT >]Arg0##SRCSpanish verbs: roncar  ‘to snore’, sudar  ‘to sweat’…Catalan verbs: cridar  ‘to shout’,  plorar  ‘to cry’… 2.1. Automatic Annotation AnCora-Verb lexicons were used for the semiautomatictagging of the AnCora corpora with arguments, thematicroles and semantic classes.   A set of manually written rulesautomatically mapped part of the information declared inthese lexicons onto the syntactic structure (Martí et al.,2007).   We defined three different types of rules takinginto account the kind of information they were based on:a)   Rules based on a specific function ormorphosyntactic property. For example, if thepredicate has associated the verbal morpheme‘PASS’ (passive voice), then its subject has theargument position Arg1 and the thematic role patient(SUJ-Arg1-PAT).b)   Rules based on the semantic properties of thepredicates. For instance, when predicates aremonosemic, the mapping between syntactic functionand argument and thematic role as well as theassignment of the semantic class is directly realized.In the case of polysemic verbs, the mapping can bepartial because it is only automatically assigned theunambiguous information.c)   Rules based on the type of adverb or prepositionalmultiword appearing in a specific constituent. Forinstance, if the prepositional multiword a_causa_de  (‘because_of’) or the adverb aún (‘still’, ‘yet’) inSpanish, appears in an adverbial complement(function = CC), then it is automatically assignedthe argument and thematic role ArgM-CAU (anadjunct argument with the thematic role cause) aswell as ArgM-TMP (an adjunct argument with thethematic role temporal) respectively.We applied these rules following a decreasing heuristicaccording to the degree of generality, that is, we appliedfirst the more general rules of type a), secondly the type c)rules and, finally, the type b) rules. In the automaticannotation process we obtained either full annotations –containing information about the arguments and thethematic roles– or partial annotations with only argumentsor thematic roles. This procedure permits to automaticallyannotate 60% 4 of the expected arguments and thematicroles with a fairly low error (below 2%) (Martí et al.,2007). Given the high quality of the results obtained weclaim that this methodology is very suited for the semi-automatic approach to corpus annotation and able to savea significant amount of manual effort. Afterwards wemanually completed the thematic role annotation in orderto guarantee the accuracy required to support the finalresource. The Catalan corpus, AnCora-Ca, is alreadycompleted for the 500,000 words, while the semanticmanual checking covers, up to now, the 100,000 words of the Spanish corpus, AnCora-Es. The Spanish corpus willbe completed at the endof this year. 3. Quantitative Analysis of Data The Spanish lexicon, AnCora-Verb-Es, contains a total of 1965 different verbs (corresponding to 3671 senses) andthe Catalan lexicon, AnCora-Verb-Ca, contains 2151 verbs(corresponding to 4513 senses).In table 1, the distribution of these verbs’ senses insemantic classes it is shown for both languages. Theaverage of senses per lemmata is 1,86 for Spanish and2,09 for Catalan.Table 1 shows that the semantic class with the highestnumber of different verbs, in both languages, is by far thetransitive-agentive class (A2) followed by theunaccusative-state class (B2) and the causative-transitiveclass (A1). It has to be noticed that in B2 class the passiveor inchoative constructions coming from other classes(A1, A2 and A3) as result of a diatheses alternation arealso included. For instance, the passive alternation of theverbal predicate verificar  ‘to verify’ (from A2 semanticclass) is annotated as B2 (See figure 2). The expression of most alternations entails an aspectual change, whichnecessarily implies a change of semantic class.   4   From which the 94% corresponds to full annotations and 6% topartial annotations. 799  40233619171563955516312419415011858672071901061081473543179211967110200400600800100012001400160018002000 A1 A2 A31 A32 B1 B2 C1 C2 C3 C4 D1 D2 D3CatalanSpanish   Table 1: Verbs associated to each semantic class verificar  - 01LSS1.2 (A2)SUJ Arg0##AGTCD Arg1##PATEX: “(…)verificar la responsabilitat dels actualsdirectius” 5  +PASSIVALSS2.2 (B2)SUJ Arg1##PATEX: “(…) que es verifiqui l’honradesa dels càrrecspublics” 6  Figure 2: Lexical entry of  verificar ‘ to verify’ inAnCora-Verb-CaNext we present the figures corresponding to theprojection of AnCora-Verb-Es and AnCora-Verb-Calexicons in AnCora-Es and AnCora-Ca corporarespectively (See table 2 and table 3).For the quantitative analysis of the data we have takeninto account the 500,000 words fully annotated forCatalan and a subset of 100,000 words for Spanish.These figures correspond to the total amount of semantic annotated data manually checked.The Spanish subset comprises a total amount of 11,061verbal tokens, corresponding to 2613 senses (Table 2).The Catalan subset comprises a total of 48,319 verbaltokens corresponding to 4102 senses (Table 3). In thiscontext, we understand for sense the number of differentlemmata associated to each verbal class.   5 ‘(…) verifying the responsibility of the current managers’ 6 ‘(…)that the honesty of the publics charges is verified’ Spanish Semantic ClassesLSS Tokenslemmata(senses)%A1 485 192 4.38 A2 3833 886 34.65 A31 210 46 1.90 A32 755 82 6.82 B1 681 184 6.16 B2 1406 756 12.71 C1 736 160 6.65 C2 2299 110 20.78 C3 4 1 0.04 C4 195 33 1.76 D1 428 149 3.87 D2 15 5 0.14 D3 14 9 0.13 Total 11,061 2613 100Table 2: Figures corresponding to a corpus-sample of 100,000 words fully annotatedEven though the number of words annotated in Catalanis upper than that of the Spanish, if we compare the datato level of percentages the results are parallel. In bothlanguages the verbal classes with the highest number of occurrences are: the transitive-agentive class (A2) with3833 in Spanish and 8165 in Catalan; the attributive-state class (C2) with 2299 in Spanish and 8117 inCatalan; and the unaccusative-state class (B2) with1406 in Spanish and 7631 in Catalan. The sum of which 800  means the 68.14% and the 70.18% of the total verbaloccurrences annotated in AnCora-Es and AnCora-Carespectivelly. If we take into account the number of different lemmata associated to each of these threeclasses, we can observe that A2 presents the highestnumber of different types (886 for Spanish and 1311 forCatalan), followed by B2 7 (756 and 1496 for Spanishand Catalan respectively); while C2 class has the lowernumber (only 110 for Spanish and 115 for Catalan). Itmeans that A2 and B2 classes are more sparselydistributed than C2. Catalan Semantic ClassesLSS Tokenslemmata(senses)%A1 1826 311 3.78 A2 8165 1311 37.59 A31 1001 77 2.07 A32 4129 140 8.55 B1 2368 273 4.90 B2 7631 1496 15.79 C1 3208 182 6.64 C2 8117 115 16.80 C3 139 18 0.29 C4 561 30 1.16 D1 1143 138 2.37 D2 20 5 0.04 D3 11 6 0.02 Total 48,319 4,102 100Table 3: Figures corresponding to 500,000 words fullyannotatedVerbs belonging to A32, C1, B1, A1 and D1 semanticclasses represent a little bit more than the 25% of thetotal verbal predicates appeared in the corpora, the26.24% for Spanish and 27.88% for Catalan. Whereasthe rest of verbal classes -A31, C4, C3, D2 and D3-represent the 3.97% and the 3.58% of the total verbaloccurrences in AnCora-Es and AnCora-Ca corporarespectively.In order to get more information about how verbalpredicates are distributed in each semantic class, wehave obtained the frequency of the 10 more frequentlemmata for each class and its corresponding percentagewith respect to the total amount of the class   (See Table4 for Spanish and Table 5 for Catalan). Notice thatdespite the difference in corpus size, the percentagesoverlap to a great extent. However, this overlappingdoes not take place in all verbs.Table 4 and 5 show that, for example, in the attributive-state class (C2) and the beneficiary-state class (C4), the10 more frequent lemmata represent the 83,9% and the75.89% of the total verbal tokens of these classes forSpanish, and for Catalan the 88,94% (C2) and the92.8% (C4). The same subset in the scalar-state class 7   Note that in this class are also included the passiveconstructions.  (C3) in Catalan covers also the 92.8% of the total classtokens. Therefore, the state classes have few verbaltypes but they present a very high occurrence in bothcorpora. In fact, the verb ser  (‘to be’) is the one with thehighest frequency in both languages. On the oppositeside we find the unaccusative-state class (B2), in whichthe 10 more frequent lemmata only represent the 8.8%for Spanish and the 14.34% for Catalan. LSSSpanish verbs 8 %A1 provocar:18; convertir:15; poner:14;hacer:13; abrir:13; mejorar:13; reducir:12;afectar:12; quemar:11; aumentar:1027.0 A2 hacer:122; ver: 98; saber:93; querer:76;hablar:42; creer:41; lograr:39; mantener:38;intentar:38; pensar:3716.2 A31 poner:31; señalar:23; alcanzar:19; sacar:13;colocar:9; llevar:7; introducir:6;contemplar:6; situar: 6; tirar:660.0 A32  decir:205; dar:50; permitir:37; pedir:28;explicar:28; presenter:24; asegurar:23;afirmar:23; indicar:22; ofrecer:1960.7 B1 llegar:69; caer:30; salir:29; entrar:26;ocurrir:23; aparecer:19; nacer:17; acudir:14;producir:13; registrar:1337.1 B2 convertir:30; considerar:12; hacer:12;conducir:12; utilizar:10; conocer:10;llamar:10; ver:10; aumentar:9; abrir:98.8 C1 haber:120; estar:57; existir:38; vivir:33;tratar:23; morir:22; acabar:20; encontrar:17;referir:15; contar:1348.6 C2 ser:1367; tener:209; estar:188; quedar:40;suponer:27; resultar:24; mostrar:20;llevar:19;ver:18; sentir:1783.9 C4 parecer:64; servir:17; suceder:16; quedar:14; gustar:10; pasar:9; importar:8; ir:6;llegar:5; faltar:575.8 D1 trabajar:32; ir:32; hacer:16;volver:15;pasar:13; echar:11; dar:10;actuar:10;regresar:10; huir:10;37.1 Table 4: The 10 more frequent verbs for each semanticclass in SpanishIt is important to highlight that nine of the thirteenCatalan semantic classes -A31, A32, C1, C2, C3, C4,D1, D2 and D3- cover, with the 10 more frequentlemmata, more than the 50% of the total amount of verbal occurrences in each class. In the case of theSpanish subset, the number of classes is eight -A31,A32, C1, C2, C3, C4, D2 and D3-. Only four classes inCatalan –A1, A2, B1 and B2- and five in Spanish –A1,A2, B1, B2 and D1- are below 50%, probably becausethey are also the classes with more different verballemmata, and more sparsely distributed too. 8   We have not considered the C3, D2 and D3 semantic classesbecause they have less than 6 different lemmata per class. 801
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks