Richard Futrell, Peng Qian, E. Gibson, Evelina Fedorenko, I. Blank
{"title":"Syntactic dependencies correspond to word pairs with high mutual information","authors":"Richard Futrell, Peng Qian, E. Gibson, Evelina Fedorenko, I. Blank","doi":"10.18653/v1/W19-7703","DOIUrl":null,"url":null,"abstract":"How is syntactic dependency structure reflected in the statistical distribution of words in corpora? Here we give empirical evidence and theoretical arguments for what we call the Head–Dependent Mutual Information (HDMI) Hypothesis: that syntactic heads and their dependents correspond to word pairs with especially high mutual information, an information-theoretic measure of strength of association. In support of this idea, we estimate mutual information between word pairs in dependencies based on an automatically-parsed corpus of 320 million tokens of English web text, finding that the mutual information between words in dependencies is robustly higher than a controlled baseline consisting of non-dependent word pairs. Next, we give a formal argument which derives the HDMI Hypothesis from a probabilistic interpretation of the postulates of dependency grammar. Our study also provides some useful empirical results about mutual information in corpora: we find that maximum-likelihood estimates of mutual information between raw word-forms are biased even at our large sample size, and we find that there is a general decay of mutual information between part-of-speech tags with distance.","PeriodicalId":443459,"journal":{"name":"Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)","volume":"515 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-7703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22
Abstract
How is syntactic dependency structure reflected in the statistical distribution of words in corpora? Here we give empirical evidence and theoretical arguments for what we call the Head–Dependent Mutual Information (HDMI) Hypothesis: that syntactic heads and their dependents correspond to word pairs with especially high mutual information, an information-theoretic measure of strength of association. In support of this idea, we estimate mutual information between word pairs in dependencies based on an automatically-parsed corpus of 320 million tokens of English web text, finding that the mutual information between words in dependencies is robustly higher than a controlled baseline consisting of non-dependent word pairs. Next, we give a formal argument which derives the HDMI Hypothesis from a probabilistic interpretation of the postulates of dependency grammar. Our study also provides some useful empirical results about mutual information in corpora: we find that maximum-likelihood estimates of mutual information between raw word-forms are biased even at our large sample size, and we find that there is a general decay of mutual information between part-of-speech tags with distance.