Paper / Presentation Title: “Taking Disagreements into Consideration: Documenting and Reporting Human Annotation Variability in Privacy Policy Analysis,” forthcoming in Information Research and for presentation at iConference 2025.
About the Researcher
Yuanye Ma’s research focuses on user-centered privacy, exploring how individuals understand privacy and how their perceptions and behaviors are shaped by different methods of privacy communication and technological design.
She is broadly interested in examining complex social issues arising from the use of information and communication technologies. Specifically she wants to apply data science to address challenges such as health equity and the ways people look for information about their health conditions.
About the Research
AI and machine learning (ML) processes are designed to seek a singular “truth” or standard. To reach this standard, the processes optimize various methods, including how data sets are annotated, often removing outliers through a majority consensus and valuing high agreement among annotators as a quality indicator. Similarly, when models make predictions, they typically present only the most probable outcome, eliminating disagreement to uphold a singular “truth.”
However, certain data, including legal and policy texts, are inherently prone to ambiguity and open to different interpretations. Privacy policy documents are among such data where a multitude of interpretations can exist. We argue that the constant evolution and development of privacy technologies and practices should consider genuine differences among people when it comes to interpreting privacy policy documents.
In this study, we introduce a new approach to realistically report how well an automated approach mirrors human judgments conducting gold standards that increase agreement on the online privacy policies corpora. Our results found that the new measure complements the standard practice of reporting Fleiss Kappa. We conclude with four specific recommendations that would provide a more realistic assessment of how well-automated models replicate human judgment.