Inside the Motamot Dataset: Annotation & Quality Control for Bangla NLP

Table Of Links Abstract I. INTRODUCTION II. RELATED WORKS III. BACKGROUND STUDY IV. CORPUS CREATION V. IMPLEMENTATION DETAILS VI. RESULT ANALYSIS & DISCUSSION VII. FUTURE RESEARCH DIRECTIONS VIII. CONCLUSION AND REFERENCES A. Data Collection We called the dataset “Motamot” [15] in Bengali (mtamt) and in English (Opinion). It was meticulously compiled from a range of online newspapers focusing on political events and conversations during Bangladeshi elections. Our data collection process involved scraping articles and opinion pieces from reputable news sources, ensuring a diverse and representative sample of political discourse. “Motamot” gives a broad look into the many opinions and conversations that shape Bangladesh’s political environment. B. Data Attributes The dataset comprises several key attributes essential for comprehensive analysis. These attributes include the “source link,” providing the URL of the article or news source, alongside the “newspaper name,” denoting the origin of the article. The “published date” attribute indicates the date of article publication, offering temporal context. \ Each article is accompanied by a “headline”, serving as a concise title or summary of its content. Additionally, the “short description” attribute provides a brief excerpt summarizing the main points or arguments presented in the article. Finally, the “sentiment” attribute assigns a sentiment label (e.g., Positive, Negative) to each article, facilitating sentiment analysis and classification. C. Annotation Process The dataset was extensively annotated manually, with a dedicated team of six student annotators overseeing the process. Annotators carefully analyzed each article’s content to provide proper emotion labels. This entailed a careful examination of the tone, language complexities, and contextual subtleties within the articles, assuring proper classification of the opinions stated regarding political issues. D. Annotation Guideline The development of annotation guidelines was an intensive process designed to ensure uniformity and precision in sentiment sentiment labeling. These guidelines were meticulously crafted to provide annotators with clear criteria for identifying sentiment expressions within the text. Examples were included to illustrate each sentiment category, guiding annotators in their understanding and application of sentiment labels. \ For instance, instances expressing support, agreement, or optimism towards political figures or policies were categorized as “Positive,” while those conveying criticism, disagreement, or dissatisfaction were labeled as “Negative.” This guidance helped maintain uniformity across annotations and ensured that sentiments were accurately captured and categorized within the dataset. E. Dataset Statistics Figure 1 shows a complete overview of the dataset’s composition and scope, highlighting precise statistics across multiple subsets that have been thoroughly partitioned for optimal analysis. The dataset is divided into three essential categories: train (80%), test (10%), and validation (10%). F. Annotation Quality Control Process The Fleiss Kappa [16] inter-rater reliability coefficient was employed with six annotators to evaluate the annotation approach’s consistency. Achieving a high kappa score of 0.87 would indicate strong agreement among annotators regarding sentiment labeling, reflecting the method’s reliability and uniformity. This level of agreement suggests that the quality control techniques implemented were effective in ensuring uniformity and dependability in sentiment labeling across the dataset. G. Challenges Faced and Limitations The data collection process faced challenges including biased news reporting, sentiment ambiguity, and language variations. Careful navigation was required to maintain dataset integrity, with meticulous data selection criteria and precise annotation guidelines implemented. Quality control measures were employed to mitigate these challenges’ impact. The dataset includes only “Positive” and “Negative” sentiment labels, potentially limiting sentiment analysis granularity. Gender, location, and political biases were noted and considered during annotation, though addressing them comprehensively remains challenging. H. Availability and Usage The “Motamot” [15] dataset is available in CSV format, making it easily accessible and compatible with a wide range of research tools and platforms. :::info Authors: Fatema Tuj Johora Faria Mukaffi Bin Moin Rabeya Islam Mumu Md Mahabubul Alam Abir Abrar Nawar Alfy Mohammad Shafiul Alam ::: :::info This paper is available on arxiv under CC BY 4.0 license. ::: \

View original source — Hacker Noon ↗

ShareShare on X Share on Facebook