ArmanPersoNERCorpus is the first manually-annotated Persian Named-Entity (NE) dataset. We are releasing it only for academic research use.
The dataset includes 250,015 tokens and 7,682 Persian sentences in totall. It is available in 3 folds of training and test sets.
Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a new line.
NERtags are in IOB format. According to the instructions provided to the annotators, NEs are categorized into six classes person, organization (such as banks, ministries, embassies, teams, nationalities, networks and publishers), location (such as cities, villages, rivers, seas, golfs, deserts and mountains), facility (such as schools, universities, research centers, airports, railways, bridges, roads, harbors, stations, hospitals, parks, zoos and cinemas), product (such as books, newspapers, TV shows, movies, airplanes, ships, cars, theories, laws, agreements and religion), and event (such as wars, earthquakes, national holidays, festivals and conferences); other are the remaining tokens.
It is worth noting that annotating was a challenging task as tokens were categorized according to the context. For instance, “congress” is classified differently in the following two named-entity samples: “US Congress” (organization) and “Third International Congress of Marine Science” (event).