Capisco project

This project developed a tool to assist scholars in identifying and selecting resources from within a Digital Library Corpus. Current access to this resource is available via text-based search in fulltext and metadata.

Our Capisco System analyzes documents by the semantics of their content. Traditional access to the digitized document collections is available primarily via string-based search in the documents’ full-text and metadata. Such a text-based search identifies documents purely according to lexicographical analysis. Most research questions and areas of scholarly interest, however, can rarely be described by simple textual keywords and instead, they encompass larger concepts.

Capisco avoids the need for complete semantic document markup using ontologies by leveraging an automatically generated Concept-in-Context (CiC) network. The network is seeded by a priori analysis of Wikipedia texts and identification of semantic metadata, implementing a annotate-to-wikipedia (A2W) approach. Our Capisco system disambiguates the semantics of terms in the documents by their semantics and context and identifies the relevant CiC concepts.

We further developed means to harness the results of our developed semantic analysis and disambiguation, while retaining the existing keyword-based search and lexicographic index. We engineer this so the output of semantic analysis (performed off-line) is suitable for import directly into existing digital library metadata and index structures, and thus incorporated without the need for architecture modifications.

Capisco was developed in collaboration with the HathiTrust Research Center.

Project Contact

Annika Hinze (hinze@waikato.ac.nz)

 

Relevant Academic Publications

Hinze, A., Bainbridge, D., Wilkins, R., Taube-Schock, C., & Downie, J. S. (2018). Seeding strategies for semantic disambiguation. In Proc 18th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018) (pp. 343-344). Fort Worth, Texas: ACM. doi:10.1145/3197026.3203874

Hinze, A., Bainbridge, D., Cunningham, S. J., Taube-Schock, C., Matamua, R., Downie, J. S., & Rasmussen, E. (2018). Capisco: low-cost concept-based access to digital libraries. International Journal on Digital Libraries, Online First, 1-28. doi:10.1007/s00799-018-0232-3

Hinze, A., Coleman, M., Cunningham, S. J., & Bainbridge, D. (2016). Semantic Bookworm: mining literary resources revisited. In Proc 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (pp. 227-228). Newark, NJ, USA: ACM. doi:10.1145/2910896.2925444

Hinze, A., Bainbridge, D., Cunningham, S. J., & Downie, J. S. (2016). Low-cost semantic enhancement to digital library metadata and indexing: simple yet effective strategies. In Proc 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (pp. 93-102). Newark, NJ, USA: ACM. doi:10.1145/2910896.2910910

Hinze, A., Taube-Schock, C., Bainbridge, D., Cunningham, S. J., & Downie, J. S. (2015). “Introducing Capisco: a semantically-enhanced search and discovery system for large-scale text corpora”. ACM SIGWEB Newsletter, (Autumn 2015), 1-14. doi:10.1145/2833219.2833223

Cunningham, S. J., Hinze, A. M., Bainbridge, D., Taube-Schock, C., & Ryan, T. (2014). Building heritage document collections for Pacific Island nations using semantic-enriched search. In Proceedings of the Samoa Conference III. Sāmoa: National University of Sāmoa. Retrieved from http://samoanstudies.ws/publications/proceedings-of-the-samoa-conference-iii/