Bridging Text Embeddings for Unconventional Linguistic Contexts

Authors

  • Pramath Parashar Data Science Specialist BHP Minerals Service Company. Author

DOI:

https://doi.org/10.63282/3050-9262.IJAIDSML-V7I1P105

Keywords:

Text Embeddings, Word2vec, Symbolic Knowledge Representation, Synthetic Corpus Generation, Narrative Tropes, TV Tropes, Clustering Analysis, Skip-Gram Model, Semantic Similarity, Non-Linguistic Text Processing

Abstract

This document explores a novel extension of word embedding techniques to facilitate contextual information extraction from atypical textual sources. Utilizing a corpus derived from the tvtropes dataset, this work introduces two distinct models: an N-Grams Permutation approach and a Database-Like Reinforcement methodology. A comparative evaluation demonstrates that the artificially generated corpus, when processed by these models, significantly enhances accuracy by up to 45.2% over human-curated linguistic representations for this specialized application of Natural Language Processing.

References

[1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

[2] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.

[3] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.

[4] O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix factorization,” in Advances in Neural Information Processing Systems, vol. 27, 2014.

[5] O. Barkan and N. Koenigstein, “Item2vec: Neural item embedding for collaborative filtering,” in Machine Learning for Signal Processing (MLSP), 2016, pp. 1–6.

[6] M. Grbovic and H. Cheng, “Commerce recommendation using recurrent neural net- works,” in Proceedings of the 10th ACM Conference on Recommender Systems, 2015,

[7] pp. 273–280.

[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL- HLT, 2019.

[9] A. Radford, J. Wu, R. Child et al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019.

[10] R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An open multilingual graph of general knowledge,” Proceedings of AAAI, 2017.

[11] J. Lehmann et al., “Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia,” Semantic Web, vol. 6, no. 2, pp. 167–195, 2015.

[12] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of KDD, 2014, pp. 701–710.

[13] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of KDD, 2016, pp. 855–864.

[14] P. Gerv´as, “Story generator algorithms and narrative models: An overview,” pp. 1–15, 2009.

[15] L. J. Martin, P. Ammanabrolu, X. Wang, W. Hancock, S. Singh, B. Harrison, and M. O. Riedl, “Event representations for automated story generation with deep neural nets,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). New Orleans, Louisiana, USA: AAAI Press, 2018, pp. 868–875. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/11301

[16] D. K. Elson, “Dramabank: Annotating agency in narrative discourse,” LREC, 2012

[17] A. Goyal and E. Riloff, “Automatically generating a corpus of aligned stories,” in Proceedings of the NAACL HLT Workshop on Computational Models of Narrative, 2013.

[18] E. Asgari and M. R. K. Mofrad, “Continuous distributed representation of biological sequences for deep proteomics and genomics,” in PLOS One, vol. 10, no. 11, 2015.

[19] M. O. Riedl and R. M. Young, “Narrative planning: Balancing plot and character,” in Journal of Artificial Intelligence Research, vol. 39, 2010, pp. 217–268.

[20] P. Tambwekar, S. Dhuliawala et al., “Controllable neural story plot generation via reward shaping,” in Proceedings of IJCAI, 2019, pp. 5982–5988.

[21] J. Weston, S. Bengio, and N. Usunier, “Wsabie: Scaling up to large vocabulary image annotation,” in Proceedings of the 21st International Conference on World Wide Web, 2012, pp. 640–650.

[22] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.

[23] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 12, pp. 2724–2743, 2017.

[24] P. Ammanabrolu and M. Riedl, “Story realization: Expanding plot events into sentences,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 598–608.

[25] L. Wu, F. Petroni et al., “Scalable zero-shot entity linking with dense entity retrieval,” Proceedings of EMNLP, 2020.

[26] N. Poerner, B. Roth, and H. Schu¨tze, “E-bert: Efficient-yet-effective entity embeddings for bert,” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 803–825, 2020.

[27] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne,” in Journal of Machine Learning Research, vol. 9, 2008, pp. 2579–2605.

Published

2026-01-14

Issue

Section

Articles

How to Cite

1.
Parashar P. Bridging Text Embeddings for Unconventional Linguistic Contexts. IJAIDSML [Internet]. 2026 Jan. 14 [cited 2026 Jan. 23];7(1):21-5. Available from: https://ijaidsml.org/index.php/ijaidsml/article/view/398