phone +7 (3412) 91 60 92

Archive of Issues


Russia Yekaterinburg
Year
2021
Volume
31
Issue
1
Pages
132-148
<<
>>
Section Computer science
Title Relevance ranking for proximity full-text search based on additional indexes with multi-component keys
Author(-s) Veretennikov A.B.a
Affiliations Ural Federal Universitya
Abstract The problem of proximity full-text search is considered. If a search query contains high-frequently occurring words, then multi-component key indexes deliver improvement of the search speed in comparison with ordinary inverted indexes. It was shown that we can increase the search speed up to 130 times in cases when queries consist of high-frequently occurring words. In this paper, we are investigating how the multi-component key indexes architecture affects the quality of the search. We consider several well-known methods of relevance ranking; these methods are of different authors. Using these methods we perform the search in the ordinary inverted index and then in the index that is enhanced with multi-component key indexes. The results show that with multi-component key indexes we obtain search results that are very near in terms of relevance ranking to the search results that are obtained by means of ordinary inverted indexes.
Keywords full-text search, search engines, relevance ranking, inverted indexes, proximity search, three-component key indexes
UDC 519.683.5
MSC 68P20, 68P10
DOI 10.35634/vm210110
Received 11 October 2020
Language Russian
Citation Veretennikov A.B. Relevance ranking for proximity full-text search based on additional indexes with multi-component keys, Vestnik Udmurtskogo Universiteta. Matematika. Mekhanika. Komp'yuternye Nauki, 2021, vol. 31, issue 1, pp. 132-148.
References
  1. Sadakane K. Fast algorithms for k-word proximity search, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2001, vol. 84, issue 9, pp. 2311-2318.
  2. Gall M., Brost G. K-word proximity search on encrypted data, 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA), 2016, pp. 365-372. https://doi.org/10.1109/WAINA.2016.104
  3. Zipf G.K. Selected studies of the principle of relative frequency in language, Harvard University Press, 1932.
  4. Veretennikov A.B. Proximity full-text search by means of additional indexes with multi-component keys: in pursuit of optimal performance, Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2018. Communications in Computer and Information Science, vol. 1003, Cham: Springer, 2019, pp. 111-130. https://doi.org/10.1007/978-3-030-23584-0_7
  5. Miller R.B. Response time in man-computer conversational transactions, Proceedings of the December 9-11, 1968, fall joint computer conference, part I on - AFIPS '68 (Fall, part I), San Francisco, California, 1968, vol. 33, pp. 267-277. https://doi.org/10.1145/1476589.1476628
  6. Veretennikov A.B. Proximity full-text search with response time guarantee by means of three component keys, Bulletin of the South Ural State University. Series “Computational Mathematics and Software Engineering”, 2018, vol. 7, issue 1, pp. 60-77 (in Russian). https://doi.org/10.14529/cmse180105
  7. Veretennikov A.B. Proximity full-text search with a response time guarantee by means of additional indexes, Intelligent Systems and Applications: Proceedings of the 2018 Intelligent Systems Conference, Cham: Springer, 2018, pp. 936-954. https://doi.org/10.1007/978-3-030-01054-6_66
  8. Veretennikov A.B. An efficient algorithm for three-component key index construction, Vestnik Udmurtskogo Universiteta. Matematika. Mekhanika. Komp'yuternye Nauki, 2019, vol. 29, issue 1, pp. 117-132 (in Russian). https://doi.org/10.20537/vm190111
  9. Anh V.N., de Kretser O., Moffat A. Vector-space ranking with effective early termination, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '01, 2001, pp. 35-42. https://doi.org/10.1145/383952.383957
  10. Garcia S., Williams H.E., Cannane A. Access-ordered indexes, ACSC '04 Proceedings of the 27th Australasian Conference on Computer Science, Dunedin, New Zealand, 2004, pp. 7-14.
  11. Lin J., Trotman A. Anytime ranking for impact-ordered indexes, Proceedings of the 2015 International Conference on The Theory of Information Retrieval, 2015, pp. 301-304. https://doi.org/10.1145/2808194.2809477
  12. Anh V.N., Moffat A. Pruned query evaluation using pre-computed impacts, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06, 2006, pp. 372-379. https://doi.org/10.1145/1148170.1148235
  13. Veretennikov A.B. Efficient full-text search by means of additional indexes of frequently used words, Sistemy Upravleniya i Informatsionnye Tekhnologii, 2016, vol. 66, issue 4, pp. 52-60 (in Russian).
  14. Zobel J., Moffat A. Inverted files for text search engines, ACM Computing Surveys, 2006, vol. 38, issue 2, article 6. https://doi.org/10.1145/1132956.1132959
  15. Yang Y., Ning H. Block linked list index structure for large data full text retrieval, 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), 2017, pp. 2123-2128. https://doi.org/10.1109/FSKD.2017.8393099
  16. Williams H.E., Zobel J., Bahle D. Fast phrase querying with combined indexes, ACM Transactions on Information Systems (TOIS), 2004, vol. 22, issue 4, pp. 573-594. https://doi.org/10.1145/1028099.1028102
  17. Crane M., Culpepper J., Lin J., Mackenzie J., Trotman A. A comparison of document-at-a-time and score-at-a-time query evaluation, Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017, pp. 201-210. https://doi.org/10.1145/3018661.3018726
  18. Robertson S.E., Walker S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, SIGIR '94, London: Springer, 1994, pp. 232-241. https://doi.org/10.1007/978-1-4471-2099-5_24
  19. Salton G., Yang C.S. On the specification of term values in automatic indexing, Journal of Documentation, 1973, vol. 29, issue 4, pp. 351-372.
  20. Yan H., Shi S., Zhang F., Suel T., Wen J.-R. Efficient term proximity search with term-pair indexes, Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10, Toronto, 2010, pp. 1229-1238. https://doi.org/10.1145/1871437.1871593
  21. Brin S., Page L. Reprint of: The anatomy of a large-scale hypertextual web search engine, Computer Networks, 2012, vol. 56, issue 18, pp. 3825-3833. https://doi.org/10.1016/j.comnet.2012.10.007
  22. Lu X., Moffat A., Culpepper J.S. Efficient and effective higher order proximity modeling, Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, 2016, pp. 21-30. https://doi.org/10.1145/2970398.2970404
  23. Clarke C.L.A., Cormack G.V., Tudhope E.A. Relevance ranking for one to three term queries, Information Processing and Management, 2000, vol. 36, issue 2, pp. 291-311. https://doi.org/10.1016/S0306-4573(99)00017-5
  24. Song R., Taylor M.J., Wen J.-R., Hon H.-W., Yu Y. Viewing term proximity from a different perspective, Advances in Information Retrieval: Proceedings of 30th European Conference on IR Research, Berlin: Springer, 2008, pp. 346-357. https://doi.org/10.1007/978-3-540-78646-7_32
  25. Trotman A., Puurula A., Burgess B. Improvements to BM25 and language models examined, Proceedings of the 2014 Australasian Document Computing Symposium on - ADCS '14, 2014, pp. 58-65. https://doi.org/10.1145/2682862.2682863
  26. Dang V., Bendersky M., Croft W.B. Two-stage learning to rank for information retrieval, Advances in Information Retrieval: Proceedings of 35th European Conference on IR Research, Berlin: Springer, 2013, pp. 423-434. https://doi.org/10.1007/978-3-642-36973-5_36
  27. Silva S.D.N., de Moura E.S., Calado P.P., da Silva A.S. Effective lightweight learning-to-rank method using unified term impacts, IEEE Access, 2020, vol. 8, pp. 70420-70437. https://doi.org/10.1109/ACCESS.2020.2986943
  28. Mitra B., Diaz F., Craswell N. Learning to match using local and distributed representations of text for web search, Proceedings of the 26th International Conference on World Wide Web, Republic and Canton of Geneva, Switzerland, 2017, pp. 1291-1299. https://doi.org/10.1145/3038912.3052579
  29. Silva T.P.C., de Moura E.S., Cavalcanti J.M.B., da Silva A.S., de Carvalho M.G., Gonçalves M.A. An evolutionary approach for combining different sources of evidence in search engines, Information Systems, 2009, vol. 34, issue 2, pp. 276-289. https://doi.org/10.1016/j.is.2008.07.003
  30. Levenshtein V.I. Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady, 1966, vol. 10, issue 8, pp. 707-710.
  31. Makhoul J., Kubala F., Schwartz R., Weischedel R. Performance measures for information extraction, Proceedings of DARPA Broadcast News Workshop, 1999, pp. 249-252.
  32. Järvelin K., Kekäläinen J. Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems, 2002, vol. 20, issue 4, pp. 422-446.
  33. Liu T.-Y. Learning to rank for information retrieval, Berlin: Springer, 2011. https://doi.org/10.1007/978-3-642-14267-3
  34. Büttcher S., Clarke C., Soboroff I. The TREC 2006 terabyte track, Proceedings of the Fifteenth Text REtrieval Conference, TREC 2006, 2006, pp. 128-141.
  35. Fox C. A stop list for general text, ACM SIGIR Forum, 1989, vol. 24, issue 1-2, pp. 19-35. https://doi.org/10.1145/378881.378888
  36. Jansen B.J., Spink A., Saracevic T. Real life, real users, and real needs: a study and analysis of user queries on the web, Information Processing and Management, 2000, vol. 36, issue 2, pp. 207-227. https://doi.org/10.1016/S0306-4573(99)00056-4
  37. Jansen B.J., Spink A. How are we searching the World Wide Web? A comparison of nine search engine transaction logs, Information Processing and Management, 2006, vol. 42, pp. 248-263. https://doi.org/10.1016/j.ipm.2004.10.007
Full text
<< Previous article
Next article >>