Section
|
Computer science
|
Title
|
Relevance ranking for proximity full-text search based on additional indexes with multi-component keys
|
Author(-s)
|
Veretennikov A.B.a
|
Affiliations
|
Ural Federal Universitya
|
Abstract
|
The problem of proximity full-text search is considered. If a search query contains high-frequently occurring words, then multi-component key indexes deliver improvement of the search speed in comparison with ordinary inverted indexes. It was shown that we can increase the search speed up to 130 times in cases when queries consist of high-frequently occurring words. In this paper, we are investigating how the multi-component key indexes architecture affects the quality of the search. We consider several well-known methods of relevance ranking; these methods are of different authors. Using these methods we perform the search in the ordinary inverted index and then in the index that is enhanced with multi-component key indexes. The results show that with multi-component key indexes we obtain search results that are very near in terms of relevance ranking to the search results that are obtained by means of ordinary inverted indexes.
|
Keywords
|
full-text search, search engines, relevance ranking, inverted indexes, proximity search, three-component key indexes
|
UDC
|
519.683.5
|
MSC
|
68P20, 68P10
|
DOI
|
10.35634/vm210110
|
Received
|
11 October 2020
|
Language
|
Russian
|
Citation
|
Veretennikov A.B. Relevance ranking for proximity full-text search based on additional indexes with multi-component keys, Vestnik Udmurtskogo Universiteta. Matematika. Mekhanika. Komp'yuternye Nauki, 2021, vol. 31, issue 1, pp. 132-148.
|
References
|
- Sadakane K. Fast algorithms for k-word proximity search, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2001, vol. 84, issue 9, pp. 2311-2318.
- Gall M., Brost G. K-word proximity search on encrypted data, 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA), 2016, pp. 365-372. https://doi.org/10.1109/WAINA.2016.104
- Zipf G.K. Selected studies of the principle of relative frequency in language, Harvard University Press, 1932.
- Veretennikov A.B. Proximity full-text search by means of additional indexes with multi-component keys: in pursuit of optimal performance, Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2018. Communications in Computer and Information Science, vol. 1003, Cham: Springer, 2019, pp. 111-130. https://doi.org/10.1007/978-3-030-23584-0_7
- Miller R.B. Response time in man-computer conversational transactions, Proceedings of the December 9-11, 1968, fall joint computer conference, part I on - AFIPS '68 (Fall, part I), San Francisco, California, 1968, vol. 33, pp. 267-277. https://doi.org/10.1145/1476589.1476628
- Veretennikov A.B. Proximity full-text search with response time guarantee by means of three component keys, Bulletin of the South Ural State University. Series “Computational Mathematics and Software Engineering”, 2018, vol. 7, issue 1, pp. 60-77 (in Russian). https://doi.org/10.14529/cmse180105
- Veretennikov A.B. Proximity full-text search with a response time guarantee by means of additional indexes, Intelligent Systems and Applications: Proceedings of the 2018 Intelligent Systems Conference, Cham: Springer, 2018, pp. 936-954. https://doi.org/10.1007/978-3-030-01054-6_66
- Veretennikov A.B. An efficient algorithm for three-component key index construction, Vestnik Udmurtskogo Universiteta. Matematika. Mekhanika. Komp'yuternye Nauki, 2019, vol. 29, issue 1, pp. 117-132 (in Russian). https://doi.org/10.20537/vm190111
- Anh V.N., de Kretser O., Moffat A. Vector-space ranking with effective early termination, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '01, 2001, pp. 35-42. https://doi.org/10.1145/383952.383957
- Garcia S., Williams H.E., Cannane A. Access-ordered indexes, ACSC '04 Proceedings of the 27th Australasian Conference on Computer Science, Dunedin, New Zealand, 2004, pp. 7-14.
- Lin J., Trotman A. Anytime ranking for impact-ordered indexes, Proceedings of the 2015 International Conference on The Theory of Information Retrieval, 2015, pp. 301-304. https://doi.org/10.1145/2808194.2809477
- Anh V.N., Moffat A. Pruned query evaluation using pre-computed impacts, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06, 2006, pp. 372-379. https://doi.org/10.1145/1148170.1148235
- Veretennikov A.B. Efficient full-text search by means of additional indexes of frequently used words, Sistemy Upravleniya i Informatsionnye Tekhnologii, 2016, vol. 66, issue 4, pp. 52-60 (in Russian).
- Zobel J., Moffat A. Inverted files for text search engines, ACM Computing Surveys, 2006, vol. 38, issue 2, article 6. https://doi.org/10.1145/1132956.1132959
- Yang Y., Ning H. Block linked list index structure for large data full text retrieval, 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), 2017, pp. 2123-2128. https://doi.org/10.1109/FSKD.2017.8393099
- Williams H.E., Zobel J., Bahle D. Fast phrase querying with combined indexes, ACM Transactions on Information Systems (TOIS), 2004, vol. 22, issue 4, pp. 573-594. https://doi.org/10.1145/1028099.1028102
- Crane M., Culpepper J., Lin J., Mackenzie J., Trotman A. A comparison of document-at-a-time and score-at-a-time query evaluation, Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017, pp. 201-210. https://doi.org/10.1145/3018661.3018726
- Robertson S.E., Walker S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, SIGIR '94, London: Springer, 1994, pp. 232-241. https://doi.org/10.1007/978-1-4471-2099-5_24
- Salton G., Yang C.S. On the specification of term values in automatic indexing, Journal of Documentation, 1973, vol. 29, issue 4, pp. 351-372.
- Yan H., Shi S., Zhang F., Suel T., Wen J.-R. Efficient term proximity search with term-pair indexes, Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10, Toronto, 2010, pp. 1229-1238. https://doi.org/10.1145/1871437.1871593
- Brin S., Page L. Reprint of: The anatomy of a large-scale hypertextual web search engine, Computer Networks, 2012, vol. 56, issue 18, pp. 3825-3833. https://doi.org/10.1016/j.comnet.2012.10.007
- Lu X., Moffat A., Culpepper J.S. Efficient and effective higher order proximity modeling, Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, 2016, pp. 21-30. https://doi.org/10.1145/2970398.2970404
- Clarke C.L.A., Cormack G.V., Tudhope E.A. Relevance ranking for one to three term queries, Information Processing and Management, 2000, vol. 36, issue 2, pp. 291-311. https://doi.org/10.1016/S0306-4573(99)00017-5
- Song R., Taylor M.J., Wen J.-R., Hon H.-W., Yu Y. Viewing term proximity from a different perspective, Advances in Information Retrieval: Proceedings of 30th European Conference on IR Research, Berlin: Springer, 2008, pp. 346-357. https://doi.org/10.1007/978-3-540-78646-7_32
- Trotman A., Puurula A., Burgess B. Improvements to BM25 and language models examined, Proceedings of the 2014 Australasian Document Computing Symposium on - ADCS '14, 2014, pp. 58-65. https://doi.org/10.1145/2682862.2682863
- Dang V., Bendersky M., Croft W.B. Two-stage learning to rank for information retrieval, Advances in Information Retrieval: Proceedings of 35th European Conference on IR Research, Berlin: Springer, 2013, pp. 423-434. https://doi.org/10.1007/978-3-642-36973-5_36
- Silva S.D.N., de Moura E.S., Calado P.P., da Silva A.S. Effective lightweight learning-to-rank method using unified term impacts, IEEE Access, 2020, vol. 8, pp. 70420-70437. https://doi.org/10.1109/ACCESS.2020.2986943
- Mitra B., Diaz F., Craswell N. Learning to match using local and distributed representations of text for web search, Proceedings of the 26th International Conference on World Wide Web, Republic and Canton of Geneva, Switzerland, 2017, pp. 1291-1299. https://doi.org/10.1145/3038912.3052579
- Silva T.P.C., de Moura E.S., Cavalcanti J.M.B., da Silva A.S., de Carvalho M.G., Gonçalves M.A. An evolutionary approach for combining different sources of evidence in search engines, Information Systems, 2009, vol. 34, issue 2, pp. 276-289. https://doi.org/10.1016/j.is.2008.07.003
- Levenshtein V.I. Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady, 1966, vol. 10, issue 8, pp. 707-710.
- Makhoul J., Kubala F., Schwartz R., Weischedel R. Performance measures for information extraction, Proceedings of DARPA Broadcast News Workshop, 1999, pp. 249-252.
- Järvelin K., Kekäläinen J. Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems, 2002, vol. 20, issue 4, pp. 422-446.
- Liu T.-Y. Learning to rank for information retrieval, Berlin: Springer, 2011. https://doi.org/10.1007/978-3-642-14267-3
- Büttcher S., Clarke C., Soboroff I. The TREC 2006 terabyte track, Proceedings of the Fifteenth Text REtrieval Conference, TREC 2006, 2006, pp. 128-141.
- Fox C. A stop list for general text, ACM SIGIR Forum, 1989, vol. 24, issue 1-2, pp. 19-35. https://doi.org/10.1145/378881.378888
- Jansen B.J., Spink A., Saracevic T. Real life, real users, and real needs: a study and analysis of user queries on the web, Information Processing and Management, 2000, vol. 36, issue 2, pp. 207-227. https://doi.org/10.1016/S0306-4573(99)00056-4
- Jansen B.J., Spink A. How are we searching the World Wide Web? A comparison of nine search engine transaction logs, Information Processing and Management, 2006, vol. 42, pp. 248-263. https://doi.org/10.1016/j.ipm.2004.10.007
|
Full text
|
|