Faster text search with hybrid indexing

Auer, Eric

Local TagsRelease HistoryDetailsSummary

Faster text search with hybrid indexing

Auer, E. (2013). Faster text search with hybrid indexing. Poster presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/11858/00-001M-0000-000E-780E-B Version Permalink: https://hdl.handle.net/11858/00-001M-0000-000E-7810-3

Genre: Poster

Files

show Files

hide Files

:

poster-trova-lucene-clin-2013.pdf (Preprint), 2MB

View Save

File Permalink:
https://hdl.handle.net/11858/00-001M-0000-000E-780D-D

Name:
poster-trova-lucene-clin-2013.pdf

Description:
Faster text search with hybrid indexing: A4 PDF file. Actual poster presented in A0 during CLIN 2013.

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
http://creativecommons.org/licenses/by/3.0/

Locators

show

Creators

show

hide

Creators:
Auer, Eric¹, Author

Affiliations:
1The Language Archive, MPI for Psycholinguistics, Max Planck Society, ou_530892

Content

show

hide

Free keywords: Trova annotation content search Lucene PostgreSQL N-gram hash fingerprint index substring search

Abstract: Growing amounts of annotation data in The Language Archive make it necessary to significantly speed up search to keep response times user friendly. Unlike keyword oriented web search engines, the Trova and CQL Search services at TLA allow searching for arbitrary exact substrings and (at lower speed) even regular expressions, not just whole words. To achieve both fast and versatile search, a combination of indexes is used. Word, substring and regular expression search queries are analyzed, yielding information about substrings and other properties which must be present in a tier (or file) so that tier can contain a hit for the query in question at all. Those properties are then either hash-mapped to fixed size bit vectors (fingerprints) for PostgreSQL based filtering or expressed as sets of N-grams (up to a fixed length) for filtering with Lucene N-gram indexes. Both methods aim to quickly find a small list of candidate tiers, containing all (but not much more) tiers which may contain hits. As Lucene has no native support for substring search, our system uses a fast but accurate N-gram based approximation. We present details of the implemented algorithm and elaborate the improvements in response times achieved. We were able to speed up most steps (of: opening indexes, defining a search domain, gathering candidates, finding hits and collecting hit details) and a typical benchmark session now completes in a fraction of the time used by the already powerful previous implementation.

Details

show

hide

Language(s): eng - English

Dates: Created: 2013-01-14

Publication Status: Not specified

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: -

Degree: -

Event

show

hide

Title: the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013)

Place of Event: Enschede, The Netherlands

Start-/End Date: 2013-01-18 - 2013-01-18

Legal Case

show

Project information

show

Source

show