English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
  Faster text search with hybrid indexing

Auer, E. (2013). Faster text search with hybrid indexing. Poster presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands.

Item is

Files

show Files
hide Files
:
poster-trova-lucene-clin-2013.pdf (Preprint), 2MB
Name:
poster-trova-lucene-clin-2013.pdf
Description:
Faster text search with hybrid indexing: A4 PDF file. Actual poster presented in A0 during CLIN 2013.
OA-Status:
Visibility:
Public
MIME-Type / Checksum:
application/pdf / [MD5]
Technical Metadata:
Copyright Date:
-
Copyright Info:
-

Locators

show

Creators

show
hide
 Creators:
Auer, Eric1, Author           
Affiliations:
1The Language Archive, MPI for Psycholinguistics, Max Planck Society, ou_530892              

Content

show
hide
Free keywords: Trova annotation content search Lucene PostgreSQL N-gram hash fingerprint index substring search
 Abstract: Growing amounts of annotation data in The Language Archive make it necessary to significantly speed up search to keep response times user friendly. Unlike keyword oriented web search engines, the Trova and CQL Search services at TLA allow searching for arbitrary exact substrings and (at lower speed) even regular expressions, not just whole words. To achieve both fast and versatile search, a combination of indexes is used. Word, substring and regular expression search queries are analyzed, yielding information about substrings and other properties which must be present in a tier (or file) so that tier can contain a hit for the query in question at all. Those properties are then either hash-mapped to fixed size bit vectors (fingerprints) for PostgreSQL based filtering or expressed as sets of N-grams (up to a fixed length) for filtering with Lucene N-gram indexes. Both methods aim to quickly find a small list of candidate tiers, containing all (but not much more) tiers which may contain hits. As Lucene has no native support for substring search, our system uses a fast but accurate N-gram based approximation. We present details of the implemented algorithm and elaborate the improvements in response times achieved. We were able to speed up most steps (of: opening indexes, defining a search domain, gathering candidates, finding hits and collecting hit details) and a typical benchmark session now completes in a fraction of the time used by the already powerful previous implementation.

Details

show
hide
Language(s): eng - English
 Dates: 2013-01-14
 Publication Status: Not specified
 Pages: -
 Publishing info: -
 Table of Contents: -
 Rev. Type: -
 Identifiers: -
 Degree: -

Event

show
hide
Title: the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013)
Place of Event: Enschede, The Netherlands
Start-/End Date: 2013-01-18 - 2013-01-18

Legal Case

show

Project information

show

Source

show