Query Matching Guide

Learn how to percolate lucene Queries to generate custom events based on business policies This guide covers:

  • lucene syntax

  • matching queries

1. Prerequisites

To complete this guide, you need:

  • less than 15 minutes

  • an IDE

  • JDK 1.8+ installed with JAVA_HOME configured appropriately

  • Apache Maven 3.5.3+

  • The completed greeter application from the Event Aggregation Guide

2. Architecture

In this guide, we expand on the initial stream that was created as part of the Event Aggregation Guide. We cover how to raise custom alerts based on lucene matching query criterion.

Archi

3. Solution

We recommend that you follow the instructions in the next sections and create the application step by step. However, you can go right to the completed example.

4. Logisland job setup

Our application will add a new feature to the previous app (Event Aggregation Guide) to match again some particuliar criterion. This is the heart of the complex event processing as you can define your own logic to build some new information over your data.

  1. First stream converts apache logs to typed records (please note the use of ConvertFieldsType processor)

  2. Second the sql stream we will compute every x seconds, the top twenty src_ip for connections count as new top_client_metrics Record containing connections_count and avg_bytes_out fields.

  3. Third stream creates alerts records within the MatchQuery processor which matches incomming records (from logisland_aggregations) against some criteria to eventualy produce some alerts to send them to logisland_alerts

# match threshold queries
- stream: aggregation_threshold_matching_stream
  component: com.hurence.logisland.stream.spark.KafkaRecordStreamParallelProcessing
  configuration:
    kafka.input.topics: logisland_aggregations
    kafka.output.topics: logisland_alerts
    kafka.input.topics.serializer: com.hurence.logisland.serializer.JsonSerializer
    kafka.output.topics.serializer: com.hurence.logisland.serializer.JsonSerializer
    processorConfigurations:

    # a parser that produce alerts from lucene queries
    - processor: match_query
      component: com.hurence.logisland.processor.MatchQuery
      configuration:
        numeric.fields: bytes_out,connections_count
        too_much_bandwidth: avg_bytes_out:[25000 TO 5000000]
        too_many_connections: connections_count:[150 TO 300]
        output.record.type: threshold_alert

In this stream for example a new record of type too_much_bandwidth would be created if one record match the condition avg_bytes_out:[25000 TO 5000000]. Please read the the following paragraph on Lucene Query Syntax to learn more.

5. Lucene Query Syntax

Here are some query examples demonstrating the query syntax.

Keyword matching

  • Search for word "foo" in the title field : title:foo

  • Search for phrase "foo bar" in the title field : title:"foo bar"

  • Search for phrase "foo bar" in the title field AND the phrase "quick fox" in the body field : title:"foo bar" AND body:"quick fox"

  • Search for either the phrase "foo bar" in the title field AND the phrase "quick fox" in the body field, or the word "fox" in the title field : (title:"foo bar" AND body:"quick fox") OR title:fox

  • Search for word "foo" and not "bar" in the title field : title:foo -title:bar

Wildcard matching

  • Search for any word that starts with "foo" in the title field : title:foo*

  • Search for any word that starts with "foo" and ends with bar in the title field : title:foo*bar

Note that Lucene doesn’t support using a * symbol as the first character of a search.

Proximity matching

Lucene supports finding words are a within a specific distance away.

  • Search for "foo bar" within 4 words from each other : "foo bar"~4

Note that for proximity searches, exact matches are proximity zero, and word transpositions (bar foo) are proximity 1.

A query such as "foo bar"~10000000 is an interesting alternative to foo AND bar.

Whilst both queries are effectively equivalent with respect to the documents that are returned, the proximity query assigns a higher score to documents for which the terms foo and bar are closer together.

Range searches

Range Queries allow one to match documents whose field(s) values are between the lower and upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically.

mod_date:[20020101 TO 20030101]

Solr’s built-in field types are very convenient for performing range queries on numbers without requiring padding.

6. Launch the script

For this tutorial we will handle some apache logs with a splitText parser and send them to Elastiscearch Connect a shell to your logisland container to launch the following streaming jobs.

docker exec -i -t logisland-quickstarts_logisland_1 bin/logisland.sh --conf conf/core/match-queries.yml

7. Check the alert outputs

To see alert records flowing through kafka, run the following command :

sudo docker exec -ti logisland-quickstarts_kafka_1 /opt/kafka/bin/kafka-console-consumer.sh  \\
    --bootstrap-server kafka:9092 --topic logisland_alerts

You should see the apache log events such as :

@TODO put here a sample alert output