Logisland - Event Aggregation

Learn how to aggregate metrics over time window events This guide covers:

avro schema validation
event aggregation stream

1. Prerequisites

To complete this guide, you need:

less than 15 minutes
an IDE
JDK 1.8+ installed with JAVA_HOME configured appropriately
Apache Maven 3.5.3+
The completed greeter application from the Getting Started Guide

2. Architecture

In this guide, we expand on the initial log parsing stream that was created as part of the Getting Started Guide. We cover how to generate time window metrics on some http traffic (apache log records).

3. Solution

We recommend that you follow the instructions in the next sections and create the application step by step. However, you can go right to the completed example.

Clone the Git repository: git clone https://github.com/hurence/logisland-quickstarts.git, or download an archive.

The solution is located in the conf/core/event-aggregation.yml file.

This guide assumes you already have the completed application from the getting-started guide.

4. Logisland job setup

Our application will add a new stream to the previous one (Getting Started Guide) : the sql stream is a special one one use a KafkaRecordStreamSQLAggregator. This stream defines input/output topics names as well as Serializers, avro schema.

The Avro schema is set for the input topic and must be same as the avro schema of the output topic for the stream that produces the records (please refer to Getting Started Guide)

{
    "version":1,
    "type": "record",
    "name": "com.hurence.logisland.record.apache_log",
    "fields": [
        { "name": "record_errors",   "type": [ {"type": "array", "items": "string"},"null"] },
        { "name": "record_raw_key", "type": ["string","null"] },
        { "name": "record_raw_value", "type": ["string","null"] },
        { "name": "record_id",   "type": ["string"] },
        { "name": "record_time", "type": ["long"] },
        { "name": "record_type", "type": ["string"] },
        { "name": "src_ip",      "type": ["string","null"] },
        { "name": "http_method", "type": ["string","null"] },
        { "name": "bytes_out",   "type": ["long","null"] },
        { "name": "http_query",  "type": ["string","null"] },
        { "name": "http_version","type": ["string","null"] },
        { "name": "http_status", "type": ["string","null"] },
        { "name": "identd",      "type": ["string","null"] },
        { "name": "user",        "type": ["string","null"] }
    ]
}

The most important part of the KafkaRecordStreamSQLAggregator is its sql.query property which defines a query to apply on the incoming records for the given time window.

The following SQL query will be applied on sliding window of 10" of records.

SELECT count(*) AS connections_count,
       avg(bytes_out) AS avg_bytes_out,
       src_ip,
       first(record_time) as record_time
FROM logisland_events
GROUP BY src_ip
ORDER BY connections_count DESC
LIMIT 20

which will consider logisland_events topic as SQL table and create 20 output Record with the fields avg_bytes_out, src_ip & record_time. The statement with record_time will ensure that the created records will correspond to the effective input event time (not the actual time).

- stream: metrics_by_host
    component: com.hurence.logisland.stream.spark.KafkaRecordStreamSQLAggregator
    type: stream
    documentation: a processor that links
    configuration:
    kafka.input.topics: logisland_events
    kafka.output.topics: logisland_aggregations
    kafka.error.topics: logisland_errors
    kafka.input.topics.serializer: com.hurence.logisland.serializer.KryoSerializer
    kafka.output.topics.serializer: com.hurence.logisland.serializer.KryoSerializer
    kafka.error.topics.serializer: com.hurence.logisland.serializer.JsonSerializer
    kafka.metadata.broker.list: sandbox:9092
    kafka.zookeeper.quorum: sandbox:2181
    kafka.topic.autoCreate: true
    kafka.topic.default.partitions: 2
    kafka.topic.default.replicationFactor: 1
    window.duration: 10
    avro.input.schema: >
        {  "version":1,
            "type": "record",
            "name": "com.hurence.logisland.record.apache_log",
            "fields": [
            { "name": "record_errors",   "type": [ {"type": "array", "items": "string"},"null"] },
            { "name": "record_raw_key", "type": ["string","null"] },
            { "name": "record_raw_value", "type": ["string","null"] },
            { "name": "record_id",   "type": ["string"] },
            { "name": "record_time", "type": ["long"] },
            { "name": "record_type", "type": ["string"] },
            { "name": "src_ip",      "type": ["string","null"] },
            { "name": "http_method", "type": ["string","null"] },
            { "name": "bytes_out",   "type": ["long","null"] },
            { "name": "http_query",  "type": ["string","null"] },
            { "name": "http_version","type": ["string","null"] },
            { "name": "http_status", "type": ["string","null"] },
            { "name": "identd",      "type": ["string","null"] },
            { "name": "user",        "type": ["string","null"] }    ]}
    sql.query: >
        SELECT count(*) AS connections_count, avg(bytes_out) AS avg_bytes_out, src_ip
        FROM logisland_events
        GROUP BY src_ip
        ORDER BY event_count DESC
        LIMIT 20
    max.results.count: 1000
    output.record.type: top_client_metrics

Here we will compute every x seconds, the top twenty src_ip for connections count. The result of the query will be pushed into to logisland_aggregations topic as new top_client_metrics Record containing connections_count and avg_bytes_out fields.

5. Launch the script

For this tutorial we will handle some apache logs with a splitText parser and send them to Elastiscearch Connect a shell to your logisland container to launch the following streaming jobs.

sudo docker exec -i -t logisland-quickstarts_logisland_1 bin/logisland.sh --conf conf/core/event-aggregation.yml

6. See what we have in return

Logisland will handle the parsing of the log lines and structure them as Records. Those will be sent in another Kafka topic.

sudo docker exec -ti logisland-quickstarts_kafka_1 /opt/kafka/bin/kafka-console-consumer.sh  \\
    --bootstrap-server kafka:9092 --topic logisland_aggregations

You should see the apache log events such as :

@TODO put here a sample aggregation output