Getting Started Guide

Learn how to create a Hello World Logisland app. This guide covers:

  • Bootstrapping an application

  • Inject some logs

  • Parse them

  • Inspect the resulting events

1. Prerequisites

To complete this guide, you need:

  • less than 15 minutes

  • a simple text editor

  • docker & docker-compose installed and running

2. Architecture

In this guide, we create a straightforward application parsing apache logs with a regex from logisland_raw topic and send structured events to logisland_events

Architecture

This guide also covers the testing kafka access .

3. Solution

We recommend that you follow the instructions in the next sections and create the application step by step. However, you can go right to the completed example.

Clone the Git repository: git clone https://github.com/hurence/logisland-quickstarts.git, or download an archive.

The solution is located in the conf/core/parse-apache-logs.yml file.

4. Boostraping Logisland

Our app is pretty straightforward, it reads some raw apache log lines from a kafka topic, parse these lines as Logisland Records and send these records to another Kafka topic.

4.1. Configure the streaming app

The easiest way to create a new Logisland project is to open a text editor in the conf/core folder and to edit a config file. Edit the following content of conf/core/parse-apache-logs.yml in any text editor if you want to make some adjustments.

version: 1.1.2
documentation: LogIsland analytics main config file. Put here every engine or component config

engine:
  component: com.hurence.logisland.engine.spark.KafkaStreamProcessingEngine
  configuration:
    spark.app.name: GettingStarted
    spark.master: local[4]
    spark.driver.memory: 2G
    spark.streaming.batchDuration: 200
    spark.streaming.kafka.maxRatePerPartition: 3000
    spark.streaming.timeout: -1
    spark.ui.port: 4050
  streamConfigurations:

    - stream: parsing_stream
      component: com.hurence.logisland.stream.spark.KafkaRecordStreamParallelProcessing
      configuration:
        kafka.input.topics: logisland_raw
        kafka.output.topics: logisland_events
        kafka.error.topics: logisland_errors
        kafka.input.topics.serializer: none
        kafka.output.topics.serializer: com.hurence.logisland.serializer.JsonSerializer
        kafka.error.topics.serializer: com.hurence.logisland.serializer.JsonSerializer
        kafka.metadata.broker.list: kafka:9092
        kafka.zookeeper.quorum: zookeeper:2181
        kafka.topic.autoCreate: true
        kafka.topic.default.partitions: 4
        kafka.topic.default.replicationFactor: 1
      processorConfigurations:

        # a parser that produce events from an apache log REGEX
        - processor: apache_parser
          component: com.hurence.logisland.processor.SplitText
          configuration:
            record.type: apache_log
            value.regex: (\S+)\s+(\S+)\s+(\S+)\s+\[([\w:\/]+\s[+\-]\d{4})\]\s+"(\S+)\s+(\S+)\s*(\S*)"\s+(\S+)\s+(\S+)
            value.fields: src_ip,identd,user,record_time,http_method,http_query,http_version,http_status,bytes_out

An Engine is needed to handle the stream processing. This configuration file defines a stream processing job setup. The first section configures the Spark engine (we will use a `KafkaStreamProcessingEngine <../plugins.html#kafkastreamprocessingengine>`_) to run in local mode with 2 cpu cores and 2G of RAM.

Inside this engine you will run a Kafka stream of processing, so we setup input/output topics and Kafka/Zookeeper hosts. Here the stream will read all the logs sent in logisland_raw topic and push the processing output into logisland_events topic.

We can define some serializers to marshall all records from and to a topic.

5. Setting up Underlying services

Apache Kafka is pitched as a Distributed Streaming Platform. In Kafka lingo, Producers continuously generate data (streams) and Consumers are responsible for processing, storing and analysing it. Kafka Brokers are responsible for ensuring that in a distributed scenario the data can reach from Producers to Consumers without any inconsistency. A set of Kafka brokers and another piece of software called zookeeper constitute a typical Kafka deployment.

Let’s start with a single broker instance. In the core directory create your docker-compose.yml with the following content:

version: '3'
services:
  zookeeper:
    image: wurstmeister/zookeeper
    ports:
      - "2181:2181"
    networks:
      - logisland

  kafka:
    image: wurstmeister/kafka:0.10.2.1
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_HOST_NAME: kafka
      KAFKA_CREATE_TOPICS: "test:1:1"
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./data:/tmp/data
    networks:
      - logisland

  logisland:
    image: hurence/logisland:latest
    entrypoint:
      - "tail"
      - "-f"
      - "bin/logisland.sh"
    ports:
      - '4050:4050'
    volumes:
      - ./conf:/opt/logisland/conf
    environment:
      KAFKA_HOME: /opt/kafka_2.11-0.10.2.2
      KAFKA_BROKERS: kafka:9092
      ZK_QUORUM: zookeeper:2181
      REDIS_CONNECTION: redis:6379
    networks:
      - logisland

  loggen:
    image: hurence/loggen:latest
    networks:
      - logisland
    environment:
      LOGGEN_NUM: 500
      KAFKA_BROKERS: kafka:9092

  redis:
    hostname: redis
    image: 'redis:latest'
    ports:
      - '6379:6379'
    networks:
      - logisland

volumes:
  kafka-home:

networks:
  logisland:

Start Zookeeper, Kafka, Loggen & Logisland container with the following command docker-compose up -d

6. Launch Logisland Application

Now you can run the job inside the logisland container

sudo docker exec -ti logisland-quickstarts_logisland_1 ./bin/logisland.sh --conf conf/core/parse-apache-logs.yml

This will launch a Logisland Stream application waiting for data in a Kafka topic that will be automatically created.

The last logs should be something like :

Detected spark version 2.4.0.
We'll automatically plug in engine jar /opt/logisland-1.1.2/lib/engines/logisland-engine-spark_2_3-1.1.2.jar
Listening for transport dt_socket at address: 5005
log4j:WARN No such property [datePattern] in org.apache.log4j.RollingFileAppender.
log4j:WARN File option not set for appender [rolling].
log4j:WARN Are you using FileAppender instead of ConsoleAppender?
2019-05-29 17:01:06 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:ERROR No output stream or file set for the appender named [rolling].
2019-05-29 17:01:07 INFO  StreamProcessingRunner:43 - starting StreamProcessingRunner

██╗      ██████╗  ██████╗   ██╗███████╗██╗      █████╗ ███╗   ██╗██████╗
██║     ██╔═══██╗██╔════╝   ██║██╔════╝██║     ██╔══██╗████╗  ██║██╔══██╗
██║     ██║   ██║██║  ███╗  ██║███████╗██║     ███████║██╔██╗ ██║██║  ██║
██║     ██║   ██║██║   ██║  ██║╚════██║██║     ██╔══██║██║╚██╗██║██║  ██║
███████╗╚██████╔╝╚██████╔╝  ██║███████║███████╗██║  ██║██║ ╚████║██████╔╝
╚══════╝ ╚═════╝  ╚═════╝   ╚═╝╚══════╝╚══════╝╚═╝  ╚═╝╚═╝  ╚═══╝╚═════╝   v1.1.2
2019-03-19 16:08:47 INFO  StreamProcessingRunner:95 - awaitTermination for engine 1
2019-03-19 16:08:47 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.

7. Log mining

Now that the plumbing is up and running we’ll just see what happens.

7.1. Inject some Apache logs into Kafka

Logs are beeing sent to logisland_raw Kafka topic to simulate a log collection system. Nothing to do here as it is already done for you by the loggen container which generates random apache logs.

You could send the first 4000 lines of NASA http access over July 1995 to LogIsland with kafka scripts (available in our logisland container) to `logisland_raw` Kafka topic with the following command :
`sudo docker exec -ti logisland-quickstarts_kafka_1 sh -c "cat /tmp/data/access.log | /opt/kafka/bin/kafka-console-producer.sh --broker-list kafka:9092 --topic logisland_raw"`

If you want to see what is flowing into logisland_raw topic, just run the following :

sudo docker exec -ti logisland-quickstarts_kafka_1 /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic logisland_raw

It should something like

68.25.236.19 - - [19/Jun/2019:15:05:36 +0200] "GET /app/main/posts HTTP/1.0" 200 4989 "http://hunter-clark.org/main/faq.php" "Mozilla/5.0 ...
64.252.158.21 - - [19/Jun/2019:15:05:36 +0200] "POST /search/tag/list HTTP/1.0" 200 5029 "http://lambert.com/" "Mozilla/5.0 ...
77.185.192.58 - - [19/Jun/2019:15:05:36 +0200] "GET /wp-content HTTP/1.0" 200 4976 "http://simmons.com/" "Mozilla/5.0 ...
157.242.110.206 - - [19/Jun/2019:15:05:36 +0200] "DELETE /wp-content HTTP/1.0" 200 4913 "http://gonzalez-collins.com/post/" "Mozilla/5.0 ...

7.2. See what we have in return

Logisland handles the parsing of the log lines and structure them as Records before sending them to logisland_events. Those will be sent in another Kafka topic.

sudo docker exec -ti logisland-quickstarts_kafka_1 /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic logisland_events

You should see the apache log events such as :

{
  "id" : "e80c7088-6676-4b56-9aaf-30cfbe1a4367",
  "type" : "apache_log",
  "creationDate" : 804571706000,
  "fields" : {
    "src_ip" : "166.79.67.111",
    "record_id" : "e80c7088-6676-4b56-9aaf-30cfbe1a4367",
    "http_method" : "GET",
    "record_value" : "166.79.67.111 - - [01/Jul/1995:00:08:26 -0400] \"GET /images/NASA-logosmall.gif HTTP/1.0\" 200 786",
    "http_query" : "/images/NASA-logosmall.gif",
    "bytes_out" : "786",
    "identd" : "-",
    "http_version" : "HTTP/1.0",
    "http_status" : "200",
    "record_time" : 804571706000,
    "user" : "-",
    "record_type" : "apache_log"
  }
}

8. What’s going on behing the scene

You should open the following url http://localhost:4050/streaming/ in your browser to get some metrics from spark streaming throughput.

Spark Streaming

The main metric here is the lag or scheduling delay, which one you want to be as low as possible (near zero) or at least no increasing constantly otherwise this would be a sign that you need to make your app scale, it can be by increasing the partitionning of Kafka topics and the number of cores that process simultaneously your data. This is a huge topic.

Another metric is processing time, which gives us some insights about the processing capabilities of our setup. This where the logisland scalabity can be viewed (add more kafka topics and spark processors and you’ll see how you can handle a more intense workload)

9. Stop the solution

Stop Zookeeper, Kafka, Loggen & Logisland container with the following command docker-compose down

10. What’s next?

This guide covered the creation of an application using Logisland. However, there is much more. We recommend continuing the journey with the event aggregation guide, where you learn about how to aggregate some metrics from your data.