Flume安装与应用

Flume概述

  • 日志采集和汇总工具
  • 收集到的日志数据汇总到HDFS存储
  • flume: 1.9.0

Flume组件

  • source:数据源(需要采集的数据)
  • channel:临时存储的数据位置,通常存储在内存
  • sink:数据目标存储,hdfs

安装

  • 上传安装文件

  • 解压

    1
    2
    tar zxvf apache-flume-1.9.0-bin.tar.gz
    sudo mv apache-flume-1.9.0-bin /usr/
  • 配置环境变量,~/.bash_profile

    1
    2
    3
    4
    5
    6
    7
    8
    9
    JAVA_HOME=/usr/jdk1.8.0_231
    HADOOP_HOME=/usr/hadoop-3.2.1
    FLUME_HOME=/usr/apache-flume-1.9.0-bin
    PATH=$FLUME_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$PATH

    export JAVA_HOME
    export HADOOP_HOME
    export FLUME_HOME
    export PATH

    注意:source ~/.bash_profile

  • Flume基本配置,$FLUME_HOME/conf/flume-env.sh

    1
    2
    3
    4
    $ cp flume-env.sh.template  flume-env.sh
    $ vi flume-env.sh

    22 export JAVA_HOME=/usr/jdk1.8.0_231
  • 解决jar包冲突

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    cd /usr/apache-flume-1.9.0-bin/lib/
    ll guava-11.0.2.jar
    -rw-rw-r-- 1 hadoop hadoop 1648200 9月 13 2018 guava-11.0.2.jar

    cd /usr/hadoop-3.2.1/share/hadoop/common/lib/
    ll guava-27.0-jre.jar
    -rw-r--r-- 1 hadoop hadoop 2747878 9月 10 2019 guava-27.0-jre.jar

    rm -rf /usr/apache-flume-1.9.0-bin/lib/guava-11.0.2.jar
    cp /usr/hadoop-3.2.1/share/hadoop/common/lib/guava-27.0-jre.jar /usr/apache-flume-1.9.0-bin/lib/

实现数据同步

  • 功能需求

    • 采集爬虫服务器数据
  • 实现步骤

    • 启动数据采集服务
    • 启动hdfs服务,保证hdfs可读写
    • 配置Agent(source、channel、sink)
    • 使用flume1.7+版本新特性,source组件提供了高可靠的同步模式(TAILDIR),保证数据不丢失
    • 编写运行脚本(shell)并执行
  • 配置文件

    1
    2
    3
    4
    cd /usr/apache-flume-1.9.0-bin/
    mkdir myconf
    cd myconf/

vi flume-taildir-memory-hdfs.properties

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


创建agent配置文件`flume-taildir-memory-hdfs.properties`,内容如下:

```sh
# Name the components on this agent
hdfs_agent.sources = r1
hdfs_agent.sinks = k1
hdfs_agent.channels = c1

# Describe/configure the source
hdfs_agent.sources.r1.type = TAILDIR
hdfs_agent.sources.r1.filegroups = f1
hdfs_agent.sources.r1.filegroups.f1 = /home/hadoop/spider/data/collect/.*\.log
hdfs_agent.sources.r1.positionFile = /home/hadoop/spider/data/.flume/taildir_position.json

# Describe the sink
hdfs_agent.sinks.k1.type = hdfs
hdfs_agent.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/hdfs_filegroups_source/%Y-%m-%d/
hdfs_agent.sinks.k1.hdfs.rollInterval = 3600
hdfs_agent.sinks.k1.hdfs.rollSize = 1048576
hdfs_agent.sinks.k1.hdfs.rollCount = 0
hdfs_agent.sinks.k1.hdfs.filePrefix = log_file_%H
hdfs_agent.sinks.k1.hdfs.fileSuffix = .log
hdfs_agent.sinks.k1.hdfs.fileType = DataStream
hdfs_agent.sinks.k1.hdfs.useLocalTimeStamp = true

# Use a channel which buffers events in memory
hdfs_agent.channels.c1.type = memory
hdfs_agent.channels.c1.capacity = 1000
hdfs_agent.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
hdfs_agent.sources.r1.channels = c1
hdfs_agent.sinks.k1.channel = c1

创建hdfs目录

1
hdfs dfs -mkdir -p /flume/hdfs_filegroups_source/
  • 编写shell脚本,安装目录下创建mysbin目录,start_taildir_memory_hdfs.sh,内容如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    $ cd /usr/apache-flume-1.9.0-bin/
    $ mkdir mysbin
    $ cd mysbin
    $ vi start_taildir_memory_hdfs.sh


    #!/bin/bash

    ROOT_PATH=$(dirname $(dirname $(readlink -f $0)))
    cd $ROOT_PATH

    bin/flume-ng agent --conf ./conf/ -f myconf/flume-taildir-memory-hdfs.properties -Dflume.root.logger=INFO,console -n hdfs_agent

    更改脚本的执行权限

    1
    chmod 755 start_taildir_memory_hdfs.sh

    执行start_taildir_memory_hdfs.sh脚本文件,命令如下:

    1
    nohup ./start_taildir_memory_hdfs.sh &
作者

江风引雨

发布于

2020-07-22

更新于

2023-01-10

许可协议

CC BY 4.0

评论