syntax = "proto3";
package parroter;

service ParrotService {
      rpc say(ParrotRequest) returns (ParrotResponse) {}
}

message ParrotRequest {
      string msg = 1;
}

message ParrotResponse {
      string msg = 1;
      int32  count = 2;
}

.protoからRubyコードを出力します。

grpc_tools_ruby_protoc -Iproto --ruby_out=lib --grpc_out=lib proto/parroter.proto

とりあえず生成されたファイルをrequireすれば、gRPC Client / Server をつくれるのですが、できればインターフェースと実装の分離をしておきたい。

Building Microservices using gRPC on Ruby – Shiladitya Mandal – Software Developer に書かれているように、private な gem を生成するのがよさそうです。ディレクトリレイアウトは以下のようになりました。

$ tree .
.
├── Gemfile
├── LICENSE.txt
├── README.md
├── Rakefile
├── bin
│   ├── console
│   └── setup
├── lib
│   ├── parroter
│   │   └── version.rb
│   ├── parroter.rb
│   ├── parroter_pb.rb
│   └── parroter_services_pb.rb
├── parroter.gemspec
└── proto
    └── parroter.proto

https://github.com/kotaroito/grpc-parroter-service

gRPC Server

作成した private な gem を使って、https://shiladitya-bits.github.io/Building-Microservices-from-scratch-using-gRPC-on-Ruby を参考にしながら、gRPC Serverを作成してみました。

Gemfile

source 'https://rubygems.org'

gem 'parroter',:git => "https://github.com/kotaroito/grpc-parroter-service",:branch => 'master'
gem 'grpc', '1.7.0.pre1'

bin/start_server.rb

#!/usr/bin/env ruby

require 'grpc'
require 'parroter_services_pb'

class ParrotServer
  class << self
    def start
      start_grpc_server
    end

    private
    def start_grpc_server
      @server = GRPC::RpcServer.new
      @server.add_http2_port("0.0.0.0:50052", :this_port_is_insecure)
      @server.handle(ParrotService)
      @server.run_till_terminated
    end
  end
end

class ParrotService < Parroter::ParrotService::Service
  def initialize
    @count = {}
  end

  def say(parrot_req, _unused_call)
    p parrot_req
    Parroter::ParrotResponse.new(msg: parrot_req.msg, count: count_msg(parrot_req.msg))
  end

  private

  def count_msg(msg)
    @count[msg] = 0 unless @count[msg]
    @count[msg] += 1
  end
end

ParrotServer.start

bin/test_parrot_service

#!/usr/bin/env ruby
require 'grpc'
require 'parroter_services_pb'

def test_single_call
  stub = Parroter::ParrotService::Stub.new('0.0.0.0:50052', :this_channel_is_insecure)
  req = Parroter::ParrotRequest.new(msg: 'Hello gRPC.')
  resp_obj = stub.say(req)
  p resp_obj
end

test_single_call

gRPC Serverを起動後に、何度かClientを実行すると...

$ bundle exec bin/test_parrot_service
<Parroter::ParrotResponse: msg: "Hello gRPC.", count: 1>

$ bundle exec bin/test_parrot_service
<Parroter::ParrotResponse: msg: "Hello gRPC.", count: 2>

こうなります。

気になることをつらつらと

RubyでgRPC Serverを書けそうなことは分かったので、気になることをいくつか調べてみました。

2017年10月現在、（Rubyで書くなら）サーバーは"grpc" gemに同梱されているGRPC::RpcServer 一択だと思われます。充実したドキュメントはなく、設定オプションを知りたければGithubのソースを追いかけるのがよさそう。

https://github.com/grpc/grpc/blob/master/src/ruby/lib/grpc/generic/rpc_server.rb#L205-L210 を読むと、スレッドのpool_size や poll_periodなどが設定できそうです。

Interceptor

Rack Middleware に相当するものは、gRPCだとInterceptorと呼ばれるようです。 grpc gemの1.6.7にはありませんでしたが、1.7.0.pre1 から実装されていました。ただし、EXPERIMENTAL API とのこと。

https://github.com/grpc/grpc/blob/master/src/ruby/lib/grpc/generic/rpc_server.rb#L200-L203

すごく簡素ですが、Interceptorを使ってみました。

class ParrotServer
  class << self
    def start
      start_grpc_server
    end

    private
    def start_grpc_server
      @server = GRPC::RpcServer.new(interceptors:[HelloInterceptor.new])
      @server.add_http2_port("0.0.0.0:50052", :this_port_is_insecure)
      @server.handle(ParrotService)
      @server.run_till_terminated
    end
  end
end

class ParrotService < Parroter::ParrotService::Service
   ...snip...
end

class HelloInterceptor < ::GRPC::ServerInterceptor
  def request_response(request:, call:, method:)
    p "Received request/response call at method #{method}" \
      " with request #{request} for call #{call}"
    call.output_metadata[:interc] = 'from_request_response'
    p "[GRPC::Ok] (#{method.owner.name}.#{method.name})"
    yield
  end
end

ParrotServer.start

感想

gRPCはインターフェースを明確に宣言でき、しかもカンタンなのが良い。 JSON Schemaの経験があるだけに余計に。。。

現時点では詳しいドキュメントないので、ソースコードを読みさえすれば、Rubyでもプロダクションで使えるgRPC Serverを書けそうな気がしてる。

参考資料

2017-10-13

Mac OSにPrestoをインストールして試してみる

presto

BigQuery、Athenaに続いてこんどはPrestoを触ってみたので、記録を残しておきます。

Prestoとは？

tug.red

インストールと設定

2.1. Deploying Presto — Presto 0.185 Documentation には tar ball と書かれていますが、brewでインストールが可能です。java >= 1.8 が必須。

brew install presto

これで、version 0.185（2017年10月現在）がインストールされます。設定はほぼデフォルトですが、jvm.configでオプションを1つ無効化した記憶。

/usr/local/Cellar/presto/0.185/libexec/etc/jvm.config

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError

起動は

presto-server run

で行います。

MySQLに接続してみる

Connectorさえあれば（つくれば）、どんなデータソースでも直接SQLでクエリを書けるというのがPrestoのウリの一つ。（自分にとっては）最も手軽なMySQLを試してみます。

/usr/local/Cellar/presto/0.185/libexec/etc/catalog/mysql.properties

connector.name=mysql
connection-url=jdbc:mysql://localhost:3306
connection-user=root
connection-password=

これで準備はできたので、CLIをインストールして、prestoに接続します。

2.2. Command Line Interface — Presto 0.185 Documentation

./presto --server localhost:8080 --catalog mysql

以下、実際にworldデータベースに対して、クエリを打ってみた結果です。

presto:world> use world;
presto:world> SHOW TABLES;
      Table
-----------------
 city
 country
 countrylanguage
(3 rows)

Query 20171013_134538_00010_ra2dz, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:00 [3 rows, 71B] [11 rows/s, 271B/s]

presto:world> SELECT * FROM city LIMIT 10;
 id |                name                 | countrycode |       district       | population
----+-------------------------------------+-------------+----------------------+------------
  1 | Kabul                               | AFG         | Kabol                |    1780000
  2 | Qandahar                            | AFG         | Qandahar             |     237500
  3 | Herat                               | AFG         | Herat                |     186800
  4 | Mazar-e-Sharif                      | AFG         | Balkh                |     127800
  5 | Amsterdam                           | NLD         | Noord-Holland        |     731200
  6 | Rotterdam                           | NLD         | Zuid-Holland         |     593321
  7 | Haag                                | NLD         | Zuid-Holland         |     440900
  8 | Utrecht                             | NLD         | Utrecht              |     234323
  9 | Eindhoven                           | NLD         | Noord-Brabant        |     201843
 10 | Tilburg                             | NLD         | Noord-Brabant        |     193238
(10 rows)

Query 20171013_134553_00011_ra2dz, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:01 [4.08K rows, 0B] [5.3K rows/s, 0B/s]

思ったこと

PrestoでMySQLにクエリを投げるのは比較的カンタンだった。

5.4. Hive Connector — Presto 0.185 Documentation もあり、Amazon S3にあるログに対してクエリを実行できるらしい。（試そうと思ったが、HadoopやらHiveやら事前準備が多く断念）

実際にプロダクションで使う場合には、セットアップ・運用が大変そうなので、Amazon EMRなどのマネージドサービスを使うのがよさそうです。

2017-10-06

Amazon Athenaについて

aws athena

なんとなく触ってみたレベルなので、特徴やユースケースをかんたんに整理しておきたい。

特徴は？

Amazon S3のデータを標準SQLで分析できる
サーバーのセットアップ・管理は不要
スキーマ定義は必要だが、事前のデータロードは不要
SQLのエンジンは presto

よくある質問 - Amazon Athena | AWS にいろいろ書いてあります。

ユースケースは？

AWSソリューションアーキテクトの方のスライドがわかりやすいです。

f:id:kotaroito2002:20171006091349p:plain

出典： Presto ベースのマネージドサービス Amazon Athena

SQLエンジンとして採用しているprestoは多様なデータソースに対応していますが、AthenaはS3のみを対象にしている（だよね？）ので、参照頻度は低いがいつか使うかもしれないログをS3に保存しておき、そのタイミングが来たらAthenaでさくっと分析するっていう使い方になるのかなと。

2017-10-05

BigQueryの何がすごいのか？

既に語り尽くされてそうな話ではありますが、自分なりにまとめないとアタマに入らないのでドキュメントの整理も兼ねて、ポイントをまとめてみます。

BigQueryとは何か?

Google Cloud Platformでは、"BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse." と紹介されており、主要なユースケースとして、大規模データに対するアドホックで試行錯誤を要するインタラクティブなクエリが想定されています。

Technical White Paper: An Inside Look at Google BigQueryに書かれているケースがより具体的です。

Can you imagine how Google handles this kind of Big Data during daily operations? Just to give you an idea, consider the following scenarios:

What if a director suddenly asks, “Hey, can you give me yesterday’s number of impressions for AdWords display ads – but only in the Tokyo region?”.
Or, “Can you quickly draw a graph of AdWords traffic trends for this particular region and for this specific time interval in a day?”

普通のウェブサービスならいざ知らず、Googleの規模でこれを実現するのはなかなか大変です。例えば、Google Searchは世界中のウェブサイトを扱っているわけで。。

で、Googleの内部で開発されたのが、Dremel という技術で、これを外部に公開したのがBigQueryです。

BigQueryの何がすごいのか？

一言で言うなら、1000億行クラスのデータセットに対して、何の事前準備もなくフルスキャンしても数十秒のオーダーで結果が返ってくること。

この事実だけを聞いても最初はピンと来なかったんですが、 Anatomy of a BigQuery Query | Google Cloud Big Data and Machine Learning Blog | Google Cloud Platformを読むと次元の違うことをやっていることがよくわかります。

詳しくは記事参照ですが、100 billion rows のテーブルに対して正規表現によるマッチを行い、30秒以内に結果を返すということは少なくとも

About 330 100MB/sec dedicated hard-drives to read 1TB of data
A 330 Gigabit network to shuffle the 1.25 TB of data
3,300 cores to uncompress 1TB of data and process 100 billion regular expressions at 1 μsec per

というリソースを同時稼働させてるということです。

この規模のリソースを事前準備なしにボタンを押すだけで扱えるとは。。。BigQuery恐るべし。。

BigQueryの中を覗く

じゃあ、一体どうやって実現しているんだろうかという疑問が湧きますが、

Technical White Paper: An Inside Look at Google BigQuery や Dremel: Interactive Analysis of Web-Scale Datasetsを読むと、BigQueryの内側を垣間見ることができます。White Paperによると、前例のないレベルのパフォーマンスは2つのコア技術により支えられているとのこと。

1. Columnar Storage

f:id:kotaroito2002:20171005083340p:plain 出典： Dremel: Interactive Analysis of Web-Scale Datasets

BigQueryはColumnar Storageを採用しており、アイデアそのものはデータウェアハウス向けのDBではよくある設計。 BigQuery（あるいはDremel）を特徴づけるのは、数千台のサーバーのコンピューティングパワーを活用し、クラウドサービスとして提供していること。

2. Tree Architecture

数千台のマシンでクエリを分散実行し、最終的に秒オーダーで結果を集約することが設計におけるチャレンジで、DremelではTree Architectureを採用しているらしいです。

f:id:kotaroito2002:20171005085043p:plain

出典： BigQuery under the hood | Google Cloud Big Data and Machine Learning Blog | Google Cloud Platform

Treeのリーフは「スロット」と呼ばれ、データの読み出しと計算を担当しており、Treeの枝は「ミキサー」と呼ばれ、集約を担当しています。

ネストされたデータ・繰り返しデータを扱える

Loading Data | BigQuery | Google Cloud Platform や Querying Nested and Repeated Fields in Legacy SQL | BigQuery | Google Cloud Platform にて紹介されていますが、BigQueryはネストや繰り返しのあるJSONを扱うことができます。

f:id:kotaroito2002:20171005232751p:plain

まとめというか感想

ドキュメントを読んでまとめただけですが、BigQueryの中が少し分かった気になりました。ともかく凄いの一言に尽きます。

References

2017-09-30

RailsのログをfluentdでBigQuery、あるいはS3に取り込む

fluentd bigquery

GCP BigQueryやAWS Athenaを実際に触る機会が欲しかったので、RailsのログをfluentdでBigQueryやS3に取り込んでみます。とりあえず触ることが目的で、実用できるかはとりあえず脇においておきます。

なお、fluentd（td-agent）はOS Xに0.14.21 をインストールしています。

Rails fluent logger

Railsのログはデフォルトでは log ディレクトリに出力されるため、これをfluentdで扱えるようにする必要があります。

ドキュメント Collecting and Analyzing Ruby on Rails Logs | Fluentd に従って、lograge と act-fluent-logger-rails を使ってみます。

ほぼドキュメント通りですが、以下のように設定しました。

config/application.rb

  class Application < Rails::Application
    config.log_level = :info
    config.logger = ActFluentLoggerRails::Logger.new
    config.lograge.enabled = true
    config.lograge.formatter = Lograge::Formatters::Json.new
    config.lograge.custom_options = lambda do |event|
      exceptions = %w(controller action format id)
      {
        params: event.payload[:params].except(*exceptions)
      }
    end
  end

custom_optionsでリクエストパラメタをログに出力するようにしています。

config/fluent-logger.yml

development:
  fluent_host:   '127.0.0.1'
  fluent_port:   24224
  tag:           'foo'
  messages_type: 'string'

アプリケーションの設定はこれで終わりです。

fluentd + BigQuery

次は、BigQueryにログを出力です。 fluent-plugin-bigquery を使います。

/etc/td-agent/td-agent.conf

<match foo>
  @type parser
  key_name messages
  format json
  tag rails
</match>

<filter rails>
  @type record_transformer
  remove_keys location 
</filter>

<match rails>
  @type copy
  <store>
    @type stdout
  </store>
  <store>
    @type bigquery
    method insert
    auth_method json_key
    json_key /path/to/json_key

    auto_create_table true

    project your-project-id
    dataset rails_playground
    table   logs
    schema [
      {"name": "method", "type": "STRING"},
      {"name": "path", "type": "STRING"},
      {"name": "format", "type": "STRING"},
      {"name": "controller", "type": "STRING"},
      {"name": "action", "type": "STRING"},
      {"name": "status", "type": "INTEGER"},
      {"name": "duration", "type": "FLOAT"},
      {"name": "view", "type": "FLOAT"},
      {"name": "db", "type": "FLOAT"},
      {"name": "params", "type": "STRING"}
    ]
  </store>
</match>

BigQueryのスキーマも設定します。

f:id:kotaroito2002:20170922090630p:plain

実際にクエリを実行してみた結果がこちら。

f:id:kotaroito2002:20170922091321p:plain

fluentd + S3

次はfluentdでS3にログを保存してみます。まずはtd-agentの設定です。Amazon S3 Output Plugin | Fluentdに従えばOKです。

<match pattern>
  @type s3

  aws_key_id YOUR_AWS_KEY_ID
  aws_sec_key YOUR_AWS_SECRET_KEY
  s3_bucket YOUR_S3_BUCKET_NAME
  s3_region ap-northeast-1


  path kotaroito/rails-playground/dt=%Y-%m-%d/
  s3_object_key_format %{path}%{time_slice}_%{hostname}_%{index}.%{file_extension}

  <buffer tag,time>
    @type file
    path /var/log/td-agent/s3
    timekey 60
    timekey_wait 1m
    timekey_use_utc true # use utc
  </buffer>

  format json
  include_time_key true
</match>

検証を早くしたいので、timekey は60に設定しています。これでログがS3に保存されていきます。

{"method":"GET","path":"/books","format":"html","controller":"BooksController","action":"index","status":200,"duration":184.47,"view":156.19,"db":1.52,"params":{},"time":"2017-09-27T14:00:11Z"}
{"method":"GET","path":"/books/6","format":"html","controller":"BooksController","action":"show","status":200,"duration":43.6,"view":37.93,"db":0.34,"params":{},"time":"2017-09-27T14:00:21Z"}

Athena

S3に保存しただけじゃ面白くないので、以前から気になっていたAthenaを使ってみることにします。

Amazon Athena (サーバーレスのインタラクティブなクエリサービス) | AWS

Amazon Athena はインタラクティブなクエリサービスで、Amazon S3 内のデータを標準的な SQL を使用して簡単に分析できます。Athena はサーバーレスなので、インフラストラクチャの管理は不要です。実行したクエリに対してのみ料金が発生します。

Athena は簡単に使えます。Amazon S3 にあるデータを指定して、スキーマを定義し、標準的な SQL を使ってデータのクエリを開始するだけです。多くの場合、数秒で結果が出てきます。Athena を使用すると、分析用データを準備するための複雑な ETL ジョブは不要になります。これによって、誰でも SQL のスキルを使って、大型データセットをすばやく、簡単に分析できるようになります。

Athenaには Getting Started — User Guideにてチュートリアルが用意されているので、これを最初にやると雰囲気が掴めます。

やるべきことは、CREATE TABLE をして、

CREATE EXTERNAL TABLE IF NOT EXISTS default.rails_logs (
  `method` string,
  `path` string,
  `format` string,
  `controller` string,
  `action` string,
  `status` int,
  `duration` float,
  `view` float,
  `db` float,
  `params` map<string,string>,
  `time` string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://your-bucket-name/path_to_log_dir/'
TBLPROPERTIES ('has_encrypted_data'='false')

あとはクエリを実行するだけです。

SELECT * FROM default."rails_logs" limit 10

f:id:kotaroito2002:20170929085434p:plain

ね、簡単でしょ？

と言いたいところなんですが、色々調べると「S3にログだけ置いておけばあとはAthenaがいい感じやってくれるぜー」という夢のような話はなく、（まともに運用するなら）パーティショニング、ログフォーマット、実行結果の保存等、考えることは多そうです。

qiita.com

日常的に見ることはないログをS3に保存しておき、ad-hocに分析したいという時にはAthenaは1つの選択肢かもなと思いました。（その他ユースケースでAthenaが最適解になるかはもうちょっと自身が勉強しないと、答が出なそう）

まとめというか、感想

logrageとact-fluent-logger-railsを使うと、RailsのログをかんたんにBigQueryやS3に取り込みできました。ただし、logrageは例外ログまではサポートしていない（FAQ）ので、例外トラッキングサービス（airbrake.ioなど）別の方法を用意する必要がありそうです。

Athenaはどんなユースケースに使えるかもう少し研究したいところ。

2017-09-17

Mac OS Xにtd-agentをインストールする

td-agent

基本的にはドキュメントに従い、dmgからインストールすればOKです。

docs.fluentd.org

いくつか自分で調べたことがあったので、FAQ的にメモに残しておきます。

起動/終了はどうやればいい？

sudo launchctl start td-agent
sudo launchctl stop td-agent

プラグインはどうやってインストールすればよい？

sudo /opt/td-agent/embedded/bin/fluent-gem install fluent-plugin-parser

でよい。