[B! parquet] kimutanskのブックマーク

Read few parquet files at the same time in Spark

kimutansk 2017/04/14

Textファイルの場合複数パス指定時はカンマ区切り、parquetやorcの場合は可変長引数で与える、と・・・　parquetにカンマ区切りで実行するとやはりこけるんですね。

リンク

Using Apache Parquet Data Files with CDH | 6.3.x | Cloudera Documentation

Apache Parquet is a columnar storage format available to any component in the Hadoop ecosystem, regardless of the data processing framework, data model, or programming language. The Parquet file format incorporates several features that support data warehouse-style operations: Columnar storage layout - A query can examine and perform calculations on all values for a column while reading only a sma

kimutansk 2016/08/04

読み方が一瞥できるのはありがたい・・・

リンク

Cannot saveAsParquetFile from a RDD of case class

kimutansk 2015/11/18

DataFrameであればもともとスキーマ持つ存在なのでそのままParquet出力も可能と。こちらですかね。ただ、事前にValueObjectクラスを指定必要ですか・・

リンク

[SPARK-3368] Spark cannot be used with Avro and Parquet - ASF JIRA

kimutansk 2015/11/18

あれ。SparkRDDから直接Parquetはけないんですかね。だとすると結構厄介そう。

リンク

Apache Spark User List - Kafka->HDFS to store as Parquet format

kimutansk 2015/11/18

ParquetをSparkのバッチアプリケーションからスキーマ指定して出力するにはこうやりますか。KeyがVoidということは実質どんなRDDでも可能？

リンク

parquet-compatibility/parquet-testdata/tpch at master · Parquet/parquet-compatibility

kimutansk 2014/11/21

ParquetのSchema、拡張子Schemaというファイルがありますが、これで定義可能・・？　どうなんでしょ。

Parquet

リンク

parquet-compatibility/parquet-compat/src/test/java/parquet/compat/test/ConvertUtils.java at master · Parquet/parquet-compatibility

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert

kimutansk 2014/11/14

Parquetの変換しているコードですが・・これは、CSVをそのまま変換しているわけですが、ヘッダ情報とかはどうなるんでしょうね・・？

リンク

Cloudera Blog

We are excited to announce the acquisition of Octop ai, a leading data lineage and catalog platform that provides data discovery and governance for enterprises to enhance their data-driven decision making. Cloudera’s mission since its inception has been to empower organizations to transf orm all their data to deliver trusted, valuable, and predictive insights. With AI and […] Read blog post

kimutansk 2014/10/28

AvroSchemaを用いて生成する方式が基本ですか。

Parquet
data

リンク

Dremel made simple with Parquet

Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases. The goal is to keep I/O to a minimum by reading from a disk only the data required for the query. Using Parquet at Twitter,

kimutansk 2014/10/24

Parquest、Repetitionレベル（繰り返し）と、Definitionレベル（定義する階層レベル）でこういう風に表現されますか。

リンク

RCFile，Parquet，ORCFile

この2ヶ月で，Cloudera/Twitter，Hortonworks からそれぞれ別の列指向ファイルフォーマットが公開されました．Parquet と ORCFile です．この記事では，まず RCFile の復習をして，その後 Parquet と ORCFile それぞれの共通点と違いをおおまかに見ていこうと思います．コードレベルの詳細な違いについては，次回以降で見ていきます． RCFile の復習 RCFile は　Record Columnar File の略で，Hive から利用できるストレージフォーマットです．特に，HDFS や S3 といった分散ストレージ上でパフォーマンスがでるように設計されています． HDFS/S3 といったストレージでは，基本的にデータを計算機間で同じ負荷になるようにデータを分散配置します．このため，従来の列指向ストレージフォーマットのように適当に列毎に

kimutansk 2014/10/23

カラムナーのファイルの形式はぱっと見た感じよくわからない形式になっているので・・・なるほど。

リンク

Parquet Hadoop Summit 2013

Parquet is a columnar storage format for Hadoop data. It was developed by Twitter and Cloudera to optimize storage and querying of large datasets. Parquet provides more efficient compression and I/O compared to traditional row-based formats by storing data by column. Early results show a 28% reduction in storage size and up to a 114% improvement in query performance versus the original Thrift form

kimutansk 2013/11/04

入れ子データ形式にも対応したHadoop用カラムナストレージParquet。ClouderaとTwitterが協力して作られていたんですね。

リンク

Twitter: データ分析基盤改善取り組み - ワザノバ | wazanova.jp

https://www.facebook.com/photo.php?v=10151697364230687&set=vb.9445547199&type=2&theater TwitterのAnalyticsインフラチームが、データ分析基盤の改善に取り組んできた事例を紹介しています。 1) 背景４億tweet/日を発信 & 消費しているユーザのアクティビティを、Twitter社内の多くのチームがそれぞれの観点 & 様々な利用形態で分析データを必要とするため、量およびデータの依存関係が、相当大きく複雑なものになっている。Analyticsインフラは、1000ノードあるHadoopのクラスタをいくつかもつ規模。ストレージフットプリント & I/Oを減らすだけでなく、他の方法でプロセススピードをあげることに取り組んでいる。 2) Parquet （「Hadoop用のカラムナストレージフォー

kimutansk 2013/10/28

Twitterのバッチ処理／スピード処理をまとめたラムダアーキテクチャに対する共通Servingレイヤまで含んだOSSですか。楽しみではありますねぇ。

リンク

Hadoop用カラムナストレージ「Parquet」正式版をTwitterがオープンソースで公開

データを列方向に格納することで読み出し性能を向上し、高速な分析を実現する技術は、「カラム型データベース」「カラムナーストレージ」「カラム型データストア」などと呼ばれて注目されています。その技術をHadoopのストレージに持たせることで、Hadoopでもさらに高速な分析を可能にする「Parquet」バージョン1.0を、Twitterがオープンソースで公開しました。公開したのは7月30日と1カ月ほど前のことで気付くのが少々遅かったのですが、ほかに日本語の記事が見当たらなかったので紹介したいと思います。 Parquetとはどのようなソフトウェアなのか、Twitterのブログから少し長めの説明を引用しましょう。 Parquet is an open-source columnar storage format for Hadoop. Its goal is to provide a state

kimutansk 2013/09/03

Hadoop用のカラムナーストレージ来ましたか。列単位でデータ取得できるなら様々なプロダクトに恩恵きそうですね。

リンク

はてなブックマーク

タグ

関連タグで絞り込む (13)

parquetに関するkimutanskのブックマーク (13)

お知らせ

今週のはてなブックマーク数ランキング（2025年1月第1週）

今週のはてなブックマーク数ランキング（2024年12月第4週）

「あとで読む」タグで振り返る2024年〜今年の「あとで読む」、今年のうちに〜

公式Twitter

キーボードショートカット一覧

はてなブックマーク

公式Twitter

はてなのサービス

タグ

関連タグで絞り込む (13)

parquetに関するkimutanskのブックマーク (13)

お知らせ

今週のはてなブックマーク数ランキング（2025年1月第1週）

今週のはてなブックマーク数ランキング（2024年12月第4週）

「あとで読む」タグで振り返る2024年 〜今年の「あとで読む」、今年のうちに〜

公式Twitter

キーボードショートカット一覧

公式Twitter

はてなのサービス

「あとで読む」タグで振り返る2024年〜今年の「あとで読む」、今年のうちに〜