This document discusses the evolution of Hadoop and its use cases in the adtech industry. It describes how Hadoop was initially used primarily for batch processing via Hive and MapReduce. Over time, improvements like Tez, Presto, and Impala enabled faster interactive SQL queries on big data. The document also outlines how the Hadoop ecosystem is now used for real-time log collection, reporting, model generation, and more across the entire adtech stack. Key recent developments discussed include improvements in Hive like LLAP that enable sub-second SQL and ACID transactions, as well as tools like Cloudbreak for deploying Hadoop clusters in the cloud.
This document discusses Hortonworks Data Platform (HDP) updates and releases. It notes that HDP will have more frequent releases of components like Spark, Hive, and Ambari, while having longer release cycles for core Hadoop components. HDP 2.5 is highlighted as including interactive Hive queries using LLAP, enterprise Spark support in Zeppelin notebooks, real-time applications support in Storm and HBase/Phoenix, streamlined operations using Ambari, and dynamic security with Atlas and Ranger integration.
Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.
The document discusses how Hadoop can be used for interactive and real-time data analysis. It notes that the amount of digital data is growing exponentially and will reach 40 zettabytes by 2020. Traditional data systems are struggling to manage this new data. Hadoop provides a solution by tying together inexpensive servers to act as one large computer for processing big data using various Apache projects for data access, governance, security and operations. Examples show how Hadoop can be used to analyze real-time streaming data from sensors on trucks to monitor routes, vehicles and drivers.
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
This document discusses enabling Apache Zeppelin and Spark for data science in the enterprise. It outlines current issues with Zeppelin and Spark integration including secure data access, multi-tenancy, and fault tolerance. It then describes how using Livy Server as a session management service solves these issues by providing secure, isolated sessions for each user. The document concludes by covering near term improvements like session management and long term goals like controlled sharing and model deployment.
This document provides an agenda and overview for a hands-on introductory course on Spark and Zeppelin. The agenda includes a quick demo, overview of Spark and Zeppelin, a 1 hour lab, discussion of Spark 2.0 features, and a Q&A session. The overview sections explain key Spark concepts like RDDs, DataFrames, and MLlib as well as how Spark SQL, Streaming, and GraphX work. It also introduces the Apache Zeppelin notebook platform and Hortonworks Data Platform sandbox for experimenting with Spark and Hadoop technologies.
Apache Hive 2.0 provides major new features for SQL on Hadoop such as:
- HPLSQL which adds procedural SQL capabilities like loops and branches.
- LLAP which enables sub-second queries through persistent daemons and in-memory caching.
- Using HBase as the metastore which speeds up query planning times for queries involving thousands of partitions.
- Improvements to Hive on Spark and the cost-based optimizer.
- Many bug fixes and under-the-hood improvements were also made while maintaining backwards compatibility where possible.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Speaker
Alan Gates, Co-founder, Hortonworks
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
Hortonworks SmartSense provides proactive recommendations that improve cluster performance, security and operations. And since 30% of issues are configuration related, Hortonworks SmartSense makes an immediate impact on Hadoop system performance and availability, in some cases boosting hardware performance by two times. Learn how SmartSense can help you increase the efficiency of your Hadoop hardware, through customized cluster recommendations.
View the on-demand webinar: https://hortonworks.com/webinar/boosts-hadoop-hardware-performance-2x-smartsense/
The document discusses how Apache Ambari can be used to streamline Hadoop DevOps. It describes how Ambari can be used to provision, manage, and monitor Hadoop clusters. It highlights new features in Ambari 2.4 like support for additional services, role-based access control, management packs, and Grafana integration. It also covers how Ambari supports automated deployment and cluster management using blueprints.
This document summarizes the Hortonworks Data Platform (HDP). HDP is a fully integrated big data platform that includes Apache Hadoop, HBase, Hive, Pig and other projects. It addresses the challenges of integrating and managing open source projects by providing certified, tested distributions with extensive quality assurance. HDP also includes tools for management and monitoring clusters through Ambari, and for data integration through Talend Open Studio.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
This document discusses security requirements and solutions for Apache Spark production deployments. It covers authenticating users with Kerberos/AD, authorizing access to Spark jobs and data with Ranger, auditing access, and encrypting data at rest and in motion. It provides examples of configuring Kerberos authentication for Spark, using Ranger to control authorization to HDFS and SparkSQL, and demonstrates dynamic row filtering and masking of sensitive data in SparkSQL queries based on user policies.
With its large install base in production, the Storm 1.x line has proven itself as a stable and reliable workhorse that scales well horizontally. Much has been learnt from evolving the 1.x line that we can now leverage to build the next generation execution engine. Under the STORM-2284 umbrella, we are working hard to bring you this new engine which is being redesigned at a fundamental level for Storm 2.0. The goal is to dramatically improve performance and enhance Storm's abilities without breaking compatibility.
This improved vertical scaling will help meet the needs of the growing user base by delivering more performance with less hardware.
In this talk, we will take an in-depth look at the existing and proposed designs for Storm's threading model and the messaging subsystem. We will also do a quick run-down of the major proposed improvements and share some early results from the work in progress.
Speaker
Roshan Naik, Senior MTS, Hortonworks
Connecting the Drops with Apache NiFi & Apache MiNiFiDataWorks Summit
Demand for increased capture of information to drive analytic insights into an organizations' assets and infrastructure is growing at unprecedented rates. However, as data volume growth soars, the ability to provide seamless ingestion pipelines becomes operationally complex as the magnitude of data sources and types expands.
This talk will focus on the efforts of the Apache NiFi community including subproject, MiNiFi; an agent based architecture and its relation to the core Apache NiFi project. MiNiFi is focused on providing a platform that meets and adapts to where data is born while providing the core tenets of NiFi in provenance, security, and command and control. These capabilities provide versatile avenues for the bi-directional exchange of information across data and control planes while dealing with the constraints of operation at opposite ends of the scale spectrum tackling the first and last miles of dataflow management.
We will highlight ongoing and new efforts in the community to provide greater flexibility with deployment and configuration management of flows. Versioned flows provide greater operational flexibility and serve as a powerful foundation to orchestrate the collection and transmission from the point of data's inception through to its transmission to consumers and processing systems.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
The document discusses the past, present, and future of Apache Hadoop YARN. It describes how YARN started as a sub-project of Hadoop to improve its resource management capabilities. Today, YARN is central to modern data architectures, providing centralized resource management and scheduling. Going forward, YARN aims to better support containers, simplified APIs, treating services as first-class citizens, and enhance its user experience.
This document discusses Apache NiFi and how it was used to create a new composable data flow system for Schlumberger in just 10 man hours. The previous system was very complex, took over 100 man years to create, and was difficult to change. NiFi allows for easy visualization of the data flow, debugging of issues, and rapid creation of new processors. It also enables quick testing of data flows using curated test data sets and live data in Docker containers. Next steps discussed include further exploring use cases for rig data ingestion with NiFi to provide data provenance and understand the chain of custody of data as it moves through the system.
LLAP (Live Long and Process) is the newest query acceleration engine for Hive 2.0, which entered GA in 2017. LLAP brings into light a new set of trade-offs and optimizations that allows for efficient and secure multi-user BI systems on the cloud. In this talk, we discuss the specifics of building a modern BI engine within those boundaries, designed to be fast and cost-effective on the public cloud. The focus of the LLAP cache is to speed up common BI query patterns on the cloud, while avoiding most of the operational administration overheads of maintaining a caching layer, with an automatically coherent cache with intelligent eviction and support for custom file formats from text to ORC, and explore the possibilities of combining the cache with a transactional storage layer which supports online UPDATE and DELETES without full data reloads. LLAP by itself, as a relational data layer, extends the same caching and security advantages to any other data processing framework. We overview the structure of such a hybrid system, where both Hive and Spark use LLAP to provide SQL query acceleration on the cloud with new, improved concurrent query support and production-ready tools and UI.
Speaker
Sergey Shelukin, Member of Technical Staff, Hortonworks
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
The document discusses how Hortonworks Data Platform (HDP) enables a modern data architecture with Apache Hadoop. HDP provides a common data set stored in HDFS that can be accessed through various applications for batch, interactive, and real-time processing. This allows organizations to store all their data in one place and access it simultaneously through multiple means. YARN is the architectural center of HDP and enables this modern data architecture. HDP also provides enterprise capabilities like security, governance, and operations to make Hadoop suitable for business use.
The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.
This document provides an overview of Apache NiFi 1.0 and discusses its new enhancements, including a modernized UI with a complete interface redesign, multitenant authorization capabilities, zero master clustering, and foundational work for software development lifecycles. It also outlines NiFi's use for data flow management and integration with downstream systems.
Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit
This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that for tables with many common strings, Avro with Snappy compression provided good performance. For other tables, ORC with Zlib compression generally performed best. The document outlines the characteristics and performance of each format for different use cases like full table scans, column projections, predicate pushdown, and metadata access. It concludes by recommending experimenting with the open benchmark suite and formats.
This document discusses ways to troubleshoot slow Hadoop jobs using metrics, logging, and tracing. It describes how to use the Ambari metrics system and Grafana dashboards to monitor metrics for clusters. It also explains how to leverage Hadoop logs and the YARN Application Timeline Service for logging and correlation across workloads. Finally, it presents Apache Zeppelin and analyzers for Hive, Tez, and YARN as tools for ad-hoc analysis to diagnose issues.
The document discusses Apache NiFi and its role in the Hadoop ecosystem. It provides an overview of NiFi, describes how it can be used to integrate with Hadoop components like HDFS, HBase, and Kafka. It also discusses how NiFi supports stream processing integrations and outlines some use cases. The document concludes by discussing future work, including improving NiFi's high availability, multi-tenancy, and expanding its ecosystem integrations.
View the recording:
http://hortonworks.com/webinar/accelerating-real-time-data-ingest-hadoop/
Hadoop didn’t disrupt the data center. The exploding amounts of data did. But, let’s face it, if you can’t move your data to Hadoop, then you can’t use it in Hadoop. The experts from Hortonworks, the #1 leader in Hadoop development, and Attunity, a leading data management software provider, cover:
- How to ingest your most valuable data into Hadoop using Attunity Replicate
- About how customers are using Hortonworks DataFlow (HDF) powered by Apache NiFi
- How to combine the real-time change data capture (CDC) technology with connected data platforms from Hortonworks
We discuss how Attunity Replicate and Hortonworks Data Flow (HDF) work together to move data into Hadoop.
Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
In this talk, we’ll first hit the ground with the current status of Apache Hadoop YARN – how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale.
We'll then move on to the exciting present & future of YARN – features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. We’ll discuss the current status as well as the future promise of features and initiatives like – 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.
Speaker:
Sunil Govindan, Senior Software Engineer, Hortonworks
Rohith Sharma K S, Senior Software Engineer, Hortonworks
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
Hortonworks SmartSense provides proactive recommendations that improve cluster performance, security and operations. And since 30% of issues are configuration related, Hortonworks SmartSense makes an immediate impact on Hadoop system performance and availability, in some cases boosting hardware performance by two times. Learn how SmartSense can help you increase the efficiency of your Hadoop hardware, through customized cluster recommendations.
View the on-demand webinar: https://hortonworks.com/webinar/boosts-hadoop-hardware-performance-2x-smartsense/
The document discusses how Apache Ambari can be used to streamline Hadoop DevOps. It describes how Ambari can be used to provision, manage, and monitor Hadoop clusters. It highlights new features in Ambari 2.4 like support for additional services, role-based access control, management packs, and Grafana integration. It also covers how Ambari supports automated deployment and cluster management using blueprints.
This document summarizes the Hortonworks Data Platform (HDP). HDP is a fully integrated big data platform that includes Apache Hadoop, HBase, Hive, Pig and other projects. It addresses the challenges of integrating and managing open source projects by providing certified, tested distributions with extensive quality assurance. HDP also includes tools for management and monitoring clusters through Ambari, and for data integration through Talend Open Studio.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
This document discusses security requirements and solutions for Apache Spark production deployments. It covers authenticating users with Kerberos/AD, authorizing access to Spark jobs and data with Ranger, auditing access, and encrypting data at rest and in motion. It provides examples of configuring Kerberos authentication for Spark, using Ranger to control authorization to HDFS and SparkSQL, and demonstrates dynamic row filtering and masking of sensitive data in SparkSQL queries based on user policies.
With its large install base in production, the Storm 1.x line has proven itself as a stable and reliable workhorse that scales well horizontally. Much has been learnt from evolving the 1.x line that we can now leverage to build the next generation execution engine. Under the STORM-2284 umbrella, we are working hard to bring you this new engine which is being redesigned at a fundamental level for Storm 2.0. The goal is to dramatically improve performance and enhance Storm's abilities without breaking compatibility.
This improved vertical scaling will help meet the needs of the growing user base by delivering more performance with less hardware.
In this talk, we will take an in-depth look at the existing and proposed designs for Storm's threading model and the messaging subsystem. We will also do a quick run-down of the major proposed improvements and share some early results from the work in progress.
Speaker
Roshan Naik, Senior MTS, Hortonworks
Connecting the Drops with Apache NiFi & Apache MiNiFiDataWorks Summit
Demand for increased capture of information to drive analytic insights into an organizations' assets and infrastructure is growing at unprecedented rates. However, as data volume growth soars, the ability to provide seamless ingestion pipelines becomes operationally complex as the magnitude of data sources and types expands.
This talk will focus on the efforts of the Apache NiFi community including subproject, MiNiFi; an agent based architecture and its relation to the core Apache NiFi project. MiNiFi is focused on providing a platform that meets and adapts to where data is born while providing the core tenets of NiFi in provenance, security, and command and control. These capabilities provide versatile avenues for the bi-directional exchange of information across data and control planes while dealing with the constraints of operation at opposite ends of the scale spectrum tackling the first and last miles of dataflow management.
We will highlight ongoing and new efforts in the community to provide greater flexibility with deployment and configuration management of flows. Versioned flows provide greater operational flexibility and serve as a powerful foundation to orchestrate the collection and transmission from the point of data's inception through to its transmission to consumers and processing systems.
The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.
The document discusses the past, present, and future of Apache Hadoop YARN. It describes how YARN started as a sub-project of Hadoop to improve its resource management capabilities. Today, YARN is central to modern data architectures, providing centralized resource management and scheduling. Going forward, YARN aims to better support containers, simplified APIs, treating services as first-class citizens, and enhance its user experience.
This document discusses Apache NiFi and how it was used to create a new composable data flow system for Schlumberger in just 10 man hours. The previous system was very complex, took over 100 man years to create, and was difficult to change. NiFi allows for easy visualization of the data flow, debugging of issues, and rapid creation of new processors. It also enables quick testing of data flows using curated test data sets and live data in Docker containers. Next steps discussed include further exploring use cases for rig data ingestion with NiFi to provide data provenance and understand the chain of custody of data as it moves through the system.
LLAP (Live Long and Process) is the newest query acceleration engine for Hive 2.0, which entered GA in 2017. LLAP brings into light a new set of trade-offs and optimizations that allows for efficient and secure multi-user BI systems on the cloud. In this talk, we discuss the specifics of building a modern BI engine within those boundaries, designed to be fast and cost-effective on the public cloud. The focus of the LLAP cache is to speed up common BI query patterns on the cloud, while avoiding most of the operational administration overheads of maintaining a caching layer, with an automatically coherent cache with intelligent eviction and support for custom file formats from text to ORC, and explore the possibilities of combining the cache with a transactional storage layer which supports online UPDATE and DELETES without full data reloads. LLAP by itself, as a relational data layer, extends the same caching and security advantages to any other data processing framework. We overview the structure of such a hybrid system, where both Hive and Spark use LLAP to provide SQL query acceleration on the cloud with new, improved concurrent query support and production-ready tools and UI.
Speaker
Sergey Shelukin, Member of Technical Staff, Hortonworks
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
The document discusses how Hortonworks Data Platform (HDP) enables a modern data architecture with Apache Hadoop. HDP provides a common data set stored in HDFS that can be accessed through various applications for batch, interactive, and real-time processing. This allows organizations to store all their data in one place and access it simultaneously through multiple means. YARN is the architectural center of HDP and enables this modern data architecture. HDP also provides enterprise capabilities like security, governance, and operations to make Hadoop suitable for business use.
The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.
This document provides an overview of Apache NiFi 1.0 and discusses its new enhancements, including a modernized UI with a complete interface redesign, multitenant authorization capabilities, zero master clustering, and foundational work for software development lifecycles. It also outlines NiFi's use for data flow management and integration with downstream systems.
Zeppelin has become a popular way to unlock the value of data lake due to its user interface and appeal to business users. These business users ask their IT department for access to Zeppelin. Enterprise IT department want to help their business users but they have several enterprise concerns such as enterprise security, integration with their corporate LDAP/AD, scalability and multi-user environment, integration with Ranger and Kerberos. This session will walk through enterprise concerns and how these concerns can be handled with Zeppelin.
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit
This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that for tables with many common strings, Avro with Snappy compression provided good performance. For other tables, ORC with Zlib compression generally performed best. The document outlines the characteristics and performance of each format for different use cases like full table scans, column projections, predicate pushdown, and metadata access. It concludes by recommending experimenting with the open benchmark suite and formats.
This document discusses ways to troubleshoot slow Hadoop jobs using metrics, logging, and tracing. It describes how to use the Ambari metrics system and Grafana dashboards to monitor metrics for clusters. It also explains how to leverage Hadoop logs and the YARN Application Timeline Service for logging and correlation across workloads. Finally, it presents Apache Zeppelin and analyzers for Hive, Tez, and YARN as tools for ad-hoc analysis to diagnose issues.
The document discusses Apache NiFi and its role in the Hadoop ecosystem. It provides an overview of NiFi, describes how it can be used to integrate with Hadoop components like HDFS, HBase, and Kafka. It also discusses how NiFi supports stream processing integrations and outlines some use cases. The document concludes by discussing future work, including improving NiFi's high availability, multi-tenancy, and expanding its ecosystem integrations.
View the recording:
http://hortonworks.com/webinar/accelerating-real-time-data-ingest-hadoop/
Hadoop didn’t disrupt the data center. The exploding amounts of data did. But, let’s face it, if you can’t move your data to Hadoop, then you can’t use it in Hadoop. The experts from Hortonworks, the #1 leader in Hadoop development, and Attunity, a leading data management software provider, cover:
- How to ingest your most valuable data into Hadoop using Attunity Replicate
- About how customers are using Hortonworks DataFlow (HDF) powered by Apache NiFi
- How to combine the real-time change data capture (CDC) technology with connected data platforms from Hortonworks
We discuss how Attunity Replicate and Hortonworks Data Flow (HDF) work together to move data into Hadoop.
Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
In this talk, we’ll first hit the ground with the current status of Apache Hadoop YARN – how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale.
We'll then move on to the exciting present & future of YARN – features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. We’ll discuss the current status as well as the future promise of features and initiatives like – 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.
Speaker:
Sunil Govindan, Senior Software Engineer, Hortonworks
Rohith Sharma K S, Senior Software Engineer, Hortonworks
今回のウェビナーでは、Hadoop1.xからみなさまに深く親しまれてきたApache Hiveが昨今、どのような形で高速化されてきたかについて話します。MapReduceからTezに変わった実行エンジン、インデックスを持ったカラムナーファイルフォーマットであるORC、モダンなCPUを最大限に活用するVectorization、Apache Calciteを利用したCost Based Optimizerによる実行計画の最適化、そして1秒以下のクエリレスポンスを実現するLLAPについて説明します。いずれの機能も数行の設定やコマンドで活用可能なものばかりですが、今回はそれらの背景でどんな仕組みが動いているのか、どんな仕組みで実現されているのかということについて話します。
The story about how to figure out what to measure, and how you can benchmark that. This slide deck tells the idea of benchmarking and does not tell actual commercial/open source benchmark tools.
Dynamic Resource Allocation in Apache SparkYuta Imai
Dynamic resource allocation in Apache Spark allows executors to be dynamically added or removed based on the workload of applications. Extra executors are added when applications have pending tasks to help balance workload, and idle executors are removed to free resources for other applications. The dynamic allocation policies control when executors are requested or removed based on factors like pending tasks and executor idle time. An external shuffle service is also used to improve shuffle performance.
The document discusses various options for getting data from Kafka into Hadoop, including Camus, Flume, Spark Streaming, and Storm. It provides information on how each works and their advantages and disadvantages. The presenter has 15 years of experience moving data and is now a Cloudera engineer working on projects like Flume, Sqoop, and Kafka.
LLAP enables sub-second analytical queries in Hive by running query fragments directly in memory on compute nodes using a long-running daemon process. It provides high performance scans and execution through an in-memory columnar cache shared across queries. LLAP queries are coordinated independently by Tez while utilizing Hive operators for processing and Tez for data transfers. It improves upon traditional MapReduce and Tez by keeping intermediate query results in memory rather than writing to disk.
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
The document discusses real-time processing in Hadoop using the Hortonworks Data Platform (HDP). It provides an overview of using HDP for real-time streaming analytics in a logistics scenario. Example applications and architectures are presented, including using Kafka for ingesting sensor data, Storm for stream processing, and HBase for real-time querying. Demos will also illustrate integrating predictive analytics into streaming scenarios.
Storm Demo Talk - Colorado Springs May 2015Mac Moore
The document discusses real-time processing capabilities in Hadoop and Hortonworks Data Platform (HDP). It begins with an introduction to Hortonworks and an overview of real-time streaming architectures on HDP. It then demonstrates streaming capabilities with and without predictive analytics additions. The document highlights how HDP provides a centralized architecture and open data platform to enable real-time and batch processing of any type of data for analytics applications.
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
How can you simplify the management and monitoring of your Hadoop environment? Ensure IT can focus on the right business priorities supported by Hadoop? Take a look at this presentation and learn how you can simplify the management and monitoring of your Hadoop environment, and ensure IT can focus on the right business priorities supported by Hadoop.
The document discusses real-time processing in Hadoop and provides an overview of streaming architectures using the Hortonworks Data Platform (HDP). It includes two demos, the first showing a basic streaming scenario and the second integrating predictive analytics. The document aims to introduce HDP's capabilities for real-time streaming and predictive analytics and demonstrate them through examples relevant to logistics companies.
This document provides an overview of real-time processing capabilities on Hortonworks Data Platform (HDP). It discusses how a trucking company uses HDP to analyze sensor data from trucks in real-time to monitor for violations and integrate predictive analytics. The company collects data using Kafka and analyzes it using Storm, HBase and Hive on Tez. This provides real-time dashboards as well as querying of historical data to identify issues with routes, trucks or drivers. The document explains components like Kafka, Storm and HBase and how they enable a unified YARN-based architecture for multiple workloads on a single HDP cluster.
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
This document provides an overview of how a trucking company can use Hortonworks Data Platform (HDP) to gain insights from real-time streaming data generated by sensors in its trucks. The company wants to monitor trucks for locations, violations, and other events. HDP allows the company to ingest streaming data from trucks using Kafka and analyze it in real-time with Storm for alerts or serve it to applications with HBase. The company can also run interactive queries on historical data with Hive and Tez. All of this is run on a single HDP cluster for consistent governance, security, and operations across batch and real-time workloads.
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
This is Mark Ledbetter's presentation from the September 22, 2014 Hortonworks webinar “What’s Possible with a Modern Data Architecture?” Mark is vice president for industry solutions at Hortonworks. He has more than twenty-five years experience in the software industry with a focus on Retail and supply chain.
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu
SpringOne Platform 2016
Speaker: Ian Fyfe; Director, Product Marketing, Hortonworks
Apache Hadoop is the most powerful and popular platform for ingesting, storing and processing enormous amounts of “big data”. However, due to its original roots as a batch processing system, doing interactive business analytics with Hadoop has historically suffered from slow response times, or forced business analysts to extract data summaries out of Hadoop into separate data marts. This talk will discuss the different options for implementing speed-of-thought business analytics and machine learning tools directly on top of Hadoop including Apache Hive on Tez, Apache Hive on LLAP, Apache HAWQ and Apache MADlib.
Apache Ambari is a single framework for IT administrators to provision, manage and monitor a Hadoop cluster. Apache Ambari 1.7.0 is included with Hortonworks Data Platform 2.2.
In this 30-minute webinar, Hortonworks Product Manager Jeff Sposetti and Apache Ambari committer Mahadev Konar discussed new capabilities including:
Improvements to Ambari core - such as support for ResourceManager HA
Extensions to Ambari platform - introducing Ambari Administration and Ambari Views
Enhancements to Ambari Stacks - dynamic configuration recommendations and validations via a "Stack Advisor"
This document summarizes Hortonworks' Hadoop distribution called Hortonworks Data Platform (HDP). It discusses how HDP provides a comprehensive data management platform built around Apache Hadoop and YARN. HDP includes tools for storage, processing, security, operations and accessing data through batch, interactive and real-time methods. The document also outlines new capabilities in HDP 2.2 like improved engines for SQL, Spark and streaming and expanded deployment options.
Trafodion – an enterprise class sql based on hadoopKrishna-Kumar
Trafodion is a joint HP Labs and HP-IT research project to develop an enterprise-class SQL on Hadoop DBMS engine that specifically targets operational workloads as opposed to analytic workloads. Operational SQL describe workloads previous described as OLTP (online transaction processing) workloads and Operational Data Store (ODS) workloads, but expands that definition from the broad range of enterprise-level transactional applications (ERP, CRM, etc.) to include the new transactions generated from social and mobile data interactions and observations and the new mixing of structured and semi-structured data.
Apache Atlas provides metadata services and a centralized metadata repository for Hadoop platforms. It aims to enable data governance across structured and unstructured data through hierarchical taxonomies. Upcoming features include expanded dataset lineage tracking and integration with Apache Kafka and Ranger for dynamic access policy management. Challenges of big data management include scaling traditional tools to handle large volumes of entities and metadata, and Atlas addresses this through its decentralized and metadata-driven approach.
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
Over the last eighteen months, we have seen significant adoption of Hadoop eco-system centric big data processing in Microsoft Azure and Amazon AWS. In this talk we present some of the lessons learned and architectural considerations for cloud-based deployments including security, fault tolerance and auto-scaling.
We look at how Hortonworks Data Cloud and Cloudbreak can automate that scaling of Hadoop clusters, showing how it can react dynamically to workloads, and what that can deliver in cost-effective Hadoop-in-cloud deployments.
Hortonworks and Red Hat Webinar - Part 2Hortonworks
Learn more about creating reference architectures that optimize the delivery the Hortonworks Data Platform. You will hear more about Hive, JBoss Data Virtualization Security, and you will also see in action how to combine sentiment data from Hadoop with data from traditional relational sources.
Apache Phoenix Query Server PhoenixCon2016Josh Elser
This document discusses Apache Phoenix Query Server, which provides a client-server abstraction for Apache Phoenix using Apache Calcite's Avatica sub-project. It allows Phoenix to have thin clients by offloading computational resources to query servers running on Hadoop clusters. This enables non-Java clients through a standardized HTTP API. The query server implementation uses HTTP, Protocol Buffers for serialization, and common libraries like Jetty and Dropwizard Metrics. It aims to simplify Phoenix client development and improve performance and scalability.
The document discusses bringing multi-tenancy to Apache Zeppelin through the use of Apache Livy. Livy is an open-source REST interface that allows interacting with Spark from anywhere and enables features like multi-user sessions and security. It improves on previous versions of interactive analysis in Zeppelin by allowing custom user sessions through Livy and improving security and isolation between users through mechanisms like SPNEGO and impersonation. The integration of Livy provides multi-tenancy, security, and isolation for interactive analysis in Zeppelin.
Hadoop & DevOps : better together by Maxime Lanciaux.
From deployment automation with tools (like jenkins, git, maven, ambari, ansible) to full automation with monitoring on HDP2.5+.
Introduction to the Hortonworks YARN Ready ProgramHortonworks
The recently launched YARN Ready Program will accelerate multi-workload Hadoop in the Enterprise. The program enables developers to integrate new and existing applications with YARN-based Hadoop. We will cover:
--the program and it's benefits
--why it is important to customers
--tools and guides to help you get started
--technical resources to support you
--marketing recognition you can leverage
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar PatturajScyllaDB
Freshworks migrated from Cassandra to ScyllaDB to handle growing audit log data efficiently. Cassandra required frequent scaling, complex repairs, and had non-linear scaling. ScyllaDB reduced costs with fewer machines and improved operations. Using Zero Downtime Migration (ZDM), they bulk-migrated data, performed dual writes, and validated consistency.
Technology use over time and its impact on consumers and businesses.pptxkaylagaze
In this presentation, I will discuss how technology has changed consumer behaviour and its impact on consumers and businesses. I will focus on internet access, digital devices, how customers search for information and what they buy online, video consumption, and lastly consumer trends.
Unlock AI Creativity: Image Generation with DALL·EExpeed Software
Discover the power of AI image generation with DALL·E, an advanced AI model that transforms text prompts into stunning, high-quality visuals. This presentation explores how artificial intelligence is revolutionizing digital creativity, from graphic design to content creation and marketing. Learn about the technology behind DALL·E, its real-world applications, and how businesses can leverage AI-generated art for innovation. Whether you're a designer, developer, or marketer, this guide will help you unlock new creative possibilities with AI-driven image synthesis.
Technology use over time and its impact on consumers and businesses.pptxkaylagaze
In this presentation, I explore how technology has changed consumer behaviour and its impact on consumers and businesses. I will focus on internet access, digital devices, how customers search for information and what they buy online, video consumption, and lastly consumer trends.
30B Images and Counting: Scaling Canva's Content-Understanding Pipelines by K...ScyllaDB
Scaling content understanding for billions of images is no easy feat. This talk dives into building extreme label classification models, balancing accuracy & speed, and optimizing ML pipelines for scale. You'll learn new ways to tackle real-time performance challenges in massive data environments.
DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (平山毅)Tsuyoshi Hirayama
DAO UTokyo 2025
東京大学情報学環 ブロックチェーン研究イニシアティブ
https://utbciii.com/2024/12/12/announcing-dao-utokyo-2025-conference/
Session 1 :DLT mass adoption
IBM Tsuyoshi Hirayama (平山毅)
This is session #4 of the 5-session online study series with Google Cloud, where we take you onto the journey learning generative AI. You’ll explore the dynamic landscape of Generative AI, gaining both theoretical insights and practical know-how of Google Cloud GenAI tools such as Gemini, Vertex AI, AI agents and Imagen 3.
Formal Methods: Whence and Whither? [Martin Fränzle Festkolloquium, 2025]Jonathan Bowen
Alan Turing arguably wrote the first paper on formal methods 75 years ago. Since then, there have been claims and counterclaims about formal methods. Tool development has been slow but aided by Moore’s Law with the increasing power of computers. Although formal methods are not widespread in practical usage at a heavyweight level, their influence as crept into software engineering practice to the extent that they are no longer necessarily called formal methods in their use. In addition, in areas where safety and security are important, with the increasing use of computers in such applications, formal methods are a viable way to improve the reliability of such software-based systems. Their use in hardware where a mistake can be very costly is also important. This talk explores the journey of formal methods to the present day and speculates on future directions.
UiPath Document Understanding - Generative AI and Active learning capabilitiesDianaGray10
This session focus on Generative AI features and Active learning modern experience with Document understanding.
Topics Covered:
Overview of Document Understanding
How Generative Annotation works?
What is Generative Classification?
How to use Generative Extraction activities?
What is Generative Validation?
How Active learning modern experience accelerate model training?
Q/A
❓ If you have any questions or feedback, please refer to the "Women in Automation 2025" dedicated Forum thread. You can find there extra details and updates.
Backstage Software Templates for Java DevelopersMarkus Eisele
As a Java developer you might have a hard time accepting the limitations that you feel being introduced into your development cycles. Let's look at the positives and learn everything important to know to turn Backstag's software templates into a helpful tool you can use to elevate the platform experience for all developers.
Just like life, our code must evolve to meet the demands of an ever-changing world. Adaptability is key in developing for the web, tablets, APIs, or serverless applications. Multi-runtime development is the future, and that future is dynamic. Enter BoxLang: Dynamic. Modular. Productive. (www.boxlang.io)
BoxLang transforms development with its dynamic design, enabling developers to write expressive, functional code effortlessly. Its modular architecture ensures flexibility, allowing easy integration into your existing ecosystems.
Interoperability at Its Core
BoxLang boasts 100% interoperability with Java, seamlessly blending traditional and modern development practices. This opens up new possibilities for innovation and collaboration.
Multi-Runtime Versatility
From a compact 6MB OS binary to running on our pure Java web server, CommandBox, Jakarta EE, AWS Lambda, Microsoft Functions, WebAssembly, Android, and more, BoxLang is designed to adapt to any runtime environment. BoxLang combines modern features from CFML, Node, Ruby, Kotlin, Java, and Clojure with the familiarity of Java bytecode compilation. This makes it the go-to language for developers looking to the future while building a solid foundation.
Empowering Creativity with IDE Tools
Unlock your creative potential with powerful IDE tools designed for BoxLang, offering an intuitive development experience that streamlines your workflow. Join us as we redefine JVM development and step into the era of BoxLang. Welcome to the future.
Understanding Traditional AI with Custom Vision & MuleSoft.pptxshyamraj55
Understanding Traditional AI with Custom Vision & MuleSoft.pptx | ### Slide Deck Description:
This presentation features Atul, a Senior Solution Architect at NTT DATA, sharing his journey into traditional AI using Azure's Custom Vision tool. He discusses how AI mimics human thinking and reasoning, differentiates between predictive and generative AI, and demonstrates a real-world use case. The session covers the step-by-step process of creating and training an AI model for image classification and object detection—specifically, an ad display that adapts based on the viewer's gender. Atulavan highlights the ease of implementation without deep software or programming expertise. The presentation concludes with a Q&A session addressing technical and privacy concerns.
Fl studio crack version 12.9 Free Downloadkherorpacca127
https://ncracked.com/7961-2/
Note: >>👆👆 Please copy the link and paste it into Google New Tab now Download link
The ultimate guide to FL Studio 12.9 Crack, the revolutionary digital audio workstation that empowers musicians and producers of all levels. This software has become a cornerstone in the music industry, offering unparalleled creative capabilities, cutting-edge features, and an intuitive workflow.
With FL Studio 12.9 Crack, you gain access to a vast arsenal of instruments, effects, and plugins, seamlessly integrated into a user-friendly interface. Its signature Piano Roll Editor provides an exceptional level of musical expression, while the advanced automation features empower you to create complex and dynamic compositions.
Gojek Clone is a versatile multi-service super app that offers ride-hailing, food delivery, payment services, and more, providing a seamless experience for users and businesses alike on a single platform.
Computational Photography: How Technology is Changing Way We Capture the WorldHusseinMalikMammadli
📸 Computational Photography (Computer Vision/Image): How Technology is Changing the Way We Capture the World
Heç düşünmüsünüzmü, müasir smartfonlar və kameralar necə bu qədər gözəl görüntülər yaradır? Bunun sirri Computational Fotoqrafiyasında(Computer Vision/Imaging) gizlidir—şəkilləri çəkmə və emal etmə üsulumuzu təkmilləşdirən, kompüter elmi ilə fotoqrafiyanın inqilabi birləşməsi.
World Information Architecture Day 2025 - UX at a CrossroadsJoshua Randall
User Experience stands at a crossroads: will we live up to our potential to design a better world? or will we be co-opted by “product management” or another business buzzword?
Looking backwards, this talk will show how UX has repeatedly failed to create a better world, drawing on industry data from Nielsen Norman Group, Baymard, MeasuringU, WebAIM, and others.
Looking forwards, this talk will argue that UX must resist hype, say no more often and collaborate less often (you read that right), and become a true profession — in order to be able to design a better world.