Big Data - IBM

Hadoop echoSystem and IBM Big Insights Rafie Tarabay [email protected] [email protected] ?When you have Big Data When you have one of the next 3 items Variety: Manage and benefit from diverse data types and data structures Velocity: Analyze streaming data and large volumes of persistent data Volume: Scale from terabytes to zettabytes BI vs Big Data Analysis BI :

Business Users determine what question to ask, then IT Structures the data to answer that question. Sample of BI tasks: Monthly sales reports, Profitability analysis, Customer surveys Big Data Approach: IT delivers a platform to enable creative discovery, then Business Explores what questions could be asked

Sample of BigData tasks: Brand sentiment, Product strategy, Maximum asset utilization Data representation formats used for Big Data Common data representation formats used for big data include: Row- or record-based encodings: Flat files / text files CSV and delimited files

Avro / SequenceFile JSON Other formats: XML, YAML Column-based storage formats:

RC / ORC file Parquet NoSQL Database ? What is Parquet, RC/ORC file formats, and Avro Parquet Parquet is a columnar storage format, Allows compression schemes to be specified on a per-column level Offer better write performance by storing metadata at the end of the file

Provides the best results in benchmark performance tests RC/ORC file formats developed to support Hive and use a columnar storage format Provides basic statistics such as min, max, sum, and count, on columns Avro Avro data files are a compact, efficient binary format NoSQL Databases NoSQL is a new way of handling variety of data. NoSQL DB can handle Millions of Queries per Sec while normal RDBMS can handle Thousands of Queries per Sec only, and both are follow CAP Theorem. Types of NoSQL datastores: Key-value stores: MemCacheD, REDIS, and Riak Column stores: HBase and Cassandra Document stores: MongoDB, CouchDB, Cloudant, and MarkLogic

Graph stores: Neo4j and Sesame CAP Theorem CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability. * Consistency means Every read receives the most recent write or an error * Availability means Every request receives a (non-error) response (without guarantee that it contains the most recent write) HBase, and MongoDB ---> CP [give data Consistency but not Availability] Cassandra , CouchDB ---> AP [give data Availability but not Consistency] while traditional Relational DBMS are CA [support Consistency and Availability but not network partition]

Time line for Hadoop Hadoop Apache Hadoop Stack The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. Hadoop HDFS : [IBM has alternative file system for Hadoop with name GPFS]

where Hadoop stores data a file system that spans all the nodes in a Hadoop cluster links together the file systems on many local nodes to make them into one large file system that spans all the data nodes of the cluster Hadoop MapReduce v1 : an implementation for large-scale data processing. MapReduce engine consists of : - JobTracker : receive client applications jobs and send orders to the TaskTrackes who are nearest to the data as possible. - TaskTracke: exists on cluster's nodes to receive the orders from JobTracker

YARN (it is the newer version of MapReduce): each cluster has a Resource Manager, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc. Advantages and disadvantages of Hadoop Hadoop is good for: processing massive amounts of data through parallelism

handling a variety of data (structured, unstructured, semi-structured) using inexpensive commodity hardware Hadoop is not good for: processing transactions (random access) when work cannot be parallelized

Fast access to data processing lots of small files intensive calculations with small amounts of data What hardware is not used for Hadoop? RAID

Linux Logical Volume Manager (LVM) Solid-state disk (SSD) HSFS Hadoop Distributed File System (HDFS) principles

Distributed, scalable, fault tolerant, high throughput Data access through MapReduce Files split into blocks (aka splits) 3 replicas for each piece of data by default Can create, delete, and copy, but cannot update Designed for streaming reads, not random access Data locality is an important concept: processing data on or near the physical storage to decrease transmission of data HDFS: architecture Master / Slave architecture NameNode Manages the file system namespace and metadata

Regulates access to files by clients NameNode a b c d DataNode

Many DataNodes per cluster Manages storage attached to the nodes Periodically reports status to NameNode Data is stored across multiple nodes

Nodes and components will fail, so for reliability data is replicated across multiple nodes File1 a b d b a c a

d c DataNodes c b d Hadoop HDFS: read and write files from HDFS Create sample text file on Linux # echo My First Hadoop Lesson > test.txt Linux List system files to ensure file creation # ls -lt List current files on your home directory in HDFS

# hadoop fs -ls / Create new directory in HDFS name it test # hadoop fs -mkdir test Load test.txt file into Hadoop HDFS # hadoop fs -put test.txt test/ View contents of HDFS file test.txt # hadoop fs -cat test/test.txt hadoop fs - Command Reference ls Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry. lsr Behaves like -ls, but recursively displays entries in all subdirectories of path. du

Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full HDFS protocol prefix. dus Like -du, but prints a summary of disk usage of all files/directories in the path. mv Moves the file or directory indicated by src to dest, within HDFS. cp Copies the file or directory identified by src to dest, within HDFS. rm Removes the file or empty directory identified by path. rmr Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path). hadoop fs - Command Reference put Copies the file or directory from the local file system identified by localSrc to dest within the DFS.

copyFromLocal Identical to -put moveFromLocal Copy file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success. get [-crc] Copies the file or directory in HDFS identified by src to the local file system path identified by localDest. getmerge Retrieves all files that match the path src in HDFS, and copies them local to a single, merged file localDest. cat

Displays the contents of filename on stdout. copyToLocal Identical to -get moveToLocal Works like -get, but deletes the HDFS copy on success. mkdir Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., mkdir -p in Linux). hadoop fs - Command Reference stat [format] Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block

size (%o), replication (%r), and modification date (%y, %Y). tail [-f] Shows the last 1KB of file on stdout. chmod [-R] mode,mode,... ... Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask. chown [-R] [owner][:[group]] ... Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified. chgrp [-R] group ... Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified. help Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd.

YARN YARN Sometimes called MapReduce 2.0, YARN decouples scheduling capabilities from the data processing component Hadoop clusters can now run interactive querying and streaming data applications simultaneously. Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for operational applications that can't wait for batch jobs to finish. YARN HBASE

HBase is a NoSQL column family database that runs on top of Hadoop HDFS (it is the default Hadoop Database ). Can handle large tables which have billions of rows and millions of columns with fault tolerance and horizontal scalability. HBase concept was inspired by Googles Big Table. Schema does not need to be defined up front

support high performance random r/w applications Data is stored in HBase table(s) Tables are made of rows and columns Row stored in order by row keys

Query data using get/put/scan only For more information https://www.tutorialspoint.com/hbase/index.htm PIG PIG Apache Pig is used for querying data stored in Hadoop clusters.

It allows users to write complex MapReduce transformations using high-level scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce tasks by using its Pig Engine component so that it can be executed within YARN for access to a single dataset stored in the HDFS. Programmers need not write complex code in Java for MapReduce tasks rather they can use Pig Latin to perform MapReduce tasks.

Apache Pig provides nested data types like tuples, bags, and maps that are missing from MapReduce along with built-in operators like joins, filters, ordering etc. Apache Pig can handle structured, unstructured, and semi-structured data. For more information https://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm Hive Hive

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements. Hive shell, JDBC, and ODBC are supported

Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase For more information https://www.tutorialspoint.com/hive/hive_introduction.htm Create Database/Tables in Hive hive> CREATE DATABASE IF NOT EXISTS userdb; hive> SHOW DATABASES; hive> DROP DATABASE IF EXISTS userdb; hive> DROP DATABASE IF EXISTS userdb CASCADE; (to drop all tables also)

hive> CREATE TABLE IF NOT EXISTS employee (id int, name String, salary String, destination String) COMMENT Employee details ROW FORMAT DELIMITED FIELDS TERMINATED BY \t LINES TERMINATED BY \n STORED AS TEXTFILE; hive> ALTER TABLE employee RENAME TO emp; hive> ALTER TABLE employee CHANGE name ename String; hive> ALTER TABLE employee CHANGE salary salary Double; hive> ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT 'Department name); hive> DROP TABLE IF EXISTS employee; hive> SHOW TABLES;

Select, Views hive> SELECT * FROM employee WHERE Id=1205; hive> SELECT * FROM employee WHERE Salary>=40000; hive> SELECT 20+30 ADD FROM temp;

hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP; hive> SELECT round(2.6) from temp; hive> SELECT floor(2.6) from temp; hive> SELECT ceil(2.6) from temp;

hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE salary>30000; hive> DROP VIEW emp_30000; Index, Order by, Group by, Join hive> CREATE INDEX inedx_salary ON TABLE employee(salary) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'; hive> DROP INDEX index_salary ON employee; hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT; hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT; hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c

LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); Java example for Hive JDBC import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveQLOrderBy { public static void main(String[] args) throws SQLException { Class.forName(org.apache.hadoop.hive.jdbc.HiveDriver); // Register driver and create driver instance Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", ""); // get connection

Statement stmt = con.createStatement(); // create statement Resultset res = stmt.executeQuery("SELECT * FROM employee ORDER BY DEPT;"); // execute statement System.out.println(" ID \t Name \t Salary \t Designation \t Dept "); while (res.next()) { System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " + res.getString(4) + " " + res.getString(5)); } con.close(); } } $ javac HiveQLOrderBy.java $ java HiveQLOrderBy Phoenix

Apache Phoenix Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native noSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of noSQL stores.

Apache phoenix is a good choice for low-latency and mid-size table (1M - 100M rows) Apache phoenix is faster than Hive, and Impala Phoenix main features support transaction Support User defined functions

Support Secondary Indexes supports view syntax Solr Solr (enterprise search engine) Solr is used to build search applications deliver high performance with support for execution of parallel SQL queries... It was built on top of Lucene (full text search

engine). Solr can be used along with Hadoop to search large volumes of text-centric data.. Not only search, Solr can also be used for storage purpose. Like other NoSQL databases, it is a non-relational data storage and processing technology. Support Fulltext Search, It utilizes the RAM, not the CPU. PDF, Word document indexing

Auto-suggest, Stop words, synonyms, etc. Supports replication Communicate with the search server via HTTP (it can even return Json, Native

PHP/Ruby/Python) Index directly from the database with custom queries For more information: https://www.tutorialspoint.com/apache_solr/apache_solr_overview.htm Elasticsearch-Hadoop Elasticsearch-Hadoop (ES-Hadoop) ES-Hadoop is a real-time FTS search and analytics engine

Connect the massive data storage and deep processing power of Hadoop with the real-time search and analytics of Elasticsearch. ES-Hadoop connector lets you get quick insight from your big data and makes working in the Hadoop ecosystem even better. ES-Hadoop lets you index Hadoop data into the Elastic Stack to take full advantage of the speedy Elasticsearch engine and beautiful Kibana visualizations.

With ES-Hadoop, you can easily build dynamic, embedded search applications to serve your Hadoop data or perform deep, low-latency analytics using full-text, geospatial queries and aggregations. ES-Hadoop lets you easily move data bi-directionally between Elasticsearch and Hadoop while exposing HDFS as a repository for long-term archival. SQOOP SQOOP

Import data from relational database tables into HDFS Export data from HDFS into relational database tables Sqoop works with all databases that have a JDBC connection. JDBC driver JAR files should exists in $SQOOP_HOME/lib It uses MapReduce to import and export the data Imports data can be stored as Text files Binary files Into HBase Into Hive Oozie Oozie

Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. It allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to run parallel to each other. One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack supporting various Hadoop jobs like Hive, Pig, Sqoop as well as systemspecific jobs like Java and Shell. Oozie has three types of jobs: Oozie Workflow Jobs These are represented as Directed Acyclic Graphs (DAGs) to

specify a sequence of actions to be executed. Oozie Coordinator Jobs These consist of workflow jobs triggered by time and data availability. Oozie Bundle These can be referred to as a package of multiple coordinator and workflow jobs. For more information https://www.tutorialspoint.com/apache_oozie/apache_oozie_introduction.htm R-Hadoop

R Hadoop RHadoop is a collection of five R packages that allow users to manage and analyze data with Hadoop. Rhdfs: Connect HDFSto R. Rhbase: Rmr2: Ravro: connect HBASE to R

enable R to perform statistical analysis using MapReduce enable R to read and write avro files from local and HDFS Plyrmr: enable user to perform common data manipulation operations, as found in plyr and reshape2, on data sets stored on Hadoop SPARK Spark with Hadoop 2+ Spark is an alternative in-memory framework to MapReduce Supports general workloads as well as streaming, interactive queries and machine learning providing performance gains

Spark jobs can be written in Scala, Python, or Java; APIs are available for all three Run Spark Scala shells by (spark-shell) Spark Python shells by (pyspark) Apache Spark was the world record holder in 2014 for sorting. By sorting 100 TB of data on 207 machines in 23 minutes but Hadoop MapReduce took 72 minutes on 2100 machines. Spark libraries Spark SQL: is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , Hive etc. example:

scala> sqlContext.sql("SELECT * FROM src").collect scala> hiveContext.sql("SELECT * FROM src").collect Spark Streaming: Write applications to process streaming data in Java or Scala. Receives data from: Kafka, Flume, HDFS / S3, Kinesis, Twitter Pushes data out to: HDFS, Databases, Dashboard MLLib: Spark2+ has new optimized library support machine learning functions on a cluster based on new DataFrame-based API in the spark.ml package. GraphX: API for graphs and parallel computation Flume Flume Flume was created to allow you to flow data stream from a source into your Hadoop note that HDFS files does not support update by default. The source of data stream can be

TCP traffic on the port Logs that constantly appended. ie, Log file of the web server Tweeter feeds . Kafka Kafka

Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one endpoint to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis. messages are persisted in a topic. consumers can subscribe to one or more topic and consume all the messages in that topic. Knox

Knox Hadoop clusters is unsecured by default and any one can call it and we can block direct access to Hadoop using Knox. Knox Gateway is a REST API Gateway to interact with Hadoop clusters. Knox allow control, integration, monitoring and automation administrative and analytical tasks Knox provide Authentication using LDAP and Active Directory Authentication Slider Slider (support long-running distributed services) YARN resource management and scheduling works well for batch workloads, but not for interactive or real-time data processing services

Apache Slider extends YARN to support long-running distributed services on an Hadoop cluster, Supports restart after process failure, support Live Long and Process (LLAP) Applications can be stopped then started The distribution of the deployed application across the YARN cluster is persisted

This enables best-effort placement close to the previous locations Applications which remember the previous placement of data (such as HBase) can exhibit fast start-up times from this feature. YARN itself monitors the health of "YARN containers" hosting parts of the deployed application YARN notifies the Slider manager application of container failure

Slider then asks YARN for a new container, into which Slider deploys a replacement for the failed component, keeping the size of managed applications consistent with the specified configuration Slider implements all its functionality through YARN APIs and the existing application shell scripts The goal of the application was to have minimal code changes and impact on existing applications ZooKeeper

ZooKeeper Zookeeper is a distributed coordination service that manages large sets of nodes. On any partial failure, clients can connect to any node to receive correct, up-to-date information Services depend on ZooKeeper Hbase,MapReduce, and Flume Z-Node znode is a file that persists in memory on the ZooKeeper servers znode can be updated by any node in the cluster Applications can synchronize their tasks across the distributed cluster by updating their status in a ZooKeeper znode, which would then inform the rest of the cluster of a specific nodes status change. ZNode Shell command: Create, delete, exists, getChildren, getData, setData, getACL, setACL, sync Watches events

any node in the cluster can register to be informed of changes to specific znode (watch) Watches are one-time triggers and always ordered. Client sees watched event before new ZNode data. ZNode watches events: NodeChildrenChanged, NodeCreated, NodeDataChanged, NodeDeleted Ambari Ambari (GUI tools to manage Hadoop( Ambari View is developed by HortonWorks. Ambari is a GUI tool you can use to create(install) manage the entire hadoop cluster. You can keep on expanding by adding nodes and monitor the health,

space utilization etc through Ambari. Ambari views are more to help users to use the installed components/services like hive, pig, capacity scheduler to see the cluster-load and manage YARN workload management, provisioning cluster resources, manage files etc. We have another GUI tool with name HUE developed by Cloudera IBM BIG-INSIGHTS V4.0 How to get IBM

BigInsights? IBM BigInsights for Apache Hadoop Offering Suite BigInsights Quick Start Edition IBM Open Platform with Apache Hadoop Elite Support for IBM Open Platform with

Apache Hadoop Apache Hadoop Stack: HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider Big SQL - 100% ANSI compliant, high performant, secure SQL engine

BigSheets - spreadsheet-like interface for discovery & visualization

Big R - advanced statistical & data mining Machine Learning with Big R - machine learning algorithms apply to Hadoop data set

Advanced Text Analytics - visual tooling to annotate automated text extraction IBM BigInsights v4

BigInsights Analyst Module BigInsights Data Scientist Module BigInsights Enterprise Management Module *

* * BigInsights for Apache Hadoop * Paid support for IBM Open Platform with Apache Hadoop required for BigInsights modules Enterprise Management- Enhanced cluster & resource mgmt & GPFS (POSIX-compliant) file systems

Governance Catalog Cognos BI, InfoSphere Streams, Watson Explorer, Data Click Pricing & licensing

Free Free Yearly Subscription for support Node based pricing BigInsights Main Services GPFS Filesystem

IBM Spectrum Symphony [Adaptive MapReduce] BigSQL BigSheets Text Analytics

Big R Detail s IBM BigInsights Enterprise Management modules - GPFS FPO - Adaptive MapReduce GPFS FPO Also know as IBM Spectrum Scale

IBM GPFS (Distributed Filesystem) : HDFS alternative What is expected from GPFS? Compute Cluster high scale, high performance, high availability, data integrity same data accessible from different computers logical isolation: filesets are separate filesystems inside a filesystem HDFS physical isolation: filesets can be put in separate storage pools enterprise features (quotas, security, ACLs, snapshots, etc.) GPFS-FPO

GPFS: General Parallel File System FPO: File Placement Optimizer Hadoop file system (HDFS) HDFS files can only access with Hadoop APIs. so, standard applications cannot use it Should define the size of disk space allocated to HDFS filesystems Does not handle security/access control IBM GPFS file system (GPFS-FPO) Any application can access and use it using all the commands used in Windows/Unix No need to define the size of disk space allocated to GPFS Support access control on file and disk level

Does not replicate metadata, has a single point of failure in the NameNode Distributed metadata feature eliminates any single point of failure (metadata is replicated just like data) Should load all the metadata in memory to work Metadata doesn't need to get read into memory before the filesystem is available Dealing with small numbers of large files only dealing with large numbers of any files size and enables

mixing of multiple storage types, support write-intensive applications Allows concurrent read but only one writer No Policy-based archiving Allows concurrent read and write by multiple programs Policy-based archiving: data can be migrated automatically to tapes If nobody has touches it for a given period (So as the data ages, it is automatically migrated to lessexpensive storage) HDFS vs. GPFS for Hadoop GPFS: MAKE A FILE AVAILABLE TO

HADOOP HDFS: hadoop fs -copyFromLocal /local/source/path /hdfs/target/path GPFS/UNIX: cp /source/path /target/path HDFS : hadoop fs -mv path1/ path2/ GPFS/regular UNIX: mv path1/ path2/

HDFS: diff < (hadoop fs -cat file1) < (hadoop fs -cat file2) GPFS/regular UNIX: diff file1 file2 IBM Spectrum Symphony [Adaptive MapReduce] Adaptive MapReduce (Platform Symphony) While Hadoop clusters normally run one job at a time, Platform Symphony is designed for concurrency, allowing up to 300 job trackers to run on a single cluster at the same time with agile reallocation of resources based on real-time changes to job priorities. ? What is Platform Symphony

Detail s IBM BigInsights Analyst modules - Big SQL - Big Sheet BIG SQL What is Big SQL - Industry-standard SQL query interface for BigInsights data - New Hadoop query engine derived from decades of IBM R&D investment in RDBMS technology, including database parallelism and query optimization Why Big SQL

- Easy on-ramp to Hadoop for SQL professionals - Support familiar SQL tools / applications (via JDBC and ODBC drivers( What operations are supported - Create tables / views. Store data in DFS, HBase, or Hive warehouse - Load data into tables (from local files, remote files, RDBMSs( - Query data (project, restrict, join, union, wide range of sub-queries, and built-in functions, UDFs, etc.( - GRANT / REVOKE privileges, create roles, create column masks and row permissions - Transparently join / union data between Hadoop and RDBMSs in single query - Collect statistics and inspect detailed data access plan - Establish workload management controls - Monitor Big SQL usage Why choose Big SQL instead of Hive and other vendors? Performance

Application Portability & Integration Data shared with Hadoop ecosystem Comprehensive file format support Superior enablement of IBM and Third Party software Modern MPP runtime Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Results not constrained by memory Rich SQL Comprehensive SQL Support

IBM SQL PL compatibility Extensive Analytic Functions Federation Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2 LUW, Teradata, Oracle, Netezza, Informix, SQL Server Enterprise Features Advanced security/auditing Resource and workload management Self tuning memory management

Comprehensive monitoring BIG SHEETS What you can do with BigSheets? Model big data collected from various sources in spreadsheet-like structures Filter and enrich content with built-in functions

Combine data in different workbooks Visualize results through spreadsheets, charts Export data into common formats (if desired) Detail s IBM BigInsights Data Scientist modules - Text Analytics - Big R Text Analytics Approach for text analytics

Rule development Analysis Sample input documents Label snippets Find clues Access sample documents Locate examples of

information to be extracted Performance tuning Production Develop Test Profile Export

extractors extractors extractors extractors Refine rules for runtime performance Compile modules

Create extractors that meet requirements Verify that appropriate data is being extracted Big R Limitations of open source R R was originally created as a single user tool Not naturally designed for parallelism

Can not easily leverage modern multi-core CPUs Key Take-Away Open Source R is a powerful tool, however, it has limited functionality in terms of parallelism and memory, thereby bounding the ability to analyze big data. Big data > RAM R is designed to run in-memory on a shared memory machine

Constrains the size of the data that you can reasonably work with Memory capacity limitation Forces R users to use smaller datasets Sampling can lead to inaccurate or sub-optimal analysis All available information

Analyzed information R APPROACH All available information analyzed BIG R APPROACH Advantages of Big R Full integration of R into BigInsights Hadoop

scalable can data processing use existing R assets (code and CRAN packages) wide class of algorithms and growing ? How Big R syntax look like Use Dataset "airline that contains Scheduled flights in US 1987-2009 :

Compute the mean departure delay for each airline on a monthly basis. Simple Big R example # Connect to BigInsights > bigr.connect(host="192.168.153.219", user="bigr", password="bigr") # Construct a bigr.frame to access large data set > air <- bigr.frame(dataSource="DEL", dataPath="airline_demo.csv", ) # Filter flights delayed by 15+ mins at departure or arrival > airSubset <- air[air$Cancelled == 0 & (air$DepDelay >= 15 | air$ArrDelay >= 15), c("UniqueCarrier", "Origin", "Dest","DepDelay", "ArrDelay", "CRSElapsedTime")] # What percentage of flights were delayed overall? > nrow(airSubset) / nrow(air) [1] 0.2269586

# What are the longest flights? > bf <- sort(air, by = air$Distance, decreasing = T) > bf <- bf[,c("Origin", "Dest", "Distance")] > head(bf, 3) Origin Dest Distance 1 HNL JFK 4983 2 EWR HNL

4962 3 HNL EWR 4962 BIG DATA DEMO Examples STEPS To work on your own machine Download/Install Download

VMWare coudera CDH image http://www.devopsservice.com/install-mysql-workbench-on-ubuntu-14-04-and-centos-6/ Install MySQL Workbench to control mysql using GUI interface http://www.devopsservice.com/install-mysql-workbench-on-ubuntu-14-04-and-centos-6/ To work on IBM Cloud How

to install Hue 3 on IBM BigInsights 4.0 to explore Big Data http://gethue.com/how-to-install-hue-3-on-ibm-biginsights-4-0-to-explore-big-data/ USING IBM CLOUD How to work with Hadoop on IBM cloud Login to IBM Cloud (BlueMix), and search for Hadoop You will get two results (Lite = Free), and (Subscription=with Cost) choose Analytics

Engine Using Cloudera VM Example 1 Copy file from HD to HDFS Using command line hadoop fs -put /HD PATH/temperature.csv Using HUE GUI /Hadoop Path/temp

Example 2 Use Scoop to move mySql DB table to Hadoop file system inside hive directory > sqoop import-all-tables \ -m 1\ --connect jdbc:mysql://localhost:3306/retail_db \ --username=retail_dba \ --password=cloudera \ --compression-codec=snappy \ --as-parquetfile \ --warehouse-dir=/user/hive/warehouse \ --hive-import -m

parameter: number of .parquet files /usr/hive/warehouse To To is the default hive path view tables after move to HDFS > hadoop fs -ls /user/hive/warehouse/ get the actual hive Tables path, use terminally type hive then run command set hive.metastore.warehouse.dir;

Hive Sample query 1 To view current Hive tables show tables; Run SQL command on Hive tables select c.category_name, count(order_item_quantity) as count from order_items oi inner join products p on oi.order_item_product_id = p.product_id inner join categories c on c.category_id = p.product_category_id group by c.category_name

order by count desc limit 10; Hive Sample Query 2 select p.product_id, p.product_name, r.revenue from products p inner join ( select oi.order_item_product_id, sum(cast(oi.order_item_subtotal as float)) as revenue from order_items oi inner join orders o on oi.order_item_order_id = o.order_id where o.order_status <> 'CANCELED' and o.order_status <> 'SUSPECTED_FRAUD' group by order_item_product_id )r on p.product_id = r.order_item_product_id order by r.revenue desc

limit 10; Cloudera Impala Impala is an open source Massively Parallel Processing (MPP) SQL engine. Impala doesnt require data to be moved or transformed prior to processing. So, it can handle Hive tables and give performance gain (Impala is 6 to 69 times faster than Hive)

To refresh Impala to get new Hive tables run the next command impala-shell invalidate metadata; To list the available tables show tables; Now you can run the same Hive SQL commands ! Example 3 Copy temperature.csv file from HD to new HDFS directory temp then load this file inside new Hive table

hadoop fs -mkdir -p /user/cloudera/temp hadoop fs -put /var/www/html/temperature.csv /user/cloudera/temp Create Hive table based on CVS file hive> Create database weather; CREATE EXTERNAL TABLE IF NOT EXISTS weather.temperature ( place STRING COMMENT 'place', year INT COMMENT 'Year', month STRING COMMENT 'Month', temp FLOAT COMMENT 'temperature') ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/cloudera/temp/';

Example 4 Using HUE create a workflow using Oozie to move data from mySQL/CSV files to Hive Step 1: get the virtual machine IP using ifconfig Step 2: navigate to the http://IP:8888 , to get HUE login screen (cloudera/cloudera) Step 3: Open Oozie: Workflows>Editors>Workflows> then click create button Oozie icons Hive Script (old)

Hive2 Script (New) Sqoop Shell SSH

HDSF FS Simple Oozie workflow 1) delete HDFS folder 2) Copy mySql table as text file to HDFS 3) Create Hive table based on this text file Add workflow schedule Setup workflow settings

Workflow can contains some variables To define new variable ${Variable} Sometimes you need to define hive libpath in HUE to work with hive ozzie.libpath : /user/oozie/share/lib/hive

Recently Viewed Presentations

  • Confidentiality Disclaimer - DGA ENERGY

    Confidentiality Disclaimer - DGA ENERGY

    Overview. DGA Energy was established in 1986 and is a New York Corporation in good standing . 100% owned by the Gulati family. Based in India and the USA, they also own extensive interests and investment in Real Estate, Health...
  • Army Audit Readiness Test Results: SBR: Internal Controls E&amp;C ...

    Army Audit Readiness Test Results: SBR: Internal Controls E&C ...

    DAAS MILS Transaction Edits and Routing. 2 Garbled Transactions refer to transactions with fields that are shifted or unreadable (DLM 4000.25-4, Section C4.3.2.4.1) ** Based on review of the FY2016 DAAS SOC 1 Report by the AWCF Audit Readiness Team,...
  • Leading through Guided Pathways Implementation Your Role in

    Leading through Guided Pathways Implementation Your Role in

    And finally in 1994, Pizza Hut became the first chain to offer online ordering of pizza. The point is this. Companies that find ways to creatively disrupt their industries are the ones that stand the test of time. ... And...
  • A Back-End Design Flow for Single Chip Radios

    A Back-End Design Flow for Single Chip Radios

    My Poster Title My Name My Poster Title My Name Prof. Jan M. Rabaey Project Poster Presentation EECS 141 SPRING 04 * Prof. Jan M. Rabaey Project Poster Presentation EECS 141 SPRING 04 Outstanding Features of My Design This poster...
  • The Human Body - SCHOOLinSITES

    The Human Body - SCHOOLinSITES

    The Human Body Dodge Elementary 3rd Grade Group 1 Student PowerPoint About the Skeletal System The Skeletal System is made up of joints, bones and muscles. Each of these has a job to do to help our body work. There...
  • S.P.Richards Company Capabilities Overview

    S.P.Richards Company Capabilities Overview

    S.P. Richards Logistics Challenges "Forecasts are always wrong, and planners are always surprised" - George Plossl S.P. Richards Logistics Challenges FIFO LIFO S.P. Richards Logistics Challenges Service Levels Developed SPRinter network Same night service from other DC's Raises fill rates...
  • EM Training Support Unit: experience of doctors in

    EM Training Support Unit: experience of doctors in

    EM Training Support Unit: experience of doctors in difficulty Dr Jo Jones Associate Postgraduate Dean Secondary Care lead for TSU Describing Complex situations- 2 Trainee perspective: feel singled out set constantly moving targets maybe 'high stakes' ( self image, first...
  • Supervising Classified Employees - Hungerford Law

    Supervising Classified Employees - Hungerford Law

    Your first response? You see an Educational Assistant over-react to a 2nd-grade boy out-of-line and not paying attention. The EA's voice is harsh, the child is berated in front of the class, and the EA pushes the boy back in...