Originally developed by eBay and then contributed to the open source community, Apache Kylin™ is an open-source and distributed analytical data warehouse designed to provide SQL interface and multi-dimensional analysis (OLAP) for Hadoop and Spark. It supports extremely large-scale datasets and can query huge tables in sub-seconds.
The key that enables Kylin to provide a sub-second latency is pre-calculation, which involves pre-calculating the measures of a data cube with a star topology in a combination of dimensions, saving the results in HBase, and then providing query APIs such as JDBC, ODBC, and RESTful APIs to implement real-time queries.
Run the script to restart the Kylin server to purge the cache.
Log in at the Kylin website with the default username and password (ADMIN/KYLIN), select the
learn_kylin project from the project drop-down list in the top-left corner, select the sample cube named
kylin_sales_cube, click Actions > Build, and select a date after January 1, 2014 (overwriting all 10000 sample records).
Click the Monitor to view the building progress until 100%.
Click the Insight to execute SQLs; for example:
select part_dt, sum(price) as total_sold, count(distinct seller_id) as sellers from kylin_sales group by part_dt order by part_dt
kylin.env.hadoop-conf-dir property in
Check the Spark configuration.
Kylin embeds a Spark binary (v2.1.2) in
$KYLIN_HOME/spark, and all Spark properties prefixed with
kylin.engine.spark-conf. can be managed in
$KYLIN_HOME/conf/kylin.properties. These properties will be extracted and applied when a submitted Spark job is executed; for example, if you configure
kylin.engine.spark-conf.spark.executor.memory=4G, Kylin will use
–conf spark.executor.memory=4G as a parameter when executing
Before you run Spark cubing, you are recommended to take a look at these configurations and customize them based on your cluster. Below is the recommended configuration with Spark dynamic resource allocation enabled:
kylin.engine.spark-conf.spark.master=yarn kylin.engine.spark-conf.spark.submit.deployMode=cluster kylin.engine.spark-conf.spark.dynamicAllocation.enabled=true kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=1 kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000 kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300 kylin.engine.spark-conf.spark.yarn.queue=default kylin.engine.spark-conf.spark.driver.memory=2G kylin.engine.spark-conf.spark.executor.memory=4G kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 kylin.engine.spark-conf.spark.executor.cores=1 kylin.engine.spark-conf.spark.network.timeout=600 kylin.engine.spark-conf.spark.shuffle.service.enabled=true #kylin.engine.spark-conf.spark.executor.instances=1 kylin.engine.spark-conf.spark.eventLog.enabled=true kylin.engine.spark-conf.spark.hadoop.dfs.replication=2 kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress=true kylin.engine.spark-conf.spark.hadoop.mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history ## Uncommenting for HDP #kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current #kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current #kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current
For running on the Hortonworks platform, you need to specify
hdp.version as the Java option for Yarn container; therefore, you should uncomment the last three lines in
Besides, in order to avoid repeatedly uploading Spark jars to Yarn, you can manually upload them once and then configure the jar's HDFS path. The HDFS path must be a full path name.
jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ . hadoop fs -mkdir -p /kylin/spark/ hadoop fs -put spark-libs.jar /kylin/spark/
kylin.properties as follows:
kylin.engine.spark-conf.* parameters can be overwritten at the cube or project level, which gives you more flexibility.
Create and modify a sample cube.
sample.sh to create a sample cube and then start the Kylin server:
/usr/local/service/kylin/bin/sample.sh /usr/local/service/kylin/bin/kylin.sh start
After Kylin is started, access the Kylin website and edit the
kylin_sales cube on the "Advanced Setting" page by changing Cube Engine from MapReduce to Spark (Beta):
Click Next to enter the "Configuration Overwrites" page, and click +Property to add the
kylin.engine.spark.rdd-partition-cut-mb property with a value of 500.
The sample cube has two memory hungry measures:
COUNT DISTINCT and
TOPN(100). When the source data is small, their estimated size will be much larger than their actual size, thus causing more RDD partitions to be split and slowing down the building process. 500 is a reasonable number. Click Next and Save to save the cube.
For cubes without
TOPN, please keep the default configuration.
Build a cube with Spark.
Click Build and select the current date as the end date. Kylin will generate a building job on the "Monitor" page, in which the 7th step is Spark cubing. The job engine will start to execute the steps in sequence.
When Kylin executes this step, you can monitor the status in the Yarn resource manager. Click the "Application Master" link to open the web UI of Spark, which will display the progress and details of each stage.
After all the steps are successfully performed, the cube will become "Ready", and you can perform queries.