Amazon Web Services (AWS) Databricks
This topic explains how to deploy Unravel on Amazon Web Services (AWS) Databricks.
Create AWS components
Install Unravel
Configure and restart Unravel
Configure Databricks with Unravel
Install a compatible version of MySQL server and database.
On CentOS 6:
wget https://dev.mysql.com/get/mysql80-community-release-el6-1.noarch.rpm sudo yum install yum-utils sudo rpm -ivh mysql80-community-release-el6-1.noarch.rpm sudo yum-config-manager --disable mysql80-community sudo yum-config-manager --enable mysql57-community sudo yum install mysql-community-server
On CentOS 7:
wget https://dev.mysql.com/get/mysql80-community-release-el7-1.noarch.rpm sudo rpm -ivh mysql80-community-release-el7-1.noarch.rpm sudo yum-config-manager --disable mysql80-community sudo yum-config-manager --enable mysql57-community sudo yum install mysql-community-server
On SELinux:
If you are installing MySQL on an SELinux host and are not using the default
datadir
, see Deploying Unravel on security-enhanced Linux.
Note
PostgreSQL is supported from Unravel version 4.6.1.6
You can either use the Unravel bundled PostgreSQL or external PostgreSQL version 12.
Install Unravel. The Unravel bundled PostgreSQL is automatically installed.
Run the following commands in the given sequence to set up and connect the bundled PostgresSQL:
sudo /usr/local/unravel/bin/emdb_enable.sh sudo /etc/init.d/unravel_all.sh restart.
Download PostgreSQL 12.
sudo yum install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
Run the following commands to install:
sudo yum install postgresql12-server sudo /usr/pgsql-12/bin/postgresql-12-setup initdb sudo systemctl start postgresql-12 sudo systemctl enable postgresql-12
Review the Virtual Private Cloud (VPC) Peering options to connect Databricks with the Unravel VM.
Workspace | VPC Peering Options |
---|---|
Workspace and Unravel VM are in the same VPC | - |
Workspace VPC is in a different Region | Use VPC Peering: |
Workspace VPC is in a different AWS account | Use VPC Peering: |
Install the following Unravel prerequisites on EC2 instance:
Install ntpd.
sudo su - yum install ntp ntpd -u ntp:ntp
Prepare the data disk. Set permissions for Unravel and symlink Unravel's directories to the
/srv
mount.mkdir -p /srv/local/unravel# chmod -R 755 /srv/local ln -s /srv/local/unravel /usr/local/unravel chmod 755 /usr/local/unravel
Install MySQL if not done already.
yum install mysql
Install the Databricks File System (DBFS) command-line interface.
sudo bash yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm yum install python-pip pip install databricks-cli
Note
You can test the connectivity using the DBFS command-line interface. In case there are any errors, such as,
Error: ValueError: Timeout value connect was Timeout
, reinstall the DBFS command-line interface using Python virtualenv as follows:pip install databricks-cli yum install python3 virtualenv -p /usr/bin/python3 mypy3 source mypy3/bin/activate# pip install databricks-cli
Download the latest RPM for Cloud.
Install the RPM using the following command:
sudo rpm -ivh
cloud_rpm
2> /tmp/rpm-install-log.txt
Edit
/usr/local/unravel/etc/unravel.properties
.Set the following properties:
com.unraveldata.cluster.type=DB com.unraveldata.tagging.enabled=true
In case you do not find these properties, add it to the file.
Using MySQL, create a database and user for Unravel. Enter MySQL admin login password when prompted:
mysql> CREATE DATABASE unravel_mysql_prod; mysql> CREATE USER '<Unravel database user>'@'MySQL server name' IDENTIFIED BY '<Unravel database password>'; mysql> GRANT ALL PRIVILEGES ON unravel_mysql_prod.* TO '<Unravel database user>'@'<MySQL server name>';
Configure MySQL in
/usr/local/unravel/etc/unravel.properties
.unravel.jdbc.username=
<Unravel database user>
unravel.jdbc.password=<Unravel database password>
unravel.jdbc.url=jdbc:mysql://<MySQL Server name>
:3306/unravel_mysql_prod unravel.jdbc.url.params=useSSL=true&requireSSL=falseInstall MySQL JDBC connector driver in Unravel classpath.
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.47.tar.gz -O /tmp/mysql-connector-java-5.1.47.tar.gz tar xvzf /tmp/mysql-connector-java-5.1.47.tar.gz sudo mkdir -p /usr/local/unravel/share/java sudo cp /tmp/mysql-connector-java-5.1.47/mysql-connector-java-5.1.47.jar /usr/local/unravel/share/java sudo chown unravel:unravel /usr/local/unravel/share/java/mysql-connector-java-5.1.47.jar
Create database and tables for Unravel.
/usr/local/unravel/dbin/db_schema_upgrade.sh
Note
PostgreSQL is supported from Unravel version 4.6.1.6
Run psql and create a database and user for Unravel. Enter psql admin login password when prompted:
sudo -u postgres psql
create database unravel; create user unravel with encrypted password 'unraveldata'; grant all privileges on database unravel to unravel; ALTER USER unravel WITH SUPERUSER; alter user unravel with createdb createrole inherit replication bypassrls; grant connect on database unravel to unravel; grant usage on schema public to unravel; grant all privileges on all tables in schema public to unravel; grant all privileges on all sequences in schema public to unravel; alter default privileges in schema public grant all privileges on tables to unravel; alter default privileges in schema public grant all privileges on sequences to unravel; grant pg_read_server_files to unravel; grant pg_write_server_files to unravel; grant pg_execute_server_program to unravel;
Allow user/server to connect to the database.
Option 1: If PostgreSQL is installed on the same server as Unravel, add the following line in
/var/lib/pgsql/12/data/pg_hba.conf
at the first line of IPv4 local connections:host all unravel 127.0.0.1/32 md5
Option 2: If PostgreSQL is installed on a different server, do the following:
Add the following line in
/var/lib/pgsql/12/data/pg_hba.conf
:host all unravel <Unravel Server Internal IP Address>/32 md5
Update
/var/lib/pgsql/12/data/postgresql.conf
and ensure listen_addresses = '*' is set to allow PostgreSQL to listen to all the traffic.
Add following properties in
/usr/local/unravel/etc/unravel.properties
:unravel.jdbc.username=unravel unravel.jdbc.password=unraveldata unravel.jdbc.url=jdbc:postgresql://127.0.0.1:5432/unravel
Restart PostgreSQL.
sudo systemctl restart postgresql-12.service
Install PostgreSQL JDBC connector driver in Unravel classpath.
sudo mkdir -p /usr/local/unravel/share/java wget https://jdbc.postgresql.org/download/postgresql-42.2.18.jar -O /usr/local/unravel/share/java/postgresql-42.2.18.jar sudo chown unravel:unravel /usr/local/unravel/share/java/postgresql-42.2.18.jar
Test and update db schema.
/usr/local/unravel/install_bin/db_access.sh* /usr/local/unravel/dbin/db_schema_upgrade.sh
In Databricks, go to Workspace > Admin Console > Access Control and enable Personal Access tokens. See Enable token-based authentication.
Go to Workspace> User Settings> Access Tokens and click Generate New Token. See Authenticate using Databricks personal access tokens. Choose the lifetime of the token as indefinite.
Install Unravel agents on the Workspace and update the Unravel config with the Workspace details. refer to Running the Databricks_setup.sh script.
Note
Run the following commands only if the Databricks command-line is installed using Python
virtualenv
.sudo bash virtualenv -p /usr/bin/python3 mypy3 source mypy3/bin/activate
/usr/local/unravel/install_bin/databricks_setup.sh --add-workspace -i <Workspace ID> -n <Workspace name> -t <Workspace token> -r https://<Workspace instance> -p <workspace_tier> -u <Unravel DNS or IP Address>:4043
Restart all Unravel services
service unravel_all.sh restart
Using a supported web browser, (See compatibility matrix for AWS Databricks) navigate to
http://
and log in with usernameunravel-host
:3000admin
with passwordunraveldata
.
In your Databricks workspace, update the following tabs under Advanced Options for every cluster (Automated/Interactive) that you want to monitor:
Spark
Copy the following snippet to Spark > Spark Conf. Replace <Unravel DNS or IP Address>
. This snippet is also generated by the Databricks setup script on Unravel.
Note
For spark-submit jobs, click Configure spark-submit and copy the following snippet in the Set Parameters > Parameters text box as spark-submit parameters. Replace <Unravel DNS or IP Address>
.
"--conf", "spark.eventLog.enabled=true",
"--conf", "spark.eventLog.dir=dbfs:/databricks/unravel/eventLogs/",
"--conf", "spark.unravel.shutdown.delay.ms=300",
"--conf", "spark.unravel.server.hostport=<Unravel DNS or IP Address>
:4043",
"--conf", "spark.executor.extraJavaOptions= -Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=executor,libs=spark-2.3",
"--conf", "spark.driver.extraJavaOptions= -Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=driver,script=StreamingProbe.btclass,libs=spark-2.3"
spark.eventLog.enabled true
spark.eventLog.dir dbfs:/databricks/unravel/eventLogs/
spark.unravel.server.hostport <Unravel DNS or IP Address>
:4043
spark.unravel.shutdown.delay.ms 300
spark.executor.extraJavaOptions -Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=executor,libs=spark-2.3
spark.driver.extraJavaOptions -Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=driver,script=StreamingProbe.btclass,libs=spark-2.3
Logging
Select DBFS as Destination, and copy the following as Cluster Log Path.
dbfs:/cluster-logs/
Init Scripts
In the Init Scripts tab, set Destination to DBFS. Copy the following as the Init script path and click Add.
dbfs:/databricks/unravel/unravel-db-sensor-archive/dbin/install-unravel.sh