XUtils

H2O

Open Source Fast Scalable Machine Learning Platform.


H2O

For any question not answered in this file or in H2O-3 Documentation, please use:

Ask on GitHub Ask on StackOverflow Ask on Gitter

H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. H2O provides implementations of many popular algorithms such as Generalized Linear Models (GLM), Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks, Stacked Ensembles, Naive Bayes, Generalized Additive Models (GAM), Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (H2O AutoML).

H2O is extensible so that developers can add data transformations and custom algorithms of their choice and access them through all of those clients. H2O models can be downloaded and loaded into H2O memory for scoring, or exported into POJO or MOJO format for extemely fast scoring in production. More information can be found in the H2O User Guide.

H2O-3 (this repository) is the third incarnation of H2O, and the successor to H2O-2.

Table of Contents

1. Downloading H2O-3

While most of this README is written for developers who do their own builds, most H2O users just download and use a pre-built version. If you are a Python or R user, the easiest way to install H2O is via PyPI or Anaconda (for Python) or CRAN (for R):

Python

pip install h2o

R

install.packages("h2o")

For the latest stable, nightly, Hadoop (or Spark / Sparkling Water) releases, or the stand-alone H2O jar, please visit: https://h2o.ai/download

More info on downloading & installing H2O is available in the H2O User Guide.

2. Open Source Resources

Most people interact with three or four primary open source resources: GitHub (which you’ve already found), GitHub issues (for bug reports and issue tracking), Stack Overflow for H2O code/software-specific questions, and h2ostream (a Google Group / email discussion forum) for questions not suitable for Stack Overflow. There is also a Gitter H2O developer chat group, however for archival purposes & to maximize accessibility, we’d prefer that standard H2O Q&A be conducted on Stack Overflow.

2.1 Issue Tracking and Feature Requests

You can browse and create new issues in our GitHub repository: https://github.com/h2oai/h2o-3

  • You can browse and search for issues without logging in to Github:
    1. Click the Issues tab on the top of the page
    2. Apply filter to search for particular issues
  • To create an issue (either a bug or a feature request):

2.2 List of H2O Resources

3. Using H2O-3 Artifacts

Every nightly build publishes R, Python, Java, and Scala artifacts to a build-specific repository. In particular, you can find Java artifacts in the maven/repo directory.

Here is an example snippet of a gradle build file using h2o-3 as a dependency. Replace x, y, z, and nnnn with valid numbers.

// h2o-3 dependency information
def h2oBranch = 'master'
def h2oBuildNumber = 'nnnn'
def h2oProjectVersion = "x.y.z.${h2oBuildNumber}"

repositories {
  // h2o-3 dependencies
  maven {
    url "https://s3.amazonaws.com/h2o-release/h2o-3/${h2oBranch}/${h2oBuildNumber}/maven/repo/"
  }
}

dependencies {
  compile "ai.h2o:h2o-core:${h2oProjectVersion}"
  compile "ai.h2o:h2o-algos:${h2oProjectVersion}"
  compile "ai.h2o:h2o-web:${h2oProjectVersion}"
  compile "ai.h2o:h2o-app:${h2oProjectVersion}"
}

Refer to the latest H2O-3 bleeding edge nightly build page for information about installing nightly build artifacts.

Refer to the h2o-droplets GitHub repository for a working example of how to use Java artifacts with gradle.

Note: Stable H2O-3 artifacts are periodically published to Maven Central (click here to search) but may substantially lag behind H2O-3 Bleeding Edge nightly builds.

4. Building H2O-3

Getting started with H2O development requires JDK 1.8+, Node.js, Gradle, Python and R. We use the Gradle wrapper (called gradlew) to ensure up-to-date local versions of Gradle and other dependencies are installed in your development directory.

4.1. Before building

Building h2o requires a properly set up R environment with required packages and Python environment with the following packages:

grip
tabulate
requests
wheel

To install these packages you can use pip or conda. If you have troubles installing these packages on Windows, please follow section Setup on Windows of this guide.

(Note: It is recommended to use some virtual environment such as VirtualEnv, to install all packages. )

4.2. Building from the command line (Quick Start)

To build H2O from the repository, perform the following steps.

Recipe 1: Clone fresh, build, skip tests, and run H2O

# Build H2O
git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew build -x test

You may encounter problems: e.g. npm missing. Install it:
brew install npm

# Start H2O
java -jar build/h2o.jar

# Point browser to http://localhost:54321

Recipe 3: Pull, clean, build, and run tests

git pull
./gradlew syncSmalldata
./gradlew syncRPackages
./gradlew clean
./gradlew build

Recipe 4: Just building the docs

./gradlew clean && ./gradlew build -x test && (export DO_FAST=1; ./gradlew dist)
open target/docs-website/h2o-docs/index.html

Recipe 5: Building using a Makefile

Root of the git repository contains a Makefile with convenient shortcuts for frequent build targets used in development. To build h2o.jar while skipping tests and also the building of alternative assemblies, execute

make

To build h2o.jar using the minimal assembly, run

make minimal

The minimal assembly is well suited for developement of H2O machine learning algorithms. It doesn’t bundle some heavyweight dependencies (like Hadoop) and using it saves build time as well as need to download large libraries from Maven repositories.

4.3. Setup on Windows

Step 1: Download and install WinPython.

From the command line, validate python is using the newly installed package by using which python (or sudo which python). Update the Environment variable with the WinPython path.

Step 2: Install required Python packages:
pip install grip tabulate wheel
Step 3: Install JDK

Install Java 1.8+ and add the appropriate directory C:\Program Files\Java\jdk1.7.0_65\bin with java.exe to PATH in Environment Variables. To make sure the command prompt is detecting the correct Java version, run:

javac -version

The CLASSPATH variable also needs to be set to the lib subfolder of the JDK:

CLASSPATH=/<path>/<to>/<jdk>/lib
Step 4. Install Node.js

Install Node.js and add the installed directory C:\Program Files\nodejs, which must include node.exe and npm.cmd to PATH if not already prepended.

Step 6b. Validate Cygwin

If Cygwin is already installed, remove the Python packages or ensure that Native Python is before Cygwin in the PATH variable.

Step 7. Update or validate the Windows PATH variable to include R, Java JDK, Cygwin.
Step 9. Run the top-level gradle build:
cd h2o-3
./gradlew.bat build

If you encounter errors run again with --stacktrace for more instructions on missing dependencies.

4.4. Setup on OS X

If you don’t have Homebrew, we recommend installing it. It makes package management for OS X easy.

Step 1. Install JDK

Install Java 1.8+. To make sure the command prompt is detecting the correct Java version, run:

javac -version
Step 2. Install Node.js:

Using Homebrew:

brew install node

Otherwise, install from the NodeJS website.

Step 4. Install python and the required packages:

Install python:

brew install python

Install pip package manager:

sudo easy_install pip

Next install required packages:

sudo pip install wheel requests tabulate  
Step 5. Git Clone h2o-3

OS X should already have Git installed. To download and update h2o-3 source codes:

git clone https://github.com/h2oai/h2o-3
Step 6. Run the top-level gradle build:
cd h2o-3
./gradlew build

Note: on a regular machine it may take very long time (about an hour) to run all the tests.

If you encounter errors run again with --stacktrace for more instructions on missing dependencies.

4.5. Setup on Ubuntu 14.04

Step 1. Install Node.js
curl -sL https://deb.nodesource.com/setup_0.12 | sudo bash -
sudo apt-get install -y nodejs
Step 4. Git Clone h2o-3

If you don’t already have a Git client:

sudo apt-get install git

Download and update h2o-3 source codes:

git clone https://github.com/h2oai/h2o-3
Step 5. Run the top-level gradle build:
cd h2o-3
./gradlew build

If you encounter errors, run again using --stacktrace for more instructions on missing dependencies.

Make sure that you are not running as root, since bower will reject such a run.

4.6. Setup on Ubuntu 13.10

Step 1. Install Node.js
curl -sL https://deb.nodesource.com/setup_16.x | sudo bash -
sudo apt-get install -y nodejs
Steps 2-4. Follow steps 2-4 for Ubuntu 14.04 (above)

install local R packages

R -e ‘install.packages(c(“RCurl”,“jsonlite”,“statmod”,“devtools”,“roxygen2”,“testthat”), dependencies=TRUE, repos=”http://cran.rstudio.com/“)’

cd git clone https://github.com/h2oai/h2o-3.git cd h2o-3

Build H2O

./gradlew syncSmalldata ./gradlew syncRPackages ./gradlew build -x test



<a name="Launching"></a>

## 5. Launching H2O after Building

To start the H2O cluster locally, execute the following on the command line:

    java -jar build/h2o.jar

A list of available start-up JVM and H2O options (e.g. `-Xmx`, `-nthreads`, `-ip`), is available in the [H2O User Guide](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/starting-h2o.html#from-the-command-line).

<a name="BuildingHadoop"></a>
#### Precautions to take when leveraging secure impersonation

*  The target use case for secure impersonation is applications or services that pre-authenticate a user and then use (in this case) the h2odriver on behalf of that user.  H2O's Steam is a perfect example: auth user in web app over SSL, impersonate that user when creating the h2o YARN container.
*  The proxy user should have limited permissions in the Hadoop cluster; this means no permissions to access data or make API calls.  In this way, if it's compromised it would only have the power to impersonate a specific subset of the users in the cluster and only from specific machines.
*  Use the `hadoop.proxyuser.<proxyusername>.hosts` property whenever possible or practical.
*  Don't give the proxyusername's password or keytab to any user you don't want to impersonate another user (this is generally *any* user).  The point of impersonation is not to allow users to impersonate each other.  See the first bullet for the typical use case.
*  Limit user logon to the machine the proxying is occurring from whenever practical.
*  Make sure the keytab used to login the proxy user is properly secured and that users can't login as that id (via `su`, for instance)
*  Never set hadoop.proxyuser.<proxyusername>.{users,groups} to '*' or 'hdfs', 'yarn', etc.  Allowing any user to impersonate hdfs, yarn, or any other important user/group should be done with extreme caution and *strongly* analyzed before it's allowed.

#### Risks with secure impersonation

*  The id performing the impersonation can be compromised like any other user id.
*  Setting any `hadoop.proxyuser.<proxyusername>.{hosts,groups,users}` property to '*' can greatly increase exposure to security risk.
*  When users aren't authenticated before being used with the driver (e.g. like Steam does via a secure web app/API), auditability of the process/system is difficult.


$ git diff diff –git a/h2o-app/build.gradle b/h2o-app/build.gradle index af3b929..097af85 100644 — a/h2o-app/build.gradle +++ b/h2o-app/build.gradle @@ -8,5 +8,6 @@ dependencies { compile project(”:h2o-algos”) compile project(“:h2o-core”) compile project(“:h2o-genmodel”)

  • compile project(“:h2o-persist-hdfs”) }

diff –git a/h2o-persist-hdfs/build.gradle b/h2o-persist-hdfs/build.gradle index 41b96b2..6368ea9 100644 — a/h2o-persist-hdfs/build.gradle +++ b/h2o-persist-hdfs/build.gradle @@ -2,5 +2,6 @@ description = “H2O Persist HDFS”

dependencies { compile project(“:h2o-core”)

  • compile(“org.apache.hadoop:hadoop-client:2.0.0-cdh4.3.0”)
  • compile(“org.apache.hadoop:hadoop-client:2.4.1-mapr-1408”)
  • compile(“org.json:org.json:chargebee-1.0”) }

<a name="Sparkling"></a>
## 7. Sparkling Water

Sparkling Water combines two open-source technologies: Apache Spark and the H2O Machine Learning platform.  It makes H2O’s library of advanced algorithms, including Deep Learning, GLM, GBM, K-Means, and Distributed Random Forest, accessible from Spark workflows. Spark users can select the best features from either platform to meet their Machine Learning needs.  Users can combine Spark's RDD API and Spark MLLib with H2O’s machine learning algorithms, or use H2O independently of Spark for the model building process and post-process the results in Spark.

**Sparkling Water Resources**:

* [Download page for pre-built packages](http://h2o.ai/download/)
* [Sparkling Water GitHub repository](https://github.com/h2oai/sparkling-water)  
* [README](https://github.com/h2oai/sparkling-water/blob/master/README.md)
* [Developer documentation](https://github.com/h2oai/sparkling-water/blob/master/DEVEL.md)

<a name="Documentation"></a>
## 8. Documentation

### Documenation Homepage

The main H2O documentation is the [H2O User Guide](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html).  Visit <http://docs.h2o.ai> for the top-level introduction to documentation on H2O projects.


### Generate REST API documentation

To generate the REST API documentation, use the following commands:

    cd ~/h2o-3
    cd py
    python ./generate_rest_api_docs.py  # to generate Markdown only
    python ./generate_rest_api_docs.py --generate_html  --github_user GITHUB_USER --github_password GITHUB_PASSWORD # to generate Markdown and HTML

The default location for the generated documentation is `build/docs/REST`.

If the build fails, try `gradlew clean`, then `git clean -f`.

### Bleeding edge build documentation

Documentation for each bleeding edge nightly build is available on the [nightly build page](http://s3.amazonaws.com/h2o-release/h2o/master/latest.html).


<a name="Citing"></a>
## 9. Citing H2O

If you use H2O as part of your workflow in a publication, please cite your H2O resource(s) using the following BibTex entry:

### H2O Software

	@Manual{h2o_package_or_module,
	    title = {package_or_module_title},
	    author = {H2O.ai},
	    year = {year},
	    month = {month},
	    note = {version_information},
	    url = {resource_url},
	}

**Formatted H2O Software citation examples**:

- H2O.ai (Oct. 2016). _Python Interface for H2O_, Python module version 3.10.0.8. [https://github.com/h2oai/h2o-3](https://github.com/h2oai/h2o-3).
- H2O.ai (Oct. 2016). _R Interface for H2O_, R package version 3.10.0.8. [https://github.com/h2oai/h2o-3](https://github.com/h2oai/h2o-3).
- H2O.ai (Oct. 2016). _H2O_, H2O version 3.10.0.8. [https://github.com/h2oai/h2o-3](https://github.com/h2oai/h2o-3).

### H2O Booklets

H2O algorithm booklets are available at the [Documentation Homepage](http://docs.h2o.ai/h2o/latest-stable/index.html).

	@Manual{h2o_booklet_name,
	    title = {booklet_title},
	    author = {list_of_authors},
	    year = {year},
	    month = {month},
	    url = {link_url},
	}

**Formatted booklet citation examples**:

Arora, A., Candel, A., Lanford, J., LeDell, E., and Parmar, V. (Oct. 2016). _Deep Learning with H2O_. <http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf>.

Click, C., Lanford, J., Malohlava, M., Parmar, V., and Roark, H. (Oct. 2016). _Gradient Boosted Models with H2O_. <http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GBMBooklet.pdf>.

<a name="Community"></a>
## 10. Community

H2O has been built by a great many number of contributors over the years both within H2O.ai (the company) and the greater open source community.  You can begin to contribute to H2O by answering [Stack Overflow](http://stackoverflow.com/questions/tagged/h2o) questions or [filing bug reports](https://github.com/h2oai/h2o-3/issues).  Please join us!  


### Team & Committers

SriSatish Ambati Cliff Click Tom Kraljevic Tomas Nykodym Michal Malohlava Kevin Normoyle Spencer Aiello Anqi Fu Nidhi Mehta Arno Candel Josephine Wang Amy Wang Max Schloemer Ray Peck Prithvi Prabhu Brandon Hill Jeff Gambera Ariel Rao Viraj Parmar Kendall Harris Anand Avati Jessica Lanford Alex Tellez Allison Washburn Amy Wang Erik Eckstrand Neeraja Madabhushi Sebastian Vidrio Ben Sabrin Matt Dowle Mark Landry Erin LeDell Andrey Spiridonov Oleg Rogynskyy Nick Martin Nancy Jordan Nishant Kalonia Nadine Hussami Jeff Cramer Stacie Spreitzer Vinod Iyengar Charlene Windom Parag Sanghavi Navdeep Gill Lauren DiPerna Anmol Bal Mark Chan Nick Karpov Avni Wadhwa Ashrith Barthur Karen Hayrapetyan Jo-fai Chow Dmitry Larko Branden Murray Jakub Hava Wen Phan Magnus Stensmo Pasha Stetsenko Angela Bartz Mateusz Dymczyk Micah Stubbs Ivy Wang Terone Ward Leland Wilkinson Wendy Wong Nikhil Shekhar Pavel Pscheidl Michal Kurka Veronika Maurerova Jan Sterba Jan Jendrusak Sebastien Poirier Tomáš Frýda Ard Kelmendi Yuliia Syzon Adam Valenta Marek Novotny


<a name="Advisors"></a>
## Advisors

Scientific Advisory Council

Stephen Boyd Rob Tibshirani Trevor Hastie


Systems, Data, FileSystems and Hadoop

Doug Lea Chris Pouliot Dhruba Borthakur


<a name="Investors"></a>
## Investors

Jishnu Bhattacharjee, Nexus Venture Partners Anand Babu Periasamy Anand Rajaraman Ash Bhardwaj Rakesh Mathur Michael Marks Egbert Bierman Rajesh Ambati “`


Articles

  • coming soon...