Skip to content

TEZ-4682: [Cloud] Tez AM docker image#456

Open
Aggarwal-Raghav wants to merge 3 commits intoapache:masterfrom
Aggarwal-Raghav:TEZ-4682
Open

TEZ-4682: [Cloud] Tez AM docker image#456
Aggarwal-Raghav wants to merge 3 commits intoapache:masterfrom
Aggarwal-Raghav:TEZ-4682

Conversation

@Aggarwal-Raghav
Copy link
Contributor

No description provided.

@tez-yetus

This comment was marked as outdated.

@Aggarwal-Raghav
Copy link
Contributor Author

Aggarwal-Raghav commented Jan 24, 2026

@abstractdog , I was able to start DagAppMaster with ZK on local. Attaching logs for the container docker_logs.txt

docker run -d \
        --name tez-am \
        -p 10001:10001 \
        -e TEZ_FRAMEWORK_MODE="STANDALONE_ZOOKEEPER" apache/tez-am:1.0.0-SNAPSHOT
brew install zookeeper
zkServer start

But this PR has lot of open items and I need some advice on the following:

  1. Is the docker directory inside tez-dist fine or should I create a sepate sub-module for dockerfile related code which will be executed after tez-dist module.
  2. This image will presumeably be ran with ZK + K8 + S3. Question is do we need a hadoop tarball inside this image just in case for some 3rd party jars etc. If my understanding is correct, it shouldn't be there but I've kept it for now. Will remove if you say so.
  3. in DAGAppMaster#main() there are lot of ENV variables which I have mocked for now in entrypoint.sh. I'll try to improve this (suggestions are welcomed here)
  4. my tez-site.xml is not getting picked up from classpath
    Configuration conf = new Configuration();
    . will debug that
  5. Any way/How to test this AM container without YARN by running some job?

@tez-yetus

This comment was marked as outdated.

@abstractdog
Copy link
Contributor

abstractdog commented Jan 26, 2026

@abstractdog , I was able to start DagAppMaster with ZK on local. Attaching logs for the container docker_logs.txt

docker run -d \
        --name tez-am \
        -p 10001:10001 \
        -e TEZ_FRAMEWORK_MODE="STANDALONE_ZOOKEEPER" apache/tez-am:1.0.0-SNAPSHOT
brew install zookeeper
zkServer start

But this PR has lot of open items and I need some advice on the following:

  1. Is the docker directory inside tez-dist fine or should I create a sepate sub-module for dockerfile related code which will be executed after tez-dist module.
  2. This image will presumeably be ran with ZK + K8 + S3. Question is do we need a hadoop tarball inside this image just in case for some 3rd party jars etc. If my understanding is correct, it shouldn't be there but I've kept it for now. Will remove if you say so.
  3. in DAGAppMaster#main() there are lot of ENV variables which I have mocked for now in entrypoint.sh. I'll try to improve this (suggestions are welcomed here)
  4. my tez-site.xml is not getting picked up from classpath
    Configuration conf = new Configuration();

    . will debug that
  5. Any way/How to test this AM container without YARN by running some job?

very good, very good, let me check this in detail sometime this week, here are some pointers in the meantime, responding your questions:

  1. I believe we can follow Apache Hive in this area, feel free to do something like here: https://github.com/apache/hive/tree/master/packaging

  2. We should keep hadoop jars. Even if the k8s environment is not the hadoop/yarn environment anymore, Tez heavily depends on hadoop compile time and runtime as well, and this is something we don't intend to break in the short or midterm.

  3. I'll check it. What we should really be clear about is e.g.

# 3. NodeManager Details
export NM_HOST=${NM_HOST:-"localhost"}
export NM_PORT=${NM_PORT:-"12345"}

there is no Yarn NodeManager in a k8s environment, so the reader of the entrypoint.sh should see a clear code distinguishing between needed env vars and legacy/backward-compatible env vars, that's what should be handled with care in my opinion

  1. Okay.

  2. Yeah. So given that neither tez containers (TEZ-4665) nor llap containers (HIVE-29411) thing is implemented, we cannot successfully run a whole DAG, but we can get to a point where at least a DAG is successfully submitted from Hive to this AM container. So, I believe, to make this happen, we need to make a HS2 container (see Hive instructions for dockerized setup) be able to find this Tez AM container, so most probably, we need to stop using tez.local.mode=true for this experiment
    UPDATE: after creating HIVE-29419 for a separate TezAM image in Hive, the testing of this AM could be as simple as opening a TezClient to the AM container and submitting a DAG (with documentation attached).

@Aggarwal-Raghav
Copy link
Contributor Author

Aggarwal-Raghav commented Jan 27, 2026

Thanks for the pointers @abstractdog .

  1. Yes, the implementation is reminiscent of hive (TBH, pom.xml and build-docker.sh and some parts of Dockerfile are taken from hive to some extent)
  2. For basic startup of tez am without hadoop jars, I didn't observed any issue. As tez tar ball contains few hadoop jars and i think they and their transitive dependency jars are sufficient for tez-am to be client of hadoop services (but I have commit ready just in case if we later want to remove hadoop tarball)
  3. No Update. I believe, code change in DagAppMaster is required for segregation.
  4. Raised TEZ-4685: DagAppMaster is not picking tez-site.xml from classpath in zookeeper mode #458

Few additional things:

  1. DAGAppMaster#serviceInit() => DAGAppMaster#createTaskSchedulerManager is trying to connect to ResourceManager even in zookeeper mode . I think we shouldn't use YARN scheduler and maybe move to Yunikorn (we are using that in spark internally). Let me know how to proceed for this? For now should I raise a PR for skipping it if zk mode is enabled?
2026-01-27 19:13:06,207 INFO zookeeper.ZkAMRegistry: Added AMRecord to zkpath /tez-external-sessions/tez_am/server/application_1769280834537_0000
2026-01-27 19:13:06,208 INFO app.DAGAppMaster: Added AMRecord: {hostName=2d0733bd53ae, externalId=tez-session-, hostIp=172.17.0.2, port=10001, computeName=default-compute, appId=application_1769280834537_0000} to registry..
2026-01-27 19:13:06,210 INFO rm.TaskSchedulerManager: Creating YARN TaskScheduler: org.apache.tez.dag.app.rm.DagAwareYarnTaskScheduler
2026-01-27 19:13:06,253 INFO conf.Configuration: resource-types.xml not found
2026-01-27 19:13:06,253 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2026-01-27 19:13:06,259 INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2026-01-27 19:13:06,259 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2026-01-27 19:13:06,263 INFO rm.DagAwareYarnTaskScheduler: scheduler initialized with maxRMHeartbeatInterval:1000 reuseEnabled:true reuseRack:true reuseAny:false localityDelay:250 preemptPercentage:10 preemptMaxWaitTime:60000 numHeartbeatsBetweenPreemptions:3 idleContainerMinTimeout:5000 idleContainerMaxTimeout:10000 sessionMinHeldContainers:0
2026-01-27 19:13:06,267 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8030
2026-01-27 19:13:07,572 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2026-01-27 19:13:08,580 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2026-01-27 19:13:09,588 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2026-01-27 19:13:10,595 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
  1. Disable tez.am.ui as it's also using yarn rm proxy

@tez-yetus

This comment was marked as outdated.

@tez-yetus

This comment was marked as outdated.

@Aggarwal-Raghav
Copy link
Contributor Author

DAGAppMaster#serviceInit() => DAGAppMaster#createTaskSchedulerManager is trying to connect to ResourceManager even in zookeeper mode . I think we shouldn't use YARN scheduler and maybe move to Yunikorn (we are using that in spark internally). Let me know how to proceed for this? For now should I raise a PR for skipping it if zk mode is enabled?

Using tez.local.mode=true, solves this as it will use LocalTaskScheduler
DAG is Up and ready:
Screenshot 2026-02-13 at 12 16 32 AM

@tez-yetus

This comment was marked as outdated.

@Aggarwal-Raghav Aggarwal-Raghav changed the title [DRAFT] [WIP] TEZ-4682: [Cloud] Tez AM docker image TEZ-4682: [Cloud] Tez AM docker image Feb 15, 2026
@tez-yetus

This comment was marked as outdated.

@Aggarwal-Raghav
Copy link
Contributor Author

@abstractdog , Can you please help with review?

@abstractdog
Copy link
Contributor

@abstractdog , Can you please help with review?

let me get back to this next week

@Aggarwal-Raghav
Copy link
Contributor Author

@abstractdog , Can you please help with review?

let me get back to this next week

sure

@tez-yetus

This comment was marked as outdated.

mvn clean install -DskipTests -Pdocker,tools
```

2. Install zookeeper in mac by:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add ubuntu steps? we might be so kind to let linux users' life be easier

Copy link
Contributor

@abstractdog abstractdog Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: can we use a dockerized zookeeper instead? install ZK on the host machine looks against this whole cloud/docker initiative (also, in case of problems or ZK nodes messed up, deleting and restarting a container feels easier and cleaner to me)


# Tez AM Container Environment Configuration

HADOOP_USER_NAME=tez
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking, can you order the env vars here the same as they are ordered in the entrypoint script?

tez-dist/pom.xml Outdated
Comment on lines 138 to 144
<argument>${project.basedir}/src/docker/build-docker.sh</argument>
<argument>-hadoop</argument>
<argument>${hadoop.version}</argument>
<argument>-tez</argument>
<argument>${project.version}</argument>
<argument>-repo</argument>
<argument>apache</argument>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make it similar to what I can see in Hive? much less verbose, e.g.

              <arguments>
                <argument>.... .sh</argument>
                <argument>-hadoop ${hadoop.version}</argument>
                <argument>-tez ${tez.version}</argument>
              </arguments>

@Aggarwal-Raghav
Copy link
Contributor Author

Thanks for the thorough review @abstractdog . I'll address the review comments shortly, you can continue the review in the meantime. I hope you are able to build the image and start tez am standalone process 😅

I still believe we can get rid of hadoop tarball dependency completely as the hadoop dependent required jars are already part of tez tarball. It might unnecessary increase docker image size.

Also please suggest should I use eclipse-temurin:21.0.3_9-jre-ubi9-minimal or jdk, in case we want to take jstack or other java debugging tools, jkd image is required.

@abstractdog
Copy link
Contributor

abstractdog commented Feb 25, 2026

Thanks for the thorough review @abstractdog . I'll address the review comments shortly, you can continue the review in the meantime. I hope you are able to build the image and start tez am standalone process 😅

I still believe we can get rid of hadoop tarball dependency completely as the hadoop dependent required jars are already part of tez tarball. It might unnecessary increase docker image size.

Also please suggest should I use eclipse-temurin:21.0.3_9-jre-ubi9-minimal or jdk, in case we want to take jstack or other java debugging tools, jkd image is required.

  1. I believe as long as a simple DAG can run without adding hadoop jars (other than what's already inside tez.tar.gz), it fine to get rid of them: we cannot test it now, but we can still check if the AM starts correctly, and if so, we're good

  2. regarding debugging tools: I would add them in the first round, and we can still optimize later: these images are not for production in the first round, so I would rather have a slightly bigger image with debugging tools than having a small image that's harder to use while investigating something

# HADOOP FETCH LOGIC #
######################
HADOOP_FILE_NAME="hadoop-$HADOOP_VERSION.tar.gz"
HADOOP_URL=${HADOOP_URL:-"https://archive.apache.org/dist/hadoop/core/hadoop-$HADOOP_VERSION/$HADOOP_FILE_NAME"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using this first:
https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz
and then fall back to archive

archive.apache.org is crazy slow for me at the moment (not the first time), maybe it would worth discovering dlcdn.apache.org

docker run \
-p 10001:10001 -p 8042:8042 \
--name tez-am \
apache/tez-am:1.0.0-SNAPSHOT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would introduce a TEZ_VERSION env var beforehand and refer to it: this would make it clear what's going to become obsolete in this doc, and what's more permanent :)

Comment on lines 37 to 40
docker run \
-p 10001:10001 -p 8042:8042 \
--name tez-am \
apache/tez-am:1.0.0-SNAPSHOT
Copy link
Contributor

@abstractdog abstractdog Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying the steps here and while the docker run I get:

2026-02-25 15:08:52,107 ERROR app.DAGAppMaster: Error starting DAGAppMaster
java.io.FileNotFoundException: /opt/tez/tez-conf.pb (No such file or directory)
	at java.base/java.io.FileInputStream.open0(Native Method)
	at java.base/java.io.FileInputStream.open(Unknown Source)
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at org.apache.tez.common.TezUtilsInternal.readUserSpecifiedTezConfiguration(TezUtilsInternal.java:83)
	at org.apache.tez.frameworkplugins.yarn.YarnServerFrameworkService$YarnAMExtensions.loadConfigurationProto(YarnServerFrameworkService.java:73)
	at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:2435)

also earlier I get this:

/entrypoint.sh: line 34: hostname: command not found

I believe this happens in the entrypoint, so should not be related to my host machine

can you advise what could possible cause these? I mean, I can debug it for sure, but maybe it's more straightforward for you, given you're the author

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For /opt/tez/tez-conf.pb in top commit i removed it because it was not failing for me. please ensure you are using the same tez-site.xml as in the PR or you can revert the last commit entrypoint.sh .

hostname one is know issue the command doesn't exist in the base docker image. i'll fix it, i forgot to remove it :-(

Copy link
Contributor Author

@Aggarwal-Raghav Aggarwal-Raghav Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd tez-dist/src/docker and run the command.

docker run --rm \
              -p 10001:10001 -p 5005:5005 \
              --env-file tez.env \
              --name tez-am \
              apache/tez-am:1.0.0-SNAPSHOT

that tez-conf.pb is required in YARN mode not in zk mode. please give -e STANDALONE_ZOOKEEPER or tez.env and ensure that hadoop is also running or remove the followning propety. I was testing this with TEZ-4686

<property>
        <name>fs.defaultFS</name>
        <value>hdfs://host.docker.internal:9000</value>
    </property>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your stacktrace also suggests its going in default YARN mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, but I think docker run command should contain tez.env right? I just copy-pasted the command, that was the problem, I can see only:

    ```bash
    docker run \
        -p 10001:10001 -p 8042:8042 \
        --name tez-am \
        apache/tez-am:1.0.0-SNAPSHOT

Copy link
Contributor Author

@Aggarwal-Raghav Aggarwal-Raghav Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the README.md needs some change. will update it. Please let me know if you face any more issues in tez-am startup. I'm fully available until your tez-am image works :-)

I added the tez.env in point4 but didn't updated the older headings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, thanks! making the image work on my side is crucial part of the review 🚀

@tez-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 5m 11s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 xmllint 0m 0s xmllint was not available.
+0 🆗 shelldocs 0m 0s Shelldocs was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ master Compile Tests _
+0 🆗 mvndep 1m 42s Maven dependency ordering for branch
+1 💚 mvninstall 7m 36s master passed
+1 💚 compile 1m 31s master passed
+1 💚 javadoc 1m 7s master passed
-0 ⚠️ patch 1m 23s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 5s Maven dependency ordering for patch
+1 💚 mvninstall 2m 37s the patch passed
+1 💚 codespell 0m 48s No new issues.
+1 💚 compile 1m 34s the patch passed
+1 💚 javac 1m 34s the patch passed
-1 ❌ blanks 0m 0s /blanks-eol.txt The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚 hadolint 0m 1s No new issues.
+1 💚 markdownlint 0m 1s No new issues.
+1 💚 shellcheck 0m 1s No new issues.
+1 💚 javadoc 0m 59s the patch passed
_ Other Tests _
+1 💚 unit 0m 17s tez-dist in the patch passed.
+1 💚 unit 57m 47s root in the patch passed.
+1 💚 asflicense 0m 42s The patch does not generate ASF License warnings.
83m 2s
Subsystem Report/Notes
Docker ClientAPI=1.53 ServerAPI=1.53 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-456/8/artifact/out/Dockerfile
GITHUB PR #456
Optional Tests dupname asflicense codespell detsecrets javac javadoc unit xmllint compile shellcheck shelldocs hadolint markdownlint
uname Linux 3b8c553d0442 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-home/workspace/tez-multibranch_PR-456/src/.yetus/personality.sh
git revision master / 0c2a29b
Default Java Ubuntu-21.0.10+7-Ubuntu-124.04
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-456/8/testReport/
Max. process+thread count 1338 (vs. ulimit of 5500)
modules C: tez-dist . U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-456/8/console
versions git=2.43.0 maven=3.8.7 hadolint=1.18.0-0-g76eee5c codespell=2.4.1 markdownlint=0.46.0 shellcheck=0.7.1
Powered by Apache Yetus 0.15.1 https://yetus.apache.org

This message was automatically generated.

@abstractdog
Copy link
Contributor

@Aggarwal-Raghav : getting closer and closer to the finish, nice job so far!
I'm wondering if you're about to address TEZ-4689 before or after this patch

@Aggarwal-Raghav
Copy link
Contributor Author

@Aggarwal-Raghav : getting closer and closer to the finish, nice job so far! I'm wondering if you're about to address TEZ-4689 before or after this patch

Let me check this in-depth over this weekend and I'll post my analysis/PR under the JIRA. Hoping its not too late 😅

Can you please check TEZ-4686 as well. With this tez-am docker image + standalone program or (tez master + hive master) the stacktrace is observed (under attachment section). I have raised a draft PR for this

@abstractdog
Copy link
Contributor

@Aggarwal-Raghav : getting closer and closer to the finish, nice job so far! I'm wondering if you're about to address TEZ-4689 before or after this patch

Let me check this in-depth over this weekend and I'll post my analysis/PR under the JIRA. Hoping its not too late 😅

Can you please check TEZ-4686 as well. With this tez-am docker image + standalone program or (tez master + hive master) the stacktrace is observed (under attachment section). I have raised a draft PR for this

I believe this patch can be merged without TEZ-4686: with the current WIP patch of TEZ-4686, there is an issue which has to be resolved from both hive and tez side instead, I'll describe it in detail there

@abstractdog
Copy link
Contributor

@Aggarwal-Raghav : getting closer and closer to the finish, nice job so far! I'm wondering if you're about to address TEZ-4689 before or after this patch

Let me check this in-depth over this weekend and I'll post my analysis/PR under the JIRA. Hoping its not too late 😅
Can you please check TEZ-4686 as well. With this tez-am docker image + standalone program or (tez master + hive master) the stacktrace is observed (under attachment section). I have raised a draft PR for this

I believe this patch can be merged without TEZ-4686: with the current WIP patch of TEZ-4686, there is an issue which has to be resolved from both hive and tez side instead, I'll describe it in detail there, the point is that this AM docker initiative is good to go once the DAGAppMaster is successfully started and registered to ZK, which is the case now 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants