A new version of luz is now available on CRAN. luz is a high-level interface for torch. It aims to reduce the boilerplate code necessary to train torch models while being as flexible as possible so you can adapt it to run all kinds of deep learning models.
If you want to get started with luz, we recommend reading the previous blog post about the release as well as the chapter ‘Training with luz’ in the book ‘Deep Learning and Scientific Computing with R torch’.
This release adds many minor features and you can view the full changelog here. In this blog post, we’re highlighting the features we’re most excited about.
Support for Apple Silicon
As of flashlight v0.9.0, it is possible to run calculations on the GPU of Apple Silicon-equipped Macs. however, luz would not automatically use the GPU and instead use CPU-based models.
Starting with this release, luz will automatically use the ‘mps’ device when running models on Apple Silicon computers, allowing you to benefit from the acceleration of running models on GPUs.
To give an idea, running a simple CNN model on MNIST from this example for one epoch on an Apple M1 Pro chip would take 24 seconds using the GPU:
user system elapsed
19.793 1.463 24.231
Whereas on CPU it would take 60 seconds:
user system elapsed
83.783 40.196 60.253
That’s a nice acceleration!
Note that this feature is still somewhat experimental and not every torch operation is supported to run on MPS. It is likely that you will see a warning message explaining that a fallback processor may be required for some carriers:
(W MPSFallback.mm:11) Warning: The operator 'at:****' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (function operator())
Checkpoint
The checkpoint feature has been reworked in luz and it is now easier to restart training runs if they fail for some unexpected reason. All that is needed is to add a resume
callback when training the model:
# ... model definition omitted
# ...
# ...
resume <- luz_callback_resume_from_checkpoint(path = "checkpoints/")
results <- model %>% fit(
list(x, y),
callbacks = list(resume),
verbose = FALSE
)
It is also now easier to save the state of the model at each epoch, or if the model has obtained better validation results. Learn more in the “Checkpointing” article.
Fixed bugs
This release also includes several small bug fixes, such as respecting CPU usage (even when a faster device is available) or providing a more consistent metrics environment.
However, there is one bug fix that we would like to highlight in particular in this blog post. We found that the algorithm we used to accumulate losses during training had exponential complexity; so if you had many steps per epoch during model training, the slack would be very slow.
For example, considering a dummy model running for 500 steps, the slack would take 61 seconds per epoch:
Epoch 1/1
Train metrics: Loss: 1.389
user system elapsed
35.533 8.686 61.201
The same model with the bug fixed now takes 5 seconds:
Epoch 1/1
Train metrics: Loss: 1.2499
user system elapsed
4.801 0.469 5.209
This bug fix results in a 10x speedup of this model. However, the acceleration may vary depending on the model type. Models that are faster per batch and have more iterations per epoch will benefit more from this fix.
Thank you so much for reading this blog post. As always, we welcome any contribution to the torch ecosystem. Feel free to open issues and suggest new features, improve documentation, or expand the code base.
Last week we announced the release of torch v0.10.0 – here’s a link to the release blog post in case you missed it.
Photo by Peter John Maridable on Unsplash
Use again
Text and images are licensed under Creative Commons Attribution CC BY 4.0. Figures that have been reused from other sources are not covered by this license and can be identified by the note in their caption: “Image from …”.
Quote
For attribution, please cite this work as
Falbel (2023, April 17). Posit AI Blog: luz 0.4.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-04-17-luz-0-4/
BibTeX citation
@misc{luz-0-4, author = {Falbel, Daniel}, title = {Posit AI Blog: luz 0.4.0}, url = {https://blogs.rstudio.com/tensorflow/posts/2023-04-17-luz-0-4/}, year = {2023} }