Posit AI Blog: luz 0.4.0

A new version of luz is now available on CRAN. luz is a high-level interface for torch. It aims to reduce the boilerplate code necessary to train torch models while being as flexible as possible so you can adapt it to run all kinds of deep learning models.

If you want to get started with luz, we recommend reading the previous blog post about the release as well as the chapter ‘Training with luz’ in the book ‘Deep Learning and Scientific Computing with R torch’.

This release adds many minor features and you can view the full changelog here. In this blog post, we’re highlighting the features we’re most excited about.

Support for Apple Silicon

As of flashlight v0.9.0, it is possible to run calculations on the GPU of Apple Silicon-equipped Macs. however, luz would not automatically use the GPU and instead use CPU-based models.

Starting with this release, luz will automatically use the ‘mps’ device when running models on Apple Silicon computers, allowing you to benefit from the acceleration of running models on GPUs.

To give an idea, running a simple CNN model on MNIST from this example for one epoch on an Apple M1 Pro chip would take 24 seconds using the GPU:

  user  system elapsed 
19.793   1.463  24.231

Whereas on CPU it would take 60 seconds:

  user  system elapsed 
83.783  40.196  60.253

That’s a nice acceleration!

Note that this feature is still somewhat experimental and not every torch operation is supported to run on MPS. It is likely that you will see a warning message explaining that a fallback processor may be required for some carriers:

(W MPSFallback.mm:11) Warning: The operator 'at:****' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (function operator())

Checkpoint

The checkpoint feature has been reworked in luz and it is now easier to restart training runs if they fail for some unexpected reason. All that is needed is to add a resume callback when training the model:

# ... model definition omitted
# ...
# ...
resume <- luz_callback_resume_from_checkpoint(path = "checkpoints/")

results <- model %>% fit(
  list(x, y),
  callbacks = list(resume),
  verbose = FALSE
)

It is also now easier to save the state of the model at each epoch, or if the model has obtained better validation results. Learn more in the “Checkpointing” article.

Fixed bugs

This release also includes several small bug fixes, such as respecting CPU usage (even when a faster device is available) or providing a more consistent metrics environment.

However, there is one bug fix that we would like to highlight in particular in this blog post. We found that the algorithm we used to accumulate losses during training had exponential complexity; so if you had many steps per epoch during model training, the slack would be very slow.

For example, considering a dummy model running for 500 steps, the slack would take 61 seconds per epoch:

Epoch 1/1
Train metrics: Loss: 1.389                                                                
   user  system elapsed 
 35.533   8.686  61.201

The same model with the bug fixed now takes 5 seconds:

Epoch 1/1
Train metrics: Loss: 1.2499                                                                                             
   user  system elapsed 
  4.801   0.469   5.209

This bug fix results in a 10x speedup of this model. However, the acceleration may vary depending on the model type. Models that are faster per batch and have more iterations per epoch will benefit more from this fix.

Thank you so much for reading this blog post. As always, we welcome any contribution to the torch ecosystem. Feel free to open issues and suggest new features, improve documentation, or expand the code base.

Last week we announced the release of torch v0.10.0 – here’s a link to the release blog post in case you missed it.

Photo by Peter John Maridable on Unsplash

Use again

Text and images are licensed under Creative Commons Attribution CC BY 4.0. Figures that have been reused from other sources are not covered by this license and can be identified by the note in their caption: “Image from …”.

Quote

For attribution, please cite this work as

Falbel (2023, April 17). Posit AI Blog: luz 0.4.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-04-17-luz-0-4/

BibTeX citation

@misc{luz-0-4,
  author = {Falbel, Daniel},
  title = {Posit AI Blog: luz 0.4.0},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-04-17-luz-0-4/},
  year = {2023}
}

Support for Apple Silicon

Checkpoint

Fixed bugs

Use again

Quote

Leave a Comment Cancel reply