In this blog-post I'm going to explain how you can use VisionCamera V3, TensorFlow Lite and Skia to create a funny little stickman dancing app from scratch! 🕺

While this looks insanely difficult to do, VisionCamera V3 comes with exciting new APIs that allow us to build this app without a single line of native code!

Dancing app demo

Don't want to read? Here's the full code: https://github.com/mrousavy/VisionCameraSkiaDemo

Let's get started:

Step 1: Setting up the project

Make sure you set up your React Native development environment.
Create a new React Native (0.71+) project, we're gonna call it StickmanDancingApp:
```
npx react-native init StickmanDancingApp
```
Run the app with yarn ios.

Step 2: Add the Camera

To add Camera functionality to our React Native app, we're going to use react-native-vision-camera.

VisionCamera docs screenshot

Install react-native-vision-camera from npm:
```
yarn add react-native-vision-camera@rc
cd ios && pod install
```
Note: As of August 2023, VisionCamera V3 is still in "RC" (beta) mode. If VisionCamera V3 has already been released, you don't need the @rc tag in this command. See this issue for updates.

Open ios/StickmanDancingApp/Info.plist and add the required Camera Permissions:

<key>NSCameraUsageDescription</key>
<string>StickmanDancingApp needs Camera access.</string>

Rebuild your app

Now that we installed VisionCamera, let's use it! In App.tsx, ask for permissions:

function App() {
  const [hasPermission, setHasPermission] = useState(false)
 
  useEffect(() => {
    Camera.requestCameraPermission().then((p) =>
      setHasPermission(p === 'granted')
    )
  }, [])
 
  // ...
}

Then select the Camera device you want to use. Your phone has multiple Camera devices, for this example we just want to use the front facing camera (which is pretty much always a wide-angle camera):
```
function App() {
  // ...
  const devices = useCameraDevices('wide-angle-camera')
  const device = devices.front
  // ...
}
```

And finally, we want to render the Camera:

function App() {
  // ...
 
  return (
    <View style={StyleSheet.absoluteFill}>
      {!hasPermission && <Text style={styles.text}>No Camera Permission.</Text>}
      {hasPermission && device != null && (
        <Camera
          style={StyleSheet.absoluteFill}
          device={device}
          isActive={true}
        />
      )}
    </View>
  )
}

And voila, we can see the Camera appear on screen:

Step 3: Enable Frame Processors

VisionCamera has a feature called "Frame Processors", which allows us to create JavaScript functions that get called for every Frame the camera "sees". In this function we can get access to the native GPU Frame buffer, allowing us to run any kind of processing such as ML-processing, face-detection, pose-detection, etc on it.

Since the Camera streams frames at a very high refresh rate (e.g. 30 FPS, 60 FPS, or even 120 FPS), we need to make sure those functions are as fast as possible. To achieve this, VisionCamera uses a lot of advanced GPU techniques and a highly optimized C++ implementation to avoid any expensive buffer copies.

This feature is only enabled if you install react-native-worklets-core, a library to create small JavaScript functions that can run on a separate Thread, focused on specific high performance tasks. This is the same concept as in react-native-reanimated.

Worklets Core docs screenshot

Install react-native-worklets-core:
```
yarn add react-native-worklets-core
cd ios
pod install
```
You should see this appear in the output logs of the pod install command:
```
[VisionCamera] react-native-worklets-core found, Frame Processors enabled!
```
If VisionCamera doesn't detect react-native-worklets-core, make sure to investigate why, otherwise Frame Processors are not enabled!

Add the 'worklet' compiler plugin to your babel.config.js

module.exports = {
  // ...
  plugins: [
    ['react-native-worklets-core/plugin'],
    // ...
  ],
}

Restart your bundler:
```
yarn start --reset-cache
```

And finally add a simple Frame Processor that just logs to the console:

function App() {
  // ...
  const frameProcessor = useFrameProcessor((frame) => {
    'worklet'
    console.log(`Received a ${frame.width} x ${frame.height} Frame!`)
  }, [])
 
  // ...
        <Camera
          // ...
          frameProcessor={frameProcessor}
        />
  // ...
}

You should see logs appear in your metro console:

Received a 3048 x 2096 Frame!
Received a 3048 x 2096 Frame!
Received a 3048 x 2096 Frame!
...

In this Frame Processor we have access to the current Camera Frame. Every operation is fully synchronous, meaning if you take too long to finish your processing, a frame might get dropped. At 60 FPS, your Frame Processor should finish it's execution within 16.66667ms.

We can use utilities such as runAsync or runAtTargetFps to perform asynchronous or throttled processing.

Step 4: Implement pose detection using TensorFlow Lite

To implement the pose detection, we're going to use TensorFlow Lite. TensorFlow Lite is a framework to run standardized machine learning models on mobile devices. While you could also build this with other Frameworks such as PyTorch, MediaPipe, MLKit or Apple Vision, we're using TensorFlow Lite because it is easier to integrate with React Native.

To use TensorFlow Lite models in React Native we're going to use react-native-fast-tflite, a library built by me which uses a highly optimized C++ implementation that allows you to use shared ArrayBuffers without copying them - this is not only faster, but also allows us to use it within a VisionCamera Frame Processor.

Fast TFLite docs screenshot

Install react-native-fast-tflite from npm:

yarn add react-native-fast-tflite
cd ios && pod install

Include .tflite files in your JS bundle by adding it to assetExts in your metro.config.js:

const {getDefaultConfig, mergeConfig} = require('@react-native/metro-config')
 
const config = {
  resolver: {
    assetExts: ['tflite', ...getDefaultConfig(__dirname)?.resolver?.assetExts],
  },
}
 
module.exports = mergeConfig(getDefaultConfig(__dirname), config)

(Optional) If you want to use the CoreML GPU Delegate, add this to your ios/Podfile:
```
$EnableCoreMLDelegate=true
```
Now we need to find a good pose-detection model. Here's some things to look out for:
- The model needs to be a TensorFlow Lite model (.tflite extension)
- The model should ideally use uint8/int8 instead of floats for it's input type for faster execution.
- Models often come in accurate and fast options. Download both and compare speed to choose a matching one.
- The model should ideally support the CoreML GPU delegate for faster execution.
I'm going to use the MoveNet Pose Detection (int8) model from Google (tfhub.dev), which comes in two variants; "thunder" (more accurate) and "lightning" (faster).
Drag lite-model-movenet-singlepose-lightning-tflite-int8-4.tflite into your src/assets/ folder (I'm going to rename it to pose-detection-fast.tflite)

Load the model:

const plugin = useTensorflowModel(require('./assets/pose-detection-fast.tflite'))

(Optional): We should use the CoreML GPU delegate for better performance if the model supports it:
```
const plugin = useTensorflowModel(..., 'core-ml')
```

Understanding inputs/outputs

If you're new to TensorFlow, the APIs can be quite confusing. Here's a short TL;DR:

TensorFlow uses "tensors" as it's inputs and outputs. They are simply arrays of numbers. You pass in an array of numbers, and you receive an array of numbers. The way you interpret the data is up to you.

If you open your model in Netron you can inspect the types and sizes of your inputs and outputs:

Netron.app

In this case, we can see that there is exactly one input tensor, and one output tensor.

The input type is uint8, and it is 1 x 192 x 192 x 3 (= 110.592) values big. The 1 means a batch of one images, 192 is the image width, 192 is the image height, and 3 is the values per pixel ([R, G, B]).

The output type is float32, and it is 1 x 1 x 17 x 3 (= 51) values big. By reading the description on tfhub.dev, we can understand that the output values can be read as:

i: The Y coordinate of this point
i + 1: The X coordinate of this point
i + 2: The confidence score of this point

..where i is between 0 and 17, each representing one landmark (e.g. nose, elbow, knee, ..)

Step 5: Calling the TensorFlow Lite Pose Detection Model

Now that we understand the input data of our model, we need to prepare it accordingly:

Cameras are usually streaming Frames in YUV or PRIVATE formats. Since this model is trained on RGB images, we need to configure the Camera to stream in RGB formats:
```
<Camera
  // ...
  pixelFormat="rgb"
/>
```

Since the model expects 192x192 frames, we need to downsize the Camera Frame first. Let's create a simple JS-based resize plugin:

// Cache array buffer to avoid it being constantly re-allocated
const CACHE_ID = '__cachedArrayForResizer'
function getArrayFromCache(size: number): Int8Array {
  'worklet'
  if (global[CACHE_ID] == null || global[CACHE_ID].length != size) {
    global[CACHE_ID] = new Int8Array(size)
  }
  return global[CACHE_ID]
}
 
// Resize any Frame to the target width and height in RGB format.
export function resize(frame: Frame, width: number, height: number): Int8Array {
  'worklet'
  const inputWidth = frame.width
  const inputHeight = frame.height
  const arrayData = frame.toArrayBuffer()
 
  const outputSize = width * height * 3 // 3 for RGB
  const outputFrame = getArrayFromCache(outputSize)
 
  for (let y = 0; y < height; y++) {
    for (let x = 0; x < width; x++) {
      // Find closest pixel from the source image
      const srcX = Math.floor((x / width) * inputWidth)
      const srcY = Math.floor((y / height) * inputHeight)
 
      // Compute the source and destination index
      const srcIndex = (srcY * inputWidth + srcX) * 4 // 4 for BGRA
      const destIndex = (y * width + x) * 3           // 3 for RGB
 
      // Convert from BGRA to RGB
      outputFrame[destIndex] = arrayData[srcIndex + 2]     // R
      outputFrame[destIndex + 1] = arrayData[srcIndex + 1] // G
      outputFrame[destIndex + 2] = arrayData[srcIndex]     // B
    }
  }
 
  return outputFrame
}

This assumes that the input frame is in BGRA format. On Android, RGB Frames might be in RGB or RGBA format instead of BGRA, so make sure to adjust this accordingly. Also, you can implement this as a native Frame Processor Plugin to improve it's performance, e.g. by using OpenCV, Accelerate or Metal.

And finally, let's call the TensorFlow Lite plugin with our 192x192 Frame:

function App() {
  // ...
  const plugin = ...
 
  const frameProcessor = useFrameProcessor((frame) => {
    'worklet'
    if (plugin.state === "loaded") {
      const resized = resize(frame, 192, 192)
      const outputs = plugin.model.runSync([resized])
      console.log(`Received ${outputs.length} outputs!`)
    }
  }, [plugin])
  // ...
}

Perfect! We can see our TensorFlow Lite model running perfectly and returning 1 output tensor on each execution. Let's start drawing stuff!

Step 6: Setting up Skia

To draw stuff to a screen we're going to use Skia, an open-source cross-platform drawing library powered by Metal on iOS and OpenGL on Android.

Skia repo

Let's install react-native-skia from npm:
```
yarn add @shopify/react-native-skia
cd ios && pod install
```
You should see this appear in the output logs of the pod install command:
```
[VisionCamera] react-native-skia found, Skia Frame Processors enabled!
```
If VisionCamera doesn't detect react-native-skia, make sure to investigate why, otherwise Skia Frame Processors are not enabled!

Instead of useFrameProcessor, now use useSkiaFrameProcessor:

function App() {
  // ...
  const paint = Skia.Paint()
  paint.setColor(Skia.Color('red'))
 
  const frameProcessor = useSkiaFrameProcessor((frame) => {
    'worklet'
    // ...
    const rect = Skia.XYWHRect(150, 150, 50, 50)
    frame.drawRect(rect, paint)
  }, [plugin])
  // ...
}

You should see a red rectangle appear on the screen.

Now that we got our Skia context set up, let's start drawing the skeleton:

function App() {
  // ...
  // The indexes of the TFLite output model connecting the lines we want to draw
  const lines = [
    // left shoulder -> elbow
    5, 7,
    // right shoulder -> elbow
    6, 8,
    // left elbow -> wrist
    7, 9,
    // right elbow -> wrist
    8, 10,
    // left hip -> knee
    11, 13,
    // right hip -> knee
    12, 14,
    // left knee -> ankle
    13, 15,
    // right knee -> ankle
    14, 16,
    // left hip -> right hip
    11, 12,
    // left shoulder -> right shoulder
    5, 6,
    // left shoulder -> left hip
    5, 11,
    // right shoulder -> right hip
    6, 12,
  ]
  // White paint for the lines
  const paint = Skia.Paint()
  paint.setStrokeWidth(5);
  paint.setColor(Skia.Color('white'))
 
  const frameProcessor = useSkiaFrameProcessor((frame) => {
    'worklet'
    const resized = resize(frame, inputWidth, inputHeight)
    const outputs = plugin.model.runSync([smaller])
    const output = outputs[0] // one output tensor
 
    const frameWidth = frame.width
    const frameHeight = frame.height
 
    // remove everything from the Frame, making it black
    frame.clear(Skia.Color('black'))
 
    // draw each line on the skeleton
    for (let i = 0; i < lines.length; i += 2) {
      const from = lines[i]
      const to = lines[i + 1]
 
      const confidence = output[from * 3 + 2]
      if (confidence > 0.5) {
        frame.drawLine(
          output[from * 3 + 1] * frameWidth,
          output[from * 3] * frameHeight,
          output[to * 3 + 1] * frameWidth,
          output[to * 3] * frameHeight,
          paint,
        )
      }
    }
  }, [plugin, paint])
  // ...
}

To make things look a bit happier, let's also draw an emoji as a face:

function App() {
  // ...
  // downloaded from the internet
  const EMOJI_SIZE = 30
  const emojiFont = useFont(require('./assets/NotoEmoji-Medium.ttf'), EMOJI_SIZE)
 
 
  const frameProcessor = useSkiaFrameProcessor((frame) => {
    'worklet'
    // ...
 
    if (emojiFont != null) {
      const faceConfidence = output[2];
      if (faceConfidence > 0.5) {
        const noseY = output[0] * frame.height + EMOJI_SIZE * 0.3;
        const noseX = output[1] * frame.width - EMOJI_SIZE / 2;
        frame.drawText('😄', noseX, noseY, paint, emojiFont);
      }
    }
  }, [plugin, paint, emojiFont])
  // ...
}

And voila! We can dance infront of the Camera without it looking too terrible 🥳 (Bear with me here, my dancing skills are far from perfect)

It has been an insane 2 year long journey for me to get to this point. VisionCamera V3 is changing the mobile camera game by making realtime Frame processing (+ TFLite) and drawing (+ Skia) easy to use by bringing it to React Native. 💪

Compared to other solutions, this uses significantly less code, is highly optimized for GPU processing, and benefits from fast-refresh/hot-reload so you don't have to rebuild your app when you change the color of the skeleton, apply a Skia shader, or even entirely swap out the TFLite model - it's all just a ⌘ + S away! 🤩

If you liked this post or appreciate what I'm doing with VisionCamera, TFLite, Worklets and Skia please consider sponsoring me on GitHub so I can invest into dance lessons! 🕺

Here's the full code for the app we just built: https://github.com/mrousavy/VisionCameraSkiaDemo