iOS development focused on detecting text in a video recording scene using Vision and AVFoundation is incredibly valuable for developers interested in building apps with real-time image processing or OCR capabilities. This post provides hands-on guidance on how to combine AVFoundation's video capture features with Vision's powerful text recognition capabilities, allowing developers to create apps that automatically extract and analyze text from videos. It is especially useful for building innovative apps in fields such as accessibility, document scanning, or interactive media, offering both technical insights and practical code examples to help developers implement advanced text detection in their projects.
In this post, we will walk through creating a sample iOS app that detects speed limit traffic signs. At the end of the post, you will find a GitHub repository link to the source code.
iOS Speed signal detection app
The app basically detects and filter those ones that could fit with a speed limit:

The View
Main ContentView retrieves possible speed detection text and just prints a speed limit traffic signal with detected speed value on it:
struct ContentView: View {
@State private var detectedSpeed: String = ""
var body: some View {
ZStack {
CameraView(detectedSpeed: $detectedSpeed)
.edgesIgnoringSafeArea(.all)
trafficLimitSpeedSignalView(detectedSpeed)
}
}
func trafficLimitSpeedSignalView(_ detectedText: String ) -> some View {
if !detectedText.isEmpty {
return AnyView(
ZStack {
Circle()
.stroke(Color.red, lineWidth: 15)
.frame(width: 150, height: 150)
Circle()
.fill(Color.white)
.frame(width: 140, height: 140)
Text("\(detectedText)")
.font(.system(size: 80, weight: .heavy))
.foregroundColor(.black)
}
)
} else {
return AnyView(EmptyView())
}
}
}
CameraView, where all magic takes place...
Key library frameworks have been the following:
- AVCaptureSession: Captures video data from the device camera.
- VNRecognizeTextRequest: Part of Apple’s Vision framework used for Optical Character Recognition (OCR) to recognize text in images.
CameraView
component that uses the device’s camera to capture video, and processes the video feed to detect and extract speed values from any visible text (e.g., road signs with speed limits).struct CameraView: UIViewControllerRepresentable {
@Binding var detectedSpeed: String
UIViewControllerRepresentable
structure integrates aUIViewController
(specifically a camera view) into a SwiftUI-based application. SwiftUI is future, but not for this applications context yet.@Binding var detectedSpeed: String
: A binding to a string that will hold the detected speed from the camera feed. Changes on this property wraper will update main ContentView.
func makeCoordinator() -> Coordinator {
return Coordinator(detectedSpeed: $detectedSpeed)
}
func makeUIViewController(context: Context) -> UIViewController {
let controller = UIViewController()
let captureSession = AVCaptureSession()
captureSession.sessionPreset = .high
guard let videoDevice = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .back),
let videoInput = try? AVCaptureDeviceInput(device: videoDevice) else {
return controller
}
captureSession.addInput(videoInput)
let previewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
previewLayer.videoGravity = .resizeAspectFill
previewLayer.frame = controller.view.layer.bounds
controller.view.layer.addSublayer(previewLayer)
let videoOutput = AVCaptureVideoDataOutput()
videoOutput.setSampleBufferDelegate(context.coordinator, queue: DispatchQueue(label: "videoQueue"))
captureSession.addOutput(videoOutput)
Task { @GlobalManager in
captureSession.startRunning()
}
return controller
}
func updateUIViewController(_ uiViewController: UIViewController, context: Context) { }
Next, Methods in CameraView:
makeCoordinator()
: Creates an instance of theCoordinator
class that manages the camera’s data output and processes the video feed.makeUIViewController(context:)
: This method sets up and configures the camera session:- AVCaptureSession: A session to manage the input from the camera and output to process the captured video.
- AVCaptureDevice: Selects the device’s rear camera (
.back
). - AVCaptureDeviceInput: Creates an input from the rear camera.
- AVCaptureVideoPreviewLayer: Displays a live preview of the video feed on the screen.
- AVCaptureVideoDataOutput: Captures video frames for processing by the
Coordinator
class. captureSession.startRunning()
: Starts the video capture.
updateUIViewController(_, context:)
: This method is required by theUIViewControllerRepresentable
protocol but is left empty in this case because no updates to the view controller are needed after initial setup.
class Coordinator: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
@Binding var detectedSpeed: String
init(detectedSpeed: Binding<String>) {
_detectedSpeed = detectedSpeed
}
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
recognizeText(in: pixelBuffer)
}
private func recognizeText(in image: CVPixelBuffer) {
let textRequest = VNRecognizeTextRequest { [weak self] (request, error) in
guard let self,
let observations = request.results as? [VNRecognizedTextObservation],
let topCandidate = self.getCandidate(from: observations) else {
return
}
if let speedCandidate = Int(topCandidate),
(10...130).contains(speedCandidate),
speedCandidate % 10 == 0 {
print("Speed candidate: \(speedCandidate)")
detectedSpeed = "\(speedCandidate)"
}
}
let requestHandler = VNImageRequestHandler(cvPixelBuffer: image, options: [:])
try? requestHandler.perform([textRequest])
}
private func getCandidate(from observations: [VNRecognizedTextObservation]) -> String? {
var candidates = [String]()
for observation in observations {
for candidate in observation.topCandidates(10) {
if candidate.confidence > 0.9,
let speedCandidate = Int(candidate.string),
(10...130).contains(speedCandidate),
speedCandidate % 10 == 0 {
candidates.append(candidate.string)
}
}
}
return candidates.firstMostCommonItemRepeated()
}
}
Coordinator
class is responsible for handling the video feed and extracting relevant text (i.e., speed values) from the camera image.
- Properties:
@Binding var detectedSpeed: String
: A binding to update the detected speed from the camera feed.
- Methods:
captureOutput(_:didOutput:from:)
: This delegate method is called whenever a new video frame is captured. It gets the pixel buffer from the frame and passes it to therecognizeText(in:)
method to detect text.recognizeText(in:)
: This method uses Vision framework (VNRecognizeTextRequest
) to perform text recognition on the captured video frame. The recognized text is checked to see if it contains a valid speed value (a number between 10 and 130, divisible by 10).- If a valid speed is detected, it updates the
detectedSpeed
binding to show the recognized speed.
- If a valid speed is detected, it updates the
getCandidate(from:)
: This method processes multiple recognized text candidates and selects the most likely speed value based on:- High confidence (over 90%).
- Speed range (10 to 130, divisible by 10).
- Returning the most common speed value if multiple candidates are found.
Conclusions
This example dips our feet in the huge broad posibilites that will bring Vision Framework, not only limited to text detection but shapes also are possible.
You can find source code used for writing this post in following repository. Also can play with this implementation in an app called Car Clip Camera placed at Apple Store.
References
- AVFoundation
Apple Developer Documentation
- Vision
Apple Developer Documentation
- Car Clip Camera
Apple Store