Seamless Text Input with Your Voice on iOS -

Most likely, you have faced a situation where you're enjoying the seamless flow of an application—for instance, while making a train or hotel reservation. Then, suddenly—bam!—a never-ending form appears, disrupting the experience. I'm not saying that filling out such forms is irrelevant for the business—quite the opposite. However, as an app owner, you may notice in your analytics a significant drop in user conversions at this stage.

In this post, I want to introduce a more seamless and user-friendly text input option to improve the experience of filling out multiple fields in a form.

Base project

To help you understand this topic better, we’ll start with a video presentation. Next, we’ll analyze the key parts of the code. You can also download the complete code from the repository linked below.

To begin entering text, long-press the desired text field. When the bottom line turns orange, it indicates that the has been activated speech-to-text mode. Release your finger once you see the text correctly transcribed. If the transcribed text is correct, the line will turn green; otherwise, it will turn red.

Let's dig in the code...

The view is built with a language picker, which is a crucial feature. It allows you to select the language you will use later, especially when interacting with a form containing multiple text fields.

struct VoiceRecorderView: View {
   @StateObject private var localeManager = appSingletons.localeManager
    @State var name: String = ""
    @State var surename: String = ""
    @State var age: String = ""
    @State var email: String = ""
    var body: some View {
        Form {
            Section {
                Picker("Select language", selection: $localeManager.localeIdentifier) {
                    ForEach(localeManager.locales, id: \.self) { Text($0).tag($0) }
                }
                .pickerStyle(SegmentedPickerStyle())
                .onChange(of: localeManager.localeIdentifier) {
                }
            }

            Section {
                TextFieldView(textInputValue: $name,
                              placeholder: "Name:",
                              invalidFormatMessage: "Text must be greater than 6 characters!") { textInputValue in
                    textInputValue.count > 6
                }
                
                TextFieldView(textInputValue: $surename,
                              placeholder: "Surename:",
                              invalidFormatMessage: "Text must be greater than 6 characters!") { textInputValue in
                    textInputValue.count > 6
                }
                TextFieldView(textInputValue: $age,
                              placeholder: "Age:",
                              invalidFormatMessage: "Age must be between 18 and 65") { textInputValue in
                    if let number = Int(textInputValue) {
                        return number >= 18 && number <= 65
                    }
                    return false
                }
            }
            
            Section {
                TextFieldView(textInputValue: $email,
                              placeholder: "Email:",
                              invalidFormatMessage: "Must be a valid email address") { textInputValue in
                    let emailRegex = #"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"#
                    let emailPredicate = NSPredicate(format: "SELF MATCHES %@", emailRegex)
                    return emailPredicate.evaluate(with: textInputValue)
                }
            }   
        }
        .padding()
    }
}

For every text field, we need a binding variable to hold the text field’s value, a placeholder for guidance, and an error message to display when the acceptance criteria function is not satisfied.

When we examine the TextFieldView, we see that it is essentially a text field enhanced with additional features to improve user-friendliness.

struct TextFieldView: View {
    
    @State private var isPressed = false
    
    @State private var borderColor = Color.gray
    @StateObject private var localeManager = appSingletons.localeManager

    @Binding var textInputValue: String
    let placeholder: String
    let invalidFormatMessage: String?
    var isValid: (String) -> Bool = { _ in true }
    
    var body: some View {
        VStack(alignment: .leading) {
            if !textInputValue.isEmpty {
                Text(placeholder)
                    .font(.caption)
            }
            TextField(placeholder, text: $textInputValue)
                .accessibleTextField(text: $textInputValue, isPressed: $isPressed)
                .overlay(
                    Rectangle()
                        .frame(height: 2)
                        .foregroundColor(borderColor),
                    alignment: .bottom
                )
            .onChange(of: textInputValue) { oldValue, newValue in
                    borderColor = getColor(text: newValue, isPressed: isPressed )
            }
            .onChange(of: isPressed) {
                    borderColor = getColor(text: textInputValue, isPressed: isPressed )
            }
            if !textInputValue.isEmpty,
               !isValid(textInputValue),
                let invalidFormatMessage {
                Text(invalidFormatMessage)
                    .foregroundColor(Color.red)
            }
        }
    }
    
    func getColor(text: String, isPressed: Bool) -> Color {
        guard !isPressed else { return Color.orange }
        guard !text.isEmpty else { return Color.gray }
        return isValid(text) ? Color.green : Color.red
    }
    
}

The key point in the above code is the modifier .accessibleTextField, where all the magic of converting voice to text happens. We have encapsulated all speech-to-text functionality within this modifier.

extension View {
    func accessibleTextField(text: Binding<String>, isPressed: Binding<Bool>) -> some View {
        self.modifier(AccessibleTextField(text: text, isPressed: isPressed))
    }
}

struct AccessibleTextField: ViewModifier {
    @StateObject private var viewModel = VoiceRecorderViewModel()
    
    @Binding var text: String
    @Binding var isPressed: Bool
    private let lock = NSLock()
    func body(content: Content) -> some View {
        content
            .onChange(of: viewModel.transcribedText) {
                guard viewModel.transcribedText != "" else { return }
                self.text = viewModel.transcribedText
            }
            .simultaneousGesture(
                DragGesture(minimumDistance: 0)
                    .onChanged { _ in
                        lock.withLock {
                            if !isPressed {
                                isPressed = true
                                viewModel.startRecording(locale: appSingletons.localeManager.getCurrentLocale())
                            }
                        }
                        
                    }
                    .onEnded { _ in
                        
                        if isPressed {
                            lock.withLock {
                                isPressed = false
                                viewModel.stopRecording()
                            }
                        }
                    }
            )
    }
}

The voice-to-text functionality is implemented in the VoiceRecorderViewModel. In the view, it is controlled by detecting a long press from the user to start recording and releasing to stop the recording. The transcribed voice text is then forwarded upward via the text Binding attribute.

Finally, here is the view model that handles the transcription:

import Foundation
import AVFoundation
import Speech

class VoiceRecorderViewModel: ObservableObject {
    @Published var transcribedText: String = ""
    @Published var isRecording: Bool = false
    
    private var audioRecorder: AVAudioRecorder?
    private let audioSession = AVAudioSession.sharedInstance()
    private let recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
    private var recognitionTask: SFSpeechRecognitionTask?
    private var audioEngine = AVAudioEngine()
    
    var speechRecognizer: SFSpeechRecognizer?

    func startRecording(locale: Locale) {
        do {
            self.speechRecognizer = SFSpeechRecognizer(locale: locale)

            recognitionTask?.cancel()
            recognitionTask = nil

            try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
            try audioSession.setActive(true, options: .notifyOthersOnDeactivation)

            guard let recognizer = speechRecognizer, recognizer.isAvailable else {
                transcribedText = "Reconocimiento de voz no disponible para el idioma seleccionado."
                return
            }
            
            let inputNode = audioEngine.inputNode
            let recordingFormat = inputNode.outputFormat(forBus: 0)
            inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { buffer, when in
                self.recognitionRequest.append(buffer)
            }
            
            audioEngine.prepare()
            try audioEngine.start()
            
            recognitionTask = recognizer.recognitionTask(with: recognitionRequest) { result, error in
                if let result = result {
                    self.transcribedText = result.bestTranscription.formattedString
                }
            }
            
            isRecording = true
        } catch {
            transcribedText = "Error al iniciar la grabación: \(error.localizedDescription)"
        }
    }
    
    func stopRecording() {
        audioEngine.stop()
        audioEngine.inputNode.removeTap(onBus: 0)
        recognitionRequest.endAudio()
        recognitionTask?.cancel()
        isRecording = false
    }
}

Key Components

Properties:
- @Published var transcribedText: Holds the real-time transcribed text, allowing SwiftUI views to bind and update dynamically.
- @Published var isRecording: Indicates whether the application is currently recording.
- audioRecorder, audioSession, recognitionRequest, recognitionTask, audioEngine, speechRecognizer: These manage audio recording and speech recognition.
Speech Recognition Workflow:
- SFSpeechRecognizer: Recognizes and transcribes speech from audio input for a specified locale.
- SFSpeechAudioBufferRecognitionRequest: Provides an audio buffer for speech recognition tasks.
- AVAudioEngine: Captures microphone input.

Conclusions

I aim you that you download the project from following github repositoryand start to play with such great techology.

References

Speech
Apple Developer Documentation