8000 GitHub - diptiaswath/phonemeRecognizer
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

diptiaswath/phonemeRecognizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Recognition: Phoneme Classification

Executive Summary

A practical implementation of phoneme classification achieving 74% accuracy on real-world speech data. The system processes raw audio into mel-spectrograms, uses context-aware frame analysis, and employs a deep neural network architecture.

Core Challenges

  • Phonemes span multiple time segments
  • Natural speech has uneven phoneme distribution
  • Speaker variations and background noise
  • Large-scale data processing requirements

Technical Implementation

77F8

Audio Processing

Raw audio processing involves:

  • 25ms frames with 10ms stride
  • Fourier Transform conversion
  • Mel-scale filtering for 40 features
  • Context window of 85 frames (k=42)
  • Input shape: torch.Size([3400])

Dataset Scale

  • Training: 14,542 utterances (18.4M frames)
  • Validation: 2,200 utterances (1.5M frames)
  • Testing: 2,200 utterances (1.6M frames)

Memory Management

Total dataset: ~251.77GB
Batch size: 1024 frames (~93MB)

Optimized DataLoader implementation includes:

  • Multi-threaded loading
  • GPU memory pinning
  • Selective data shuffling

Neural Network Architecture

Layer Structure:
Input (3400) → 1024102451225612864Output (71)

Each layer contains:

  • Linear transformation
  • BatchNorm
  • ReLU activation

Performance Metrics

Test Accuracy: 74.09%
Validation Accuracy: 73.70%
Training Accuracy: 42.53%

Key Findings include:

  • Overall recognition rate of ~74% for identifying 71 distinct speech sounds
  • Misclassifications occur mostly between similar-sounding phonemes
  • Clear distinctions maintained between very different sounds

Future Optimizations

  • Reduce dropout rates
  • Adjust batch normalization momentum
  • Test alternative learning rate schedules

Deep Dives

https://teal-morning-55a.notion.site/Understanding-Speech-Recognition-A-Hands-On-Guide-to-Phoneme-Classification-174fc6aa90f48091a278d99ccfb4eecf

References

  • Deep Learning for AI, Carnegie Mellon University
  • PyTorch Documentation
  • Deep Learning with PyTorch (Stevens, E., et al., 2020)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0