
TFL Journey Expense Calculator
An AI-powered expense automation tool that transforms manual transport reimbursement calculations into an intelligent, production-ready workflow
Motivation
As an employee claiming transport reimbursement, calculating monthly travel expenses is a time-consuming nightmare:
- Manually matching journey dates with work days across multiple months
- Cross-referencing invoices with actual work days
- Calculating costs from multiple TFL statements
- Repeating this process every month or every few months

This app eliminates that hassle entirely. It's not just another CRUD app, it solves a real problem, built to demonstrate real-world AI application development with proper observability, security, and performance optimization.
Key Features
- Multi-format file support: Accepts CSV invoices, PDF statements, or images with intelligent format detection
- AI-powered extraction: Uses Google Gemini to intelligently parse and understand transport data with structured output validation
- Visual date selection: Simple, intuitive calendar UI to select days you actually worked
- Automatic calculation: Instantly computes total reimbursement amount with itemized breakdowns
- Smart summaries: Generates clear, exportable expense reports ready for finance teams
- Multi-invoice support: Handles multiple TFL statements seamlessly across different months
- Production observability: Full-stack tracing with Langfuse for debugging and monitoring AI workflows
- Privacy-first design: No user data is stored, all processing is stateless and in-memory, ensuring complete privacy
Technologies Used
- Frontend: React 19, TypeScript, Vite, PDF.js, Tesseract.js
- Backend: Node.js, Express, Google Gemini AI (gemini-2.5-flash-lite)
- Observability: Langfuse, OpenTelemetry
- DevOps: pnpm, Doppler, Concurrently
Architecture
Frontend Layer
The React frontend handles file uploads, date selection, and result visualization:
- FileUpload Component: Drag-and-drop interface supporting CSV, PDF, and image files
- Calendar Component: Interactive date picker for selecting work days
- SummaryReport Component: Displays itemized expenses with total calculations
- Client-side processing: PDF.js for PDF text extraction, Tesseract.js for OCR on images


Backend Layer
The Express server acts as a secure proxy, protecting API keys and handling AI requests:
- Secure API proxy: All Gemini API calls go through the backend, ensuring zero client-side API key exposure
- CORS protection: Only configured origins can access the API
- Rate limiting: 30 requests/minute per IP to prevent abuse
- Error handling: Comprehensive error handling with proper HTTP status codes
AI Processing Layer
The core intelligence lives in the Gemini service:
- Multi-modal processing: Handles text (CSV), PDFs, and images (via OCR)
- Structured extraction: Uses Gemini's structured output with JSON schema validation
- Intelligent parsing: Understands TfL invoice formats, handles date inheritance, and filters out non-journey entries
- Chunking strategy: Large PDFs are split into 2-4 page chunks for parallel processing
Installation and Setup
Prerequisites
- Node.js (v18+)
- pnpm (or npm)
- Google Gemini API key (Get one free)
- Langfuse account (optional but recommended for observability)
Clone the Repository
git clone https://github.com/crypticseeds/tfL-journey-expense-calculator.git
cd tfl-journey-expense-calculatorInstall Dependencies
pnpm installEnvironment Configuration
Create a .env file:
# Required: Google Gemini API Key
GEMINI_API_KEY=your-gemini-api-key-here
# Optional: Langfuse for AI observability (recommended)
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com
# Backend configuration
PORT=3001
FRONTEND_ORIGIN=http://localhost:3000Running the Application
Development Mode:
pnpm run dev:allThe application will be available at:
- Frontend:
http://localhost:3000 - Backend API:
http://localhost:3001
Using Doppler for Production Secret Management:
doppler login
doppler setup
doppler run -- pnpm run dev:allKey Features Deep Dive
Feature 1: Multi-Format Document Processing
How It Works
The system intelligently detects file types and routes them to appropriate processors:
- CSV files: Directly parsed using a custom parser that handles TfL statement formats
- PDF files: Extracted using PDF.js, preserving line structure and spatial relationships
- Images: Processed through Tesseract.js OCR, then sent to Gemini for extraction
The key innovation is preserving document structure (especially for PDFs) to maintain date context that gets inherited by subsequent journey entries.
Code Example
// PDF text extraction with line structure preservation
function rebuildPageTextWithLineBreaks(textContent: {
items?: Array<{ str?: string; transform?: number[] }>;
}): string {
const lines: Line[] = [];
const yThreshold = 2; // tolerance for grouping into same visual line
// Group text items by Y position to preserve line structure
for (const item of items) {
const tr = item.transform ?? [0, 0, 0, 0, 0, 0];
const y = tr[5] ?? 0;
let line = lines.find((l) => Math.abs(l.y - y) <= yThreshold);
if (!line) {
line = { y, parts: [] };
lines.push(line);
}
line.parts.push({ x: tr[4] ?? 0, str: String(item.str ?? "") });
}
// Sort top-to-bottom and reconstruct lines
lines.sort((a, b) => b.y - a.y);
return lines
.map((l) =>
l.parts
.sort((a, b) => a.x - b.x)
.map((p) => p.str)
.join(" "),
)
.join("\n");
}Feature 2: AI-Powered Structured Extraction
How It Works
Google Gemini processes the extracted text with a carefully crafted prompt that:
- Understands TfL statement formats (contactless/Oyster)
- Handles date inheritance (dates printed once, followed by multiple journeys)
- Filters out non-journey entries (caps, payments, refunds)
- Returns structured JSON with validated date and amount fields
The system uses Gemini's structured output feature with a JSON schema, ensuring type safety and validation.
Code Example
const PROMPT = `
You are an expert data extraction agent for TfL contactless/Oyster statements.
Extract EVERY individual journey charge from this document.
CRITICAL:
- Dates are often printed once followed by multiple journey lines;
subsequent journeys inherit the most recent date header.
- Output date as YYYY-MM-DD. Output amount as a positive number.
- Strictly IGNORE non-journey lines: cap, capped, daily cap, weekly cap,
total, payment, auto top up, refund, credit, adjustment.
Return ONLY JSON: { "expenses": [ { "date": "YYYY-MM-DD", "amount": 0.00 } ] }
`;
const response = await generateContent({
model: "gemini-2.5-flash-lite",
contents: {
parts: [{ text: PROMPT }, { text: documentText }],
},
config: {
responseMimeType: "application/json",
responseSchema: travelDataSchema,
thinkingConfig: { thinkingBudget: 4096 },
},
});Feature 3: Performance Optimization with Parallel Processing
How It Works
Large PDFs are intelligently chunked and processed in parallel:
- Chunking strategy: PDFs with >4 pages are split into 2-4 page chunks
- Parallel processing: Up to 3 chunks processed concurrently
- File-level parallelization: Up to 3 files processed simultaneously
- Result merging: Chunks are merged with date inheritance logic preserved
This optimization reduced processing time for an 8-page PDF from 58.63 seconds to 12.58 seconds—a 78.5% improvement.
Code Example
// Determine chunking strategy
let pagesPerChunk: number;
if (pdf.numPages <= 4) {
pagesPerChunk = pdf.numPages; // Single chunk for small PDFs
} else if (pdf.numPages <= 8) {
pagesPerChunk = 2; // 2 pages per chunk for medium PDFs
} else {
pagesPerChunk = 4; // 4 pages per chunk for large PDFs
}
// Process chunks in parallel (limit to 3 concurrent)
const MAX_CONCURRENT_CHUNKS = 3;
const chunkPromises = chunks.map((chunkPages, chunkIndex) =>
processPdfChunk(
chunkPages,
chunkIndex,
chunks.length,
model,
onProgressUpdate,
),
);
// Process with concurrency limit
for (let i = 0; i < chunkPromises.length; i += MAX_CONCURRENT_CHUNKS) {
const batch = chunkPromises.slice(i, i + MAX_CONCURRENT_CHUNKS);
const batchResults = await Promise.all(batch);
chunkResults.push(...batchResults);
}Technical Challenges
Challenge 1: PDF Structure Preservation
Problem: PDFs don't preserve logical line structure—text is extracted as individual positioned elements. TfL statements print dates once, followed by multiple journey entries that inherit that date. Without preserving line structure, date context is lost.
Solution: Implemented a Y-position-based line grouping algorithm that reconstructs visual lines by grouping text items within a 2-pixel Y-threshold. This preserves the document's visual structure, allowing the AI to correctly identify date headers and their associated journeys.
// Group text items by Y position to preserve line structure
const yThreshold = 2;
for (const item of items) {
const y = tr[5] ?? 0;
let line = lines.find((l) => Math.abs(l.y - y) <= yThreshold);
if (!line) {
line = { y, parts: [] };
lines.push(line);
}
line.parts.push({ x, str: item.str });
}Challenge 2: Performance Bottleneck with Large PDFs
Problem: Using Langfuse tracing, we identified that a single 8-page PDF was taking 58.63 seconds to process, with the Gemini API call consuming 97% of the time (57.03 seconds). The entire document was being sent in a single large API call.

Solution: Implemented a multi-pronged optimization strategy:
- Reduced thinking budget: From 32,768 to 4,096 tokens (20-40% faster for structured extraction)
- PDF chunking: Split large PDFs into 2-4 page chunks, processed in parallel
- File-level parallelization: Process up to 3 files concurrently
- Model optimization: Switched from
gemini-2.5-protogemini-2.5-flash-litefor faster inference
Results:
- Before: 58.63 seconds for an 8-page PDF
- After: 12.58 seconds for the same PDF
- 78.5% latency reduction (46.05 seconds saved)
- 4.7x speedup for single large PDFs
- 3x speedup for multiple file processing
The key insight was that Langfuse tracing made the bottleneck immediately visible—without the detailed trace showing the exact breakdown, we wouldn't have known that chunking and parallelization would provide the biggest gains.

Challenge 3: Date Inheritance Across Chunks
Problem: When splitting PDFs into chunks, date headers might appear at the end of one chunk while their associated journeys appear in the next chunk. This breaks date inheritance logic.
Solution: Implemented a merging strategy that:
- Tracks the last known date from each chunk
- Applies date inheritance during chunk merging
- Uses a fallback heuristic parser if AI extraction fails
- Validates dates to ensure they're within reasonable ranges
Performance Optimization
Implemented Optimizations
Caching Strategy
While not explicitly implemented in the current version, the architecture supports caching at multiple levels:
- Client-side: File contents are kept in memory during processing
- Backend: Could cache extracted data per file hash
- AI responses: Could cache Gemini responses for identical inputs
Database Optimization
No database is used—all processing is stateless and in-memory. No user data is stored—files are processed on-demand and discarded after calculation. This design choice prioritizes simplicity and privacy, ensuring that sensitive financial information never persists on servers.
Frontend Performance
- Lazy loading: Components are loaded on-demand
- Progress indicators: Real-time progress updates during file processing
- Error boundaries: Graceful error handling with user-friendly messages
Performance Metrics
Processing Time Improvements:
- Single 8-page PDF: 58.63s → 12.58s (78.5% reduction)
- Multiple files: Sequential → 3x parallel processing
- API call latency: 20-40% reduction through thinking budget optimization
Cost Optimization:
- Model switch:
gemini-2.5-pro→gemini-2.5-flash-lite(faster, cheaper) - Thinking budget: 32,768 → 4,096 tokens (87.5% reduction)
- Parallel processing: Reduced total API time despite more calls
Option 1: Vercel + Railway
- Frontend: Deploy React app to Vercel
- Backend: Deploy Express server to Railway
- Environment variables: Configure in each platform's dashboard
Security Checklist for Production
- Use HTTPS for all connections
- Set
FRONTEND_ORIGINto your actual domain - Rotate API keys regularly
- Enable Langfuse for monitoring
- Set up alerts for API usage spikes
- Implement request logging and monitoring
- Use Doppler or similar for secret management
Future Enhancements
Planned Features
- Google Calendar integration: Automatically detect and select work days from Google Calendar events, eliminating manual date selection
- Smart workday presets: Quick-select options for typical work schedules (Monday-Friday, custom weekday patterns, or specific days of the week)
- Automated email report generation
- Batch processing for multiple months at once
Lessons Learned
Technical Insights
-
Observability is crucial for AI systems: Langfuse tracing revealed the exact bottleneck (97% of time in Gemini API calls), which wouldn't have been obvious without detailed traces.
-
Structured outputs are game-changers: Gemini's structured output feature eliminated the need for complex JSON parsing and validation, reducing errors significantly.
-
Chunking strategy matters: The optimal chunk size (2-4 pages) balances API call overhead with parallelization benefits. Too small = too many API calls, too large = slow sequential processing.
-
Thinking budget optimization: For structured extraction tasks, reducing thinking budget from 32K to 4K tokens provided 20-40% speedup with no accuracy loss.
Best Practices Discovered
- Always trace AI workflows: Without Langfuse, we wouldn't have identified the performance bottleneck
- Design for parallelization: The chunking architecture enables horizontal scaling
- Security-first architecture: Backend proxy pattern protects API keys and enables rate limiting
- Progressive enhancement: The app works without Langfuse, but observability adds significant value
What I'd Do Differently
- Start with observability: Would integrate Langfuse from day one, not as an afterthought
- More comprehensive testing: Would write tests alongside implementation, not after
- Better error messages: Would provide more specific error messages for different failure modes
- Performance testing earlier: Would identify bottlenecks before they become problems
Resources
Documentation
GitHub Repository: tfl-journey-expense-calculator
Conclusion
The TfL Journey Expense Calculator demonstrates that AI can solve real-world problems when combined with production-grade engineering practices. By focusing on observability, performance optimization, and security, we built a tool that's not just a prototype, but a production-ready application.
The project showcases the importance of:
- Problem-first thinking: Starting with user pain points, not technology
- Production observability: Langfuse tracing was instrumental in identifying and solving performance bottlenecks
- Performance optimization: Data-driven optimization (78.5% latency reduction) through careful measurement and iteration
- Security architecture: Enterprise-grade API key protection and rate limiting
How is this guide?