500 Internal Server Error shows in random moments

Issue Overview

I have a Vercel-deployed Next.js (v14.2.16) application that frequently encounters 500 errors on two public pages. These pages fetch data from Firebase Firestore using thier admin sdk.

Key Observations

  1. The errors seem to resolve automatically upon refreshing the page.
  2. These errors are not being caught by any error handlers in the application, including:
    • Middleware
    • try/catch blocks
    • Firebase Admin SDK initialization
    • Database access functions
  3. This suggests that the app may be failing before rendering starts.

Behavior Details

  • Reproducibility: The errors are inconsistent and unpredictable. They sometimes occur when directly accessing a URL or refreshing the page after a delay.

  • Vercel Logs: The logs show an error with the message:

    Unhandled Rejection: Error: 4 DEADLINE_EXCEEDED: Deadline exceeded after 76.714s,LB pick: 0.001s,remote_addr=142.250.31.95:443
    at callErrorFromStatus (/var/task/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
    at Object.onReceiveStatus (/var/task/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:193:76)
    at Object.onReceiveStatus (/var/task/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
    at Object.onReceiveStatus (/var/task/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
    at /var/task/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11) for call at
    at ServiceClientImpl.makeUnaryRequest (/var/task/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:161:32)
    at ServiceClientImpl.<anonymous> (/var/task/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
    at /var/task/node_modules/@google-cloud/firestore/build/src/v1/firestore_client.js:237:29
    at /var/task/node_modules/google-gax/build/src/normalCalls/timeout.js:44:16
    at repeat (/var/task/node_modules/google-gax/build/src/normalCalls/retries.js:82:25)
    at /var/task/node_modules/google-gax/build/src/normalCalls/retries.js:125:13
    at OngoingCallPromise.call (/var/task/node_modules/google-gax/build/src/call.js:67:27)
    at NormalApiCaller.call (/var/task/node_modules/google-gax/build/src/normalCalls/normalApiCaller.js:34:19)
    at /var/task/node_modules/google-gax/build/src/createApiCall.js:112:30
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    Caused by: Error
    at _firestore._traceUtil.startActiveSpan (/var/task/node_modules/@google-cloud/firestore/build/src/write-batch.js:438:27)
    at DisabledTraceUtil.startActiveSpan (/var/task/node_modules/@google-cloud/firestore/build/src/telemetry/disabled-trace-util.js:16:16)
    at WriteBatch.commit (/var/task/node_modules/@google-cloud/firestore/build/src/write-batch.js:436:43)
    at /var/task/node_modules/@google-cloud/firestore/build/src/reference/document-reference.js:396:18
    at DisabledTraceUtil.startActiveSpan (/var/task/node_modules/@google-cloud/firestore/build/src/telemetry/disabled-trace-util.js:16:16)
    at DocumentReference.update (/var/task/node_modules/@google-cloud/firestore/build/src/reference/document-reference.js:390:43)
    at d (/var/task/.next/server/chunks/175.js:6:413155)
    at o (/var/task/.next/server/chunks/175.js:1:35598)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /var/task/node_modules/next/dist/compiled/next-server/app-page.runtime.prod.js:16:418 {
    code: 4,
    details: 'Deadline exceeded after 76.714s,LB pick: 0.001s,remote_addr=142.250.31.95:443',
    metadata: Metadata { internalRepr: Map(0) {}, options: {} },
    note: 'Exception occurred in retry method that was not classified as transient'}
    Node.js process exited with exit status: 128. The logs above can help with debugging the issue.
    
  • Execution Timing:
    Despite the error message indicating “Deadline exceeded after 76.714s,” the execution duration shown in the logs is just 32ms. This suggests the error might be caused by cumulative retries or network-layer issues.

Error Handling

All functions in the application include try/catch blocks to handle potential errors, but none of them seem to log or execute in this case.

Vercel Log Snapshot

From the log, the execution duration is just 32ms, contradicting the stated deadline of 76.714s.


Environment Details

  • Next.js: 14.2.16
  • Firebase: 10.13.2
  • Firebase Admin: 12.7.0

Request for Help

Any ideas on what might be causing this issue and potential fixes? I’m open to suggestions for improving error handling or mitigating these random failures.

Hi, @mysr-io! Welcome to the Vercel Community :smile:

Thanks for the detailed post.

Here are some troubleshooting tips to help resolve the 500 errors:

  1. Review Firebase Admin SDK Initialization:
    Ensure you’re initializing the Firebase Admin SDK correctly and only once. Place this initialization in a separate file and import it where needed.

  2. Implement Firestore Connection Pooling:
    Consider adding connection pooling to your Firestore client to manage connections more efficiently. This can help prevent connection-related timeouts.

  3. Implement Retry Logic:
    Create a utility function to retry Firestore operations. This can help handle temporary network issues or service disruptions.

  4. Adjust Vercel Function Timeout:
    Check your vercel.json file and consider increasing the serverless function timeout. This may help with operations that are taking longer than expected.

  5. Implement Caching:
    Use a caching mechanism to reduce the number of Firestore queries. This can improve performance and reduce the likelihood of timeouts.

  6. Error Boundary Implementation:
    Implement an Error Boundary component in your React application to catch and handle unexpected errors gracefully.

  7. Enhance Logging:
    Implement more comprehensive logging throughout your application, especially around Firestore operations. Consider using a service like Sentry or LogRocket for advanced error tracking and monitoring.

  8. Check Firestore Security Rules:
    Review your Firestore security rules to ensure they’re not inadvertently blocking access to your data.

  9. Verify Environment Variables:
    Double-check that all necessary environment variables (like Firebase credentials) are correctly set in your Vercel project settings.

  10. Analyze Network Requests:
    Use browser developer tools to analyze network requests on the problematic pages. Look for any failed requests or unusually long loading times.

  11. Review Data Fetching Methods:
    Ensure you’re using appropriate data fetching methods for your Next.js version (e.g., getServerSideProps, getStaticProps, or React Server Components for Next.js 13+).

  12. Check for Memory Leaks:
    Investigate potential memory leaks in your application, especially in components that manage their own state or set up event listeners.

  13. Analyze Serverless Function Cold Starts:
    Be aware of cold start times for serverless functions. Consider implementing strategies to keep your functions warm if cold starts are causing issues.

Let us know how you get on! :smile:

  1. Firebase Admin SDK is already initialised as you mentioned.
  2. Firestore itself does not natively support connection pooling, However, i already implemented techniques to optimize Firestore usage and avoid unnecessary overhead by reusing a single Firestore instance across the application.
  3. I am unable to troubleshoot the issue as its source cannot be traced. To address this, I plan to implement a global retry logic to handle all database queries.
  4. I adjusted the serverless function timeout, but the error was produced just 30ms after the URL was hit. This makes me wonder if the timeout setting is actually affecting the issue. I suspect the connection is not being properly released after an initial request, and when I refresh the page, the connection timeout error is displayed.
  5. I can’t implement a caching mechanism because my page is constantly changing and is the page is dynamically loaded.
  6. I have already implemented an error boundary component, but it is not catching this error. Additionally, none of the other error-handling methods are able to catch it either.
  7. I have already added logging at various points of execution, but none of them are being logged. even the log in the middleware.

All other potential causes have already been validated, as I have been stuck with this issue for over two months. It is challenging to find a solution because the issue is also difficult to reproduce.

Hi @pawlean, do you have any idea why this might be happening?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.