Skip to main content

Command Palette

Search for a command to run...

Week 6: Completing Gemini Integration and Changing the Detection Workflow

Updated
7 min read
S

I am an AI/ML enthusiast with a strong passion for bridging technology and social impact. I love solving complex problems whether it's solving confusion on any AI/ML concept or building LLM systems for real-world applications.

This week, I focused on completing the Gemini agent integration by implementing the missing pieces from previous weeks. The goal was to create a seamless workflow where Gemini can intelligently call DeepForest tools, execute them, and provide comprehensive analysis combining both AI vision capabilities and precise object detection results.

The main challenges were:

  • Tool Execution: Converting Gemini's tool calls into actual DeepForest predictions

  • Result Integration: Feeding DeepForest outputs back to Gemini for final analysis

The Primary API Call: Tool Decision and Execution

The first API call handles the intelligent decision-making process. The system prompt instructs Gemini to analyze the user query and automatically call the DeepForest tool when it would enhance the analysis. The _handle_tool_call method became the critical bridge between the conversational AI and the computer vision system. This method orchestrates the complex workflow of parameter extraction, cache validation, tool execution, and result integration:

def _handle_tool_call(self, message, image_data: np.ndarray, image_path: str) -> dict:
    """
    Orchestrate DeepForest tool execution with intelligent caching integration
    """
    tool_call = message.tool_calls[0]
    arguments = json.loads(tool_call.function.arguments)

    # Extract file metadata using centralized manager
    image_hash, file_extension = FileManager.validate_and_extract_info(image_path)

    # Create normalized parameters using centralized management
    params = DetectionParameters.from_arguments(arguments, file_extension)

    # Use cache manager for decision logic
    should_run, reason = self.cache_manager.should_run_detection(image_hash, params)

    if should_run:
        print(f"Running detection: {reason}")

        deepforest_args = params.to_deepforest_args(image_data)
        summary_text, annotated_image_array, json_output = self.deepforest_predictor.predict_objects(**deepforest_args)

        # Update cache with results
        self.cache_manager.update_cache(image_hash, params, summary_text, annotated_image_array, json_output)

    else:
        print(f"Using cached results: {reason}")
        summary_text = self.cache_manager.get_detection_summary(params.model_names)
        json_output = self.cache_manager.cached_predictions["predictions_json_str"]

    response = {
        "role": "tool",
        "content": json.dumps({
            "summary": summary_text,
            "detections_json": json_output if json_output else "[]"
        }),
        "name": tool_call.function.name,
        "tool_call_id": tool_call.id
    }
    return response

How it works

  1. Extract the tool call
    The function starts by reading the first tool_call from the message. This contains the arguments passed for detection.

  2. Get image metadata
    Using FileManager.validate_and_extract_info, it extracts a unique hash of the image along with the file extension. This hash is used as a cache key to identify whether the same image has been processed before.

  3. Build normalized parameters
    The DetectionParameters class converts raw arguments into a standardized format that DeepForest understands. This ensures consistency in how parameters are passed to the detection model.

  4. Check the cache
    The cache_manager.should_run_detection method decides whether to run a new detection or reuse cached results. It compares the image hash and parameters with existing records.

  5. Run detection if needed

    • If no cached result exists (or parameters differ), the function calls self.deepforest_predictor.predict_objects.

    • It generates a summary, annotated image, and JSON output of detections.

    • These results are then stored in the cache for future use.

  6. Reuse cached results if available
    If the cache already has the results for this image and parameter set, the function retrieves them directly instead of running DeepForest again.

  7. Return response
    Finally, the function packages the results (summary and detections) into a response dictionary with the necessary metadata (tool_call_id, name, etc.) for downstream usage.

Follow-up Analysis: Contextual Reasoning with Detection Results

When DeepForest completes its analysis, the agent receives both the detection data and an annotated image. The follow-up call provides this context to Gemini in the primary_api_call() for comprehensive analysis:

if self.cache_manager.cached_predictions["annotated_image_array"] is not None:
    json_data_preview = ""
    if self.cache_manager.cached_predictions["predictions_json_str"]:
        try:
            parsed_data = json.loads(self.cache_manager.cached_predictions["predictions_json_str"])
            if parsed_data:
                sample_detection = parsed_data[0] if len(parsed_data) > 0 else {}
                json_data_preview = f"Sample detection format: {json.dumps(sample_detection, indent=2)}"
        except:
            json_data_preview = "JSON detection data is available for formatting."

    follow_up_prompt = (
        "The DeepForest tool has completed its analysis and you now have access to detailed detection data. "
        "Below is the image with the detected objects annotated. "
        f"IMPORTANT: You have the complete detection data in JSON format including exact coordinates. {json_data_preview} "
        "You can reformat this data into any format the user requests (JSON, CSV, tables, etc.). "
        "Use both your computer vision analysis of the annotated image AND the precise detection data "
        "to provide comprehensive ecological insights and answer the user's original question."
    )

    simulated_user_msg = [{"type": "text", "text": follow_up_prompt}]
    annotated_image_b64 = encode_image_to_base64_url(self.cache_manager.cached_predictions["annotated_image_array"])
    simulated_user_msg.append({"type": "image_url", "image_url": {"url": annotated_image_b64}})

    self.messages.append({"role": "user", "content": simulated_user_msg})

If predictions_json_str exists, it tries to parse the JSON. A small sample of the first detection is formatted for preview. Initially, the preview was used as a proof-of-concept to see how the workflow behaves. The code passes only a small preview of the JSON detection data instead of the full dataset because of context window limitations. Large JSON outputs or full annotated datasets could easily exceed the token limit of the agent, causing it to truncate or fail. The annotated image is converted to a Base64 URL. A simulated user message containing both the prompt text and the image is appended to self.messages. This allows the agent to continue processing as if a user provided this combined input.

Adding the Previous conversation history with detection data in adddetection_context_to_messages

The function looks at the last 10 messages in the conversation (openai_messages) to find any tool outputs from deepforest_predict_objects. Each relevant message is parsed for JSON content. If detections_json exists and is not empty, it is marked as available. The latest valid detection data is stored in latest_json_data. If detection data is found, a system message is generated with a preview of the JSON data (first 200 characters).

Major Changes because of removing predict_image method

Following my mentor's advice that DeepForest primarily uses predict_tile for most use cases, I removed the predict_image method and all associated logic for choosing between prediction methods in the src/deepforest_agent/tools/deepforest_tools.py file.

# Before: Complex method selection
if file_extension == '.tif' or patch_size != 400:
    use_predict_tile = True
else:
    use_predict_image = True

# After: Simplified to always use predict_tile
current_predictions = model.predict_tile(
    image=image_data_array,
    patch_size=patch_size,
    patch_overlap=patch_overlap,
    iou_threshold=iou_threshold,
    thresh=thresh
)

The file extension parameter and its dependencies were also removed since standardizing on predict_tile made extension-based decision making obsolete.

Most critically, I fixed a bug in _convert_gradio_to_openai_messages() where the original image wasn't being properly attached to user messages in the conversation, as Gradio wasn't receiving the original image when passing image_path separately, only the annotated image after DeepForest execution was being sent, so I restructured the message conversion to ensure the original image is always included with the last user message for proper visual analysis.

# Fixed: Include image with the last user message
for i, message in enumerate(gradio_history):
    if message["role"] == "user":
        # If this is the last user message and we have an image, include it
        if i == len(gradio_history) - 1 and image_path:
            content_blocks = [
                {"type": "text", "text": content},
                {"type": "image_url", "image_url": {"url": data_url, "detail": "auto"}}
            ]
            openai_messages.append({"role": "user", "content": content_blocks})

Finally, I modified _add_detection_context_to_messages to provide the complete detection JSON dataset instead of just a 200-character preview.

# Before: Limited preview
context = f"Detection data contains: {latest_json_data[:200]}..."

# After: Complete dataset
context_message = {
    "role": "system", 
    "content": (
        f"FULL DETECTION DATA AVAILABLE:\n"
        f"Complete JSON Detection Data: {json_data}\n"
        f"You have access to ALL detection coordinates, confidence scores, and labels."
    )
}

New Workflow according to the changes

Permalink: Commit 9990690

After the changes, the new workflow begins with the user providing text or an image through the Gradio UI. The input flows into GeminiAgent.model_response(), where the system computes an image hash via FileManager and resets the cache if a new image is detected. Then, conversation context is built: previous detection data is added if available, a system message is injected, and the original image is attached to the last user message. The request is sent to Gemini-2.0-flash, which either returns a direct text response or makes a tool call. If a tool call occurs, handle_tool_call() normalizes parameters, checks the cache, and either retrieves cached results or runs new detection with DeepForestPredictor. The results (summary, annotated image, and JSON) are cached and formatted into a tool response. A follow-up prompt is then added, instructing the model to integrate the detection data into reasoning. The LLM generates the final response, which is returned along with the annotated image. Finally, the Gradio UI updates and displays the results to the user.

Next Week Steps

Moving forward, the next development phase will transition from commercial APIs to open source models based on mentor feedback emphasizing the importance of open source solutions in the project's mission. This shift requires implementing a multi-agent architecture because HuggingFace models typically struggle with structured output generation and complex tool orchestration compared to commercial APIs like Gemini, necessitating specialized agents for different tasks rather than relying on a single model to handle all capabilities. The existing DeepForest tool implementation will be enhanced with additional utilities. This architecture will provide better control over model selection for specific tasks.

Google Summer of Code Blogs

Part 5 of 10

This blog series is a tracker of my progress as a GSoC 2025 contributor under NumFOCUS. I’ll be noting down the gist of my implementations, useful resources, bugs I’ve faced and how I solved them, along with my early plans, ideas, and reflections.

Up next

DeepForest Multi-Agent Part 1: Moving to Open Source Models

Moving forward, I am shifting from using commercial APIs to open-source models based on mentor feedback. This shift requires implementing a multi-agent architecture because HuggingFace models. I focused on preparing the foundations for that system. I...

More from this blog

O

Open Source Journey

10 posts