Skip to main content

Home

Small files

Note

These reports are available only for Cloudera Distribution of Apache Hadoop (CDH) and Cloudera Data Platform (CDP) platforms and require HDFS administrator privileges. If you can't grant HDFS administrator privileges to the user unravel, refer to configuring FSimage.

A single mapper accesses each small file. Therefore, a large number of small files can lead to a large number of mappers. Mappers are costly to run and drive up your app's costs. This report helps you identify users who create/use an excessive amount of small files.

You can use this information to take corrective action, such as:

  • Combine multiple files into large files.

  • Notify, limit, or block users who create or use an excessive amount.

Taking actions:

  • Corrects and prevents future performance degradation.

  • Lowers your costs to run apps.

The tab opens displaying the last successfully generated report if any. It is sorted in descending order of the total number of small files in the directory. The report's parameters are listed above the table headings. You can search the table by path list, any path matching or containing the search string is displayed.

All reports, whether scheduled or ad hoc, are archived. Successful reports can be viewed or downloaded from the Report Archives tab.

Configuring the Small files report
  1. Stop Unravel

    <Unravel installation directory>/unravel/manager stop
    
  2. Set the Small files report properties as follows:

    <Unravel installation directory>/unravel/manager config properties set <property> <value>
    For example:
    /opt/unravel-install/unravel/manager config properties set com.unraveldata.ngui.sfhivetable.schedule.interval 2d

    Refer to Small files and Small files and Files reports for the complete list of properties that can be configured for the Small files report.

  3. Apply the changes.

    <Unravel installation directory>/unravel/manager config apply
    
  4. Start Unravel

    <Unravel installation directory>/unravel/manager start
Generating Small Files report
  1. Click the run.png button to generate a new report. The parameters are:

    datapage-smallfiles-newreport.png
    • Cluster: In a multi-cluster setup, you can select the cluster for which you want to generate the report.

    • Minimum File Size (bytes)/ Maximum File Size byte: Only those files whose size ranges between the minimum and maximum file size specified will be counted for the report.

    • Minimum # of Small Files: Minimum number of files in a directory matching the above size criteria. The directories fulfilling this criterion are selected for the report.

    • # Directories to Show: This is the maximum number of directories to display.

    • Advanced Options:

      • Min parent directory depth: Minimum depth to start at, that is root + x descendants, i.e., 0=root, 1=root's children (/one), etc.

      • Max parent directory depth: Maximum depth to end at, that is root + x descendants, i.e., 1=root's children (/one), 2=root's grandchildren, (/one/two), etc.

      • Drill down sub-directories: Determines how the files are accounted in the file system hierarchy. Yes (default): Accounts the file size to all its ancestor's directories. No: Accounts the file size to its parent directory.

      Note

      Min parent directory depth and Max parent directory depth must be between 0 and 50.

  2. Click Run to generate the report.

    The progress of the report generation is shown on the top of the page.

    A light green bar appears when the report was successful and results are displayed. Upon failure, the bar is light red and the New Report button turns orange.

    Click downloadcsv.png to download the current report that is displayed.

    datapage-smallfiles-report.png
Scheduling Small files report
  1. Click Schedule to generate the report regularly (See Generating Small Files report) and provide the following additional details for scheduling:

    reports-small-files-schedule.png
    • Schedule Name: Name of the schedule.

    • Schedule to Run: Select any of the following schedule options from the drop-down and set the time from the hours and minutes drop-down:

      • Daily

      • Weekdays (Sun-Sat)

      • Every two weeks

      • Every month

    • Notification: Provide an email ID to receive the notification of the reports generated.

  2. Click Schedule