www.webdeveloper.com
Results 1 to 10 of 10

Thread: Manipulating very large arrays

  1. #1
    Join Date
    Sep 2008
    Posts
    95

    Manipulating very large arrays

    Hello,

    Background: Consider spreadsheet grids with related data in 5,000 to 50,000 cells. Now consider 1000 of these spreadsheet grids ('documents') and a GUI to manipulate and research their data.

    Requirement: I need to be able to filter those documents 'on-the-fly' to show, for example, some data from "Column B" where "Column A" = some_var. This search is performed frequently depending on the user's activity.

    Problem: The issue appears to be size and performance when searching -- 40,000 cells adds up to >40MB (???) in PHP using memory_get_usage()

    So far I have tried various implementations...the best of which I've noted below:

    1. To minimize the size of the arrays, I have tried smaller 'chunks' -- but this requires iterating over the chunks which adds more processing time.
    2. Iterating over the chunks using 'foreach' and 'if' statements to find matching results (5 seconds / 40,000 cells)
    3. array_filter() gives the best results (about .3 seconds / 40,000 cells using our required criteria) but over 1000 documents, that's no good.


    I have also considered storing / indexing results in another table but the results for the same search are likely to change depending on document updates and are specific to each user depending on his/her permissions within the system -- so the storage space required and updating the table for each user when a document's information or permissions change would be resource intensive

    To facilitate other required functionality, it would be awesome to be able to keep each spreadsheet grid in an array 'document' so it can be managed quickly and handled as if it were a spreadsheet file. We've tried creating separate tables in the database for each spreadsheet column but found it limiting and extremely slow for some of the more complex tasks we need to perform on the data.


    Am I going about this the wrong way? Any suggestions or discussion on this would be appreciated. Thanks!

  2. #2
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,321
    Yes, arrays are memory hogs to begin with in PHP, and even if they weren't, that much data would still require a lot of memory.

    Not sure how feasible this is for your app, but I would consider a methodology that would load the spreadsheet data into a database (maybe using SQLite or MongoDB if you don't want to go with a full-bore relational database), and then do your data manipulation via DB operations -- something a database is optimized for.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  3. #3
    Join Date
    Dec 2002
    Location
    Seattle, WA
    Posts
    1,843
    you could try to use fread() in a pseudo-buffering state. this will help keep memory usage low while "preparing" the spreadsheet for multi-dim array conversion. rough example:
    PHP Code:
    <?php
    $fp 
    fopen($filename'rb');
    $spreadsheet = array();
    $eor "\r\n"// end of row identifier

    if($fp !== false)
    {
        
    $overflow '';

        while((
    $buffer fread($fp1024)) !== false)
        {
            
    $buffer $overflow $buffer// add any overflow back into the buffer...

            // test if we have the "tell" for a row...
            
    if(strpos($buffer$eor) !== false)
            {
                
    // split the stack
                
    list($buffer$overflow) = explode($eor$buffer);
                
    // process $data into $spreadsheet...
            
    }
            else
            {
                
    // couldn't find the $eor...
                // prepare for next pass...
                
    $overflow $buffer;
                unset(
    $buffer);
            }
        }
        
    fclose($fp);
    }
    ?>
    another possible solution would be to do a "multi-threaded" request or what it has been dubbed on the client-side as "web workers".
    this method reduces the load restrictions of a one-thread process into two-threads or more. even in a single-core or single cpu setup, this has many benefits or keeping memory usage and cpu usage low.
    you will need to split the script so that one does all the reading and converting... and the other does all the processing on the multi-dim array. an use either sockets, stream, or cURL extensions to "load" the "parser" script. "processing" script example
    PHP Code:
    <?php
    // "load" the web worker via http this way the script runs.
    // parser.php should store the data in a file,
    // so we don't need to do anything with the "return value" of file_get_contents().
    if(file_get_contents('http:/example.com/path/to/parser.php') !== false)
    {
        include 
    'path/to/ss.data.php';
    }
    // process $spreadsheet...
    ?>
    "parser" script example:
    PHP Code:
    <?php
    $fp 
    fopen($filename'rb');
    $spreadsheet = array();
    $eor "\r\n"// end of row identifier

    if($fp !== false)
    {
        
    $overflow '';

        while((
    $buffer fread($fp1024)) !== false)
        {
            
    $buffer $overflow $buffer// add any overflow back into the buffer...

            // test if we have the "tell" for a row...
            
    if(strpos($buffer$eor) !== false)
            {
                
    // split the stack
                
    list($buffer$overflow) = explode($eor$buffer);
                
    // split $data into fields and insert into $spreadsheet
            
    }
            else
            {
                
    // couldn't find the $eor...
                // prepare for next pass...
                
    $overflow $buffer;
                unset(
    $buffer);
            }
        }
        
    fclose($fp);
        
    // store $spreadsheet in a "known" location...
        // i like to use var_export() for storing data...
        
    file_put_contents('path/to/ss.data.php''<'.'?php $spreadsheet = ' .var_export($spreadsheettrue). '; ?'.'>');
        
    // you can also use serialize()...
        // self unserializing...
        #file_put_contents('path/to/ss.data.php', '<'.'?php $spreadsheet = unserialize(' .serialize($spreadsheet). '); ?'.'>');
        // or, you will need to call unserialize() on $spreadsheet
        #file_put_contents('path/to/ss.data.php', '<'.'?php $spreadsheet = ' .serialize($spreadsheet). '; ?'.'>');
        
    exit('');
    }
    ?>

  4. #4
    Join Date
    Feb 2014
    Location
    Canada
    Posts
    155
    I experienced a similar problem in a graphing application I am just about to finish, in which data was being read in very large tables across networks and servers. Some of the processing was complex and I didn't know how to do all of it in T-SQL alone. My solution was to use batch files, SQL and Windows Task Scheduler to create small, more manageable tables (there are downloadable applications for this depending on the DBMS you're using if you don't want to create one yourself). Often, the tables were created through a series of joins, which reduced the complexity needed, so graphs could be retrieved or generated on the fly a lot faster. I then created a small index of what each table contains for future users to understand.

    In your case, the only time I would ever use PHP to do all the processing is if the data set and arrays are very small, otherwise you're going to wait a while. The most straight-forward approach and the one I'd first go with would be to use a DBMS, the one you choose is up to you and based on your needs and restrictions. The reason for this is that DBMS are meant to handle processing of large data sets and can be accessible with built-in PHP functions. If you cannot use a DBMS, then your next options would be parsing the data in chunks, cURL or multi-threading.

  5. #5
    Join Date
    Sep 2008
    Posts
    95
    Thank you all for your suggestions.

    So far, I have tried NogDog's approach. I have tested queries on 1.1M rows which results in query times consistently around 0.8s which isn't fast enough, unfortunately. I have also tried creating a "view" in the database but, as expected, this took the same amount of time for the query.

    ShrineDesigns multi-threaded request proposal is interesting. I had also considered a temp table to hold query data. Is this a similar approach?

    Error404. I appreciate the suggestions. I'm not familiar with the approach so I'll need to research the method a bit more.

    I'll let you know what I find...

  6. #6
    Join Date
    Sep 2008
    Posts
    95
    I've been trying different options and have run into a (basic) question...I feel like I'm overlooking something too easy...

    When I use phpMyAdmin to query the database (SELECT * FROM `myInfo` WHERE `indexedFieldID2` IN ( 1,2,3 ) AND `someField` = 'X'), it says "Showing rows 0 - 29 ( 15,986 total, Query took 0.0039 sec)". Somehow it knows there were 15,986 rows (out of 2 million) that met my query criteria...and it figured that out in 0.0039 seconds. That's great!

    The problem is that I need to query the 2 million rows again using an ID field in those 15,986 rows in the WHERE clause. If I return all 15,986 results from the query, the query takes far too long. Is there a way to use the info from the first query in a second query without returning it?

    I thought the following INNER JOIN might work. But when I use the following, it takes 1.3 seconds which is still too long (since the query is run frequently):
    SELECT * FROM `myInfo` as m INNER JOIN `myInfo` as e ON e.indexedFieldID1=m.parentID WHERE e.`indexedFieldID2` IN (1,2,3) AND e.`someField` ='X' AND m.`someField`='Y'

    The table has an index on indexedFieldID1 and indexedFieldID2.

    Having given the other options a bit more thought, I think NogDog's idea would work best for my particular needs in the long-run with other peripheral activities required. Just need to find a way to improve the query times.

    Is there another way (a better formed query, perhaps, or temporary table, or another aspect to databases) that would be better suited for this?

    [maybe this question should be moved to the SQL section of the forum?]

  7. #7
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,321
    Have you EXPLAINEed the slow query to see if any indexes need to be added or changed, or anything is causing a full table scan?
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  8. #8
    Join Date
    Feb 2014
    Location
    Canada
    Posts
    155
    A few options for your SQL queries.

    - Do you need to retrieve all of the columns from the table? If not, specifying the exact ones may help out a bit.
    - For your inner join, instead of joining with the entire table, have you tried using a sub-query that retrieves only the relevant part of that table?
    - You may want to consider a temporary table that already has the result of the join(s), so you would only need to search for the appropriate ID.

    You can have the results of 1 query be used in another, however, it takes a bit more SQL code and I'm not sure if it would be as fast as you need.

    With the Windows SQL Server, you can see the full execution plan and find out which aspect of your query is taking a larger portion of the time and more processing. I've used phpMyAdmin but not extensively, so is there a similar feature?

  9. #9
    Join Date
    Sep 2008
    Posts
    95

    resolved

    Hello,

    Thank you both for your suggestions.

    Thank you, NogDog for the idea to 'explain' the query. I did just that in phpMyAdmin and noticed that it was not using the index I had set up. In fact, the query 'type' was 'ALL' and there was no index selected. I'm not sure if there's a better way of identifying the best index to use, but by using multiple indexes and trial & error, I was able to reduce the query time to 0.11s which is sufficient for our needs.

    Since the query is used frequently, I'll use Error404's idea for a temporary table and store the results for future look-ups.

    I appreciate your help!
    Last edited by speghettiCode; 04-21-2014 at 01:52 PM. Reason: updated post icon to 'resolved'

  10. #10
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,321
    Another satisfied customer.

    PS: Don't forget to mark the thread resolved -- if it is -- via the "Thread Tools" options at the top of the thread display.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles