www.webdeveloper.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 20

Thread: Getting delimiter from a line

Hybrid View

  1. #1
    Join Date
    Sep 2006
    Posts
    655

    Question Getting delimiter from a line

    Hi

    Can someone please tell me the best possible way to find the delimiter from a given line (not including the spaces)?

    For our convenience we can assume use the email address to split if necessary. So the very next char (except space) after the email can be our delimiter. But there may be cases where the email address is at last and no delimiter are there.


    Some examples are:

    Ex 1:
    Code:
    jon, doe, abc@gmail.com, 996655
    Ex 2:
    Code:
    abc@gmail.com; doe; ;996655
    Ex 3:
    Code:
    jon# doe# 996655# abc@gmail.com
    Ex 4:
    Code:
    jon doe 96655
    Ex 5:
    Code:
    jon doe 996655 abc@gmail.com
    Ex 6:
    Code:
    jon;doe;abc@gmail.com;996655;
    In ex 4 and 5 above, it should return as no delimiter found.

    Any help is appreciated.

    Thanks
    Last edited by phantom007; 06-20-2013 at 09:05 AM.

  2. #2
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,334
    So what would the expected response be for:
    Code:
    foo.bar#example.com#"I like using ""#"" or ""."" as a CSV delimiter."
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  3. #3
    Join Date
    Sep 2006
    Posts
    655
    Quote Originally Posted by NogDog View Post
    So what would the expected response be for:
    Code:
    foo.bar#example.com#"I like using ""#"" or ""."" as a CSV delimiter."

    Hi Thanks for ur reply.

    When I saved that line of yours and opened in my spreadsheet program specifying the delimiter as #, it spitted up the line very nicely. Chk screenshot.


    http://i.imgur.com/K8uEBlC.png

    How are they doing it?
    Last edited by phantom007; 06-20-2013 at 09:36 AM.

  4. #4
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,334
    Quote Originally Posted by cancer10 View Post
    Hi Thanks for ur reply.

    When I saved that line of yours and opened in my spreadsheet program specifying the delimiter as #, it spitted up the line very nicely. Chk screenshot.


    http://i.imgur.com/K8uEBlC.png

    How are they doing it?
    Ahh...but how did you know "#" was the delimiter and not "."? ("#" is the correct answer in this case if you want valid CSV format based on the quoting, but what if you remove all the quotes?)

    So what I'm ultimately pushing toward here in order to nail down the actual requirement, is how do you determine which is the correct separator character in any given line of text, especially if that text contains two or more candidate special characters?

    Once we have a precise requirement, the code may then become self-evident. For instance, if the requirement is simply to break the text into separate fields using any of ".,;|#" as separators, you might use preg_split() with a simple character class as the separator. If the result must always be 3 fields and using only one separator chosen from a set of possible separators, you might have to loop through a set of possible separators (perhaps using foreach() on an array of separator characters) until you find one that gives you 3 fields (or returns an error if you exhaust the possible separators with no result with 3 fields).
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  5. #5
    Join Date
    Sep 2006
    Posts
    655
    Hello NogDog

    Thanks for your reply.

    The idea is to get the very NEXT char (except white space) after the email field and assume it to be the delimiter. Now, here are the possible cases:

    Case 1. Email can be in the first column - In this case we get the delimiter by getting the very next char after email (except white space)

    Case 2: Email can be in the middle of the columns - In this case we get the delimiter by getting the very next char after email (except white space)

    Case 3: Email can be at the last column - In this case we get the delimiter by getting the char before email (except white space)

    Case 4: Email is the only column - In this case the delimiter is not required.

    Case 5: Email does not exists - In this case we show an error since email is mandatory here.

    So, the questions is how do we achieve this? Using some regex pattens? what regex would that be?

  6. #6
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,334
    This is looking promising, but you should add some more test cases, including negative tests to see if it really works.

    PHP Code:
    <?php

    function getDelimiter($str$debug false)
    {
        
    $email '\w[^@\s]*@[^@\s]+\w';
        
    $regex '/(^|\S)\s*'.$email.'\s*($|\S)/';
        if(
    preg_match($regex$str$matches)) {
            if(
    $debug) {
                echo 
    "<pre>".var_export($matches,1)."</pre>\n";
            }
            foreach(
    $matches as $match) {
                if(
    strlen($match) == 1) {
                    return 
    $match;
                }
            }
            return 
    false;
        }
        else {
            if(
    $debug) {
                echo 
    "<p>Nope</p>\n";
            }
            return 
    false;
        }
    }

    $data = array(
        
    'foo.bar@example.com# foo# bar',
        
    'foo #foo.bar@example.com #bar',
        
    'foo # bar # foo.bar@example.com'
    );
    foreach(
    $data as $test) {
        
    $delimiter getDelimiter($testtrue);
        echo 
    "<p>Delimiter for '$test' is '$delimiter'.</p>\n";
    }
    Last edited by NogDog; 06-21-2013 at 12:33 AM.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  7. #7
    Join Date
    Sep 2006
    Posts
    655
    Thanks once again

    Can I use the following regex for email instead of the one in ur code?

    $email = '/([+a-zA-Z0-9._-]+@[a-zA-Z0-9._-]{2,}\.[a-zA-Z]{2,6})/i';

    Though I am not sure if this is more powerful than the one u used, its just that urs is confusing to me.

    Pls suggest.

    Thanks
    Last edited by phantom007; 06-21-2013 at 01:01 AM.

  8. #8
    Join Date
    Sep 2006
    Posts
    655
    Also noticed a problem, if there are no delimiters at all

    Code:
    abc@gmail.com  John Mathew
    The above returns J as a delimiter. Perhaps if we can ignore a-zA-Z-0-9 it should do the trick?


    and for the following string, it returns the delimiter as T

    Code:
    aaa@gmail.in;CHARIOT Tichel;
    Thanks
    Last edited by phantom007; 06-21-2013 at 05:51 AM.

  9. #9
    Join Date
    Sep 2006
    Posts
    655
    One more question, what if the delimiter is a non-utf char? Is there anyway to detect that?

  10. #10
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,334
    I'm starting to think the only viable solution -- outside of requiring the data source use one specific delimiter, preferably with proper CSV formatting -- is to create a "white list" of allowed delimiters and test against each one until you get the correct number of fields.
    PHP Code:
    $delims = array(','';''|''#');
    $delimiter null;
    foreach(
    $delims as $delim) {
        
    $parts explode($delim$text);
        if(
    count($parts == 3) { // or whatever correct value is
            
    $delimiter $delim;
            break;
        }

    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  11. #11
    Join Date
    Sep 2006
    Posts
    655
    Hi

    Thanks for the reply and taking the pain to code it but I am not sure how to integrate that new code of yours.

    BTW, here is another example:

    Code:
    $str = '000020;ACTIVE;AU VIEUX CAMPEUR;;48 RUE DES ECOLES; ;75005;PARIS;president JACQUES YVES DE RORTHAYS;AF14;M.;Jean-Jacques DENUAU;06 08 16 65 62;jow.mathew@gmail.in;F004;Melle;Magali SUREDA;06 86 48 23 30;sudes.waraka@asics.fr;Melle;Anne-Charlotte MICHELET;hello-abc@get.in';
    
        $a = getDelimiter($str);
    
    echo $a; //returns 5 which is incorrect. It should return ;



    I am not sure if its too hard to implement the following logic using regex.


    Case 1: Get the next char (except a-zA-Z0-9.\r\n\f and white space) of the first email address found in a given line.

    Case 2: If there is no char found in Case 1, it should get the previous char before the email (except a-zA-Z0-9.\r\n\f and white space).

    Case 3: If there are no chars found in case 1 and case 2, it should return false.


    Please help.

  12. #12
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,334
    PHP Code:
    <?php
    /**
     * Try to figure out what the delimiter is by looking for email address
     * @return string (false if not found)
     * @param  string $str   string to search
     * @param  bool   $debug whether to output debug info (default to false)
     */
    function getDelimiter($str$debug=false)
    {
        static 
    $regex '/(^|\S)\s*[^()<>@,;:\\".\[\] \000-\031][^()<>@,;:\\"\[\] \000-\031]*@[^()<>@,;:\\"\[\] \000-\031]*[^()<>@,;:\\".\[\] \000-\031]+\s*(\S|$)/';
        if(
    preg_match($regex$str$matches)) {
            if(
    $debug) {
                echo 
    "<pre>Degbug: found email:\n".var_export($matches,1)."</pre>\n";
            }
            for(
    $ix=1$ix<=2$ix++) {
                if(!empty(
    $matches[$ix]) and preg_match('/\W/'$matches[$ix])) {
                    return 
    $matches[$ix];
                }
            }
        }
        return 
    false;
    }

    // test it:
    $test = array(
        
    '000020;ACTIVE;AU VIEUX CAMPEUR;;48 RUE DES ECOLES; ;75005;PARIS;president JACQUES YVES DE RORTHAYS;AF14;M.;Jean-Jacques DENUAU;06 08 16 65 62;jow.mathew@gmail.in;F004;Melle;Magali SUREDA;06 86 48 23 30;sudes.waraka@asics.fr;Melle;Anne-Charlotte MICHELET;hello-abc@get.in',
        
    'aaa@gmail.in;CHARIOT Tichel',
        
    'jon# doe# 996655# abc@gmail.com',
        
    'jon doe 996655 abc@gmail.com'
    );

    foreach(
    $test as $str) {
        echo 
    "<pre>$str:\n";
        
    $result getDelimiter($strtrue);
        echo 
    var_export($result,1)."</pre>\n";
    }
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  13. #13
    Join Date
    Sep 2006
    Posts
    655
    Looks good so far except that it returns false when it finds a tab delimiter.

    Code:
    $test = array(
    "jon	doe	abc@gmail.com	996655"
    );
    Returns
    Code:
    jon	doe	abc@gmail.com	996655
    false

    I will be doing more test and will come to you if I find more issues

    Can u also please tell me what exactly the regex in your code doing?

    Thanks for your hard work
    Last edited by phantom007; 06-23-2013 at 12:15 AM.

  14. #14
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,334
    You want tabs, too? You said to skip white-space, which normally includes tabs. If you tell me you also want underscores as delimiters, I may have to send somebody to rough you up a bit.
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  15. #15
    Join Date
    Sep 2006
    Posts
    655
    Sorry if I bothered you but I think its normal that CSV files contain tabs (\t) as delimiters so it should be valid, when I said white space I actually meant the space created by spacebar key on our keyboard.

    Thanks and sorry once again

    PS, please if u can explain me what exactly your regex is doing?

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles