What is the language of this word

Java, 3416 bytes, 62%

this is my solution, i analyze list of given words and find 60 most commons bigrams and trigrams for each languages. Now i’m checking my n-grams against word, and choosing language with most n-grams in word.

public class Classificator {
     
    String[][] triGr = {
            {"ing","ion","ent","tio","ted","nce","ter","res","ati","con","ess","ate","pro","ain","est","ons","men","ect","red","rea","com","ere","ers","nte","ine","her","ble","ist","tin","for","per","der","ear","str","ght","pre","ver","int","nde","the","igh","ive","sta","ure","end","enc","ned","ste","dis","ous","all","and","anc","ant","oun","ten","tra","are","sed","cti"},
            {"sch","che","ver","gen","ten","cht","ich","ein","ste","ter","hen","nde","nge","ach","ere","ung","den","sse","ers","and","eit","ier","ren","sen","ges","ang","ben","rei","est","nen","nte","men","aus","der","ent","hei","her","lle","ern","ert","uch","ine","ehe","auf","lie","tte","ige","ing","hte","mme","end","wei","len","hre","rau","ite","bes","ken","cha","ebe"},
            {"ent","are","ato","nte","ett","ere","ion","chi","con","one","men","nti","gli","pre","ess","att","tto","par","per","sta","tra","zio","and","iam","end","ter","res","est","nto","tta","acc","sci","cia","ver","ndo","amo","ant","str","tro","ssi","pro","era","eri","nta","der","ate","ort","com","man","tor","rat","ell","ale","gio","ont","col","tti","ano","ore","ist"},
            {"sze","ere","meg","ett","gye","ele","ond","egy","enn","ott","tte","ete","unk","ban","tem","agy","zer","esz","tet","ara","nek","hal","dol","mon","art","ala","ato","szt","len","men","ben","kap","ent","min","ndo","eze","sza","isz","fog","kez","ind","ten","tam","nak","fel","ene","all","asz","gon","mar","zem","szo","tek","zet","elm","het","eve","ssz","hat","ell"}

                    };
    static String[][] biGr = {
        {"in","ed","re","er","es","en","on","te","ng","st","nt","ti","ar","le","an","se","de","at","ea","co","ri","ce","or","io","al","is","it","ne","ra","ro","ou","ve","me","nd","el","li","he","ly","si","pr","ur","th","di","pe","la","ta","ss","ns","nc","ll","ec","tr","as","ai","ic","il","us","ch","un","ct"},
        {"en","er","ch","te","ge","ei","st","an","re","in","he","ie","be","sc","de","es","le","au","se","ne","el","ng","nd","un","ra","ar","nt","ve","ic","et","me","ri","li","ss","it","ht","ha","la","is","al","eh","ll","we","or","ke","fe","us","rt","ig","on","ma","ti","nn","ac","rs","at","eg","ta","ck","ol"},
        {"re","er","to","ar","en","te","ta","at","an","nt","ra","ri","co","on","ti","ia","or","io","in","st","tt","ca","es","ro","ci","di","li","no","ma","al","am","ne","me","le","sc","ve","sa","si","tr","nd","se","pa","ss","et","ic","na","pe","de","pr","ol","mo","do","so","it","la","ce","ie","is","mi","cc"},
        {"el","en","sz","te","et","er","an","me","ta","on","al","ar","ha","le","gy","eg","re","ze","em","ol","at","ek","es","tt","ke","ni","la","ra","ne","ve","nd","ak","ka","in","am","ad","ye","is","ok","ba","na","ma","ed","to","mi","do","om","be","se","ag","as","ez","ot","ko","or","cs","he","ll","nn","ny"}

                    };

    public int guess(String word) {

        if (word.length() < 3) {
            return 4; // most words below 2 characters on list are hungarians
        }
        int score[] = { 0, 0, 0, 0 };
        for (int i = 0; i < 4; i++) {
            for (String s : triGr[i]) {
                if (word.contains(s)) {
                    score[i] = score[i] + 2;
                }
            }
            for (String s : biGr[i]) {
                if (word.contains(s)) {
                    score[i] = score[i] + 1;
                }
            }
        }
        int v = -1;
        int max = 0;
        for (int i = 0; i < 4; i++) {
            if (score[i] > max) {
                max = score[i];
                v = i;
            }
        }
        v++;
        return v==0?Math.round(4)+1:v;
    }
}

and this is my testcase

public class Test {

    Map<String, List<Integer>> words = new HashMap<String, List<Integer>>();

    boolean validate(String word, Integer lang) {
        List<Integer> langs = words.get(word);
        return langs.contains(lang);
    }

    public static void main(String[] args) throws FileNotFoundException {

        FileReader reader = new FileReader("list.txt");
        BufferedReader buf = new BufferedReader(reader);
        Classificator cl = new Classificator();
        Test test = new Test();
        buf.lines().forEach(x -> test.process(x));
        int guess = 0, words = 0;
        for (String word : test.words.keySet()) {
            int lang = cl.guess(word);
            if (lang==0){
                continue;
            }
            boolean result = test.validate(word, lang);
            words++;
            if (result) {
                guess++;
            }
        }
        System.out.println(guess+ " "+words+ "    "+(guess*100f/words));
    }

    private void process(String x) {
        String arr[] = x.split("\s+");
        String word = arr[0].trim();
        List<Integer> langs = words.get(word);
        if (langs == null) {
            langs = new ArrayList<Integer>();
            words.put(word, langs);
        }
        langs.add(Integer.parseInt(arr[1].trim()));

    }

}

It’s the same word over and over, Little cloud. Those are the names of the chapters of a novel by Catalan writer Joan-Lluís Lluís, El navegant

  • Catalan

  • French

  • Caló? Rom? (per context on the novel; also, thanks /u/SlyReference, tiknó is small in Carpatian Romani; also, Maruts are some Hindu deities identified with clouds; so, for sure some Romani language)

  • Latin

  • Modern Greek (thanks to /u/nonneb) Greek (modern? Classic, probably?)

  • Azeri (thanks, /u/Andthatwasthestory) ?

  • Polish (thanks, /u/Norrius) Polish? (because the l)

  • Maltese, although it should be sħaba żgħira (thanks, /u/airelivre) Maltese?

  • Hebrew (thanks, /u/TheClimor)

  • Aymara (thanks, /u/uninanx) ?

  • Hungarian (thanks, /u/nonneb) Hungarian? (because the weird umlaut)

  • Lenape (per context on the novel, confirmed via google)

  • Japanese (thanks, /u/XyloPlayer) Japanese (per the hiragana, or katakana, or whatever that is)

  • Nengone (thanks, /u/julynarsil) ?

  • Breton (I googled it)

  • Occitan (per context on the novel, confirmed via wikipedia)

  • German

  • Traditional Chinese (thanks to /u/Crixs) Mandarin?

  • Inuktitut (thanks, /u/airelivre) Inuk Inuttut / Inutittut (at least cloud is nuvujak) ? Sounds like a eskimo language

  • Plautdietsch / Mennonite Low German (some detective work was needed to confirm it) ?

  • Kinyarwanda (thanks, /u/All_Is_Not_Self) ?

  • Arabic (thanks, /u/Doktahh)

  • Yindjibarndi (thanks, /u/julynarsil) Some north India language?

  • Yiddish (thanks, /u/nonneb) Yiddish? As 9 is Hebrew…

  • Hawaiian (thanks, /u/bulaisen) ?

  • Galician Portuguese?

  • Belarussian (thanks, /u/Veturi) Russian? Ukrainian? …?

  • Latvian (thanks, /u/tschlafer) ?

  • Western Dakota (thanks, /u/julynarsil) ?

  • Urdu (thanks, /u/Doktahh)

  • Romansch (thanks, /u/julynarsil) Romansch? Tsch and ü are Germanic, and Nüvla is quite similar to Catalan núvol

  • Vietnamese (thanks, /u/BuffaloStapler) Vietnamese? (the only language I know with those diacritics)

  • Corsican (thanks, /u/airelivre), although Sicilian or any other south Italian language might be too (thanks, /u/gort818), because albeit small cloud would be nuvulina, femenine, as it is the nickname of a male it might have become nuvulinu Sard? Sicilian? Romanian?

  • Daungwurrung or a related Kulin language (thanks, /u/JDFidelius) ?

  • Korean (thanks, /u/XyloPlayer) Korean, I guess.

  • Norwegian (thanks, /u/XyloPlayer) Scots??? it’s like little sky…

  • ? Quileute, probably (because the second dot on diï and also kwa might mean little) An African language?

  • Kazakh (thanks, /u/Norrius) Russian? Ukrainian? …?

  • Sardinian (thanks, /u/airelivre) Portuguese? A Portuguese creole? Any other Romance language?

  • Georgian (thanks, /u/Andthatwasthestory) Georgian?

  • Ladin (thanks, /u/airelivre) Some creole or Romance language?

  • Yoruba (thanks, /u/Andthatwasthestory) Some African language?

  • Armenian (thanks, /u/Andthatwasthestory) Armenian?

  • Maori (thanks, /u/All_Is_Not_Self) ?

  • ?

  • Basque (I’ve googled it) Basque? (because the ñ)

  • Persian (thanks, /u/Andthatwasthestory) Persian? (as Arabic and Urdu are already there)

  • ?

  • Welsh (thanks, /u/Andthatwasthestory) Welsh? (per the w instead of normal vowels)

  • Some Kanak language, no idea which one, probably the one spoken in Nouméa, maybe Ndrumbea

  • (Edit) NOTE: Small Cloud is the nickname of the main character, who is a male.

    Using Dutch, French and English documents.
    It is my experience that Office DOES NOT recognize the language the right way. I write a document in the system language: okay, spelling and grammar are controlled, and language is automatically set to system language (even if the two other languages are installed in the system and in the office-laguage options)

    Even while writing this text, all words are red underlined , so chrome does not detect the language either.

    The system language is Dutch, and this problem has always existed, whatever I try or do, I have to select all, set the language manually, and then do the spelling check.

    Looping through the languages makes no sense, if the detection is not right. It seems to me the language/spelling/grammar detecting/checking/ correcting options are on a stand-by since Ms-office 2007, or almost a decade. see here

    If this has to do with the fact that Dutch is a ‘small’ language, I don’t know.
    If there was a way to «set language» for the current document, a simple start-up code would do the job, so far, I did not find code that does this, except this little simple code I wrote:

    sub setlng()
    ‘set language
    Selection.WholeStory
    With Selection
    Select Case InputBox(«What’s your language? (NL= Nederlands, FR = Français, EN = English, DE = Deutch)»)
    Case «Nl», «NL», «nL»
    .LanguageID = wdDutch
    Case «Fr», «FR», «fR»
    .LanguageID = wdFrench
    Case «En», «EN», «eN»
    .LanguageID = wdEnglishUS
    Case «De», «DE», «dE»
    .LanguageID = wdGerman
    End Select

    Application.CheckLanguage = True
    End With

    End sub

    Clearly, since MSoffice was written in English, you have to use the ENGLISH word for your language, in stead of the language’s it’s word for it’s language, which would be logical…

    I’m very curious about people who live in Azerbeidjan, eve find their language "Selection.LanguageID = wdAzeriCyrillic" … hm…


    1


    GLOBAL ENGLISH English Quiz English Quiz Karuna Olga Yurievna. Karuna Olga Yurievna.


    2


    Question 1 How many words did William Shakespeare use? How many words did William Shakespeare use? a) 300 a) 300 b) 3,000 b) 3,000 c) 30,000 c) 30,000


    3


    Question 2 How many native words are there in the English Language? How many native words are there in the English Language? a) 70% a) 70% b) 50% b) 50% c) 30 % c) 30 %


    4


    Question 3 Which English word has the most definitions? Which English word has the most definitions? a) set a) set b) get b) get c) have c) have Set Set Get Get Have Have


    5


    Question 4 What language did William The Conqueror use? What language did William The Conqueror use? a) French a) French b) English b) English c) German c) German


    6


    Question 5 Which is the most common letter in English? Which is the most common letter in English? a) e a) e b) a b) a c) i c) i


    7


    Question 6 Which is the least common letter in English? Which is the least common letter in English? a) x a) x b) q b) q c) z c) z


    8


    Question 7 What is the capital of Canada? What is the capital of Canada? a) Montreal a) Montreal b) Ottawa b) Ottawa c) Adelaide c) Adelaide


    9


    Question 8 The British ask for the bill in a restaurant at the end of the meal. What do Americans ask for? The British ask for the bill in a restaurant at the end of the meal. What do Americans ask for? a) the check a) the check b) the receipt b) the receipt c) the script c) the script


    10


    Question 9 In British English, its called a mobile, what is it called in the US? In British English, its called a mobile, what is it called in the US? a) a handy a) a handy b) a cell phone b) a cell phone c) a portable phone c) a portable phone


    11


    Question 10 Which word is used more in American English than in British English? Which word is used more in American English than in British English? a) mom a) mom b) mum b) mum c) mummy c) mummy


    12


    Question 11 What is the capital of Australia? What is the capital of Australia? a) Sydney a) Sydney b) Canberra b) Canberra c) Melbourne c) Melbourne


    13


    Question 12 What is the capital city of New Zealand? What is the capital city of New Zealand? a) Sydney a) Sydney b) Oakland b) Oakland c) Wellington c) Wellington


    14


    Question 13 In Cockney,I dont Adam and Eve you means In Cockney,I dont Adam and Eve you means a) I dont love you a) I dont love you b) I dont understand you b) I dont understand you c) I dont believe you c) I dont believe you


    15


    Question 14 Which of these drink words was borrowed from Arabic? Which of these drink words was borrowed from Arabic? a) wine a) wine b) juice b) juice c) alcohol c) alcohol


    16


    Question 15 What language is the word sauna from? What language is the word sauna from? a) Swedish a) Swedish b) Dutch b) Dutch c) Finnish c) Finnish


    17


    Question 16 Which famous fast food came from Germany ? Which famous fast food came from Germany ? a) pizza a) pizza b) hamburger b) hamburger c) sandwich c) sandwich


    18


    Question 17 What language is the word robot from? What language is the word robot from? a) Czech a) Czech b) Polish b) Polish c) Hungarian c) Hungarian


    19


    Question 18 What country are hara-kiri, kimono, and karate from? What country are hara-kiri, kimono, and karate from? a) China a) China b) Japan b) Japan c) Spain c) Spain


    20


    Question 19 Which of the following English words are not French borrowings? Which of the following English words are not French borrowings? (заимствования) (заимствования) a) table, wardrobe, chair a) table, wardrobe, chair b) army, battle, peace b) army, battle, peace c) father, king, pig c) father, king, pig


    21


    Question 20 What country are the words opera, soprano,concerto, and piano from? What country are the words opera, soprano,concerto, and piano from? a) Italy a) Italy b) Spain b) Spain c) Portugal c) Portugal


    22


    Question 21 How many new words are added to the English vocabulary each year? How many new words are added to the English vocabulary each year? a) about 50 a) about 50 b) about 300 b) about 300 c) about 500 c) about 500


    23


    Question 22 Where do the majority of computer terms come from? Where do the majority of computer terms come from? a) the UK a) the UK b) the USA b) the USA c) Australia c) Australia Web PC video on-screen chat


    24


    Question 23 Which word is most frequently used in conversation? Which word is most frequently used in conversation? a) yes a) yes b) no b) no c) I c) I yes yes no no I


    25


    Question 24 Which words are most frequently used in written English? Which words are most frequently used in written English? a) boy, girl, love a) boy, girl, love b) money, business, bank b) money, business, bank c) a, the, and c) a, the, and


    26


    Question 25 What do the British say before the meal? What do the British say before the meal? a) Bon appetite! a) Bon appetite! b) Bless you! b) Bless you! c) Nothing c) Nothing


    27


    Question 26 What is the correct question tag in this polite request? Open the window, __you ? What is the correct question tag in this polite request? Open the window, __you ? a) will a) will b) do b) do c) please c) please


    28


    Question 27 If someone says Cheerio,what do they mean? If someone says Cheerio,what do they mean? a) Goodbye a) Goodbye b) Hello b) Hello c) Thank you c) Thank you


    29


    Question 28 What should you say if someone sneezes(чихает)? What should you say if someone sneezes(чихает)? a) How is it going? a) How is it going? b) Bless you! b) Bless you! c) Can I help you? c) Can I help you?


    30


    Question 29 What would you say if you wanted to sit down in a busy place? What would you say if you wanted to sit down in a busy place? a) Excuse me, is this seat busy? a) Excuse me, is this seat busy? b) Let me take this seat, please. b) Let me take this seat, please. c) Excuse me, is this seat taken? c) Excuse me, is this seat taken?


    31


    Question 30 What is a polite response to Thank you very much ? What is a polite response to Thank you very much ? a) Of course! a) Of course! b) The same to you! b) The same to you! c) Youre welcome! c) Youre welcome!


    32


    Question 31 What do you say in a shop if you only want to look and not buy? What do you say in a shop if you only want to look and not buy? a) Im just browsing. a) Im just browsing. b) Im just viewing. b) Im just viewing. c) Im just shoplifting. c) Im just shoplifting.


    33


    Question 32 To tell someone who you are on the phone, which of the following the most natural? To tell someone who you are on the phone, which of the following the most natural? a) Its Tom a) Its Tom b) Im Tom b) Im Tom c) Tom speaking c) Tom speaking


    34


    Question 33 The sentence The sentence Madam, Im Adam is spelled the same from left to right and from right to left. It is… Madam, Im Adam is spelled the same from left to right and from right to left. It is… a) an anagram a) an anagram b) a palindrome b) a palindrome c) a puzzle c) a puzzle


    35



    36


    Keys a) a) b) b) c) c)

    What Is the Definition of Word?

    «The trouble with words,» said British dramatist Dennis Potter, «is that you never know whose mouths they’ve been in.».

    ZoneCreative S.r.l./Getty Images


    A word is a speech sound or a combination of sounds, or its representation in writing, that symbolizes and communicates a meaning and may consist of a single morpheme or a combination of morphemes.

    The branch of linguistics that studies word structures is called morphology. The branch of linguistics that studies word meanings is called lexical semantics.

    Etymology

    ​From Old English, «word»

    Examples and Observations

    • «[A word is the] smallest unit of grammar that can stand alone as a complete utterance, separated by spaces in written language and potentially by pauses in speech.»
      -David Crystal, The Cambridge Encyclopedia of the English Language. Cambridge University Press, 2003
    • «A grammar . . . is divided into two major components, syntax and morphology. This division follows from the special status of the word as a basic linguistic unit, with syntax dealing with the combination of words to make sentences, and morphology with the form of words themselves.» -R. Huddleston and G. Pullum, The Cambridge Grammar of the English Language. Cambridge University Press, 2002
    • «We want words to do more than they can. We try to do with them what comes to very much like trying to mend a watch with a pickaxe or to paint a miniature with a mop; we expect them to help us to grip and dissect that which in ultimate essence is as ungrippable as shadow. Nevertheless there they are; we have got to live with them, and the wise course is to treat them as we do our neighbours, and make the best and not the worst of them.»
      -Samuel Butler, The Note-Books of Samuel Butler, 1912
    • Big Words
      «A Czech study . . . looked at how using big words (a classic strategy for impressing others) affects perceived intelligence. Counter-intuitvely, grandiose vocabulary diminished participants’ impressions of authors’ cerebral capacity. Put another way: simpler writing seems smarter.»
      -Julie Beck, «How to Look Smart.» The Atlantic, September 2014
    • The Power of Words
      «It is obvious that the fundamental means which man possesses of extending his orders of abstractions indefinitely is conditioned, and consists in general in symbolism and, in particular, in speech. Words, considered as symbols for humans, provide us with endlessly flexible conditional semantic stimuli, which are just as ‘real’ and effective for man as any other powerful stimulus.
    • Virginia Woolf on Words
      «It is words that are to blame. They are the wildest, freest, most irresponsible, most un-teachable of all things. Of course, you can catch them and sort them and place them in alphabetical order in dictionaries. But words do not live in dictionaries; they live in the mind. If you want proof of this, consider how often in moments of emotion when we most need words we find none. Yet there is the dictionary; there at our disposal are some half-a-million words all in alphabetical order. But can we use them? No, because words do not live in dictionaries, they live in the mind. Look once more at the dictionary. There beyond a doubt lie plays more splendid than Antony and Cleopatra; poems lovelier than the ‘Ode to a Nightingale’; novels beside which Pride and Prejudice or David Copperfield are the crude bunglings of amateurs. It is only a question of finding the right words and putting them in the right order. But we cannot do it because they do not live in dictionaries; they live in the mind. And how do they live in the mind? Variously and strangely, much as human beings live, ranging hither and thither, falling in love, and mating together.»
      -Virginia Woolf, «Craftsmanship.» The Death of the Moth and Other Essays, 1942
    • Word Word
      «Word Word [1983: coined by US writer Paul Dickson]. A non-technical, tongue-in-cheek term for a word repeated in contrastive statements and questions: ‘Are you talking about an American Indian or an Indian Indian?’; ‘It happens in Irish English as well as English English.'»
      -Tom McArthur, The Oxford Companion to the English Language. Oxford University Press, 1992

    Понравилась статья? Поделить с друзьями:
  • What is the meaning of a safe word
  • What is the l word series about
  • What is the meaning of a hindi word
  • What is the key word in definition
  • What is the mean of love in one word